# Exploratory Data Analysis (EDA)

## Fit.ly Customer Churn Analysis


The objective of this notebook is to validate data quality, understand the structure of the available datasets, 
and identify key patterns and limitations relevant to customer churn analysis. 

This step focuses on data validation and exploratory understanding rather than modeling or prediction.


In [1]:
# Libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)


In [3]:
# Ingest Raw data:

account_info = pd.read_csv("../data/raw/da_fitly_account_info.csv")
customer_support = pd.read_csv("../data/raw/da_fitly_customer_support.csv")
user_activity = pd.read_csv("../data/raw/da_fitly_user_activity.csv")

In [7]:
# structural overview:

datasets = {
    "Account Info": account_info,
    "Customer Support": customer_support,
    "User Activity": user_activity
}

for name, df in datasets.items():
    print(f"\n{name}")
    print("-" * len(name))
    print("Shape:", df.shape)
    display(df.head())
    
    df.info()



Account Info
------------
Shape: (400, 6)


Unnamed: 0,customer_id,email,state,plan,plan_list_price,churn_status
0,C10000,user10000@example.com,New Jersey,Enterprise,105,Y
1,C10001,user10001@example.net,Louisiana,Basic,22,Y
2,C10002,user10002@example.net,Oklahoma,Basic,24,
3,C10003,user10003@example.com,Michigan,Free,0,
4,C10004,user10004@example.com,Texas,Enterprise,119,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   customer_id      400 non-null    object
 1   email            400 non-null    object
 2   state            400 non-null    object
 3   plan             400 non-null    object
 4   plan_list_price  400 non-null    int64 
 5   churn_status     114 non-null    object
dtypes: int64(1), object(5)
memory usage: 18.9+ KB

Customer Support
----------------
Shape: (918, 7)


Unnamed: 0,ticket_time,user_id,channel,topic,resolution_time_hours,state,comments
0,2025-06-13 05:55:17.154573,10125,chat,technical,11.48,1,
1,2025-08-06 13:21:54.539551,10109,chat,account,1.01,0,
2,2025-08-22 12:39:35.718663,10149,chat,technical,10.09,0,Erase my data from your systems.
3,2025-06-07 02:49:46.986055,10268,phone,account,9.1,1,
4,2025-07-25 00:24:38.945079,10041,phone,other,2.28,1,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ticket_time            918 non-null    object 
 1   user_id                918 non-null    int64  
 2   channel                918 non-null    object 
 3   topic                  918 non-null    object 
 4   resolution_time_hours  918 non-null    float64
 5   state                  918 non-null    int64  
 6   comments               46 non-null     object 
dtypes: float64(1), int64(2), object(4)
memory usage: 50.3+ KB

User Activity
-------------
Shape: (445, 3)


Unnamed: 0,event_time,user_id,event_type
0,2025-09-08 15:05:39.422721,10118,watch_video
1,2025-09-08 08:15:05.264103,10220,watch_video
2,2025-11-14 06:28:35.207671,10009,share_workout
3,2025-08-20 16:53:38.682901,10227,read_article
4,2025-07-24 16:47:31.728422,10123,track_workout


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445 entries, 0 to 444
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   event_time  445 non-null    object
 1   user_id     445 non-null    int64 
 2   event_type  445 non-null    object
dtypes: int64(1), object(2)
memory usage: 10.6+ KB


---
### Dataset 1: Account Information (account_info)

This dataset represents the customer master table and serves as the primary source of:
- Customer attributes
- Subscription plan information
- Churn labels (when available)

Structure:
- Rows: 400 customers
- Columns: 6
- One row per customer

Key fields:
- customer_id: primary customer identifier
- plan, plan_list_price: monetization indicators
- state: geographic segmentation
- churn_status: churn indicator (Y / missing)
- Data quality observations

All customer attributes and pricing fields are complete.
The churn_status field is populated for 114 out of 400 customers (~28.5%).
- The remaining customers have missing churn labels.
- Validation insight (critical)
- The partial availability of churn labels implies:
- Churn data is incomplete, not negative.
- Customers without labels cannot be assumed to be retained.
- Analysis will focus on patterns associated with churn, not precise churn rate estimation.

This limitation is explicitly acknowledged and accounted for in downstream analysis.

---
### Dataset 2: Customer Support (customer_support)

This dataset captures customer friction and service interactions, used to evaluate:
- Whether higher support usage correlates with churn
- Whether longer resolution times indicate churn risk

Structure:
- Rows: 918 support tickets
- Columns: 7
- Multiple records per customer

Key fields:
- user_id: customer identifier (to be joined with customer_id)
- channel: support channel used
- topic: issue category
- resolution_time_hours: proxy for service quality
- state: numeric indicator (interpreted as ticket status)
- comments: free-text field

Data quality observations:
- All operational fields are complete.
- comments is populated for only 46 records (~5%)

Validation decisions:
- comments will be excluded from quantitative analysis due to high sparsity.
- state is treated as a ticket status indicator, not geographic data.
- Aggregation is required at the customer level (e.g., ticket count, average resolution time).

---
### Dataset 3: User Activity (user_activity)
This dataset represents user engagement behavior, which is central to churn analysis.

Structure:
- Rows: 445 activity events
- Columns: 3
- Multiple events per user

Key fields:
- user_id: customer identifier
- event_time: timestamp of activity
- event_type: type of engagement action

Supported engagement actions:
- track_workout
- watch_video
- read_article
- share_workout

Validation insight:
- Activity data enables:
    - Engagement volume per customer
    - Engagement mix by action type
- Temporal features (recency, frequency) may be derived if needed.