# Customer Churn - EDA

## 1. Business Understanding
- What: Predicts if a customer is likely to churn
- Why: Customer churn reduces revenue 
- Impact: Increase revenue and customer retention

## 2. Load and First Look
- 284,807 transactions
- 492 frauds (0.17%) - highly imbalanced!
- 30 features (Time, Amount, V1-V28)
- No missing values ✓

## 3. Target Variable Analysis
[Bar chart showing class imbalance]
Key insight: We'll need to handle severe imbalance

## 4. Feature Distributions
[Histograms of Amount, Time]
Key insights:
- Most transactions are small (<$100)
- Fraud transactions tend to be smaller
- No clear time pattern (fraud happens 24/7)

## 5. Correlation Analysis
[Heatmap]
Key insights:
- V17, V14, V12 most correlated with fraud
- Amount has weak negative correlation
- Features are PCA-transformed (already decorrelated)

## 6. Time-Based Patterns
[Line plot: fraud rate by hour]
Key insight: Fraud rate slightly higher during night hours

## 7. Amount Analysis
[Box plot: Amount by Class]
Key insight: Fraud transactions have lower median amount

## 8. Feature Importance (Preliminary)
[Bar chart from simple Random Forest]
Key insights: V17, V14, V12, V10 are most predictive

## 9. Conclusions & Next Steps
- Severe class imbalance → use SMOTE or class weights
- Engineer time-based features (hour, day)
- Focus on precision (minimize false positives for customer experience)
- Try multiple algorithms (RF, XGBoost, Logistic Regression)

In [4]:
import pandas as pd
import seaborn as sns


# Load dataset
df = pd.read_csv(r'data\dataset.csv')

# Display dataset information
print(df.info())
# Print Churn Data
print(df['Churn'].value_counts())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


- We can see that customers who churn represent roughly 25% of the data. This means we have a class imbalance and we'll have consider that in our model

In [None]:
df.corr(method='pearson', annot=True, cmap='coolwarm')
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')