# **Customer Churn Analysis**

In this notebook, we will perform preprocessing on a dataset related to customer churn. We will cover the following steps:

1. Loading the dataset
2. Normalizing numerical data
3. Encoding categorical data
4. Handling class imbalance
5. Detecting outliers using the Z-score method

Let's get started!

## **Step 1: Load the Dataset**

We will first load the necessary libraries and our dataset, which is a CSV file containing customer churn information.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE # type: ignore
from scipy import stats
import numpy as np

# Load the dataset
df = pd.read_csv('/content/Customer Churn.csv')

df.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.64,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.52,0
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.02,0
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805,0


## **Step 2: Scaling the Data (Normalization)**

We will normalize the numerical columns to ensure they are on the same scale. This helps improve the performance of many machine learning algorithms. We will use Min-Max scaling for this purpose.

### **Numerical Columns**
The numerical columns in our dataset are:
- Call Failure
- Subscription Length
- Charge Amount
- Seconds of Use
- Frequency of Use
- Frequency of SMS
- Distinct Called Numbers
- Age
- Customer Value


In [3]:
# Step 1: Scaling the Data (Normalization)
num_columns = ['Call  Failure', 'Subscription  Length', 'Charge  Amount',
               'Seconds of Use', 'Frequency of use', 'Frequency of SMS',
               'Distinct Called Numbers', 'Age', 'Customer Value']



## Step 3: Encoding Categorical Data (One-Hot Encoding)

Next, we will encode categorical columns using One-Hot Encoding. This technique converts categorical variables into a format that can be provided to machine learning algorithms to do a better job in prediction.

### Categorical Columns
The categorical columns in our dataset are:
- Age Group
- Tariff Plan


In [4]:
# Step 2: Encoding Categorical Data (One-Hot Encoding)
cat_columns = ['Age Group', 'Tariff Plan']

# Define OneHotEncoder for categorical columns and MinMaxScaler for numerical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), num_columns),
        ('cat', OneHotEncoder(sparse_output=False), cat_columns)  # Changed sparse to sparse_output
    ])

# Step 3: Apply scaling and encoding to the features (excluding 'Churn')
X = df.drop(columns='Churn')
y = df['Churn']



## Step 4: Transforming the Data

We will apply the scaling and encoding transformations to the features, excluding the target variable 'Churn'.


In [5]:
# Transform the data
X_processed = preprocessor.fit_transform(X)

# Convert the processed data back into a DataFrame
# Get feature names for OneHotEncoded columns
encoded_columns = preprocessor.transformers_[1][1].get_feature_names_out(cat_columns)
all_columns = num_columns + list(encoded_columns)

# Step 4: Create a new DataFrame with transformed data
X_processed_df = pd.DataFrame(X_processed, columns=all_columns)



## Step 5: Handling Class Imbalance using SMOTE

In this step, we will split the data into training and testing sets, and then apply SMOTE (Synthetic Minority Over-sampling Technique) to handle class imbalance in the target variable 'Churn'.


In [6]:

# Step 5: Handling Class Imbalance using SMOTE
X_train, X_test, y_train, y_test = train_test_split(X_processed_df, y, test_size=0.3, random_state=42)

# Apply SMOTE to handle class imbalance
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)




## Step 6: Outlier Detection using Z-Score

Finally, we will detect outliers in the training data using the Z-score method. We will consider a data point as an outlier if its Z-score is greater than 3.


In [7]:


# Step 6: Outlier Detection using Z-Score
z_scores = np.abs(stats.zscore(X_train_sm[num_columns]))  # Z-score for numeric columns
outliers = (z_scores > 3).sum(axis=0)

# Display the number of outliers detected per feature
outliers_per_feature = dict(zip(num_columns, outliers))
print("Outliers per feature:", outliers_per_feature)

Outliers per feature: {'Call  Failure': 20, 'Subscription  Length': 23, 'Charge  Amount': 66, 'Seconds of Use': 127, 'Frequency of use': 83, 'Frequency of SMS': 142, 'Distinct Called Numbers': 70, 'Age': 0, 'Customer Value': 102}


# Summary of Customer Churn Analysis Notebook

This notebook performs preprocessing on a customer churn dataset with the following steps:

1. **Loading the Dataset**: The dataset is loaded using pandas, providing insights into customer churn behavior.

2. **Scaling Numerical Data**: Numerical features are normalized using Min-Max scaling to ensure consistent scales, improving model performance.

3. **Encoding Categorical Data**: Categorical variables, such as 'Age Group' and 'Tariff Plan', are transformed into a numerical format using One-Hot Encoding.

4. **Data Transformation**: The preprocessing steps are applied to the features, and a new DataFrame is created with transformed data.

5. **Handling Class Imbalance**: The dataset is split into training and testing sets, and SMOTE is applied to address class imbalance in the target variable 'Churn'.

6. **Outlier Detection**: Z-scores are calculated for numeric features to identify and count outliers, with a threshold of 3 for detection.

### Conclusion

The notebook effectively prepares the customer churn dataset for modeling by normalizing numerical data, encoding categorical variables, handling class imbalance, and detecting outliers. These preprocessing steps are essential for building a robust predictive model.
