**<font size="5">AMM Handling Data Imbalance in Classification Models</font>**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.utils import resample

In [4]:
churnData = pd.read_csv('Customer-Churn.csv')
print(churnData)

      gender  SeniorCitizen Partner Dependents  tenure PhoneService   
0     Female              0     Yes         No       1           No  \
1       Male              0      No         No      34          Yes   
2       Male              0      No         No       2          Yes   
3       Male              0      No         No      45           No   
4     Female              0      No         No       2          Yes   
...      ...            ...     ...        ...     ...          ...   
7038    Male              0     Yes        Yes      24          Yes   
7039  Female              0     Yes        Yes      72          Yes   
7040  Female              0     Yes        Yes      11           No   
7041    Male              1     Yes         No       4          Yes   
7042    Male              0      No         No      66          Yes   

     OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV   
0                No          Yes               No          No         

In [5]:
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')

In [7]:
churnData.dropna(inplace=True)

In [8]:
features = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']
X = churnData[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [9]:
y = churnData['Churn']
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [10]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [11]:
y_pred = model.predict(X_test)
accuracy_unbalanced = accuracy_score(y_test, y_pred)
print("Accuracy (Unbalanced):", accuracy_unbalanced)

Accuracy (Unbalanced): 0.7803837953091685


In [12]:
print("Class Distribution:\n", y_train.value_counts())

Class Distribution:
 Churn
No     4130
Yes    1495
Name: count, dtype: int64


The class distribution shows how many customers are in each category of churn:

**No**: There are 4,130 customers who have not churned, which means they are still using the service.

**Ye**s: There are 1,495 customers who have churned, which means they have stopped using the service.

In simpler terms, there are more customers who haven't left the service (4,130) compared to those who have left (1,495). This difference in numbers between the two categories represents the imbalance in the data, where one category (No) is much larger than the other (Yes).

In [13]:
df_majority = churnData[churnData['Churn'] == 'No']
df_minority = churnData[churnData['Churn'] == 'Yes']

df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)

# Combine the upsampled minority class with the majority class
churnData_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Downsample the majority class
df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority), random_state=42)

# Combine the downsampled majority class with the minority class
churnData_downsampled = pd.concat([df_minority, df_majority_downsampled])

In [14]:
X_upsampled = scaler.fit_transform(churnData_upsampled[features])
y_upsampled = churnData_upsampled['Churn']

X_downsampled = scaler.fit_transform(churnData_downsampled[features])
y_downsampled = churnData_downsampled['Churn']

# Fit models on upsampled and downsampled data
model_upsampled = LogisticRegression()
model_upsampled.fit(X_upsampled, y_upsampled)

model_downsampled = LogisticRegression()
model_downsampled.fit(X_downsampled, y_downsampled)

# Evaluate accuracy
y_pred_upsampled = model_upsampled.predict(X_test)
accuracy_upsampled = accuracy_score(y_test, y_pred_upsampled)
print("Accuracy (Upsampled):", accuracy_upsampled)

y_pred_downsampled = model_downsampled.predict(X_test)
accuracy_downsampled = accuracy_score(y_test, y_pred_downsampled)
print("Accuracy (Downsampled):", accuracy_downsampled)

Accuracy (Upsampled): 0.6687988628287136
Accuracy (Downsampled): 0.6723525230987918


**Upsampled Model Accuracy:** The model trained on the upsampled data correctly predicted about 67% of the outcomes. This means that when we used a technique to balance the number of "Churn" and "No Churn" cases in the data, the model's accuracy improved, but it's still not highly accurate.

**Downsampled Model Accuracy:** The model trained on the downsampled data also achieved an accuracy of around 67%. This result indicates that even when we reduced the number of "No Churn" cases to match the number of "Churn" cases, the accuracy remained at a similar level.

In simpler terms, both the upsampled and downsampled models improved the accuracy compared to the initial unbalanced model, but the improvement is not very large. This suggests that while balancing the data helps, there might be other factors affecting the model's performance as well.

**Conclusion**

In this analysis, we looked at customer churn data to identify trends that could help us predict whether customers will leave the service ("Churn") or continue using it ("No Churn"). We started with an unbalanced dataset, where there were more "No Churn" cases than "Churn" cases.

We used techniques to balance the data, both by creating more "Churn" cases (upsampling) and by reducing "No Churn" cases (downsampling). After training models on these balanced datasets, we found that the accuracy improved slightly compared to the unbalanced model.

However, the accuracy levels achieved were around 67% for both upsampled and downsampled models, suggesting that while balancing the data improved predictions, there are still other factors at play affecting the model's performance.

In conclusion, while balancing the data is a step in the right direction, achieving higher accuracy might require exploring other techniques, considering additional features, or trying different types of machine learning models. The complex nature of predicting customer churn suggests that a holistic approach is needed to build a more accurate predictive model.