### Scenario
#### You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

### Instructions
#### In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

#### Here is the list of steps to be followed (building a simple model without balancing the data):

1.Import the required libraries and modules that you would need.

In [300]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.utils import resample
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

2.Read that data into Python and call the dataframe churnData

In [279]:
churnData = pd.read_csv('C:/Users/faeze/OneDrive/Desktop/Ironhack/Unit-7/Session-44/lab-handling-data-imbalance-classification/files_for_lab/Customer-Churn.csv')
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


3.Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.

In [280]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [281]:
categoricals = churnData.select_dtypes('object')
categoricals

Unnamed: 0,gender,Partner,Dependents,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,TotalCharges,Churn
0,Female,Yes,No,No,No,Yes,No,No,No,No,Month-to-month,29.85,No
1,Male,No,No,Yes,Yes,No,Yes,No,No,No,One year,1889.5,No
2,Male,No,No,Yes,Yes,Yes,No,No,No,No,Month-to-month,108.15,Yes
3,Male,No,No,No,Yes,No,Yes,Yes,No,No,One year,1840.75,No
4,Female,No,No,Yes,No,No,No,No,No,No,Month-to-month,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,1990.5,No
7039,Female,Yes,Yes,Yes,No,Yes,Yes,No,Yes,Yes,One year,7362.9,No
7040,Female,Yes,Yes,No,Yes,No,No,No,No,No,Month-to-month,346.45,No
7041,Male,Yes,No,Yes,No,No,No,No,No,No,Month-to-month,306.6,Yes


In [282]:
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')

In [283]:
numericals = churnData.select_dtypes('number')
numericals

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
0,0,1,29.85,29.85
1,0,34,56.95,1889.50
2,0,2,53.85,108.15
3,0,45,42.30,1840.75
4,0,2,70.70,151.65
...,...,...,...,...
7038,0,24,84.80,1990.50
7039,0,72,103.20,7362.90
7040,0,11,29.60,346.45
7041,1,4,74.40,306.60


4.Check for null values in the dataframe. Replace the null values.

In [284]:
nulls = pd.DataFrame(churnData.isna().sum()*100/len(churnData), columns=['percentage'])
nulls.sort_values('percentage', ascending = False)

Unnamed: 0,percentage
TotalCharges,0.156183
gender,0.0
SeniorCitizen,0.0
Partner,0.0
Dependents,0.0
tenure,0.0
PhoneService,0.0
OnlineSecurity,0.0
OnlineBackup,0.0
DeviceProtection,0.0


In [285]:
churnData.isna().sum().sort_values(ascending=False)

TotalCharges        11
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
Churn                0
dtype: int64

In [286]:
churnData["TotalCharges"].value_counts()

20.20      11
19.75       9
20.05       8
19.90       8
19.65       8
           ..
6849.40     1
692.35      1
130.15      1
3211.90     1
6844.50     1
Name: TotalCharges, Length: 6530, dtype: int64

In [287]:
def log_transform(x):
    if np.isfinite(x) and x!=0:
        return np.log(x)
    else:
        return np.NAN

In [288]:
churnData['TotalCharges'] = churnData['TotalCharges'].apply(log_transform).fillna(churnData['TotalCharges'].median())

5.Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:

Scale the features either by using normalizer or a standard scaler.

Split the data into a training set and a test set.

Fit a logistic regression model on the training data.

Check the accuracy on the test data.

In [290]:
#churnData['Churn'] = churnData['Churn'].apply(lambda x: 1 if x in ['Yes'] else '0')

#churnData['Churn'].value_counts()

In [291]:
#X = churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
#y = churnData['Churn']

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

#X_train_num = X_train.select_dtypes(np.number)
#X_test_num = X_test.select_dtypes(np.number)

In [292]:
#transformer = Normalizer()
#transformer.fit(X)
#x_normalized = transformer.transform(X)

In [293]:
#lr = LinearRegression()
#lr.fit(X_train, y_train)

In [294]:
#y_pred = lr.predict(X_test)
#print("Accuracy:", accuracy_score(y_test, y_pred))
#print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
#print("Classification Report:\n", classification_report(y_test, y_pred))

In [295]:
X = churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y = churnData['Churn']
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [296]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [297]:
lr=LogisticRegression()
lr.fit(X_train, y_train)

LogisticRegression()

In [298]:
y_pred = lr.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8062455642299503
Confusion Matrix:
 [[960  76]
 [197 176]]
Classification Report:
               precision    recall  f1-score   support

          No       0.83      0.93      0.88      1036
         Yes       0.70      0.47      0.56       373

    accuracy                           0.81      1409
   macro avg       0.76      0.70      0.72      1409
weighted avg       0.79      0.81      0.79      1409



6.Managing imbalance in the dataset

In [289]:
churnData["Churn"].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

7.Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.

In [301]:
ros=RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

lr = LogisticRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7497584541062802
Confusion Matrix:
 [[766 255]
 [263 786]]
Classification Report:
               precision    recall  f1-score   support

          No       0.74      0.75      0.75      1021
         Yes       0.76      0.75      0.75      1049

    accuracy                           0.75      2070
   macro avg       0.75      0.75      0.75      2070
weighted avg       0.75      0.75      0.75      2070



In [302]:
major=churnData[churnData['Churn']=='No']
minor=churnData[churnData['Churn']=='Yes']

In [303]:
minor_up=resample(minor, replace=False, n_samples=1869, random_state=0)
data_up=pd.concat([major,minor_up])

In [307]:
major_down=resample(major, replace=True, n_samples=1869, random_state=0)
data_down=pd.concat([major_down,minor])

8.Each time fit the model and see how the accuracy of the model is.

In [305]:
X_up = data_up[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y_up = data_up['Churn']

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.3, random_state=0)

lr_up = LogisticRegression()
lr_up.fit(X_train_up, y_train_up)

y_pred_up = lr_up.predict(X_test_up)
accuracy_up = accuracy_score(y_test_up, y_pred_up)

print('Accuracy of upsampled data:', round(accuracy_up,5))

Accuracy of upsampled data: 0.78609


In [308]:
X_down = data_down[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y_down = data_down['Churn']

X_train_down, X_test_down, y_train_down, y_test_down = train_test_split(X_down, y_down, test_size=0.3, random_state=0)

lr_down = LogisticRegression()
lr_down.fit(X_train_down, y_train_down)

y_pred_down = lr_down.predict(X_test_down)
accuracy_down = accuracy_score(y_test_down, y_pred_down)

print('Accuracy of downsampled data:', round(accuracy_down,5))

Accuracy of downsampled data: 0.72638
