<a href="https://colab.research.google.com/github/Oughty-Otieno/Imbalanced-Classification/blob/main/Week_5_Imbalanced_Classification_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement**

Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on
clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1
score of at least 0.59. Check the F1 for the test set.
Additionally, measure the AUC-ROC metric and compare it with the F1.
1. Download and prepare the data. Explain the procedure.
2. Examine the balance of classes. Train the model without taking into account the
imbalance. Briefly describe your findings.
3. Improve the quality of the model. Make sure you use at least two approaches to
fixing class imbalance. Use the training set to pick the best parameters. Train
different models on training and validation sets. Find the best one. Briefly
describe your findings.
4. Perform the final testing.

**Reading the Data**

In [7]:
#Read the data 
#In this cell we load the important packages 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# We read the data

df = pd.read_csv("https://bit.ly/2XZK7Bo")

df.head(10) #previewing the random 10 records

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


Examining the class ratios


In [8]:
#Getting the count and percentages proportions of the target variable
s = df.Exited
counts = s.value_counts()
percent = s.value_counts(normalize=True)
percent100 = s.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
pd.DataFrame({'counts': counts, 'per': percent, 'per100': percent100})

Unnamed: 0,counts,per,per100
0,7963,0.7963,79.6%
1,2037,0.2037,20.4%


Roughly the classes are in the ratio of 4:1

Modeling with data as is

In [9]:
#custom function for getting and displaying Missing values

def missing (df):
    missing_number = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
    return missing_values

In [10]:
missing(df)

Unnamed: 0,Missing_Number,Missing_Percent
Tenure,909,0.0909
RowNumber,0,0.0
CustomerId,0,0.0
Surname,0,0.0
CreditScore,0,0.0
Geography,0,0.0
Gender,0,0.0
Age,0,0.0
Balance,0,0.0
NumOfProducts,0,0.0


In [11]:
# We can fill in the missing values for  previous_year_rating and education using their median
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy="median")
df['Tenure'] = imputer.fit_transform(df['Tenure'].values.reshape(-1,1))[:,0]

In [12]:
#id will be unnecessary for predictions, we drop it
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

#Convert categorical variable into dummy/indicator variables
df = pd.get_dummies(df, columns=['Geography', 'Gender'], drop_first=False)


#split the data
X = df.drop(columns=["Exited"])
y = df.Exited

from sklearn.model_selection import KFold, cross_val_predict, train_test_split,GridSearchCV,cross_val_score, cross_validate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


#scaling the data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Modeling
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(class_weight="None").fit(X_train_scaled, y_train)

#Predicting Using Logistic Regression

y_test_pred_logistic_regression = log_model.predict(X_test_scaled)
y_pred_proba_logistic_regression = log_model.predict_proba(X_test_scaled)

test_data = pd.concat([X_test, y_test], axis=1)
test_data["pred"] = y_test_pred_logistic_regression
test_data["pred_proba"] = y_pred_proba_logistic_regression[:,1]

#Getting the F1 score
from sklearn.metrics import f1_score
f1_score(y_test, y_test_pred_logistic_regression, average=None)

array([0.89276808, 0.28970775])