# Laboratory work 3

**Goals**: to gain practice in building classification / prediction models, using cross validation, enhancing
the model performance.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

## Dataset

The dataset chosen is a set of customers information in IT. The aim of this dataset is to predict customer churn.

Link : [IT Customers Churn (Kaggle.com)](https://www.kaggle.com/datasets/soheiltehranipour/it-customer-churn)


### Attributes

| Attribute | Type | Description |
| --------- | ---- | ----------- |
| gender | Category | Gender of the customer (male or female) |
| SeniorCitizen | Category | Senior citizen or not (1 or 0) |
| Partner | Category | Customer has a partner or not (Yes or No) |
| Dependents| Category | Customer has dependents or not (Yes or No) |
| tenure | Number (int) | Average time since customer has initiated contracts (in years) |
| PhoneService | Category | Customer has a phone service or not (Yes or No) |
| MultipleLines | Category | Customer has multiples lines |
| InternetService | Category | Type of customer's Internet service provider (DSL, Fiber optic, No) |
| OnlineSecurity | Category | Customer has online security |
| OnlineBackup | Category | Customer has online backup |
| DeviceProtection | Category | Customer has device protection |
| TechSupport | Category | Customer has tech support |
| StreamingTV | Category | Customer subscribed to Streaming TV |
| StreamingMovies | Category | Customer subscrite to Streaming Movies |
| Contract | Category | Customer contract type (One year, Month-to-month, two year, ...) |
| PaperlessBilling | Category | Customer billing paperless or not (Yes or No) |
| PaymentMethod | Category | Customer payment method |
| MonthlyCharges | Number (float) | Customer monthly charges |
| TotalCharges | Number (float) | Customer total charges |
| Churn | Category | Customer left within the last month (Yes or No) |

### Load dataset

In [2]:
# Load dataset
df = pd.read_csv("IT_customer_churn.csv")
df.sample(10)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
5673,Male,0,No,No,4,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,No,Mailed check,19.7,117.8,No
5144,Female,1,Yes,No,3,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,30.75,82.85,No
4480,Female,0,Yes,No,1,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,69.4,69.4,Yes
2271,Female,1,No,No,40,Yes,No,Fiber optic,No,No,No,No,Yes,Yes,Month-to-month,Yes,Electronic check,91.55,3673.6,No
822,Male,0,No,No,47,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Electronic check,103.1,4889.3,No
1543,Female,0,Yes,No,5,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,No,Electronic check,69.95,330.15,Yes
6082,Female,0,Yes,No,59,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,Yes,Month-to-month,No,Electronic check,101.1,6039.9,Yes
730,Female,0,No,No,45,Yes,Yes,Fiber optic,No,No,Yes,Yes,No,No,Month-to-month,Yes,Electronic check,87.25,3941.7,Yes
6654,Female,0,Yes,No,7,Yes,Yes,Fiber optic,No,No,No,No,Yes,No,Month-to-month,Yes,Electronic check,86.5,582.5,Yes
4343,Female,1,No,No,8,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,75.75,606.25,Yes


### Check datatypes and fix it

In [3]:
df.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

**Change SeniorCitizen to object**

In [4]:
df['SeniorCitizen'] = df['SeniorCitizen'].astype(object)

**Change TotalCharges to number**

In [5]:
df.TotalCharges.values

array(['29.85', '1889.5', '108.15', ..., '346.45', '306.6', '6844.5'],
      dtype=object)

TotalCharges contains strings, I will change it to number.

In [6]:
pd.to_numeric(df.TotalCharges,errors='coerce').isnull()

0       False
1       False
2       False
3       False
4       False
        ...  
7038    False
7039    False
7040    False
7041    False
7042    False
Name: TotalCharges, Length: 7043, dtype: bool

In [7]:
df1 = df[df.TotalCharges != ' ']
df1.shape

(7032, 20)

In [8]:
df=df1

In [9]:
# ds = pd.to_numeric(df1.TotalCharges)
# df.TotalCharges = ds
# df.TotalCharges.values

In [10]:
df.shape

(7032, 20)

In [11]:
df.dtypes

gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

All types are now good.

## Choice of the target attribute

Without much hesitation, our target attribute is Churn.
We will try to know whether a client would be a churn.
This attribute is a boolean, True or False.

## Data preprocessing

In [12]:
df1 = df.copy()

In [13]:
yes_no_columns = ['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup',
                  'DeviceProtection','TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','Churn']
for col in yes_no_columns:
    df1[col].replace({'No phone service':2, 'No internet service':2, 'Yes': 1,'No': 0},inplace=True)
    
# Change genre
df1['gender'].replace({'Female':1,'Male':0},inplace=True)
df1['gender'].unique()

# Convert SeniorCitizen
df1['SeniorCitizen'].replace({'1':1,'0':0},inplace=True)

categorical_columns=['InternetService','Contract','PaymentMethod']
df1['InternetService'].replace({'DSL':2,'Fiber optic':1,'No':0},inplace=True)
df1['Contract'].replace({'Month-to-month':2,'One year':1,'Two year':0},inplace=True)
df1['PaymentMethod'].replace({'Electronic check':3, 'Mailed check':2,'Bank transfer (automatic)':1,'Credit card (automatic)':0},inplace=True)


In [14]:
df1.dtypes

gender                int64
SeniorCitizen         int64
Partner               int64
Dependents            int64
tenure                int64
PhoneService          int64
MultipleLines         int64
InternetService       int64
OnlineSecurity        int64
OnlineBackup          int64
DeviceProtection      int64
TechSupport           int64
StreamingTV           int64
StreamingMovies       int64
Contract              int64
PaperlessBilling      int64
PaymentMethod         int64
MonthlyCharges      float64
TotalCharges         object
Churn                 int64
dtype: object

## Normalize data

In [15]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [16]:
X = df1.drop('Churn',axis='columns')
y = testLabels = df1.Churn
X = scaler.fit_transform(X)



## Build and train the model

In [17]:
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier

In [18]:
def mlp(X, y, lr, activation, max_iter, hidden_layer_sizes=(100,)):
    model = MLPClassifier( learning_rate_init=lr, activation=activation, max_iter=max_iter, random_state=0)

#     model.fit(X, y)
    scores = cross_val_score(model, X, y, cv=10)
    print("Scores", scores, "\n")

    print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
    


In [19]:
lr=0.95
max_iter=50
hidden_layer_sizes=(19,)
mlp(X, y, lr, 'relu', max_iter=max_iter, hidden_layer_sizes=hidden_layer_sizes)

Scores [0.73153409 0.734375   0.79658606 0.77382646 0.68847795 0.65433855
 0.68136558 0.72688478 0.73684211 0.6742532 ] 

0.720 accuracy with a standard deviation of 0.043


## Change learning rate

In [20]:
max_iter=100

In [21]:
lr=0.0002
mlp(X, y, lr, 'relu', max_iter=max_iter, hidden_layer_sizes=hidden_layer_sizes)



Scores [0.8125     0.80539773 0.80227596 0.82219061 0.78662873 0.79231863
 0.81650071 0.80654339 0.79800853 0.81081081] 

0.805 accuracy with a standard deviation of 0.010




## Activation function

In [22]:
mlp(X, y, lr, 'logistic', max_iter=max_iter, hidden_layer_sizes=hidden_layer_sizes)

Scores [0.79971591 0.8125     0.80512091 0.82219061 0.78093883 0.78662873
 0.80227596 0.80369844 0.79943101 0.80512091] 

0.802 accuracy with a standard deviation of 0.011


In [23]:
mlp(X, y, lr, 'tanh', max_iter=max_iter, hidden_layer_sizes=hidden_layer_sizes)



Scores [0.80255682 0.80397727 0.80654339 0.82503556 0.78236131 0.78520626
 0.81650071 0.80938834 0.79943101 0.80227596] 

0.803 accuracy with a standard deviation of 0.012




## Change ANN structure

In [37]:
mlp(X, y, 0.00005, 'logistic', max_iter=5*max_iter, hidden_layer_sizes=(19,19))

Scores [0.80113636 0.82244318 0.79089616 0.81365576 0.79943101 0.79374111
 0.82219061 0.79089616 0.78805121 0.80654339] 

0.803 accuracy with a standard deviation of 0.012


## Rearrange dataset

In [25]:
df2=df1.copy()
df2.sample()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
93,1,0,0,0,65,1,1,1,1,1,1,0,1,1,2,1,0,111.05,7107,0


In [26]:
X = df2.drop('Churn',axis='columns')
y = testLabels = df2.Churn


### Oversampling

In [27]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X, y)

In [38]:
lr=lr
mlp(X, y, lr, 'logistic', max_iter=2*max_iter, hidden_layer_sizes=(19,))

Scores [0.80539773 0.8125     0.79516358 0.81081081 0.79516358 0.77951636
 0.81934566 0.80085349 0.78947368 0.80938834] 

0.802 accuracy with a standard deviation of 0.011
