### Machine Learning End-to-End with Logistic Regression

Today, I will be performing a Logistic Regression end-to-end analysis on the SyriaTel Customer Churn dataset from Kaggle.

First, we import our modules we'll be using for this exercise and then import our dataset.

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [42]:
#open up the file
telecom_df = pd.read_csv('data/bigml_59c28831336c6604c800002a.csv')

telecom_df

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.90,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,...,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,...,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,...,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,...,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False


In [43]:
for column in telecom_df.columns:
    print('---- %s ---' % column)
    print(telecom_df[column].value_counts())

---- state ---
WV    106
MN     84
NY     83
AL     80
OH     78
OR     78
WI     78
VA     77
WY     77
CT     74
MI     73
VT     73
ID     73
TX     72
UT     72
IN     71
KS     70
MD     70
MT     68
NJ     68
NC     68
CO     66
WA     66
NV     66
MS     65
RI     65
MA     65
AZ     64
MO     63
FL     63
NM     62
ND     62
ME     62
OK     61
NE     61
DE     61
SD     60
SC     60
KY     59
IL     58
NH     56
AR     55
DC     54
GA     54
HI     53
TN     53
AK     52
LA     51
PA     45
IA     44
CA     34
Name: state, dtype: int64
---- account length ---
105    43
87     42
93     40
101    40
90     39
       ..
191     1
199     1
215     1
221     1
2       1
Name: account length, Length: 212, dtype: int64
---- area code ---
415    1655
510     840
408     838
Name: area code, dtype: int64
---- phone number ---
354-5764    1
332-9896    1
409-2917    1
417-9455    1
400-8375    1
           ..
337-9303    1
348-5567    1
377-1218    1
409-1244    1
402-5076    1
Name: 

Here, we check the values of each of the churn rate values. We are looking to determine the likelihood of someone "churning", or
leaving as a customer.

In [44]:
print(telecom_df.churn.value_counts())
print()
print(telecom_df.churn.value_counts(normalize=True))

False    2850
True      483
Name: churn, dtype: int64

False    0.855086
True     0.144914
Name: churn, dtype: float64


Here, around 14% of customers end up leaving as customers.

Next, we'll designate our X and y variables and then perform our train-test split. We'll default our split to be a 75-25 split with a random_state = 42

In [45]:
# Here, we'll designate what our X and y are

X = telecom_df.drop('churn', axis=1)
y = telecom_df['churn']

# Then, we'll do a train-test split separate the data out between a training dataset and a testing data set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Preprocessing

Now that we have our train-test split. Let's start our preprocessing process.

For this example, we'll be implementing a one hot encoder to convert categorical variables to numeric and standard scaler to make sure all of our data is on the same standard scale.

Let's instantiate our one hot encoder first.

In [46]:
from sklearn.preprocessing import OneHotEncoder

# Create the encoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit(X_train)

# Apply the encoder
X_train = ohe.transform(X_train)
X_test = ohe.transform(X_test)

Next, we'll import the StandarScaler module and instantiate it on our X_train and X_test data.

In [47]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler(with_mean=False)
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

And that completes our preprocessing. Now let's get into the modelling itself.

### Modelling

Let's create for ourselves a baseline model.

In [51]:
# import the needed modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# instantiate our logistic regression instance
baseline_model = LogisticRegression(random_state=42)

# perform our cross validation

baseline_model_neg_log_loss_cv = cross_val_score(baseline_model, X_train_scaled, y_train, scoring='neg_log_loss')
baseline_model_neg_log_loss_cv

baseline_log_loss = -(baseline_model_neg_log_loss_cv.mean())
baseline_log_loss

0.8387899236355116

Here, our log loss is ~0.84. This is the baseline we'll use to compare to the other models to see how much we've improved (or not!)

#### Modified Logistic Model

Here, we'll use another Logistic Regression model, but with different parameters. Let's try modifying the solver, penalty, and class weights. 

In [57]:
modified_model = LogisticRegression(solver='saga', penalty='elasticnet', class_weight='balanced', l1_ratio=0.4, max_iter=20000)


modified_model_neg_log_loss_cv = cross_val_score(modified_model, X_train_scaled, y_train, scoring='neg_log_loss')

modified_model_log_loss = -(modified_model_neg_log_loss_cv.mean())

modified_model_log_loss

0.6156952083584286

Now, let's compare our baseline to our modified model.

In [58]:
print("Previous Model")
print("Baseline average:", baseline_log_loss)
print("Current Model")
print("Modified Model average:", modified_model_log_loss)

Previous Model
Baseline average: 0.8387899236355116
Current Model
Modified Model average: 0.6156952083584286


As we're dealing with the log loss function, we are looking for the lowest amount of log loss (lowest error), so we know that we have made some improvement from our baseline model to our modified model.

So, let's try this out on our test data now.

In [61]:
from sklearn.metrics import log_loss

modified_model.fit(X_train_scaled, y_train)
log_loss(y_test, modified_model.predict_proba(X_test_scaled))

0.6585999836806087

Our model on the test data performed slightly worst than on the trained data. This is expected as trained data normally perform better than test data does.

Now, let's go and see our evaluation metrics.

In [76]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

print(f" Our accuracy score is {accuracy_score(y_test, modified_model.predict(X_test_scaled))}")
print(f" Our precision score is {precision_score(y_test, modified_model.predict(X_test_scaled))}")
print(f" Our recall score is {recall_score(y_test, modified_model.predict(X_test_scaled))}")

 Our accuracy score is 0.8309352517985612
 Our precision score is 0.2777777777777778
 Our recall score is 0.08


And here we have it! We have our evaluation metrics now. What this means is that our model's accuracy is approximately 83%, meaning that, with this model, we can correctly identify the likelihood of someone "churning" around 83% of the time. However, our precision and recall rates are abymissal (~28% and ~8%, respectively). But this exercise was to show how the process unfolds. Next time, we can possibly try out different models and/or parameters to improve our score.