
# churn prediction with classification

We will do a churn prediction project, predicting if they will swith away from our company

This way we can give them promotional emails and stay

We will use Binary Classification

The ML strategy applied to approach this problem is binary classification, which for one instance ($i^{th}$ customer), can be expressed as:

$$\large g\left(x_{i}\right) = y_{i}$$

In the formula, $y_i$ is the model's prediction and belongs to {0,1}, with 0 being the negative value or no churning, and 1 the positive value or churning. The output corresponds to the likelihood of churning.

In brief, the main idea behind this project is to build a model with historical data from customers and assign a score of the likelihood of churning.

y is a score from 0 to 1, where 0 means no churn and 1 means churn.

basically we use some info to do this

## 3.1 data prep

In [71]:
# lets import the some modules and import the data
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt

In [72]:
#now we get data and view it

df = pd.read_csv('data/churn.csv')

df.head().T # transposes to db to make rows become columns

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


Now lets do our standard data preparation

In [73]:
# makes all columns lower case snake case, same with all values of type object

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorial_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorial_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')
    
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [74]:
# now we need to convert the types, if you run df.dTypes it shows some wacky types
# the columns are total charges, and churn. 

df.totalcharges = pd.to_numeric(df.totalcharges, errors = 'coerce')
# coerce ignores the erros
df.totalcharges = df.totalcharges.fillna(0)
# get rid of the nulls


# churn requres it be binary instead of yes or no
df.churn = (df.churn == 'yes').astype(int)

## 3.2 Set up validation framework

- We did this before using numpy and pandas to do the 60 20 20 division
- this time we will use scikit learn

In [75]:
from sklearn.model_selection import train_test_split


# we first split with full_train and test, then we split the "full_train" into train and validation
df_full_train, df_test = train_test_split(df, test_size= 0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size= 0.25, random_state=1) # 0.25 because the size scales


In [76]:
# check if the size is right
len(df_train), len(df_val), len(df_test)

(4225, 1409, 1409)

In [77]:
#reset index 
df_train.reset_index(drop=True)
df_val.reset_index(drop=True)
df_test.reset_index(drop=True)

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,8879-zkjof,female,0,no,no,41,yes,no,dsl,yes,...,yes,yes,yes,yes,one_year,yes,bank_transfer_(automatic),79.85,3320.75,0
1,0201-mibol,female,1,no,no,66,yes,yes,fiber_optic,yes,...,no,no,yes,yes,two_year,yes,bank_transfer_(automatic),102.40,6471.85,0
2,1600-dilpe,female,0,no,no,12,yes,no,dsl,no,...,no,no,no,no,month-to-month,yes,bank_transfer_(automatic),45.00,524.35,0
3,8601-qacrs,female,0,no,no,5,yes,yes,dsl,no,...,no,no,no,no,month-to-month,yes,mailed_check,50.60,249.95,1
4,7919-zodzz,female,0,yes,yes,10,yes,no,dsl,no,...,yes,no,no,yes,one_year,yes,mailed_check,65.90,660.05,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1404,5130-iekqt,male,1,no,no,25,yes,yes,fiber_optic,no,...,yes,no,yes,yes,month-to-month,no,mailed_check,105.95,2655.25,1
1405,4452-rohmo,female,0,no,no,15,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.60,331.60,0
1406,6164-haqtx,male,0,no,no,71,no,no_phone_service,dsl,yes,...,yes,yes,yes,no,two_year,no,bank_transfer_(automatic),53.95,3888.65,0
1407,3982-dqlus,male,1,yes,yes,65,yes,yes,fiber_optic,yes,...,no,no,no,no,month-to-month,yes,electronic_check,85.75,5688.45,0


In [78]:
# get the y values

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

#delete churn value form all df

del df_train['churn']
del df_val['churn']
del df_test['churn']

## 3.3 EDA

- Look for missing values
- Look at target variable
- Look at numerical and categorical variables

In [79]:
# lets see the churn rate right now
# normalize turns into percentage
df_full_train.churn.value_counts(normalize=True)

churn
0    0.730032
1    0.269968
Name: proportion, dtype: float64

In [80]:
# another way to get the churn rate is computing the meawn

global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2)

np.float64(0.27)

In [81]:
# now lets get sone numerical variables

#if you run dtypees there are 3  numerical values that we care about:
# totalCharges, tenure, and montlycharges

numerical = ['tenure', 'monthlycharges', 'totalcharges']

df_full_train.columns
categorial = ['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport','streamingtv', 'streamingmovies', 'contract', 'paperlessbilling','paymentmethod']

df_full_train[categorial].nunique()
# now we can get the number of unique values and turn them into features

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## 3.4 Feature importnace
this is a part of EDA which is where we identify if features matter or not
4 sub modules: 
- churn rate + risk ratio (for categorical) "does it matter?"
- mutual information (for categorical) "how much does it matter?"
- correlation (for numerical) "how much does it matter?"

In [82]:
# churn rate
# we need to see which one of thses actually affect our result

churn_female = df_full_train[df_full_train.gender == 'female'].churn.mean()
churn_male = df_full_train[df_full_train.gender == 'male'].churn.mean()

print(f"churn_female: {churn_female}, churn_male: {churn_male}")


churn_female: 0.27682403433476394, churn_male: 0.2632135306553911


In [83]:
churn_partner = df_full_train[df_full_train["partner"] == "yes"].churn.mean()
churn_no_partner = df_full_train[df_full_train["partner"] == "no"].churn.mean()
print(f"churn_partner: {churn_partner}, churn_no_partner: {churn_no_partner}")

churn_partner: 0.20503330866025166, churn_no_partner: 0.3298090040927694


In [84]:
# now we get the difference to see if that actually affects the churn rate ornot

global_churn_rate - churn_no_partner

np.float64(-0.05984095297455855)

There are 2 things that matter
1. difference, < 0 means more likely to churn, and > 0 means less likely to churn
2. risk ratio, if it is > 1 it means more likely, <1 means less likely


this gives us intuition on how much something affect it

In [85]:
# now we get the risk ratio
churn_no_partner / global_churn_rate, churn_partner/global_churn_rate



(np.float64(1.2216593879412643), np.float64(0.7594724924338315))

In [86]:
# now we can do a select statmenet that calculates churn based on gender
# then we add the 2 columns, diff and risk ratio

df_group = df_full_train.groupby('gender').churn.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_churn_rate
df_group['risk'] = df_group['mean'] / global_churn_rate
df_group

Unnamed: 0_level_0,mean,count,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.276824,2796,0.006856,1.025396
male,0.263214,2838,-0.006755,0.97498


In [87]:
# lets make it in a loop so we can see if which ones have more corrilation compared to others

for c in categorial:
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn_rate
    df_group['risk'] = df_group['mean'] / global_churn_rate
    # display(df_group) 

### lets do mutual information now, to measure the importance of different variables

tells us how much we can learn about one variable if we know the value of another

In [88]:
# no need to implement
from sklearn.metrics import mutual_info_score

In [89]:
mutual_info_score(df_full_train.contract, df_full_train.churn)

0.0983203874041556

In [90]:
mutual_info_score(df_full_train.gender, df_full_train.churn)

0.0001174846211139946

from this we can see that since the number for contract is bigger, it means that it affects us a lot mroe than gender

In [91]:
# lets apply this to all
def mutual_info_churn_score(series):
    return mutual_info_score(series, df_full_train.churn)

mi = df_full_train[categorial].apply(mutual_info_churn_score)
mi.sort_values(ascending = False)

#we can see contract is the most important and geneder is the leaset

contract            0.098320
onlinesecurity      0.063085
techsupport         0.061032
internetservice     0.055868
onlinebackup        0.046923
deviceprotection    0.043453
paymentmethod       0.043210
streamingtv         0.031853
streamingmovies     0.031581
paperlessbilling    0.017589
dependents          0.012346
partner             0.009968
seniorcitizen       0.009410
multiplelines       0.000857
phoneservice        0.000229
gender              0.000117
dtype: float64

Now lets do correlation, this is the importance of numerical variables

there is a formula for this, which gets you the correlation coefficient

- when correlation is positive, then it drectly correlates
- if it is negative then it has an inverse relation
- 0.0-0.2 is low, 0.2-0.5 is moderate, 0.6-1.0 is strong 

In [92]:
df_full_train[numerical].corrwith(df_full_train.churn)

# this gets us the correlation coefficient for numerical features. Kinda cool right?

tenure           -0.351885
monthlycharges    0.196805
totalcharges     -0.196353
dtype: float64

we can see there is an inverse relationship between tenure and churn, and the corrilation is quite noticable despite the number only being 0.35

## 3.5 One-hot encoding

we are going to use sklearn to encode categorical features. this is what we used in our previous lesson

In [93]:
from sklearn.feature_extraction import DictVectorizer

#this is smart and if we have non categorical variables, it does not transform

In [94]:
#seperate this into rows
dicts = df_train[['gender', 'contract']].iloc[:100].to_dict(orient = 'records')
dicts

[{'gender': 'female', 'contract': 'two_year'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'two_year'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'two_year'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'two_year'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'two_year'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'two_year'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'male', 'contract': 'one_year'},
 {'gender': 'male', 'contract': 'two_year'},
 {

In [95]:
dv = DictVectorizer(sparse=False)
dv.fit(dicts)

dv.transform(dicts) #now we get the dict


array([[0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 1., 0., 1., 0.],
       [1., 0.

In [96]:
dv.get_feature_names_out() # this actually gets us the column names

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'gender=female', 'gender=male'], dtype=object)

now lets actually do it to everything instead of just 2 columns

In [97]:
train_dicts = df_train[categorial + numerical].to_dict(orient = 'records')
dv = DictVectorizer(sparse=False)

X_train = dv.fit_transform(train_dicts)


In [98]:
# lets do the same for validation and test data set

val_dicts = df_val[categorial + numerical].to_dict(orient = 'records')
X_val = dv.transform(val_dicts) 

# 3.6 Logistic regression


In general, supervised models can be represented with this formula:  

$$
g(x_i) = y_i
$$

Depending on what is the type of target variable, the supervised task can be regression or classification (binary or multiclass). Binary classification tasks can have negative (0) or positive (1) target values. The output of these models is the probability of \( x_i \) belonging to the positive class.  

Logistic regression is similar to linear regression because both models take into account the bias term and weighted sum of features. The difference between these models is that the output of linear regression is a real number, while logistic regression outputs a value between zero and one, applying the sigmoid function to the linear regression formula.  

$$
g(x_i) = Sigmoid(w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n)
$$

$$
Sigmoid(z) = \frac{1}{1 + e^{-z}}
$$

In this way, the sigmoid function allows transforming a score into a probability.  

**in simple terms**:
you just put the sigmoid function ouside of regression, so the result falls in between 0 and 1


In [99]:
# lets train our model with sklearn

from sklearn.linear_model import LogisticRegression

In [100]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


below are some ways where we can get the coefficient or the intercept

In [101]:
model.intercept_[0]

np.float64(-0.10910575411731115)

In [102]:
model.coef_[0].round(3)

array([ 0.476, -0.175, -0.408, -0.03 , -0.078,  0.063, -0.089, -0.081,
       -0.034, -0.073, -0.336,  0.317, -0.089,  0.004, -0.258,  0.142,
        0.009,  0.063, -0.089, -0.081,  0.266, -0.089, -0.284, -0.231,
        0.123, -0.166,  0.059, -0.087, -0.032,  0.07 , -0.059,  0.142,
       -0.25 ,  0.216, -0.121, -0.089,  0.102, -0.071, -0.089,  0.052,
        0.213, -0.089, -0.232, -0.07 ,  0.   ])

In [103]:
# now we use the model
# if it is 0 and 1 then they are hard predictions, to avoid this we use predict_proba
# that gets us the soft predictions

# it has 2 columns, one is probability of being 0, and the other is probability of 1
# so right side is probability of churning

model.predict_proba(X_train)

#w e only care about the chance of churning so we get the right column
# lets do it to validation

y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
churn_decision.astype(int)

# lets get the users that will churn
df_val[churn_decision].customerid

2504    8433-wxgna
4597    3440-jpscl
2343    2637-fkfsy
5591    7228-omtpn
4482    6711-fldfb
           ...    
2611    5976-jcjrh
4211    2034-cgrhz
3999    5276-kqwhg
6240    6521-yytyi
5282    3049-solay
Name: customerid, Length: 312, dtype: object

In [104]:
# lets see how accurate our model is
(y_val == churn_decision).mean()
#we can see we got 80% of them right

np.float64(0.8026969481902059)

# 3.7 Model Interpolation

we will understand what the coeffiicients mean, and also train a smaller model with less features

In [105]:
# lets first look at the features and their respective weights
# the zip function joins each elemen of index i on list a to index i of list b



# it is very difficult ot make sense of it so lets train a smaller modeldict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))


In [106]:
# prep the data
small = ['contract', 'tenure', 'monthlycharges']

dicts_train_small =  df_train[small].iloc[:10].to_dict(orient = 'records')
df_val_small =  df_val[small].iloc[:10].to_dict(orient = 'records')



In [107]:
# make the encoding for features
dv_small = DictVectorizer(sparse=False)
dv_small.fit(dicts_train_small)
dv_small.get_feature_names_out()
X_tran_small  = dv_small.transform(dicts_train_small)

In [108]:
# train the small model
model_small = LogisticRegression()
model_small.fit(X_train, y_train)




STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [109]:
display(model_small.coef_[0])
display(model_small.intercept_[0])

array([ 4.75717837e-01, -1.74974117e-01, -4.08489634e-01, -2.99885245e-02,
       -7.77573894e-02,  6.26589495e-02, -8.90715400e-02, -8.13333235e-02,
       -3.43043508e-02, -7.34415631e-02, -3.35503209e-01,  3.16828835e-01,
       -8.90715400e-02,  3.68850280e-03, -2.58257758e-01,  1.41848117e-01,
        8.66372707e-03,  6.25260874e-02, -8.90715400e-02, -8.12004613e-02,
        2.65657105e-01, -8.90715400e-02, -2.84331479e-01, -2.31072067e-01,
        1.23326153e-01, -1.66360757e-01,  5.86148436e-02, -8.70789120e-02,
       -3.19911720e-02,  7.00844586e-02, -5.87602885e-02,  1.41848117e-01,
       -2.49594031e-01,  2.15655184e-01, -1.20527699e-01, -8.90715400e-02,
        1.01853325e-01, -7.09790548e-02, -8.90715400e-02,  5.23046809e-02,
        2.13260458e-01, -8.90715400e-02, -2.31934832e-01, -7.04966672e-02,
        3.83087375e-04])

np.float64(-0.10910575411731115)

In [110]:
dict(zip(dv_small.get_feature_names_out(), model_small.coef_[0].round(3)))


{'contract=month-to-month': np.float64(0.476),
 'contract=two_year': np.float64(-0.175),
 'monthlycharges': np.float64(-0.408),
 'tenure': np.float64(-0.03)}

## 3.7 Using the model

In [111]:
#get the dict 
dicts_full_train = df_full_train[categorial + numerical].to_dict(orient = 'records')

dv = DictVectorizer(sparse=False)
X_full_train = dv.fit_transform(dicts_full_train)
y_full_train = df_full_train.churn.values

model = LogisticRegression()
model.fit(X_full_train, y_full_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [112]:
# repeat the process for the test dataaset
dicts_test = df_test[categorial + numerical].to_dict(orient = 'records')
X_test = dv.transform(dicts_test)

y_pred = model.predict_proba(X_test)[:, 1]
y_pred

array([0.05555219, 0.13116071, 0.32157923, ..., 0.00634677, 0.19263119,
       0.65842011], shape=(1409,))

In [None]:
churn_decision = (y_pred >= 0.5)
(churn_decision == y_test).mean()
# 81% accurate we can see here
# if accuracy difference between validatoin and test is very big, then there might be an issue

np.float64(0.8140525195173882)

In [124]:
# we wanted to use it to see if the customer wants to leave or not
# if they do then we will need to send them a promotion email

#lets do this with a random customer
customer = dicts_test[10]
display(customer)
# transform this to get the feature matrix 
# (it takes in a list of dicts,o we only are using one so we make a list of length 1)
customer_features = dv.transform([customer])
model.predict_proba(customer_features)[: , 1]


{'gender': 'male',
 'seniorcitizen': 1,
 'partner': 'yes',
 'dependents': 'yes',
 'phoneservice': 'yes',
 'multiplelines': 'no',
 'internetservice': 'fiber_optic',
 'onlinesecurity': 'no',
 'onlinebackup': 'yes',
 'deviceprotection': 'no',
 'techsupport': 'no',
 'streamingtv': 'yes',
 'streamingmovies': 'yes',
 'contract': 'month-to-month',
 'paperlessbilling': 'yes',
 'paymentmethod': 'mailed_check',
 'tenure': 32,
 'monthlycharges': 93.95,
 'totalcharges': 2861.45}

array([0.49654726])

In [None]:
# so this customer is not likely to churn since the percentage is smaller than 50%

# now lets see if it is churning

y_test[10]

# it is 0 so we got it right

np.int64(0)

# Summary

- we used risk, mutual information, and correlation coefficient to get the importance of features
- we used DictVectorizer for one-hot encoding
- we used the logistic regrssion model for classification
- we got the output, which is probability
- we interpreted the weights and how it is similar to linear regression