# Churn Analytics

# Telecom Churn Dataset

### Content
The Orange Telecom's Churn Dataset, which consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription, will be used to develop predictive models. Two datasets are made available here: The churn-80 and churn-20 datasets can be downloaded.

The two sets are from the same batch, but have been split by an 80/20 ratio. As more data is often desirable for developing ML models, let's use the larger set (that is, churn-80) for training and cross-validation purposes, and the smaller set (that is, churn-20) for final testing and model performance evaluation.

A link to the dataset is given [here](https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets?select=churn-bigml-80.csv)

In [1]:
import pandas as pd

In [3]:
df_train = pd.read_csv('churn-bigml-80.csv')

In [93]:
df_train.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [95]:
df_train.head(1)

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False


In [6]:
df_train.shape

(2666, 20)

In [7]:
df_test = pd.read_csv('churn-bigml-20.csv')

In [8]:
df_test.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,LA,117,408,No,No,0,184.5,97,31.37,351.6,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False
1,IN,65,415,No,No,0,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,True
2,NY,161,415,No,No,0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,True
3,SC,111,415,No,No,0,110.4,103,18.77,137.3,102,11.67,189.6,105,8.53,7.7,6,2.08,2,False
4,HI,49,510,No,No,0,119.3,117,20.28,215.1,109,18.28,178.7,90,8.04,11.1,1,3.0,1,False


In [13]:
df_test.shape

(667, 20)

# Concat them both

In [14]:
df = pd.concat([df_train,df_test])

In [15]:
df

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,WI,114,415,No,Yes,26,137.1,88,23.31,155.7,125,13.23,247.6,94,11.14,11.5,7,3.11,2,False
663,AL,106,408,No,Yes,29,83.6,131,14.21,203.9,131,17.33,229.5,73,10.33,8.1,3,2.19,1,False
664,VT,60,415,No,No,0,193.9,118,32.96,85.0,110,7.23,210.1,134,9.45,13.2,8,3.56,3,False
665,WV,159,415,No,No,0,169.8,114,28.87,197.7,105,16.80,193.7,82,8.72,11.6,4,3.13,1,False


# EDA

<!-- ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4208294%2F1c014f0a4490cf323418b80648ba44fe%2FDATAAC.jpeg?generation=1577356397338672&alt=media) -->

# One hot encoding

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3333 entries, 0 to 666
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   3333 non-null   object 
 1   Account length          3333 non-null   int64  
 2   Area code               3333 non-null   int64  
 3   International plan      3333 non-null   object 
 4   Voice mail plan         3333 non-null   object 
 5   Number vmail messages   3333 non-null   int64  
 6   Total day minutes       3333 non-null   float64
 7   Total day calls         3333 non-null   int64  
 8   Total day charge        3333 non-null   float64
 9   Total eve minutes       3333 non-null   float64
 10  Total eve calls         3333 non-null   int64  
 11  Total eve charge        3333 non-null   float64
 12  Total night minutes     3333 non-null   float64
 13  Total night calls       3333 non-null   int64  
 14  Total night charge      3333 non-null   f

# Fixing data types

In [19]:
df['Area code']

0      415
1      415
2      415
3      408
4      415
      ... 
662    415
663    408
664    415
665    415
666    510
Name: Area code, Length: 3333, dtype: int64

In [18]:
df['Area code'].astype('object')

0      415
1      415
2      415
3      408
4      415
      ... 
662    415
663    408
664    415
665    415
666    510
Name: Area code, Length: 3333, dtype: object

### Overwrite my previous column with new data

In [20]:
df['Area code'] = df['Area code'].astype('object')

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3333 entries, 0 to 666
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   3333 non-null   object 
 1   Account length          3333 non-null   int64  
 2   Area code               3333 non-null   object 
 3   International plan      3333 non-null   object 
 4   Voice mail plan         3333 non-null   object 
 5   Number vmail messages   3333 non-null   int64  
 6   Total day minutes       3333 non-null   float64
 7   Total day calls         3333 non-null   int64  
 8   Total day charge        3333 non-null   float64
 9   Total eve minutes       3333 non-null   float64
 10  Total eve calls         3333 non-null   int64  
 11  Total eve charge        3333 non-null   float64
 12  Total night minutes     3333 non-null   float64
 13  Total night calls       3333 non-null   int64  
 14  Total night charge      3333 non-null   f

# I want to seperate my categorical and numerical columns

### This is my categorical data

In [22]:
df.select_dtypes(include=['object'])

Unnamed: 0,State,Area code,International plan,Voice mail plan
0,KS,415,No,Yes
1,OH,415,No,Yes
2,NJ,415,No,No
3,OH,408,Yes,No
4,OK,415,Yes,No
...,...,...,...,...
662,WI,415,No,Yes
663,AL,408,No,Yes
664,VT,415,No,No
665,WV,415,No,No


In [24]:
df.select_dtypes(include=['int64'])

Unnamed: 0,Account length,Number vmail messages,Total day calls,Total eve calls,Total night calls,Total intl calls,Customer service calls
0,128,25,110,99,91,3,1
1,107,26,123,103,103,3,1
2,137,0,114,110,104,5,0
3,84,0,71,88,89,7,2
4,75,0,113,122,121,3,3
...,...,...,...,...,...,...,...
662,114,26,88,125,94,7,2
663,106,29,131,131,73,3,1
664,60,0,118,110,134,8,3
665,159,0,114,105,82,4,1


In [27]:
df.select_dtypes(include=['float64'])

Unnamed: 0,Total day minutes,Total day charge,Total eve minutes,Total eve charge,Total night minutes,Total night charge,Total intl minutes,Total intl charge
0,265.1,45.07,197.4,16.78,244.7,11.01,10.0,2.70
1,161.6,27.47,195.5,16.62,254.4,11.45,13.7,3.70
2,243.4,41.38,121.2,10.30,162.6,7.32,12.2,3.29
3,299.4,50.90,61.9,5.26,196.9,8.86,6.6,1.78
4,166.7,28.34,148.3,12.61,186.9,8.41,10.1,2.73
...,...,...,...,...,...,...,...,...
662,137.1,23.31,155.7,13.23,247.6,11.14,11.5,3.11
663,83.6,14.21,203.9,17.33,229.5,10.33,8.1,2.19
664,193.9,32.96,85.0,7.23,210.1,9.45,13.2,3.56
665,169.8,28.87,197.7,16.80,193.7,8.72,11.6,3.13


### This is my numerical data

In [26]:
df.select_dtypes(include=['int64', 'float64'])

Unnamed: 0,Account length,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,128,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1
1,107,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1
2,137,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0
3,84,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,75,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,114,26,137.1,88,23.31,155.7,125,13.23,247.6,94,11.14,11.5,7,3.11,2
663,106,29,83.6,131,14.21,203.9,131,17.33,229.5,73,10.33,8.1,3,2.19,1
664,60,0,193.9,118,32.96,85.0,110,7.23,210.1,134,9.45,13.2,8,3.56,3
665,159,0,169.8,114,28.87,197.7,105,16.80,193.7,82,8.72,11.6,4,3.13,1


In [28]:
df_cat = df.select_dtypes(include=['object'])
df_num = df.select_dtypes(include=['int64', 'float64'])

# Task done

# One hot encoding

In [103]:
df_cat

Unnamed: 0,State,Area code,International plan,Voice mail plan
0,KS,415,No,Yes
1,OH,415,No,Yes
2,NJ,415,No,No
3,OH,408,Yes,No
4,OK,415,Yes,No
...,...,...,...,...
662,WI,415,No,Yes
663,AL,408,No,Yes
664,VT,415,No,No
665,WV,415,No,No


In [104]:
df_cat.columns

Index(['State', 'Area code', 'International plan', 'Voice mail plan'], dtype='object')

In [30]:
from sklearn.preprocessing import OneHotEncoder # This is me importing
enc = OneHotEncoder() # Initialising

In [31]:
enc.fit(df_cat)

In [33]:
enc.transform(df_cat)

<3333x58 sparse matrix of type '<class 'numpy.float64'>'
	with 13332 stored elements in Compressed Sparse Row format>

When we transform, we don't get back an array, but a sparse matrix. To get the array out of our sparse matrix. We use ```toarray()```

In [34]:
enc.transform(df_cat).toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 1., 0.]])

In [110]:
enc.get_feature_names_out()

array(['State_AK', 'State_AL', 'State_AR', 'State_AZ', 'State_CA',
       'State_CO', 'State_CT', 'State_DC', 'State_DE', 'State_FL',
       'State_GA', 'State_HI', 'State_IA', 'State_ID', 'State_IL',
       'State_IN', 'State_KS', 'State_KY', 'State_LA', 'State_MA',
       'State_MD', 'State_ME', 'State_MI', 'State_MN', 'State_MO',
       'State_MS', 'State_MT', 'State_NC', 'State_ND', 'State_NE',
       'State_NH', 'State_NJ', 'State_NM', 'State_NV', 'State_NY',
       'State_OH', 'State_OK', 'State_OR', 'State_PA', 'State_RI',
       'State_SC', 'State_SD', 'State_TN', 'State_TX', 'State_UT',
       'State_VA', 'State_VT', 'State_WA', 'State_WI', 'State_WV',
       'State_WY', 'Area code_408', 'Area code_415', 'Area code_510',
       'International plan_No', 'International plan_Yes',
       'Voice mail plan_No', 'Voice mail plan_Yes'], dtype=object)

# Saving my onehot encoder

In [100]:
# from joblib import dump, load
dump(enc, 'my_encoder.joblib') # save the model
# clf = load('filename.joblib') # load and reuse the model

['my_encoder.joblib']

# Let's make a DataFrame out of this.

In [44]:
pd.DataFrame(enc.transform(df_cat).toarray(), columns=enc.get_feature_names_out())

Unnamed: 0,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,State_FL,...,State_WI,State_WV,State_WY,Area code_408,Area code_415,Area code_510,International plan_No,International plan_Yes,Voice mail plan_No,Voice mail plan_Yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3329,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
3330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [61]:
df_cat_encoded = pd.DataFrame(enc.transform(df_cat).toarray(), columns=enc.get_feature_names_out())

In [62]:
df_cat_encoded

Unnamed: 0,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,State_FL,...,State_WI,State_WV,State_WY,Area code_408,Area code_415,Area code_510,International plan_No,International plan_Yes,Voice mail plan_No,Voice mail plan_Yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3329,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
3330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [40]:
df_cat

Unnamed: 0,State,Area code,International plan,Voice mail plan
0,KS,415,No,Yes
1,OH,415,No,Yes
2,NJ,415,No,No
3,OH,408,Yes,No
4,OK,415,Yes,No
...,...,...,...,...
662,WI,415,No,Yes
663,AL,408,No,Yes
664,VT,415,No,No
665,WV,415,No,No


# Now, let's work on our numeric data

In [42]:
df_num

Unnamed: 0,Account length,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,128,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1
1,107,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1
2,137,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0
3,84,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,75,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,114,26,137.1,88,23.31,155.7,125,13.23,247.6,94,11.14,11.5,7,3.11,2
663,106,29,83.6,131,14.21,203.9,131,17.33,229.5,73,10.33,8.1,3,2.19,1
664,60,0,193.9,118,32.96,85.0,110,7.23,210.1,134,9.45,13.2,8,3.56,3
665,159,0,169.8,114,28.87,197.7,105,16.80,193.7,82,8.72,11.6,4,3.13,1


In [43]:
df_num.describe()

Unnamed: 0,Account length,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


# Combining my categorical and numerical columns

In [70]:
df_cat_encoded

Unnamed: 0,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,State_FL,...,State_WI,State_WV,State_WY,Area code_408,Area code_415,Area code_510,International plan_No,International plan_Yes,Voice mail plan_No,Voice mail plan_Yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3329,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
3330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [105]:
df_num

Unnamed: 0,Account length,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,128,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.70,1
1,107,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1
2,137,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0
3,84,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,75,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
662,114,26,137.1,88,23.31,155.7,125,13.23,247.6,94,11.14,11.5,7,3.11,2
663,106,29,83.6,131,14.21,203.9,131,17.33,229.5,73,10.33,8.1,3,2.19,1
664,60,0,193.9,118,32.96,85.0,110,7.23,210.1,134,9.45,13.2,8,3.56,3
665,159,0,169.8,114,28.87,197.7,105,16.80,193.7,82,8.72,11.6,4,3.13,1


In [106]:
df_num.columns

Index(['Account length', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls'],
      dtype='object')

In [73]:
df_X = df_num.join(df_cat_encoded)

In [74]:
df_X

Unnamed: 0,Account length,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,...,State_WI,State_WV,State_WY,Area code_408,Area code_415,Area code_510,International plan_No,International plan_Yes,Voice mail plan_No,Voice mail plan_Yes
0,128,25,265.1,110,45.07,197.4,99,16.78,244.7,91,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
0,117,0,184.5,97,31.37,351.6,80,29.89,215.8,90,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,107,26,161.6,123,27.47,195.5,103,16.62,254.4,103,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,65,0,129.1,137,21.95,228.5,83,19.42,208.8,111,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2,137,0,243.4,114,41.38,121.2,110,10.30,162.6,104,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2661,79,0,134.7,98,22.90,189.7,68,16.12,221.4,128,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2662,192,36,156.2,77,26.55,215.5,126,18.32,279.1,83,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
2663,68,0,231.1,57,39.29,153.4,55,13.04,191.3,123,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2664,28,0,180.8,109,30.74,288.8,58,24.55,191.9,91,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0


# Getting my Y ready

In [75]:
df['Churn']

0      0
1      0
2      0
3      0
4      0
      ..
662    0
663    0
664    0
665    0
666    0
Name: Churn, Length: 3333, dtype: int64

In [76]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [77]:
le.fit(df['Churn'])

In [109]:
le.transform(df['Churn'])

array([0, 0, 0, ..., 0, 0, 0])

# Saving my label encoder

In [102]:
# from joblib import dump, load
dump(le, 'my_label_encoder.joblib') # save the model
# clf = load('filename.joblib') # load and reuse the model

['my_label_encoder.joblib']

In [79]:
df['Churn']

0      0
1      0
2      0
3      0
4      0
      ..
662    0
663    0
664    0
665    0
666    0
Name: Churn, Length: 3333, dtype: int64

In [80]:
y = le.transform(df['Churn'])

In [81]:
y

array([0, 0, 0, ..., 0, 0, 0])

In [82]:
len(y)

3333

# Let's split our data

In [83]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_X, y, test_size=0.3, random_state=42)

In [210]:
X_train

Unnamed: 0,Account length,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,...,State_WI,State_WV,State_WY,Area code_408,Area code_415,Area code_510,International plan_No,International plan_Yes,Voice mail plan_No,Voice mail plan_Yes
1349,93,32,218.7,117,37.18,115.0,61,9.78,192.7,85,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
695,32,0,171.2,82,29.10,185.6,102,15.78,203.3,64,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2003,79,17,236.7,95,40.24,263.5,56,22.40,259.6,107,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1543,107,0,230.4,65,39.17,257.4,80,21.88,107.3,88,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1179,1,0,196.1,107,33.34,296.5,82,25.20,211.5,91,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
547,130,0,252.0,101,42.84,170.2,105,14.47,209.2,64,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
565,100,38,177.1,88,30.11,163.7,108,13.91,242.7,72,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
647,80,0,105.8,110,17.99,43.9,88,3.73,189.6,87,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
430,194,0,48.4,101,8.23,281.1,138,23.89,218.5,87,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [85]:
y_train

array([0, 0, 0, ..., 0, 0, 0])

# Let's model

In [86]:
from sklearn.tree import DecisionTreeClassifier

In [87]:
clf = DecisionTreeClassifier()

In [88]:
clf = clf.fit(X_train, y_train)

In [108]:
clf.score(X_train,y_train)

1.0

# Saving my decision Tree

In [98]:
from joblib import dump, load
dump(clf, 'my_decision_tree.joblib') # save the model
# clf = load('filename.joblib') # load and reuse the model

['my_decision_tree.joblib']

# Let's look at our testing accuracy

In [90]:
# Getting prediction on our training data
test_predicitons = clf.predict(X_test)

In [92]:
# Calculating accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, test_predicitons)

0.759

# Let's create a function

In [None]:
def get_prediction(model_path, encoder_path, label_encoder_path, user_input):
    
    # Let's load our model
    clf = load('my_decision_tree.joblib') # load and reuse the model
    
    # Let's load our encoder
    enc = load('my_encoder.joblib') # load and reuse the model
    
    # Let's load my label encoder
    le = load('my_label_encoder.joblib') # load and reuse the model
    
    # 1. Firstly, create a DataFrame out of the user input
    # 2. Get your categorical df with df[['State', 'Area code', 'International plan', 'Voice mail plan']]
    # 3. Get your numerical df with df[['Account length', 'Number vmail messages', 'Total day minutes',
    #        'Total day calls', 'Total day charge', 'Total eve minutes',
    #        'Total eve calls', 'Total eve charge', 'Total night minutes',
    #        'Total night calls', 'Total night charge', 'Total intl minutes',
    #        'Total intl calls', 'Total intl charge', 'Customer service calls']]
    
    # 4. Encode your categorical columns
    # enc.transform()
    # 5. Save your encoded df.
    # 6. Combine your encoded df with your df_num
    
    # This about this
    # 7. At step 7, your data looks exactly like the data you used to train your model
    # 8. Can you not just do clf.predict(yourdata)
    # 9. This will give your a label.
    # 10. You will have to convert that encoded label to the actual label, you can do that with you label encoder. 

    

# What a user will input

In [107]:
['KS',128,'415','No','Yes',25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.71]

['KS',
 128,
 '415',
 'No',
 'Yes',
 25,
 265.1,
 110,
 45.07,
 197.4,
 99,
 16.78,
 244.7,
 91,
 11.01,
 10.0,
 3,
 2.71]