Introduction to Regression Project

Problem Statement
Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the
new plans (from the project for the Statistical Data Analysis course). For this
classification task, you need to develop a model that will pick the right plan. Since you’ve
already performed the data preprocessing step, you can move straight to creating the
model.
Develop a model with the highest possible accuracy. In this project, the threshold for
accuracy is 0.75. Check the accuracy using the test dataset.
1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly
describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model. This data is more complex than what
you’re used to working with, so it's not an easy task. We'll take a closer look at it
later.


In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier     # Decision Tree Classifier
from sklearn.ensemble import RandomForestClassifier #Random Forest Classifer
from sklearn.dummy import DummyClassifier           #dummy classifer


df=pd.read_csv('https://bit.ly/UsersBehaviourTelco')

df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [28]:
df.shape

(3214, 5)

In [29]:
df.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

In [30]:
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [31]:
#preparing data 
x = df.drop(['is_ultra'], axis = 1)
y = df['is_ultra']  

#Split the source data into a training set, a validation set, and a test set.
#spliting the dataset (ratio 3:1:1)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42, stratify =y)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.25, random_state = 42)

#confirm size of datasets
print(df.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
print(X_valid.shape)
print(y_valid.shape)

(3214, 5)
(1928, 4)
(643, 4)
(1928,)
(643,)
(643, 4)
(643,)


In [32]:
#instance of classifers 
logistic_classifier = LogisticRegression()
decision_classifier = DecisionTreeClassifier()
random_classifer = RandomForestClassifier()
dummy_classifer = DummyClassifier()

In [33]:
#train model
logistic_classifier.fit(X_train, y_train)
decision_classifier.fit(X_train, y_train)
random_classifer.fit(X_train, y_train)
dummy_classifer.fit(X_train, y_train)

DummyClassifier()

In [34]:
#predict test results
logistic_y_prediction = logistic_classifier.predict(X_test) 
decision_y_prediction = decision_classifier.predict(X_test) 
random_y_prediction = random_classifer.predict(X_test) 
dummy_y_prediction = dummy_classifer.predict(X_test)

In [35]:
#compare actual values with predicted values
df_logistic = pd.DataFrame({'Actual': y_test, 'Predicted': logistic_y_prediction })
df_logistic.head()

Unnamed: 0,Actual,Predicted
1201,0,0
3093,0,0
2995,1,0
2739,1,1
232,1,0


In [36]:
#compare actual values with predicted values
df_random = pd.DataFrame({'Actual': y_test, 'Predicted': random_y_prediction })
df_random.head()

Unnamed: 0,Actual,Predicted
1201,0,0
3093,0,1
2995,1,0
2739,1,1
232,1,0


In [37]:
from sklearn.metrics import accuracy_score 
#print accuracy of classifers
print('Logistic classifier:') 
print(accuracy_score(logistic_y_prediction, y_test)) 
print('Decision Tree classifier:')
print(accuracy_score(decision_y_prediction, y_test))
print('Random Forest classifier:')
print(accuracy_score(random_y_prediction, y_test))
print('Dummy classifier:')
print(accuracy_score(dummy_y_prediction, y_test))

#the most accurate classifer is random forest classifer at 0.8
#by random guessing the accuracy score is at 0.56

Logistic classifier:
0.7107309486780715
Decision Tree classifier:
0.7325038880248833
Random Forest classifier:
0.807153965785381
Dummy classifier:
0.6936236391912908


Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
Check the quality of the model using the test set.
Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

In [38]:
from sklearn.metrics import classification_report

#classification report 
print('Logistic classifier:')
print(classification_report(y_test, logistic_y_prediction))

print('Decision Tree classifier:')
print(classification_report(y_test, decision_y_prediction)) 

print('Random Forest classifier:')
print(classification_report(y_test, random_y_prediction)) 

print('Dummy Classifier:')
print(classification_report(y_test, dummy_y_prediction))

#the most accurate classifer remains to be random forest 

Logistic classifier:
              precision    recall  f1-score   support

           0       0.71      0.99      0.83       446
           1       0.74      0.09      0.15       197

    accuracy                           0.71       643
   macro avg       0.72      0.54      0.49       643
weighted avg       0.72      0.71      0.62       643

Decision Tree classifier:
              precision    recall  f1-score   support

           0       0.80      0.82      0.81       446
           1       0.57      0.54      0.55       197

    accuracy                           0.73       643
   macro avg       0.68      0.68      0.68       643
weighted avg       0.73      0.73      0.73       643

Random Forest classifier:
              precision    recall  f1-score   support

           0       0.83      0.91      0.87       446
           1       0.74      0.57      0.64       197

    accuracy                           0.81       643
   macro avg       0.78      0.74      0.76       643
w

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [39]:
#confusion matrix 
from sklearn.metrics import confusion_matrix 
 
print('Logistic Regression classifier:')
print(confusion_matrix(logistic_y_prediction, y_test)) 

print('Decision Tree classifier:')
print(confusion_matrix(decision_y_prediction, y_test)) 

print('Random Forest classifier:')
print(confusion_matrix(random_y_prediction, y_test))  

#random forest remains to be the best model 

Logistic Regression classifier:
[[440 180]
 [  6  17]]
Decision Tree classifier:
[[365  91]
 [ 81 106]]
Random Forest classifier:
[[407  85]
 [ 39 112]]


In [40]:
#logistic regression Tuning
LogisticRegression_model = LogisticRegression(
    random_state=12345, tol=1e-10, solver='liblinear',n_jobs=-1, penalty='l2')

#train a model by calling the fit() method 
LogisticRegression_model.fit(X_train,y_train)

#predict answers 
logistic_y_pred = LogisticRegression_model.predict(X_test) 

#Classifier report
print('Logistic Regression classifier:')
print(classification_report(y_test, logistic_y_pred)) 
 
print('Logistic Regression classifier:')
print(confusion_matrix(logistic_y_pred, y_test))

#accuracy score for logistic model has increased from 0.71 to 0.74 

Logistic Regression classifier:
              precision    recall  f1-score   support

           0       0.74      0.98      0.84       446
           1       0.85      0.20      0.33       197

    accuracy                           0.74       643
   macro avg       0.79      0.59      0.59       643
weighted avg       0.77      0.74      0.68       643

Logistic Regression classifier:
[[439 157]
 [  7  40]]


  " = {}.".format(effective_n_jobs(self.n_jobs))


In [41]:
#Decision Tree Tuning
parameter = {
"criterion":["gini", "entropy"],
"max_depth":[1,3,5,7,15],
"min_samples_split":[2,4,8, 16],
"min_samples_leaf":[2,4,6]}

from sklearn.model_selection import GridSearchCV
Search = GridSearchCV(DecisionTreeClassifier(), parameter, cv=5).fit(X_train, y_train)
y_pred = Search.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred)) 

#using the best parameters, the accuracy score for decision tree increases from 0.73 to 0.80

0.8009331259720062
[[431  15]
 [113  84]]
              precision    recall  f1-score   support

           0       0.79      0.97      0.87       446
           1       0.85      0.43      0.57       197

    accuracy                           0.80       643
   macro avg       0.82      0.70      0.72       643
weighted avg       0.81      0.80      0.78       643



In [42]:
# #create a model and assign it to a variable 
dec_model = DecisionTreeClassifier(**Search.best_params_,random_state=12345)

#train a model by calling the fit() method 
dec_model.fit(X_train,y_train)

#predict answers 
dec_y_pred = dec_model.predict(X_test) 

#Classifier report 
print(classification_report(y_test, dec_y_pred)) 

print(confusion_matrix(dec_y_pred, y_test))

              precision    recall  f1-score   support

           0       0.79      0.97      0.87       446
           1       0.85      0.43      0.57       197

    accuracy                           0.80       643
   macro avg       0.82      0.70      0.72       643
weighted avg       0.81      0.80      0.78       643

[[431 112]
 [ 15  85]]


In [51]:
#Random Forest Tuning
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
    

param_dist = {"max_depth": [1,3,5,7,15],
              "max_features": sp_randint(1, X_train.shape[1]),
              "min_samples_split": sp_randint(2, 16),
              "bootstrap": [True, False],
              "n_estimators": sp_randint(10, 500)}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist,
                                   n_iter=10, cv=5, random_state=42)


random_search.fit(X_train,y_train)   

random_y_pred = random_search.predict(X_test)

print(random_search.best_params_)
print(accuracy_score(y_test, random_y_pred))
print(confusion_matrix(y_test, random_y_pred))
print(classification_report(y_test, random_y_pred)) 


#using the best parameters, the accuracy score for random forest increases from 0.80 to 0.82

{'bootstrap': True, 'max_depth': 7, 'max_features': 1, 'min_samples_split': 12, 'n_estimators': 81}
0.8180404354587869
[[425  21]
 [ 96 101]]
              precision    recall  f1-score   support

           0       0.82      0.95      0.88       446
           1       0.83      0.51      0.63       197

    accuracy                           0.82       643
   macro avg       0.82      0.73      0.76       643
weighted avg       0.82      0.82      0.80       643



In [52]:
# #create a model and assign it to a variable 
Forest_model = RandomForestClassifier(**random_search.best_params_,random_state=42)

#train a model by calling the fit() method 
Forest_model.fit(X_train,y_train)

#predict answers 
forest_y_pred = Forest_model.predict(X_test) 

#Classifier report 
print(classification_report(y_test, forest_y_pred)) 

print(confusion_matrix(forest_y_pred, y_test))

              precision    recall  f1-score   support

           0       0.82      0.95      0.88       446
           1       0.83      0.53      0.65       197

    accuracy                           0.82       643
   macro avg       0.82      0.74      0.76       643
weighted avg       0.82      0.82      0.81       643

[[424  92]
 [ 22 105]]


In [56]:
#testing our models

from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
megaline_new = [[38,	277.94, 0.0,17402.54 ]]

# We will need to transform our new case
megaline_new = norm.transform(megaline_new)  

print('Logistic Regression classifier', LogisticRegression_model.predict(megaline_new))
print('Decision Tree classifier:', dec_model.predict(megaline_new))
print('Random Forest:', Forest_model.predict(megaline_new))

Logistic Regression classifier [0]
Decision Tree classifier: [1]
Random Forest: [1]


  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


Random Forest and Decision Tree classifers are reliable since they accuretly predicts a customer's plan