<a href="https://colab.research.google.com/github/Oughty-Otieno/Introduction-to-Regression-Week-4/blob/main/Introduction_to_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem **Statement**

Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the
new plans (from the project for the Statistical Data Analysis course). For this
classification task, you need to develop a model that will pick the right plan. Since you’ve
already performed the data preprocessing step, you can move straight to creating the
model.
Develop a model with the highest possible accuracy. In this project, the threshold for
accuracy is 0.75. Check the accuracy using the test dataset.
1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly
describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model. This data is more complex than what
you’re used to working with, so it's not an easy task. We'll take a closer look at it
later

In [None]:
#Read the data 
#In this cell we load the important packages 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# We read the data

df = pd.read_csv("https://bit.ly/UsersBehaviourTelco")

df.sample(10) #previewing the random 10 records


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3023,77.0,558.31,0.0,11467.55,0
742,136.0,999.09,84.0,23116.46,1
2879,0.0,0.0,6.0,22428.0,1
3170,72.0,447.4,105.0,27873.88,0
52,129.0,929.23,0.0,22508.96,1
313,15.0,104.41,17.0,4677.85,1
260,117.0,832.78,52.0,2949.68,1
1143,63.0,412.89,23.0,13945.79,0
2118,76.0,430.7,34.0,25138.49,0
563,33.0,164.87,26.0,6290.25,0


In [None]:
#Taking a closer look of the data:

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,3214.0,63.038892,33.236368,0.0,40.0,62.0,82.0,244.0
minutes,3214.0,438.208787,234.569872,0.0,274.575,430.6,571.9275,1632.06
messages,3214.0,38.281269,36.148326,0.0,9.0,30.0,57.0,224.0
mb_used,3214.0,17207.673836,7570.968246,0.0,12491.9025,16943.235,21424.7,49745.73
is_ultra,3214.0,0.306472,0.4611,0.0,0.0,0.0,1.0,1.0


-  We do not have any missing data

In [None]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [None]:
#Getting the count and percentages proportions of the target variable
s = df.is_ultra
counts = s.value_counts()
percent = s.value_counts(normalize=True)
percent100 = s.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
pd.DataFrame({'counts': counts, 'per': percent, 'per100': percent100})

Unnamed: 0,counts,per,per100
0,2229,0.693528,69.4%
1,985,0.306472,30.6%


# **Modeling**

**Train | Test Split and Scaling**

In [None]:
X = df.drop(columns=["is_ultra"])
y = df.is_ultra

from sklearn.model_selection import KFold, cross_val_predict, train_test_split,GridSearchCV,cross_val_score, cross_validate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**Scaling**

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# df2 =pd.DataFrame(X_train_scaled)
# df2.head()

**Modelling**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

log_model = LogisticRegression(class_weight="None").fit(X_train_scaled, y_train)

dt_model = DecisionTreeClassifier(criterion="gini", random_state=42,max_depth=10, min_samples_leaf=10)
dt_model.fit(X_train_scaled,y_train)

rf_model=RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train_scaled,y_train)

RandomForestClassifier()

**Predicting**

In [None]:
#Predicting Using Logistic Regression
y_test_pred_logistic_regression = log_model.predict(X_test_scaled)
y_pred_proba_logistic_regression = log_model.predict_proba(X_test_scaled)

test_data = pd.concat([X_test, y_test], axis=1)
test_data["pred"] = y_test_pred_logistic_regression
test_data["pred_proba"] = y_pred_proba_logistic_regression[:,1]
test_data.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra,pred,pred_proba
1562,14.0,95.94,0.0,2920.15,0,0,0.087541
80,118.0,843.3,69.0,28992.12,1,1,0.607088
1061,54.0,363.09,47.0,22974.04,0,0,0.328418
678,75.0,529.17,0.0,10435.47,0,0,0.198259
2651,81.0,495.5,13.0,17081.56,0,0,0.263958
1807,58.0,398.35,65.0,12097.96,0,0,0.285238
2144,44.0,324.86,0.0,18611.43,0,0,0.197874
605,24.0,135.33,33.0,15479.48,0,0,0.186597
1025,16.0,111.25,10.0,33347.09,1,0,0.258972
1073,80.0,498.91,51.0,20736.28,0,0,0.370404


In [None]:
#Predicting using the decision tree algorithm
y_test_pred_decision_trees = dt_model.predict(X_test_scaled)
y_pred_proba_decision_trees = dt_model.predict_proba(X_test_scaled)

test_data = pd.concat([X_test, y_test], axis=1)
test_data["pred"] = y_test_pred_decision_trees
test_data["pred_proba"] = y_pred_proba_decision_trees[:,1]
test_data.sample(10)


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra,pred,pred_proba
2545,96.0,729.47,27.0,20890.92,1,0,0.125436
2270,57.0,558.06,40.0,12270.33,0,0,0.125436
679,58.0,434.84,20.0,18910.83,0,0,0.125436
7,15.0,132.4,6.0,21911.6,0,0,0.164319
2816,61.0,450.84,61.0,13996.76,0,0,0.125436
761,32.0,266.95,0.0,23336.54,0,0,0.164319
1829,84.0,580.43,54.0,8668.24,0,0,0.222222
361,72.0,497.93,46.0,12651.41,0,0,0.125436
2861,9.0,29.31,25.0,28155.4,1,1,0.6
1988,69.0,522.16,9.0,15834.77,0,0,0.015873


In [None]:
#make predictions with Random Forest
y_test_pred_random_forest = rf_model.predict(X_test_scaled)
y_pred_proba_random_forest = rf_model.predict_proba(X_test_scaled)

test_data = pd.concat([X_test, y_test], axis=1)
test_data["pred"] = y_test_pred_random_forest
test_data["pred_proba"] = y_pred_proba_random_forest[:,1]
test_data.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra,pred,pred_proba
2628,89.0,635.86,0.0,8417.03,1,0,0.36
351,0.0,0.0,8.0,35525.61,1,1,0.9
2632,33.0,204.03,86.0,15724.48,0,0,0.07
679,58.0,434.84,20.0,18910.83,0,0,0.11
1048,57.0,418.81,44.0,17335.04,0,0,0.04
1642,87.0,583.02,1.0,11213.97,0,0,0.1
149,74.0,455.73,99.0,21694.92,0,0,0.05
3109,121.0,797.79,0.0,25789.68,1,1,0.67
952,87.0,518.1,17.0,13957.77,0,0,0.11
436,91.0,598.64,33.0,28524.79,1,1,0.93


In [None]:
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix

print("For Logistic Regression")
print(confusion_matrix(y_test,y_test_pred_logistic_regression))
print(classification_report(y_test,y_test_pred_logistic_regression))

print("For Decision Trees")
print(confusion_matrix(y_test,y_test_pred_decision_trees))
print(classification_report(y_test,y_test_pred_decision_trees))

print("For Randomn Forest")
print(confusion_matrix(y_test,y_test_pred_random_forest))
print(classification_report(y_test,y_test_pred_random_forest))

For Logistic Regression
[[647  13]
 [235  70]]
              precision    recall  f1-score   support

           0       0.73      0.98      0.84       660
           1       0.84      0.23      0.36       305

    accuracy                           0.74       965
   macro avg       0.79      0.60      0.60       965
weighted avg       0.77      0.74      0.69       965

For Decision Trees
[[618  42]
 [164 141]]
              precision    recall  f1-score   support

           0       0.79      0.94      0.86       660
           1       0.77      0.46      0.58       305

    accuracy                           0.79       965
   macro avg       0.78      0.70      0.72       965
weighted avg       0.78      0.79      0.77       965

For Randomn Forest
[[606  54]
 [141 164]]
              precision    recall  f1-score   support

           0       0.81      0.92      0.86       660
           1       0.75      0.54      0.63       305

    accuracy                           0.80       9

For Randomn Forest performed the best, with an accuracy of 80%. 
Decision trees also met the criteria at an accuracy of 79%.
Logistic Regression achieved an accuracy of 74%