# Sprint 7 Chapter 6/7 Course Project

# **Project description**

  (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible _accuracy_. In this project, the threshold for _accuracy_ is 0.75. Check the _accuracy_ using the test dataset.

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__
An excellent practice is to describe the goal and main steps in your own words (a skill that will help a lot on a final project). It would be good to add the progress and purpose of the study.

## Import libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# adding afther revicion V.1
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score, 
    roc_auc_score
)
from sklearn.model_selection import cross_validate

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__
    
Great, the libraries are loaded    

## Import dataset

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

## Exploration data Analysis

In [3]:
display(df[:5])

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


#### view summary of dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
# check duplicated rows
print(df.duplicated().sum())

0


In [None]:
# check duplicated rows
df.duplicated().sum()

No find missing values and no row was duplicated

the dataset is split in two set features and target

In [6]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra'] 

## Split data into separate training and test set

Declare six variables and pass them the following:
    features: features_train , features_valid and features_test;
    target feature: target_train, target_valid and target_test.

In [7]:
ratio_train = 0.8
ratio_val = 0.2
ratio_test = 0.2

# Produces test split.
remaining_train, features_test, remaining_target, target_test = train_test_split(
    features, target, test_size=ratio_test,random_state=12345)

# Adjusts val ratio, w.r.t. remaining dataset.
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining

# Produces train and val splits.
features_train, features_valid, target_train, target_valid = train_test_split(
    remaining_train, remaining_target, test_size=ratio_val_adjusted,random_state=12345)

Print number row and columns by each subset (train, validand test)

In [8]:
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

1. It is good here, random_state is fixed. We have ensured reproducibility of the results of splitting the sample into training (training) / test / validation samples, so the subsamples will be identical in all subsequent runs of our code.
    
2. Fraction of train/valid/test sizes 3:1:1 is good.


</div>

Decision tree classifier tuning Hyperparameters [max_depth]

In [9]:
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7387247278382582
max_depth = 2 : 0.7573872472783826
max_depth = 3 : 0.7651632970451011
max_depth = 4 : 0.7636080870917574
max_depth = 5 : 0.7589424572317263


### Decision tree classifier 

tuning Hyperparameters [max_depth] with test dataset

In [10]:
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_test = model.predict(features_test)
    print('TEST - max_depth =', depth, ': ', end='')
    print(accuracy_score(target_test, predictions_test))

TEST - max_depth = 1 : 0.7480559875583204
TEST - max_depth = 2 : 0.7838258164852255
TEST - max_depth = 3 : 0.7869362363919129
TEST - max_depth = 4 : 0.7869362363919129
TEST - max_depth = 5 : 0.7884914463452566


max-depth = 3 , will use for final calculation

In [11]:
model1 = DecisionTreeClassifier(random_state=12345, max_depth=3)
model1.fit(features_train, target_train)
predictions_test = model1.predict(features_test)
mse = mean_squared_error(target_test, predictions_test)
# < find the square root of MSE >
rmse = mse ** 0.5
print('RMSE:', rmse)

RMSE: 0.46158830531988904


#### Evaluate the Model
Compare the predicted labels with the true labels from the validation set or test set. Calculate the chosen performance metric using appropriate functions like accuracy_score(), precision_score(), recall_score(), f1_score(), or roc_auc_score().

In [12]:
accuracy = accuracy_score(target_test, predictions_test)
precision = precision_score(target_test, predictions_test)
recall = recall_score(target_test, predictions_test)
f1 = f1_score(target_test, predictions_test)
roc_auc = roc_auc_score(target_test, predictions_test)
 
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-Score: {f1}')
print(f'ROC AUC: {roc_auc}')

Accuracy: 0.7869362363919129
Precision: 0.7521367521367521
Recall: 0.4489795918367347
F1-Score: 0.5623003194888179
ROC AUC: 0.6920513171711639


The value of rmse is be of RMSE 0.46 indcated good fit the model 

In [13]:
# Perform 5-fold cross-validation
cv_results = cross_validate(model1, features_train, target_train, cv=5,return_train_score=True)

# Print the results
print(cv_results)

{'fit_time': array([0.00578451, 0.00490141, 0.00518131, 0.00565577, 0.00487518]), 'score_time': array([0.0019803 , 0.03015614, 0.00223184, 0.00195837, 0.00189424]), 'test_score': array([0.79533679, 0.80829016, 0.80569948, 0.79220779, 0.82597403]), 'train_score': array([0.81582361, 0.80674449, 0.81322957, 0.81659106, 0.80816591])}


### Random Forest Classifier

In [14]:
best_score = 0
best_est = 0
for est in range(1, 11):
    model = RandomForestClassifier(random_state=54321, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))



Accuracy of the best model on the validation set (n_estimators = 10): 0.7853810264385692


In [15]:
RFC_model = RandomForestClassifier(random_state=54321, n_estimators=best_est)
RFC_model.fit(features_train, target_train)

RandomForestClassifier(n_estimators=10, random_state=54321)

In [16]:
predictions_test_RFC = RFC_model.predict(features_test)
mse = mean_squared_error(target_test, predictions_test_RFC)
rmse = mse ** 0.5
print('RMSE:', rmse)

RMSE: 0.4732338040350594


In [17]:
accuracy = accuracy_score(target_test, predictions_test_RFC)
precision = precision_score(target_test, predictions_test_RFC)
recall = recall_score(target_test, predictions_test_RFC)
f1 = f1_score(target_test, predictions_test_RFC)
roc_auc = roc_auc_score(target_test, predictions_test_RFC)
 
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-Score: {f1}')
print(f'ROC AUC: {roc_auc}')

Accuracy: 0.776049766718507
Precision: 0.6733333333333333
Recall: 0.5153061224489796
F1-Score: 0.583815028901734
ROC AUC: 0.7028432178240424


The value of rmse is be of RMSE 0.47 indcated good fit the model 

### Logistic Regression


In [18]:
modelLR = LogisticRegression(
    random_state=54321, solver="liblinear"
)
modelLR.fit(features_train, target_train)
score_train = modelLR.score(features_train, target_train)
score_valid = modelLR.score(features_valid, target_valid)
score_test = modelLR.score(features_test, target_test)
print("Accuracy of the logistic regression model on the training set:",score_train,)
print( "Accuracy of the logistic regression model on the validation set:",score_valid,)
print( "Accuracy of the logistic regression model on the test set:",score_test,)

Accuracy of the logistic regression model on the training set: 0.7422199170124482
Accuracy of the logistic regression model on the validation set: 0.7293934681181959
Accuracy of the logistic regression model on the test set: 0.7511664074650077


In [19]:
predictions_test_LR = modelLR.predict(features_test)
mse = mean_squared_error(target_test, predictions_test_LR)
rmse = mse ** 0.5
print('RMSE:', rmse)

RMSE: 0.49883222884552303


In [20]:
accuracy = accuracy_score(target_test, predictions_test_LR)
precision = precision_score(target_test, predictions_test_LR)
recall = recall_score(target_test, predictions_test_LR)
f1 = f1_score(target_test, predictions_test_LR)
roc_auc = roc_auc_score(target_test, predictions_test_LR)
 
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-Score: {f1}')
print(f'ROC AUC: {roc_auc}')

Accuracy: 0.7511664074650077
Precision: 0.9285714285714286
Recall: 0.1989795918367347
F1-Score: 0.3277310924369748
ROC AUC: 0.5961340912203807


Below show the summary result arter predict the information with sundataste Test for each model


**Model result comparison table summary**


<table>
  <tr>
   <td><strong>Model</strong>
   </td>
   <td>
    <strong>RMSE</strong>
   </td>
   <td>
    <strong>Accuracy Score</strong>
   </td>
   <td>
    <strong>Precision</strong>
   </td>
   <td>
    <strong>Recall</strong>
   </td>
   <td>
    <strong>F1-Score</strong>
   </td>
   <td>
    <strong>ROC AUC</strong>
   </td>
  </tr>
  <tr>
   <td>Decision tree classifier
   </td>
   <td>
    0.461
   </td>
   <td>
    0.7869
   </td>
   <td>
    0.7521
   </td>
   <td>
    0.4489
   </td>
   <td>
    0.5623
   </td>
   <td>
    0.6920
   </td>
  </tr>
  <tr>
   <td>Random Forest Classifier
   </td>
   <td>
    0.473
   </td>
   <td>
    0.7760
   </td>
   <td>
    0.6733
   </td>
   <td>
    0.5153
   </td>
   <td>
    0.5838
   </td>
   <td>
    0.7028
   </td>
  </tr>
  <tr>
   <td>Logistic Regression
   </td>
   <td>
    0.498
   </td>
   <td>
    0.7511
   </td>
   <td>
    0.9285
   </td>
   <td>
    0.1989
   </td>
   <td>
    0.3277
   </td>
   <td>
    0.5961
   </td>
  </tr>
</table>



**Accuracy score** Decision tree is highest value follow by Random Forest classifier

**Precision** logistic Regression got highest value, next Decision tree classifier

**Recall** Random Forest classifier is highest value, follow by Decision tree classifier

**F1_score** Random Forest classifier is highest value follow by Decision tree classifier

**Roc AUC** Random Forest classifier is highest value follow by Decision tree classifier

After evaluating the metric table above, Random Forest Classifier got better results followed by Decision tree classifier.


In [21]:
# Perform 5-fold cross-validation
results_DT = cross_validate(model1, features_train, target_train, cv=5,return_train_score=True)
results_RF = cross_validate(RFC_model, features_train, target_train, cv=5,return_train_score=True)
results_LG = cross_validate(modelLR, features_train, target_train, cv=5,return_train_score=True)

# Print the results
print(' Test Score results')
print('\nDecision tree classifier ',results_DT['test_score'],'\nRandom Forest Classifier ',results_RF['test_score'],'\nLogistic Regression      ',results_LG['test_score'])
print('\n Train Score results')
print('\nDecision tree classifier ',results_DT['train_score'],'\nRandom Forest Classifier ',results_RF['train_score'],'\nLogistic Regression      ',results_LG['train_score'])

 Test Score results

Decision tree classifier  [0.79533679 0.80829016 0.80569948 0.79220779 0.82597403] 
Random Forest Classifier  [0.79015544 0.81088083 0.79274611 0.7974026  0.8025974 ] 
Logistic Regression       [0.74611399 0.70725389 0.69948187 0.72987013 0.7038961 ]

 Train Score results

Decision tree classifier  [0.81582361 0.80674449 0.81322957 0.81659106 0.80816591] 
Random Forest Classifier  [0.98184176 0.98638132 0.98508431 0.99027868 0.9837978 ] 
Logistic Regression       [0.7464332  0.70881971 0.70363165 0.74789371 0.70187946]


above show results of cross-validation for the previous models.

Where conclude the test score in average better result for Decision tree classifier model follow by Random Fores Classifier.
Train score  Random Fores is highest result 

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №4__


Otherwise it's great😊. Your project is begging for github =)   
    
Congratulations on the successful completion of the project 😊👍
And I wish you success in new works 😊