# Overview of Machine Learning Project

**The Goal:**

Developing a model that will analyze subscribers' behavior and recommend one of Megaline's newer plans. 

Smart or Ultra

We've worked with the data set in the past and have arleady perfomred the necessary actions to begin the model. We want this model to achieve the highest possible accuracy with a threshold for accuracy of 0.75. 

1. Split the source data into a training set, a validation set, and a test set.
2. Check the quality of different models by changing hyperparameters.
3. Check the quality of the model using the test set.
4. Perform a sanity check on the model.

**Break down of the data set**

Every observation in the dataset contains monthly behavior information about one user.


сalls — number of calls,
minutes — total call duration in minutes,
messages — number of text messages,
mb_used — Internet traffic used in MB,
is_ultra — plan for the current month (Ultra - 1, Smart - 0).


# Instructions

**1.Initialization:** Set up the environment by importing the necessary libraries and preparing for analysis.

**2. Reading the dataset:** The data has already been cleaned and ready for modeling.

**3. Train / Split data** Identify features and target while splitting the dataset.

**4. Model Preparation** Setting up our models and adding error count / accuracy.

**5. Model Evaluation** Hypertuning each model to achieve the highest accuracy score. 

# Initialization

In [14]:
#Adding all the imports

import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 

# Reading Dataset

In [15]:
#Reading the data set 
df = pd.read_csv('users_behavior.csv')

#Printing info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


# Train / Split data

In [16]:
#Setting my featues and target
features = df.drop(['is_ultra'], axis = 1)
target = df['is_ultra']

In [17]:
#Spilting my data set into a training set, a validation set, and a test set

#First split: training set and a temporary set (which will be split further)
df_train, df_temp = train_test_split(df, test_size=0.40, random_state=12345)

#Second split: temporary set into validation set and test set
df_valid, df_test = train_test_split(df_temp, test_size=0.50, random_state=12345)

In [18]:
#Declaring new variables
features_train = df_train.drop(['is_ultra'], axis = 1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis = 1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis = 1)
target_test = df_test['is_ultra']

#Print shape of each feature and target
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)



(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


# Model Preparation 

In [19]:
#Decision Tree regressor model
decision_tree_model = DecisionTreeClassifier(random_state=12345, max_depth=10)
decision_tree_model.fit(features_train, target_train)
decision_tree_predictions = decision_tree_model.predict(features_test)

In [20]:
#Random Forest Classifier model
random_forest_model = RandomForestClassifier(random_state=12345, n_estimators=50, max_depth=10)
random_forest_model.fit(features_train, target_train)
random_forest_predictions = random_forest_model.predict(features_test)

In [21]:
#Logistic Regression model
logistic_regression_model = LogisticRegression(random_state=12345, max_iter=500)
logistic_regression_model.fit(features_train, target_train)
logistic_regression_predictions = logistic_regression_model.predict(features_test)

In [22]:
#Accuracy and Error def
def error_count(answers, predictions):
    answers = np.array(answers)
    predictions = np.array(predictions)
    count = np.sum(answers != predictions)
    return count

def accuracy(answers, predictions):
    answers = np.array(answers)
    predictions = np.array(predictions)
    correct = np.sum(answers == predictions)
    return correct / len(answers)

In [23]:
#Evaluate Decision Tree
print(f"Decision Tree Accuracy on the test set: {accuracy(target_test, decision_tree_predictions).round(3)}")
print(f"Decision Tree Error Count on the test set: {error_count(target_test, decision_tree_predictions)}")

#Evaluate Random Forest
print(f"Random Forest Accuracy on the test set: {accuracy(target_test, random_forest_predictions).round(3)}")
print(f"Random Forest Error Count on the test set: {error_count(target_test, random_forest_predictions)}")

#Evaluate Logistic Regression
print(f"Logistic Regression Accuracy on the test set: {accuracy(target_test, logistic_regression_predictions).round(3)}")
print(f"Logistic Regression Error Count on the test set: {error_count(target_test, logistic_regression_predictions)}")

Decision Tree Accuracy on the test set: 0.788
Decision Tree Error Count on the test set: 136
Random Forest Accuracy on the test set: 0.801
Random Forest Error Count on the test set: 128
Logistic Regression Accuracy on the test set: 0.739
Logistic Regression Error Count on the test set: 168


# Model Evaluation

In [26]:
#Define parameter grids for each model
param_grid_dt = {
    'max_depth': [None, 10, 20, 30]
}

param_grid_rf = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30]
}

param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
}


dt_model = DecisionTreeClassifier(random_state=12345)
rf_model = RandomForestClassifier(random_state=12345)
lr_model = LogisticRegression(random_state=12345, max_iter=500)

#Grid Search for each model
grid_search_dt = GridSearchCV(estimator=dt_model, param_grid=param_grid_dt, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_lr = GridSearchCV(estimator=lr_model, param_grid=param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1)

#Grid Search
grid_search_dt.fit(features_train, target_train)
grid_search_rf.fit(features_train, target_train)
grid_search_lr.fit(features_train, target_train)

#Best parameters and best score
print("Best parameters for Decision Tree:", grid_search_dt.best_params_)
print("Best accuracy score for Decision Tree:", grid_search_dt.best_score_)

print("Best parameters for Random Forest:", grid_search_rf.best_params_)
print("Best accuracy score for Random Forest:", grid_search_rf.best_score_)

print("Best parameters for Logistic Regression:", grid_search_lr.best_params_)
print("Best accuracy score for Logistic Regression:", grid_search_lr.best_score_)

Best parameters for Decision Tree: {'max_depth': 10}
Best accuracy score for Decision Tree: 0.794606015745912
Best parameters for Random Forest: {'max_depth': 10, 'n_estimators': 100}
Best accuracy score for Random Forest: 0.8205477424130274
Best parameters for Logistic Regression: {'C': 0.01}
Best accuracy score for Logistic Regression: 0.7484529977794226


In [25]:
best_rf_model = RandomForestClassifier(random_state=12345, n_estimators=100, max_depth=10)
best_rf_model.fit(features_train, target_train)
random_forest_predictions = best_rf_model.predict(features_test)


#Evaluate Random Forest
print(f"Random Forest Accuracy on the test set: {accuracy(target_test, random_forest_predictions).round(3)}")
print(f"Random Forest Error Count on the test set: {error_count(target_test, random_forest_predictions)}")

Random Forest Accuracy on the test set: 0.806
Random Forest Error Count on the test set: 125


# Conclusion

After learning about sklearn.model_selection import GridSearchCV I was able to hypertune each parameter to achieve the highest possible accuracy. Now each model has the highest possible accuracy thanks to GridSearchCV.


***OVER ALL CONCLUSION***

I'd select the Random Forest model according to the higher accuracy model. We finished with a .806 accuracy and a total of 125 errors. 