__Developing a model that will pick the right mobile plan__

_The goal of this project is to explore datasets, find the best model with the best hyperparameters that will pick the right mobile plan for users of the mobile carrier._  

_To do this we'll:_  
* _preprocess the data identifying and filling in missing values, identifying and remove duplicate values,_ 
* _split the source data into a training set, a validation set, and a test set,_ 
* _try classifiers on the validation set:_  
    * _DecisionTreeClassifier with various max_depth hyperparameter,_  
    * _RandomForestClassifier with various n_estimators and max_depth hyperparameters,_  
    * _LogisticRegression with various solver hyperparameter,_
* _use the best hyperparameter on the test set,_  
* _make conclusions._

In [1]:
# importing libraries: pandas, train_test_split, mean_squared_error
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

In [2]:
# loading dataframe
users_behavior = pd.read_csv('users_behavior.csv')

In [3]:
# saving raw data
users_behavior_raw = users_behavior

In [4]:
# creating function to get info of dataframe
def get_info(df):
    display(df.head(10))
    df.info()

In [5]:
get_info(users_behavior)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


__Checking for missing values__

In [6]:
# checking for missing values
print(users_behavior.isna().sum())

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


_No missing values_

__Checking for duplicates__

In [7]:
# checking for duplicated rows
print(users_behavior.duplicated().sum())

0


_No duplicates_

In [8]:
# changing the types of the columns
users_behavior['calls'] = users_behavior['calls'].astype('int64')
users_behavior['messages'] = users_behavior['messages'].astype('int64')
users_behavior['is_ultra'] = users_behavior['is_ultra'].astype('category')

In [9]:
get_info(users_behavior)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40,311.9,83,19915.42,0
1,85,516.75,56,22696.96,0
2,77,467.66,86,21060.45,0
3,106,745.53,81,8437.39,1
4,66,418.74,1,14502.75,0
5,58,344.56,21,15823.37,0
6,57,431.64,20,3738.9,1
7,15,132.4,6,21911.6,0
8,7,43.39,3,2538.67,1
9,90,665.41,38,17358.61,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   calls     3214 non-null   int64   
 1   minutes   3214 non-null   float64 
 2   messages  3214 non-null   int64   
 3   mb_used   3214 non-null   float64 
 4   is_ultra  3214 non-null   category
dtypes: category(1), float64(2), int64(2)
memory usage: 103.8 KB


__Data preparing ended__

In [10]:
# splitting the source data into a training set (70%), a validation set (15%), and a test set (15%)
users_behavior_train, users_behavior_valid = train_test_split(users_behavior, test_size = 0.3, random_state = 101010)
users_behavior_valid, users_behavior_test = train_test_split(users_behavior_valid, test_size = 0.5, random_state = 101010)

In [11]:
users_behavior_train.info()
print()
users_behavior_valid.info()
print()
users_behavior_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2249 entries, 524 to 2494
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   calls     2249 non-null   int64   
 1   minutes   2249 non-null   float64 
 2   messages  2249 non-null   int64   
 3   mb_used   2249 non-null   float64 
 4   is_ultra  2249 non-null   category
dtypes: category(1), float64(2), int64(2)
memory usage: 90.2 KB

<class 'pandas.core.frame.DataFrame'>
Index: 482 entries, 107 to 3207
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   calls     482 non-null    int64   
 1   minutes   482 non-null    float64 
 2   messages  482 non-null    int64   
 3   mb_used   482 non-null    float64 
 4   is_ultra  482 non-null    category
dtypes: category(1), float64(2), int64(2)
memory usage: 19.4 KB

<class 'pandas.core.frame.DataFrame'>
Index: 483 entries, 37 to 1668
Data columns (total 5 columns):
 #  

In [12]:
# splitting the dataframes into a features and target
users_behavior_train_features = users_behavior_train.drop(['is_ultra'], axis = 1)
users_behavior_train_target = users_behavior_train['is_ultra']
users_behavior_valid_features = users_behavior_valid.drop(['is_ultra'], axis = 1)
users_behavior_valid_target = users_behavior_valid['is_ultra']
users_behavior_test_features = users_behavior_test.drop(['is_ultra'], axis = 1)
users_behavior_test_target = users_behavior_test['is_ultra']

In [13]:
# finding the best accuracy of the model with DecisionTreeClassifier with the max_depth from 1 to 20 on the validation set
best_score = 0
best_depth = 0
for depth in range(1, 21):
    model = DecisionTreeClassifier(max_depth = depth, random_state = 101010)
    model.fit(users_behavior_train_features, users_behavior_train_target)
    score = model.score(users_behavior_valid_features, users_behavior_valid_target)
    if score > best_score:
        best_score = score 
        best_depth = depth
print('Accuracy of the best model (DecisionTreeClassifier) on the validation set (max_depth = {}): {}'.format(best_depth, best_score))

Accuracy of the best model (DecisionTreeClassifier) on the validation set (max_depth = 8): 0.8049792531120332


In [14]:
# finding the accuracy of the model with DecisionTreeClassifier with the max_depth 8 on the test set
model = DecisionTreeClassifier(max_depth = 8, random_state = 101010)
model.fit(users_behavior_train_features, users_behavior_train_target)
score = model.score(users_behavior_test_features, users_behavior_test_target)
print('Accuracy of the best model (DecisionTreeClassifier) on the test set (max_depth = 8): {}'.format(score))

Accuracy of the best model (DecisionTreeClassifier) on the test set (max_depth = 8): 0.8157349896480331


In [15]:
# finding the best accuracy of the model with RandomForestClassifier with the max_depth from 1 to 20 and n_estimators from 1 to 20 on the validation set
best_score = 0
best_est = 0
best_depth = 0
for est in range(1, 21):
    for depth in range(1, 21):
        model = RandomForestClassifier(n_estimators = est, max_depth = depth, random_state = 101010)
        model.fit(users_behavior_train_features, users_behavior_train_target)
        score = model.score(users_behavior_valid_features, users_behavior_valid_target)
        if score > best_score:
            best_score = score 
            best_est = est
            best_depth = depth
print('Accuracy of the best model (RandomForestClassifier) on the validation set (n_estimators = {}, max_depth = {}): {}'.format(best_est, best_depth, best_score))

Accuracy of the best model (RandomForestClassifier) on the validation set (n_estimators = 5, max_depth = 12): 0.8298755186721992


In [16]:
# finding the accuracy of the model with RandomForestClassifier with the n_estimators 5  and max_depth 12 on the test set
model = RandomForestClassifier(n_estimators = 5, max_depth = 12, random_state = 101010)
model.fit(users_behavior_train_features, users_behavior_train_target)
score = model.score(users_behavior_test_features, users_behavior_test_target)
print('Accuracy of the best model (RandomForestClassifier) on the test set (n_estimators = 5, max_depth = 12): {}'.format(score))

Accuracy of the best model (RandomForestClassifier) on the test set (n_estimators = 5, max_depth = 12): 0.8178053830227743


In [17]:
# finding the best accuracy of the model with LogisticRegression with the various solvers on the validation set
best_score = 0
best_solver = 0
for solvers in ('liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'):
    model = LogisticRegression(solver = solvers, random_state = 101010)
    model.fit(users_behavior_train_features, users_behavior_train_target)
    score = model.score(users_behavior_valid_features, users_behavior_valid_target)
    if score > best_score:
        best_score = score 
        best_solver = solvers
print('Accuracy of the best model (LogisticRegression) on the validation set (solver = {}): {}'.format(best_solver, best_score));

Accuracy of the best model (LogisticRegression) on the validation set (solver = newton-cg): 0.7614107883817427




In [18]:
# finding the  accuracy of the model with LogisticRegression with the solver 'newton-cg' on the test set
model = LogisticRegression(solver = solvers, random_state = 101010)
model.fit(users_behavior_train_features, users_behavior_train_target)
score = model.score(users_behavior_test_features, users_behavior_test_target)
print('Accuracy of the best model (LogisticRegression) on the validation set (solver = newton-cg): {}'.format(score))

Accuracy of the best model (LogisticRegression) on the validation set (solver = newton-cg): 0.7018633540372671




_The best accuracy of the model on the validation and the test sets is with RandomForestClassifier (n_estimators = 5, max_depth = 12) : 0.829875518672199 and 0.8178053830227743._

In [19]:
dummy_most_frequent = DummyClassifier(random_state=101010)
dummy_most_frequent.fit(users_behavior_train_features, users_behavior_train_target)
score = model.score(users_behavior_valid_features, users_behavior_valid_target)
print('Accuracy of the model (DummyClassifier) on the validation set: {}'.format(score))

Accuracy of the model (DummyClassifier) on the validation set: 0.6721991701244814


_Accuracy of the model (DummyClassifier) on the validation set is lower then accuracy of the model on the validation and the test sets is with RandomForestClassifier (n_estimators = 5, max_depth = 12)_