# Project description

Mobile carrier Megaline want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

The goal of this project is to develop a model that will pick the right plan based on users behavior.
In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

Data description:
- сalls, number of calls.
- minutes, total call duration in minutes.
- messages, number of text messages.
- mb_used, Internet traffic used in MB.
- is_ultra, plan for the current month (Ultra - 1, Smart - 0).

In [6]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from joblib import dump

#### Open and look through the data file:

In [7]:
df = pd.read_csv('/datasets/users_behavior.csv')
print(df.head())
print("------------------------------------------------------------------------------")
print('Present of missing values:')
print(round(100*df.isnull().sum()/len(df),2))
print("------------------------------------------------------------------------------")
print(df.info())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
------------------------------------------------------------------------------
Present of missing values:
calls       0.0
minutes     0.0
messages    0.0
mb_used     0.0
is_ultra    0.0
dtype: float64
------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
N

#### Split the source data into a training set, a validation set, and a test set:

In [8]:
df_train, df_valid = train_test_split(df, test_size=0.5, random_state=12345)
df_test, df_valid = train_test_split(df_valid, test_size=0.5, random_state=12345)

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

features_all = df.drop(['is_ultra'], axis=1)
target_all = df['is_ultra']

#### Investigate the quality of different models by changing hyperparameters. 

In [12]:
def model_best_parameters(features_train, target_train, features_valid, target_valid, 
                          estimators = 100, depth = 15, splits = 20):
    best_estimator = 0
    best_depth = 0
    best_splits = 0
    
    max_accuracy = 0
    for i in range(50,estimators,5):
        model = RandomForestClassifier(random_state=12345, n_estimators=i) 
        model.fit(features_train, target_train)
        train_predictions = model.predict(features_valid)    
        training_accuracy = accuracy_score(target_valid, train_predictions)
        
        if training_accuracy > max_accuracy:
            max_accuracy = training_accuracy
            best_estimator = i
     
    max_accuracy = 0
    for i in range(1,depth):
        model = RandomForestClassifier(random_state=12345, n_estimators = 30, max_depth = i) 
        model.fit(features_train, target_train)
        train_predictions = model.predict(features_valid)    
        training_accuracy = accuracy_score(target_valid, train_predictions)
        
        if training_accuracy > max_accuracy:
            max_accuracy = training_accuracy
            best_depth = i
    
    max_accuracy = 0
    for i in range(2,splits):
        model = RandomForestClassifier(random_state=12345, n_estimators = 30, max_depth = best_depth, min_samples_split = i) 
        model.fit(features_train, target_train)
        train_predictions = model.predict(features_valid)    
        training_accuracy = accuracy_score(target_valid, train_predictions)
        
        if training_accuracy > max_accuracy:
            max_accuracy = training_accuracy
            best_splits = i
    
    return best_estimator, best_depth, best_splits

In [13]:
%%time
pick_estimator, pick_depth, pick_splits = model_best_parameters(features_train, target_train, features_valid, target_valid)
print("Best model parameters")
print("n_estimators:     ", pick_estimator)
print("max_depth:        ", pick_depth)
print("min_samples_split:", pick_splits)

Best model parameters
n_estimators:      60
max_depth:         7
min_samples_split: 13
CPU times: user 7.32 s, sys: 22 ms, total: 7.34 s
Wall time: 7.5 s


I created a function to find out the best parameters for my model.

#### Check the quality of the model using the test set:

In [14]:
model = RandomForestClassifier(random_state=12345, n_estimators = pick_estimator, max_depth = pick_depth, min_samples_split = pick_splits)
model.fit(features_train, target_train)

train_predictions = model.predict(features_train)
test_predictions = model.predict(features_test)

training_accuracy = accuracy_score(target_train, train_predictions)
test_accuracy = accuracy_score(target_test, test_predictions)

print('Model Accuracy')
print('Training set:', training_accuracy )
print('Test set:    ', test_accuracy )

Model Accuracy
Training set: 0.8444306160547604
Test set:     0.7920298879202988


The model is not overfitted and its accuracy is higher than the required 0.75.

#### Use chosen parameters and full dataset to train the final model:

In [15]:
model_final = RandomForestClassifier(random_state=12345, n_estimators = pick_estimator, max_depth = pick_depth, min_samples_split = pick_splits) 
model_final.fit(features_all, target_all)
dump(model_final, 'model.joblib') 

['model.joblib']

#### conclusions
- model accuracy is ~0.79.

#### Recommendations
- Put model in production.