**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job! Good luck on the next sprint!

# Predictive Models for Megaline

Using data on customers that have already changed to using one of Megaline's two newer plans, Smart or Ultra, we will split the data into training, validation, and testing sets to create a number of predictive models and choose the best one that will aid in selling the new plans to exiting customers who are still on legacy plans.

Our minimum threshold for accuracy of the predictions is set at `0.75`, though we will be creating a number of different models to attempt and get the highest possible accuracy in order to better help Megaline in their sales and marketing.

### Importing Libraries and Data

#### Libraries

In [1]:
### Importing libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

#### Data

In [2]:
### Importing data
df = pd.read_csv('/datasets/users_behavior.csv')

### Preprocessing Data

#### Data Overview

In [3]:
### Preprocess Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [5]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

It is verified that there are no missing values in any of the columns.

#### Data Type Conversions

In [6]:
df['calls'] = df['calls'].astype('int64')
df['messages'] = df['messages'].astype('int64')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


We have converted `calls` and `messages` to integer data type to save on bulk and aid in accelerating processing speed when we begin to assess which model will be best to use.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected!

</div>

### Prepare for Learning Models

#### Define evaluation metrics

Evaluation metrics do not require a new column as the `is_ultra` column, with a Boolean 1 or 0 indicates such. This will be used as the target for upcoming model fitting.

#### Split into Training, Validation, and Testing Sets

In [7]:
### Split into training (60%), validation (20%), and testing (20%) sets

df_train, df_valid = train_test_split(df, test_size=0.2, random_state=759638)

df_train, df_test = train_test_split(df_train, test_size=0.25, random_state=759638)

We have split the data into three sets; training, validation, and testing, at a 3:1:1 ratio. First we took 20% of the data to create the validation set, then from the remaining 80%, we took 25% to make the testing set. This follows the ratio of 3:1:1 or 60%, 20%, and 20% since after the first split, 80% of the initial data remained, and 0.8 * 0.25 = 0.2, representing a second 20% share of the full data set.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train, validation and test sets, the proportions are reasonable

</div>

#### Declare Variables

In [8]:
# declare variables for features and targets

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

The training, validation, and testing data sets have been split into features; which include number of calls, minutes, texts, and data usage, and results; whether the plan is Smart or Ultra.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good, the target is correct

</div>

### Test Models

#### Model A - Decision Tree

In [22]:
best_model = None
best_result = 0
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=759638, max_depth=depth) # create a model with the given depth
    model.fit(features_train, target_train) # train the model
    result = model.score(features_valid, target_valid) # calculate the accuracy
    if result > best_result:
        best_model = model
        best_result = result
        best_depth = depth
        
predictions_test = model.predict(features_test)
test_result = accuracy_score(target_test, predictions_test)
        
print("Depth of best model:", best_depth)
print("Accuracy of the decision tree model on the validation set:", best_result)
print("Accuracy of the decision tree model on the test set:", test_result)

Depth of best model: 3
Accuracy of the decision tree model on the validation set: 0.7978227060653188
Accuracy of the decision tree model on the test set: 0.8133748055987559


#### Model B - Random Forest

In [23]:
best_score = 0
best_est = 0
for est in range(1, 11): # choose hyperparameter range
    model = RandomForestClassifier(random_state=759638, n_estimators=est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score
        best_est = est

print("n_estimator value of best model:", best_est)
print("Accuracy of the random forest on the model validation set:", best_score)
        
final_model = RandomForestClassifier(random_state=759638, n_estimators=est) # change n_estimators to get best model
final_model.fit(features_train, target_train)
test_score = model.score(features_test, target_test)

print("Accuracy of the random forest model on the test set:", test_score)

n_estimator value of best model: 4
Accuracy of the random forest on the model validation set: 0.7776049766718507
Accuracy of the random forest model on the test set: 0.8009331259720062


#### Model C - Logistic Regression

In [28]:
model = LogisticRegression(random_state=759638, solver='liblinear')  # initialize logistic regression constructor with parameters random_state=54321 and solver='liblinear'
model.fit(features_train, target_train)  # train model on training set
score_valid = model.score(features_valid, target_valid) # calculate accuracy score on validation set
score_test = model.score(features_test, target_test) # calculate accuracy score on validation set

print("Accuracy of the logistic regression model on the validation set:", score_valid)
print("Accuracy of the logistic regression model on the test set:", score_test)

Accuracy of the logistic regression model on the validation set: 0.7122861586314152
Accuracy of the logistic regression model on the test set: 0.7169517884914464


#### Model D - Decision Tree Regression

In [25]:
best_model = None
best_score = 10000
best_depth = 0
for depth in range(1, 6): # choose hyperparameter range
    model = DecisionTreeRegressor(max_depth=depth, random_state=759638) # train model on training set
    model.fit(features_train, target_train) # train model on training set
    predictions_valid = model.predict(features_valid) # get model predictions on validation set
    result = mean_squared_error(target_valid, predictions_valid)**0.5 # calculate RMSE on validation set
    if result < best_result:
        best_model = model
        best_result = result
        best_depth = depth

model = RandomForestClassifier(random_state=759638, n_estimators=best_depth) 
model.fit(features_train, target_train)
score_valid = model.score(features_valid, target_valid)
score_test = model.score(features_test, target_test)
        
print("Best model depth:", best_depth)
print("Accuracy of the decision tree regression model on the validation set:", score_valid)
print("Accuracy of the decision tree regression model on the test set:", score_test)


Best model depth: 5
Accuracy of the decision tree regression model on the validation set: 0.7465007776049767
Accuracy of the decision tree regression model on the test set: 0.7729393468118196


#### Model E - Random Forest Regression

In [26]:
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
for est in range(10, 51, 10):
    for depth in range (1, 11):
        model = RandomForestRegressor(random_state=759638, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train) # train model on training set
        predictions_valid = model.predict(features_valid) # get model predictions on validation set
        result = mean_squared_error(target_valid, predictions_valid)**0.5 
        if result < best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth

model = RandomForestRegressor(random_state=759638, n_estimators=est, max_depth=depth)
model.fit(features_train, target_train)
score_valid = model.score(features_valid, target_valid)
score_test = model.score(features_test, target_test)

print("Best model depth:", best_depth)
print("n_estimator value of best model:", best_est)
print("Accuracy of the random forest regression model on the validation set:", score_valid)
print("Accuracy of the random forest regression model on the test set:", score_test)

Best model depth: 7
n_estimator value of best model: 50
Accuracy of the random forest regression model on the validation set: 0.27790191886395654
Accuracy of the random forest regression model on the test set: 0.3308555611521775


#### Model F - Linear Regression

In [27]:
model = LinearRegression()
model.fit(features_train, target_train) # train model on training set
predictions_valid = model.predict(features_valid) # get model predictions on validation set
score_valid = model.score(features_valid, target_valid)
score_test = model.score(features_test, target_test)

print("Accuracy of the linear regression model on the validation set:", score_valid)
print("Accuracy of the linear regression model on the test set:", score_test)

Accuracy of the linear regression model on the validation set: 0.07155124406025604
Accuracy of the linear regression model on the test set: 0.06460707768249263


<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, you tried a few different model, tuned their hyperparameters based on the validation set and evaluated the final models on the test set

</div>

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Note that `LinearRegression` and `RandomForestRegressor` are regression models (for predicting continuous variables), and are not really suitable for classification problems (in particular, `score()` method calculates R^2 score for them, not accuracy). Funny enough, `LogisticRegression` is a classification model despite the name 

</div>

### Conclusion

#### Model A - Decision Tree

After testing the accuracy of six different models, the one with the highest accuracy was determined to the Model A - Decision Tree. With a depth of 3 trees, we were able to get an accuracy of `0.7978227060653188` with the validation data set, and a higher accuracy of `0.8133748055987559` with the test data set.

#### Application of the accepted model

As this model was developed using plan uasage from customers that had already swithced to Megaline's new plans, it should be able to give a fairly accurate prediction on which plan, Smart or Ultra, should be recommended to existing customers who still use a legacy plan. Accurately predicting which plan is more appealing to certain customers should aid in convincing them to upgrade to more more modern alternative to their present choice.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright!

</div>