# Modeling Notebook

## In this notebook we will pick up where we left off and start with our next model(s) for evaluation, keeping track of them through our metrics table up top.  Once we finish all of our models we will select the ones to continue with the model stack and then on to the Bayes Optimization.

### First as usual let's import our libraries, read in our dataframe for usage, and train our X,y values for our models to come.

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.model_selection import train_test_split
from surprise import Dataset, Reader
from surprise import SVD
from surprise.model_selection import cross_validate
from surprise import accuracy


# Stored variables not reading back in properly, marking out for now and re-training the X and y below.
#%store -r X
#%store -r y
#%store -r X_train
#%store -r X_test
#%store -r y_train
#%store -r y_test

In [6]:
df_mod = pd.read_csv(f'/Users/ryanm/Desktop/df-mod.csv')
print(df_mod.shape)
df_mod.head(5)

(2580206, 14)


Unnamed: 0,user_id,order_number,order_id,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,aisle_id,department_id,product_name_code,aisle_code,department_code
0,1,1,2539329,3,9,12.0,26405,5,0,54,17,31683,99,11
1,1,2,2398795,4,8,16.0,26088,6,1,23,19,980,103,20
2,1,3,473747,4,13,22.0,30450,5,1,88,13,7124,124,16
3,1,4,2254736,5,8,30.0,26405,5,1,54,17,31683,99,11
4,1,5,431534,5,16,29.0,41787,8,1,24,4,2419,50,19


In [7]:
X = df_mod.drop('reordered', axis = 1)
y = df_mod['reordered']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Below you will see a table that summarizes all the metrics of each model we are building in this project.  We will use this to not only quickly visualize how each performed but to be able to pick which models we wish to use for our model stack later on.

In [8]:
models = ['Logistic Regression']
table_metrics = {'Model' : models, 'Precision' : 0.77, 'Accuracy' : 0.88, 'F-1' : 0.82}
# Other models to be entered in as they are completed and outputs become available
# Look into leveling up the appearance of the table, as it is a main feature as far as appearance and reference of this notebook.

table_metrics_df = pd.DataFrame(table_metrics)
print(table_metrics_df)



                 Model  Precision  Accuracy   F-1
0  Logistic Regression       0.77      0.88  0.82


In [9]:
# Note SVD not complete, had issues with running.  Will tune.
reader = Reader(rating_scale=(0,1))
data = Dataset.load_from_df(df_mod[['user_id', 'product_name_code', 'reordered']], reader)

svd = SVD()
results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
print(results)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.3349  0.3353  0.3350  0.3351  0.3346  0.3350  0.0002  
MAE (testset)     0.2486  0.2495  0.2488  0.2493  0.2478  0.2488  0.0006  
Fit time          33.14   35.46   36.25   37.61   37.03   35.90   1.56    
Test time         5.56    5.71    5.87    7.10    6.44    6.14    0.57    
{'test_rmse': array([0.33491687, 0.33526842, 0.3350459 , 0.33505275, 0.33459822]), 'test_mae': array([0.24860911, 0.24947204, 0.24875975, 0.24933599, 0.24782499]), 'fit_time': (33.13718295097351, 35.4607367515564, 36.247190713882446, 37.60566806793213, 37.027615785598755), 'test_time': (5.558759927749634, 5.708536863327026, 5.870195388793945, 7.1048126220703125, 6.441673040390015)}


In [10]:
df_user_all = df_mod['user_id'].unique()
df_products_all = df_mod['product_name_code'].unique()

svd_predictions = []

for user in df_user_all:
    for product in df_products_all:
        pred = svd.predict(str(user), str(product))
        svd_predictions.append((user, product, pred.est))
        
predictions_df = pd.DataFrame(svd_predictions, columns = ['user_id', 'product_name_code', 'predicted_rating'])

best_recommendations = predictions_df.groupby('user_id').apply(lambda x: x.nlargest(5, 'predicted_rating')).reset_index(drop=True)

KeyboardInterrupt: 

### Now let's move on to the RNN-LSTM model.  This model will not be used in the model stack, however, will be used for our n predictions and our final outputs when the project is completed so is very important.

### Great, now let's move on to our last model in the XGBoost.

### With all of our independent models trained, let's review the metrics table and pick out which ones we want in our model stack before continuing on to the final phase of this notebook in the Bayes Optimization.