![dvd_image](dataset\dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error

# Models
from sklearn.linear_model import Lasso
from  sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

# tools
from sklearn.ensemble import VotingRegressor

df = pd.read_csv("dataset\\rental_info.csv", parse_dates=['rental_date', 'return_date'])
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


# Exploring Dataset


In [2]:
# C1
#------
# Exploring datatypes and data size and if there are any missing values
#----------------------------------------------------------------------------
df.info()
df["special_features"].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   rental_date       15861 non-null  datetime64[ns, UTC]
 1   return_date       15861 non-null  datetime64[ns, UTC]
 2   amount            15861 non-null  float64            
 3   release_year      15861 non-null  float64            
 4   rental_rate       15861 non-null  float64            
 5   length            15861 non-null  float64            
 6   replacement_cost  15861 non-null  float64            
 7   special_features  15861 non-null  object             
 8   NC-17             15861 non-null  int64              
 9   PG                15861 non-null  int64              
 10  PG-13             15861 non-null  int64              
 11  R                 15861 non-null  int64              
 12  amount_2          15861 non-null  float64            
 13  l

{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted Scenes"}                                 1011
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}     983
{Trailers,Commentaries,"Deleted Scenes"}                         916
{Trailers,"Deleted Scenes","Behind

# Data Preprocessing

In [3]:
# C2
#-----
# One hot encoding for the categorical column "special_features"
#------------------------------------------------------------------
df_preprocessed = df
df_preprocessed["deleted_scenes"] = df_preprocessed["special_features"].str.contains("Deleted Scenes").astype(int)
df_preprocessed["behind_the_scenes"] = df_preprocessed["special_features"].str.contains("Behind the Scenes").astype(int)
df_preprocessed.drop(columns=["special_features"], inplace=True)
df_preprocessed.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,behind_the_scenes
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1


In [4]:
# C3
#----
# Creating the target variable column and removing columns leaking data about the target
#-----------------------------------------------------------------------------------------
df_preprocessed["rental_length_days"] = (df_preprocessed["return_date"] - df_preprocessed["rental_date"]).dt.days

df_preprocessed.drop(columns=["rental_date", "return_date"], inplace = True)
df_preprocessed.head()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,behind_the_scenes,rental_length_days
0,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1,3
1,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1,2
2,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1,7
3,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1,2
4,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1,4


In [5]:
# C4
#-----
# Defining the input variables and target variable
#-------------------------------------------------------
X = df_preprocessed.iloc[:,:-1]
y = df_preprocessed["rental_length_days"]

In [6]:
# C5
#-----
# Splitting Data
#--------------------
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = 9)

# Model Selection
In this step, I used cross validation to select the best base model.
I selected the `RandomForestRegressor` model.

In [7]:
# C6
#-----
# Defining the list of models to select from
#-------------------------------------------------
lasso = Lasso()
svr = SVR()
dt = DecisionTreeRegressor()
rf = RandomForestRegressor(random_state=9)
models = {
    "Lasso" : lasso,
    "SVR" : svr,
    "DecisionTreeRegressor" : dt,
    "RandomForestRegressor" : rf,
}

In [8]:
# C7
#-----
# Evaluating base model for selection
#------------------------------------------
for model in models.items():
    cv = cross_val_score(model[1], X_train, y_train, cv=4, scoring = 'neg_mean_squared_error', n_jobs= -1)
    print(f"{model[0]} : {np.mean(cv)}")

Lasso : -3.680857370286553
SVR : -6.961727014268216
DecisionTreeRegressor : -2.3104363077610026
RandomForestRegressor : -2.0929149492555776


# Hyperparameters tuning

In [9]:
# C8
#-----
# Selected models default Hyperparameters 
#-----------------------------------------
display(rf.get_params())

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 9,
 'verbose': 0,
 'warm_start': False}

In [10]:
# C9
#-----
# Selected model Hyperparameters list for tuning
#---------------------------------------------------
rf_params = {
    'n_estimators' : [100, 200, 400],
    'min_samples_leaf' : [0.1, 0.5, 1],
    'max_depth' : [None, 4, 8],
}

In [11]:
# C10
#------
# GridSearchCv for Hyperparameters tuning
#------------------------------------------
rf_cv = GridSearchCV(rf, param_grid= rf_params, cv= 4, scoring= 'neg_mean_squared_error', n_jobs= -1)
rf_cv.fit(X_train, y_train)
rf_best = rf_cv.best_estimator_
display(rf_cv.best_score_)
rf_cv.best_params_

-2.0886590394450857

{'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 400}

# Model Boosting


In [12]:
# C11
#------
# AdaBoost 
#-----------
ada_rf = AdaBoostRegressor(base_estimator= rf_best, n_estimators=100)
cv_result = cross_val_score(ada_rf, X_train, y_train, scoring = "neg_mean_squared_error", cv = 4, n_jobs=-1)
np.mean(cv_result)

-2.0672870771988228

# Final Model

In [13]:
# C12
#-----
# Final Model
#---------------
best_model = ada_rf
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
best_mse = mean_squared_error(y_test, y_pred)
best_mse

2.0187494735685676