![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [27]:
import pandas as pd
import numpy as np
import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)

In [28]:
df_rental = pd.read_csv('rental_info.csv')
df_rental.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [29]:
def check_dataframe(dataframe):
    print('_HEAD_'.center(50, '*'))
    print(dataframe.head(), '\n')
    print('_TAIL_'.center(50, '*'))
    print(dataframe.tail(), '\n')
    print('_SHAPE_'.center(50, '*'))
    print(dataframe.shape, '\n')
    print('_DATAFRAME INFO_'.center(50, '*'))
    print(dataframe.info(), '\n')
    print('_COLUMNS_'.center(50, '*'))
    print(dataframe.columns, '\n')
    print('_ANY NULL VALUE_'.center(50, '*'))
    print(dataframe.isna().values.any(), '\n')
    print('_TOTAL NULL VALUES_'.center(50, '*'))
    print(dataframe.isna().sum(), '\n')
    print('_DESCRIBING DATAFRAME_'.center(50, '*'))
    print(dataframe.describe([0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]).T)

check_dataframe(df_rental)

**********************_HEAD_**********************
                 rental_date                return_date  amount  release_year  rental_rate  length  replacement_cost                special_features  NC-17  PG  PG-13  R  amount_2  length_2  rental_rate_2
0  2005-05-25 02:54:33+00:00  2005-05-28 23:40:33+00:00    2.99        2005.0         2.99   126.0             16.99  {Trailers,"Behind the Scenes"}      0   0      0  1    8.9401   15876.0         8.9401
1  2005-06-15 23:19:16+00:00  2005-06-18 19:24:16+00:00    2.99        2005.0         2.99   126.0             16.99  {Trailers,"Behind the Scenes"}      0   0      0  1    8.9401   15876.0         8.9401
2  2005-07-10 04:27:45+00:00  2005-07-17 10:11:45+00:00    2.99        2005.0         2.99   126.0             16.99  {Trailers,"Behind the Scenes"}      0   0      0  1    8.9401   15876.0         8.9401
3  2005-07-31 12:06:41+00:00  2005-08-02 14:30:41+00:00    2.99        2005.0         2.99   126.0             16.99  {Trailers,"

In [30]:
df_rental['rental_lenght'] = pd.to_datetime(df_rental['return_date']) - pd.to_datetime(df_rental['rental_date'])
df_rental['rental_length_days'] = df_rental['rental_lenght'].dt.days

In [31]:
df_rental['deleted_scenes'] = np.where(df_rental['special_features'].str.contains('Deleted Scenes'), 1, 0)
df_rental['behind_the_scenes'] = np.where(df_rental['special_features'].str.contains('Behind the Scenes'), 1, 0)

In [32]:
droping_variables= ["special_features", "rental_lenght", "rental_length_days", 
                     "rental_date", "return_date"]
X = df_rental.drop(droping_variables, axis=1)
y = df_rental['rental_length_days']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [33]:
lasso = Lasso(alpha=0.3, random_state=9)
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_
print(lasso_coef)
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef>0], X_test.iloc[:, lasso_coef>0]

[ 5.84104424e-01  0.00000000e+00 -0.00000000e+00  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -0.00000000e+00  4.36220109e-02  3.01167812e-06 -1.52983561e-01
 -0.00000000e+00  0.00000000e+00]


In [34]:
# With Linear Regression Model
lin_reg = LinearRegression()
lin_reg.fit(X_lasso_train, y_train)
y_pred = lin_reg.predict(X_lasso_test)
mse_for_lin_reg = mean_squared_error(y_test, y_pred)
print(f"The calculated MSE with Linear Regressor is {mse_for_lin_reg}.")

The calculated MSE with Linear Regressor is 4.812297241276244.


In [35]:
rf = RandomForestRegressor()
rf.fit(X_lasso_train, y_train)
y_pred = rf.predict(X_lasso_test)
mse_for_rf = mean_squared_error(y_test, y_pred)
print(f"The calculated MSE with Random Forest Regressor without hyperparameter tuning is {mse_for_rf}.")

The calculated MSE with Random Forest Regressor without hyperparameter tuning is 3.640252985305122.


In [24]:
# With Random Forest Model with hyperparameter optimization
params = {
    'n_estimators': np.arange(1, 50, 1),
    'max_depth': np.arange(1, 20, 1)
}
rf = RandomForestRegressor()
random_search_cv = RandomizedSearchCV(rf, params, cv=5, random_state=9)
random_search_cv.fit(X_train, y_train)
hyper_parameters = random_search_cv.best_params_
print(f"Determined hyperparameters after hyperparameter tuning are {hyper_parameters}.")

rf_final = RandomForestRegressor(**hyper_parameters, random_state=9)
rf_final.fit(X_train, y_train)
y_pred = rf_final.predict(X_test)
mse_for_rf_hyper_prameters = mean_squared_error(y_test, y_pred)
print(f"The calculated MSE with Random Forest Regressor with hyperparameter tuning is {mse_for_rf_hyper_prameters}.")

In [25]:
best_model = rf_final
best_mse = mse_for_rf_hyper_prameters
print(f"The best model is {best_model}.")
print(f"The calculated MSE with Random Forest Regressor with hyperparameter tuning is {mse_for_rf_hyper_prameters}.")

The best model is RandomForestRegressor(max_depth=16, n_estimators=33, random_state=9).
The calculated MSE with Random Forest Regressor with hyperparameter tuning is 2.0252434990290515.
