![dvd_image](./Images/dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [98]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso, LinearRegression


In [99]:
rental_df = pd.read_csv('./Dataset/rental_info.csv')
rental_df.head()

rental_df['special_features'] = rental_df['special_features'].str.lower().str.replace(r'[{}"\']', '', regex=True)

special_features_dummies = rental_df['special_features'].str.get_dummies(sep=',')
special_features_dummies = special_features_dummies.rename(columns={'deleted scenes': 'deleted_scenes', 'behind the scenes': 'behind_the_scenes'})
special_features_dummies = special_features_dummies.drop(['commentaries','trailers'], axis=1)

rental_df = pd.concat([rental_df, special_features_dummies], axis=1)

rental_df['rental_date'] = pd.to_datetime(rental_df['rental_date'])
rental_df['return_date'] = pd.to_datetime(rental_df['return_date'])

rental_df['rental_length_days'] = (rental_df['return_date']-rental_df['rental_date']).dt.days
rental_df = rental_df.drop(['special_features','rental_date','return_date'], axis=1)
rental_df = rental_df.drop(['length_2','rental_rate_2','amount_2'], axis=1)
rental_df.head()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,behind_the_scenes,deleted_scenes,rental_length_days
0,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,1,0,3
1,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,1,0,2
2,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,1,0,7
3,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,1,0,2
4,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,1,0,4


In [100]:
rental_df.isna().sum()

amount                0
release_year          0
rental_rate           0
length                0
replacement_cost      0
NC-17                 0
PG                    0
PG-13                 0
R                     0
behind_the_scenes     0
deleted_scenes        0
rental_length_days    0
dtype: int64

In [101]:
X = rental_df.drop('rental_length_days', axis=1)
y = rental_df['rental_length_days']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=9, test_size=0.2)

In [111]:
params_rf = {
    'n_estimators':[100, 350, 500],
    'max_features':['log2', 'auto', 'sqrt'],
    'min_samples_leaf':[2,10,30] 
}

rf = RandomForestRegressor()

grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

grid_rf.fit(X_train, y_train)

rf_model = grid_rf.best_estimator_

y_pred = rf_model.predict(X_test)

rf_mse = MSE(y_test,y_pred)

print('Best rf model: {}'.format(rf_model)) 
print('Test MSE of best model: {:.3f}'.format(rf_mse)) 

Fitting 3 folds for each of 27 candidates, totalling 81 fits


27 fits failed out of a total of 81.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
27 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Work\AI\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Work\AI\.venv\Lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
  File "c:\Work\AI\.venv\Lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "c:\Work\AI\.venv\Lib\site-packages\sklearn\utils\_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_vali

Best rf model: RandomForestRegressor(max_features='log2', min_samples_leaf=2, n_estimators=500)
Test MSE of best model: 2.021


In [103]:
lasso = Lasso(alpha=0.3, random_state=9) 

lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
lin_reg = LinearRegression()
lin_reg = lin_reg.fit(X_lasso_train, y_train)
y_test_pred = lin_reg.predict(X_lasso_test)
mse_lin_reg = MSE(y_test, y_test_pred)

print('Best lin_reg model: {}'.format(lin_reg)) 
print('Test MSE of best model: {:.3f}'.format(mse_lin_reg)) 

Best lin_reg model: LinearRegression()
Test MSE of best model: 4.842


In [113]:
best_model = rf_model
best_mse = rf_mse

print('Best model: {}'.format(best_model)) 
print('Test MSE of best model: {:.3f}'.format(best_mse)) 


Best model: RandomForestRegressor(max_features='log2', min_samples_leaf=2, n_estimators=500)
Test MSE of best model: 2.021
