![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [70]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below
rental_info = pd.read_csv("rental_info.csv")
rental_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB


In [71]:
rental_info.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [72]:
rental_info['rental_length_days'] = (pd.to_datetime(rental_info['return_date']) - pd.to_datetime(rental_info['rental_date'])).dt.days

In [73]:
rental_info.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4


In [74]:
rental_info['special_features'].unique()

array(['{Trailers,"Behind the Scenes"}', '{Trailers}',
       '{Commentaries,"Behind the Scenes"}', '{Trailers,Commentaries}',
       '{"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes","Behind the Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes"}',
       '{"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes"}', '{Commentaries}',
       '{Trailers,Commentaries,"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes"}', '{"Deleted Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}'],
      dtype=object)

In [75]:
rental_info['deleted_scenes'] = np.where(rental_info['special_features'].str.contains('Deleted Scenes'),1,0)

In [76]:
rental_info['behind_the_scenes'] = np.where(rental_info['special_features'].str.contains('Behind the Scenes'),1,0)

In [77]:
rental_info.tail()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
15856,2005-08-22 10:49:15+00:00,2005-08-29 09:52:15+00:00,6.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,48.8601,7744.0,24.9001,6,1,1
15857,2005-07-31 09:48:49+00:00,2005-08-04 10:53:49+00:00,4.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,24.9001,7744.0,24.9001,4,1,1
15858,2005-08-20 10:35:30+00:00,2005-08-29 13:03:30+00:00,8.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,80.8201,7744.0,24.9001,9,1,1
15859,2005-07-31 13:10:20+00:00,2005-08-08 14:07:20+00:00,7.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,63.8401,7744.0,24.9001,8,1,1
15860,2005-08-18 06:33:55+00:00,2005-08-24 07:14:55+00:00,5.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,35.8801,7744.0,24.9001,6,1,1


In [78]:
rental_info.columns

Index(['rental_date', 'return_date', 'amount', 'release_year', 'rental_rate',
       'length', 'replacement_cost', 'special_features', 'NC-17', 'PG',
       'PG-13', 'R', 'amount_2', 'length_2', 'rental_rate_2',
       'rental_length_days', 'deleted_scenes', 'behind_the_scenes'],
      dtype='object')

In [79]:
X= rental_info.drop(['rental_date','return_date', 'special_features','rental_length_days'], axis=1)
y= rental_info['rental_length_days']

In [80]:
X.shape

(15861, 14)

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=9, test_size=0.2)

In [82]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

In [83]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.3, random_state=9)
lasso.fit(scaled_X_train, y_train)
lasso_coef = lasso.coef_

In [84]:
corr_metrics=X_train.corr()
corr_metrics.style.background_gradient()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,behind_the_scenes
amount,1.0,0.01983,0.6868,0.017117,-0.031871,0.003028,-0.011695,0.014675,-0.007605,0.956008,0.016956,0.680358,-0.015622,-0.027357
release_year,0.01983,1.0,0.033244,0.032653,0.076798,0.032917,-0.022774,0.027612,-0.054273,0.014666,0.032343,0.020543,0.016546,-0.00148
rental_rate,0.6868,0.033244,1.0,0.052193,-0.069525,0.03442,0.001673,0.021151,-0.033447,0.588439,0.05046,0.982641,-0.04885,-0.010457
length,0.017117,0.032653,0.052193,1.0,0.030187,-0.02964,-0.048758,0.058511,0.066766,0.01454,0.987603,0.048146,0.003815,0.006715
replacement_cost,-0.031871,0.076798,-0.069525,0.030187,1.0,9.9e-05,-0.078286,0.049361,0.01121,-0.022557,0.033414,-0.070865,0.054474,0.007692
NC-17,0.003028,0.032917,0.03442,-0.02964,9.9e-05,1.0,-0.255736,-0.271201,-0.254238,0.001976,-0.028587,0.036707,0.02092,0.032937
PG,-0.011695,-0.022774,0.001673,-0.048758,-0.078286,-0.255736,1.0,-0.269231,-0.252391,-0.013965,-0.053306,9.5e-05,0.055481,-0.019984
PG-13,0.014675,0.027612,0.021151,0.058511,0.049361,-0.271201,-0.269231,1.0,-0.267653,0.010048,0.064254,0.021301,-0.025032,0.001457
R,-0.007605,-0.054273,-0.033447,0.066766,0.01121,-0.254238,-0.252391,-0.267653,1.0,-0.007234,0.057132,-0.033793,-0.041991,0.001593
amount_2,0.956008,0.014666,0.588439,0.01454,-0.022557,0.001976,-0.013965,0.010048,-0.007234,1.0,0.014529,0.598102,-0.001385,-0.019094


In [85]:
lasso_coef

array([ 1.78833765,  0.        , -0.8209491 ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        , -0.        ,  0.        ,
        0.        , -0.12245781, -0.        ,  0.        ])

In [86]:
scaled_X_train[:,lasso_coef!=0]

array([[-0.52115626,  0.02498507, -0.24802528],
       [ 0.32713504, -1.18497897, -1.04188023],
       [ 1.17542634,  1.23494911,  1.34367384],
       ...,
       [ 0.75128069,  1.23494911,  1.34367384],
       [-0.52115626,  0.02498507, -0.24802528],
       [ 0.75128069,  0.02498507, -0.24802528]])

In [87]:
X_lasso_train, X_lasso_test = scaled_X_train[:,lasso_coef!=0], scaled_X_test[:, lasso_coef!=0]

In [88]:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(X_lasso_train, y_train)
LR_pred = LR.predict(X_lasso_test)

In [89]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
param={'n_estimators':np.arange(1,101,1),'max_depth':np.arange(1,11,1)}
forest = RandomForestRegressor()
rand_search = RandomizedSearchCV(forest, param_distributions=param, cv=5, random_state=9)
rand_search.fit(X_lasso_train, y_train)
hyper_params = rand_search.best_params_
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], max_depth=hyper_params["max_depth"], random_state=9)
rf.fit(X_lasso_train, y_train)
rf_pred = rf.predict(X_lasso_test)

In [90]:
from sklearn.metrics import mean_squared_error
import numpy as np

LR_mse= mean_squared_error(y_test, LR_pred)
forest_mse = mean_squared_error(y_test, rf_pred)


In [91]:
best_model=rf

In [92]:
best_mse = forest_mse