![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [198]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, GradientBoostingRegressor


# Import any additional modules and start coding below

In [199]:
df = pd.read_csv('rental_info.csv', parse_dates=['rental_date', 'return_date'])
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [200]:
df['release_year'] = df['release_year'].astype('int')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   rental_date       15861 non-null  datetime64[ns, UTC]
 1   return_date       15861 non-null  datetime64[ns, UTC]
 2   amount            15861 non-null  float64            
 3   release_year      15861 non-null  int64              
 4   rental_rate       15861 non-null  float64            
 5   length            15861 non-null  float64            
 6   replacement_cost  15861 non-null  float64            
 7   special_features  15861 non-null  object             
 8   NC-17             15861 non-null  int64              
 9   PG                15861 non-null  int64              
 10  PG-13             15861 non-null  int64              
 11  R                 15861 non-null  int64              
 12  amount_2          15861 non-null  float64            
 13  l

In [201]:
df.describe()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
count,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0
mean,4.217161,2006.885379,2.944101,114.994578,20.224727,0.204842,0.200303,0.223378,0.198726,23.355504,14832.841876,11.389287
std,2.360383,2.025027,1.649766,40.114715,6.083784,0.403599,0.400239,0.416523,0.399054,23.503164,9393.431996,10.005293
min,0.99,2004.0,0.99,46.0,9.99,0.0,0.0,0.0,0.0,0.9801,2116.0,0.9801
25%,2.99,2005.0,0.99,81.0,14.99,0.0,0.0,0.0,0.0,8.9401,6561.0,0.9801
50%,3.99,2007.0,2.99,114.0,20.99,0.0,0.0,0.0,0.0,15.9201,12996.0,8.9401
75%,4.99,2009.0,4.99,148.0,25.99,0.0,0.0,0.0,0.0,24.9001,21904.0,24.9001
max,11.99,2010.0,4.99,185.0,29.99,1.0,1.0,1.0,1.0,143.7601,34225.0,24.9001


In [202]:
df["rental_length_days"] = (df['return_date'] - df['rental_date']).dt.total_seconds() / (24 * 60 * 60)
df["rental_length_days"].head()

0    3.865278
1    2.836806
2    7.238889
3    2.100000
4    4.045139
Name: rental_length_days, dtype: float64

In [203]:
df['special_features'].unique()

array(['{Trailers,"Behind the Scenes"}', '{Trailers}',
       '{Commentaries,"Behind the Scenes"}', '{Trailers,Commentaries}',
       '{"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes","Behind the Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes"}',
       '{"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes"}', '{Commentaries}',
       '{Trailers,Commentaries,"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes"}', '{"Deleted Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}'],
      dtype=object)

In [204]:
df['deleted_scenes'] = np.where(df['special_features'].str.contains('Deleted Scenes'), 1, 0)
df['behind the scenes'] = np.where(df['special_features'].str.contains('Behind the Scenes'), 1, 0)

In [205]:
df['deleted_scenes'].value_counts()

0    7973
1    7888
Name: deleted_scenes, dtype: int64

In [206]:
df['behind the scenes'].value_counts()

1    8507
0    7354
Name: behind the scenes, dtype: int64

In [207]:
X = df.drop(['rental_date', 'return_date', 'special_features', 'rental_length_days'], axis=1).values 
y = df["rental_length_days"].values

In [208]:
print(X.shape, y.shape)

(15861, 14) (15861,)


In [209]:
seed = 9
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

In [210]:
# Define the models and parameters
lr = LinearRegression()
dtr = DecisionTreeRegressor(random_state=seed)
rfr = RandomForestRegressor(random_state=seed)
br = BaggingRegressor(estimator=dtr, random_state=seed)
gbr = GradientBoostingRegressor(learning_rate=0.2, max_depth=7, n_estimators=87, random_state=9)

In [211]:
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)
print(lr_mse)

2.724712176544294


In [212]:
# dtr_params = {
#     'criterion': ['poisson', 'friedman_mse', 'squared_error', 'absolute_error'],
#     'max_depth':np.arange(1,11,1)
# }
# dtr_rscv = RandomizedSearchCV(estimator=dtr, param_distributions=dtr_params, cv=5, random_state=seed, n_jobs=-1)
# dtr_rscv.fit(X_train, y_train)
# dtr_best_params = dtr_rscv.best_params_
# dtr_best_model = dtr_rscv.best_estimator_
# print(dtr_best_params)

In [213]:
# rfr_params = {'n_estimators': np.arange(1,101,1),
#           'max_depth':np.arange(1,11,1)
# }
# rfr_rscv = RandomizedSearchCV(estimator=rfr, param_distributions=rfr_params, cv=5, random_state=seed, n_jobs=-1)
# rfr_rscv.fit(X_train, y_train)
# rfr_best_params = rfr_rscv.best_params_
# rfr_best_model = rfr_rscv.best_estimator_
# print(rfr_best_params)

In [214]:
# br_params = {
#     'n_estimators': np.arange(1,101,1)
# }
# br_rscv = RandomizedSearchCV(estimator=br, param_distributions=br_params, cv=5, random_state=seed, n_jobs=-1)
# br_rscv.fit(X_train, y_train)
# br_best_params = br_rscv.best_params_
# br_best_model = br_rscv.best_estimator_
# print(br_best_params)

In [215]:
# gbr_params = {
#     'n_estimators': np.arange(1,101,1),
#     'max_depth':np.arange(1,11,1),
#     'learning_rate': [0.01, 0.1, 0.2, 0.3],
# }
# gbr_rscv = RandomizedSearchCV(estimator=gbr, param_distributions=gbr_params, cv=5, random_state=seed, n_jobs=-1)
# gbr_rscv.fit(X_train, y_train)
# gbr_best_params = gbr_rscv.best_params_
# gbr_best_model = gbr_rscv.best_estimator_
# print(gbr_best_params)

In [216]:
# mses = [lr_mse]
# models = [dtr_best_model, rfr_best_model, br_best_model, gbr_best_model]
# for model in models:
#     model.fit(X_train, y_train)
#     y_pred = model.predict(X_test)
#     mse = mean_squared_error(y_test, y_pred)
#     mses.append(f'{model} mse is {mse}')

In [217]:
mses

[2.724712176544294,
 "DecisionTreeRegressor(criterion='poisson', max_depth=8, random_state=9) mse is 2.2696562161125615",
 'RandomForestRegressor(max_depth=10, n_estimators=51, random_state=9) mse is 2.033207476501442',
 'BaggingRegressor(estimator=DecisionTreeRegressor(random_state=9),\n                 n_estimators=76, random_state=9) mse is 1.8185699040869152',
 'GradientBoostingRegressor(learning_rate=0.2, max_depth=7, n_estimators=87,\n                          random_state=9) mse is 1.6999264753480425']

In [218]:
# mses = [2.0180100862741534e-28, 3.574764119692528e-07, 4.682706976610501e-05, 1.0827765089265048e-07, 3.6588426863789486e-07]

In [219]:
gbr.fit(X_train, y_train)
gbr_y_pred = gbr.predict(X_test)
gbr_mse = mean_squared_error(y_test, gbr_y_pred)

In [220]:
gbr_mse

1.6999264753480425

In [221]:
best_model = GradientBoostingRegressor(learning_rate=0.2, max_depth=7, n_estimators=87, random_state=9)
best_mse = 1.6999264753480425

In [222]:
# For lasso
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler


# Read in data
df_rental = pd.read_csv("rental_info.csv")

# Add information on rental duration
df_rental["rental_length"] = pd.to_datetime(df_rental["return_date"]) - pd.to_datetime(df_rental["rental_date"])
df_rental["rental_length_days"] = df_rental["rental_length"].dt.days

### Add dummy variables
# Add dummy for deleted scenes
df_rental["deleted_scenes"] =  np.where(df_rental["special_features"].str.contains("Deleted Scenes"), 1, 0)
# Add dummy for behind the scenes
df_rental["behind_the_scenes"] =  np.where(df_rental["special_features"].str.contains("Behind the Scenes"), 1, 0)

# Choose columns to drop
cols_to_drop = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]

# Split into feature and target sets
X = df_rental.drop(cols_to_drop, axis=1)
y = df_rental["rental_length_days"]

# Further split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X, 
                                                 y, 
                                                 test_size=0.2, 
                                                 random_state=9)

# Create the Lasso model
lasso = Lasso(alpha=0.3, random_state=9) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                           max_depth=hyper_params["max_depth"], 
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest
print(best_mse)

2.225667528098759
