![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `'rental_date'`: The date (and time) the customer rents the DVD.
- `'return_date'`: The date (and time) the customer returns the DVD.
- `'amount'`: The amount paid by the customer for renting the DVD.
- `'amount_2'`: The square of `'amount'`.
- `'rental_rate'`: The rate at which the DVD is rented for.
- `'rental_rate_2'`: The square of `'rental_rate'`.
- `'release_year'`: The year the movie being rented was released.
- `'length'`: Lenght of the movie being rented, in minuites.
- `'length_2'`: The square of `'length'`.
- `'replacement_cost'`: The amount it will cost the company to replace the DVD.
- `'special_features'`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `'NC-17'`, `'PG'`, `'PG-13'`, `'R'`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

## Test

In [10]:
def sum_lst(num, pow=1):
  str_lst = [_ for _ in str(num**pow)]
  return sum([int(_) for _ in str_lst])

In [20]:
num = 10
lst = []
while len(lst)<25:
    if sum_lst(num, 1)==sum_lst(num, 2)==sum_lst(num, 3): lst.append(num)
    num += 1

## Code

In [87]:
import pandas as pd
import numpy as np
import datetime as dtt

import sklearn.model_selection as skms # import train_test_split
import sklearn.metrics as skme #import mean_squared_error
import sklearn.ensemble as sken
import sklearn.linear_model as sklm

# Import any additional modules and start coding below

In [68]:
df = pd.read_csv('rental_info.csv')
df['rental_date'] = pd.to_datetime(df['rental_date'])
df['return_date'] = pd.to_datetime(df['return_date'])
df['rental_duration_(mins)'] = (df['return_date']-df['rental_date']) / pd.Timedelta(minutes=1)
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_duration_(mins)
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,5566.0
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4085.0
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,10424.0
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3024.0
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,5825.0


In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   rental_date             15861 non-null  datetime64[ns, UTC]
 1   return_date             15861 non-null  datetime64[ns, UTC]
 2   amount                  15861 non-null  float64            
 3   release_year            15861 non-null  float64            
 4   rental_rate             15861 non-null  float64            
 5   length                  15861 non-null  float64            
 6   replacement_cost        15861 non-null  float64            
 7   special_features        15861 non-null  object             
 8   NC-17                   15861 non-null  int64              
 9   PG                      15861 non-null  int64              
 10  PG-13                   15861 non-null  int64              
 11  R                       15861 non-null  i

In [88]:
SEED = 1
rf = sken.RandomForestRegressor(random_state=SEED)

rf = sken.RandomForestRegressor(random_state=SEED)
rf.get_params()

In [89]:
X = df[['amount', 'release_year', 'rental_rate', 'length', 'replacement_cost', 'NC-17', 'PG', 'PG-13', 'R']]
y = df['rental_duration_(mins)']

In [90]:
# using the train test split function 
X_train, X_test, y_train, y_test = skms.train_test_split(
    X, y,
    random_state=SEED,
    test_size=0.3,
    shuffle=True) 

In [122]:
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
mse = skme.mean_squared_error(y_test, y_pred)
rmse = mse**(1/2)
print(mse, rmse)

1.844923085894828 1.3582794579521653


In [123]:
# params_rf = {
#     'n_estimators': [10, 20, 30, 40, 50, 60, 70, 80, 90],
#     'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18],
#     'min_samples_leaf': [0.1, 0.2, 0.3],
#     'max_features': ['log2', 'sqrt']
# }

param_rf = {
    'n_estimators': np.arange(1,101,1),
    'max_depth':np.arange(1,11,1)
}

grid_rf = skms.GridSearchCV(
    estimator=rf,
    param_grid=params_rf,
    cv=5,
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=1
)

In [124]:
grid_rf.fit(X_train, y_train)

Fitting 5 folds for each of 486 candidates, totalling 2430 fits


In [116]:
grid_rf.best_params_

{'max_depth': 4,
 'max_features': 'log2',
 'min_samples_leaf': 0.1,
 'n_estimators': 80}

In [None]:
grid_rf.best_params_

{'max_depth': 4,
 'max_features': 'log2',
 'min_samples_leaf': 0.1,
 'n_estimators': 80}

In [102]:
grid_rf.best_params_

{'max_depth': 4,
 'max_features': 'log2',
 'min_samples_leaf': 0.1,
 'n_estimators': 10}

In [117]:
best_model = grid_rf.best_estimator_
y_pred = best_model.predict(X_test)
mse = skme.mean_squared_error(y_test, y_pred)
rmse = mse**(1/2)
print(mse, rmse)

4.174860294374446 2.0432474873040825


In [105]:
lr = sklm.LinearRegression()
lr.get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}

In [108]:
# Start your coding from below
import pandas as pd
import numpy as np

import sklearn.model_selection as skms
import sklearn.metrics as skme
import sklearn.linear_model as skml
import sklearn.preprocessing as skpp
import sklearn.ensemble as sken

In [120]:

# Read in data and make appropriate manipulations
df = pd.read_csv('rental_info.csv')
df['rental_date'] = pd.to_datetime(df['rental_date'])
df['return_date'] = pd.to_datetime(df['return_date'])
df['rental_duration_days'] = (df['return_date']-df['rental_date']) / pd.Timedelta(days=1)

df['deleted_scenes'] =  np.where(df['special_features'].str.contains('Deleted Scenes'), 1, 0)
df['behind_the_scenes'] =  np.where(df['special_features'].str.contains('Behind the Scenes'), 1, 0)


X = df.drop(['special_features', 'rental_duration_days', 'rental_date', 'return_date'], axis=1)
y = df['rental_duration_days']

X_train,X_test,y_train,y_test = skms.train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=1
)

In [121]:
# Create Randon Forest Regression Model
rf = sken.RandomForestRegressor(random_state=1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
mse = skme.mean_squared_error(y_test, y_pred)
rmse = mse**(1/2)
print(mse, rmse)

1.844923085894828 1.3582794579521653


In [113]:
# Lasso model
lasso = sklm.Lasso(alpha=0.3, random_state=9) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

print(lasso_coef) #X_lasso_train, X_lasso_test)

[ 5.96845798e-01 -0.00000000e+00 -0.00000000e+00  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -0.00000000e+00  4.19008078e-02  1.65194560e-06 -1.51962291e-01
 -0.00000000e+00  0.00000000e+00]


In [None]:

# Run OLS models on lasso chosen regression
ols = sklm.LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

# Random forest hyperparameter space
param_dist = {
    'n_estimators': np.arange(1,101,1),
    'max_depth':np.arange(1,11,1)
}

# Create a random forest regressor
rf = sken.RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = skms.RandomizedSearchCV(
    rf,
    param_distributions=param_dist,
    cv=5,
    random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = sken.RandomForestRegressor(n_estimators=hyper_params['n_estimators'], 
                           max_depth=hyper_params['max_depth'], 
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest

In [None]:

# Lasso model
lasso = Lasso(alpha=0.3, random_state=9) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params['n_estimators'], 
                           max_depth=hyper_params['max_depth'], 
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest

In [107]:
best_mse

2.225667528098759