![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error

In [2]:
df = pd.read_csv('rental_info.csv')
df_copy = df.copy()  

In [3]:
# Convert date columns to datetime
df_copy['rental_date'] = pd.to_datetime(df_copy['rental_date'], errors='coerce')
df_copy['return_date'] = pd.to_datetime(df_copy['return_date'], errors='coerce')

# Compute rental length in days
df_copy['rental_length_days'] = (df_copy['return_date'] - df_copy['rental_date']).dt.days

# Feature engineering for special features
df_copy['deleted_scenes'] = df_copy['special_features'].str.contains('Deleted Scenes', na=False).astype(int)
df_copy['behind_the_scenes'] = df_copy['special_features'].str.contains('Behind the Scenes', na=False).astype(int)

# Convert release year to integer
df_copy['release_year'] = df_copy['release_year'].astype(int)

In [4]:
# Prepare data for modeling
drop_columns = ["special_features", "rental_date", "return_date", "rental_length_days"]
X = df_copy.drop(columns=drop_columns)
y = df_copy['rental_length_days']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [5]:
# Random Forest hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

rfr_cv = GridSearchCV(
    estimator=RandomForestRegressor(random_state=9),
    param_grid=param_grid,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)
rfr_cv.fit(X_train, y_train)

# Best model
best_rfr = rfr_cv.best_estimator_

# Evaluate model
y_pred = best_rfr.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

print("Best hyperparameters: ", rfr_cv.best_params_)
print("Test Mean Squared Error (MSE):", test_mse)
print("Test R-squared (R2):", test_r2)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best hyperparameters:  {'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 200}
Test Mean Squared Error (MSE): 2.0250206688633443
Test R-squared (R2): 0.7147685678932152


In [6]:
param_grid_gbr = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 10]
}

gbr_cv = GridSearchCV(
    estimator=GradientBoostingRegressor(random_state=9),
    param_grid=param_grid_gbr,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)
gbr_cv.fit(X_train, y_train)

best_gbr = gbr_cv.best_estimator_

y_pred_gbr = best_gbr.predict(X_test)
test_mse_gbr = mean_squared_error(y_test, y_pred_gbr)
test_r2_gbr = r2_score(y_test, y_pred_gbr)

print("Best hyperparameters for GradientBoostingRegressor: ", gbr_cv.best_params_)
print("Test Mean Squared Error (MSE) for GradientBoostingRegressor:", test_mse_gbr)
print("Test R-squared (R2) for GradientBoostingRegressor:", test_r2_gbr)


Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best hyperparameters for GradientBoostingRegressor:  {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 200}
Test Mean Squared Error (MSE) for GradientBoostingRegressor: 1.9350113693585484
Test R-squared (R2) for GradientBoostingRegressor: 0.7274467009095523


In [7]:
param_grid_abr = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1.0]
}

abr_cv = GridSearchCV(
    estimator=AdaBoostRegressor(random_state=9),
    param_grid=param_grid_abr,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)
abr_cv.fit(X_train, y_train)

# Best model
best_abr = abr_cv.best_estimator_

# Evaluate model
y_pred_abr = best_abr.predict(X_test)
test_mse_abr = mean_squared_error(y_test, y_pred_abr)
test_r2_abr = r2_score(y_test, y_pred_abr)

print("Best hyperparameters for AdaBoostRegressor: ", abr_cv.best_params_)
print("Test Mean Squared Error (MSE) for AdaBoostRegressor:", test_mse_abr)
print("Test R-squared (R2) for AdaBoostRegressor:", test_r2_abr)


Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best hyperparameters for AdaBoostRegressor:  {'learning_rate': 0.5, 'n_estimators': 50}
Test Mean Squared Error (MSE) for AdaBoostRegressor: 3.165301976186815
Test R-squared (R2) for AdaBoostRegressor: 0.5541558515425068


In [8]:
# Decision Tree Regressor hyperparameter tuning
param_grid_dtr = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

dtr_cv = GridSearchCV(
    estimator=DecisionTreeRegressor(random_state=9),
    param_grid=param_grid_dtr,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)
dtr_cv.fit(X_train, y_train)

# Best model
best_dtr = dtr_cv.best_estimator_

# Evaluate model
y_pred_dtr = best_dtr.predict(X_test)
test_mse_dtr = mean_squared_error(y_test, y_pred_dtr)
test_r2_dtr = r2_score(y_test, y_pred_dtr)

print("Best hyperparameters for DecisionTreeRegressor: ", dtr_cv.best_params_)
print("Test Mean Squared Error (MSE) for DecisionTreeRegressor:", test_mse_dtr)
print("Test R-squared (R2) for DecisionTreeRegressor:", test_r2_dtr)


Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best hyperparameters for DecisionTreeRegressor:  {'max_depth': 10, 'min_samples_split': 2}
Test Mean Squared Error (MSE) for DecisionTreeRegressor: 2.4508709814127108
Test R-squared (R2) for DecisionTreeRegressor: 0.654786022342331


In [9]:
# XGB Regressor hyperparameter tuning
param_grid_xgb = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 10]
}

xgb_cv = GridSearchCV(
    estimator=XGBRegressor(random_state=9, use_label_encoder=False, eval_metric='rmse'),
    param_grid=param_grid_xgb,
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)
xgb_cv.fit(X_train, y_train)

# Best model
best_xgb = xgb_cv.best_estimator_

# Evaluate model
y_pred_xgb = best_xgb.predict(X_test)
test_mse_xgb = mean_squared_error(y_test, y_pred_xgb)
test_r2_xgb = r2_score(y_test, y_pred_xgb)

print("Best hyperparameters for XGBRegressor: ", xgb_cv.best_params_)
print("Test Mean Squared Error (MSE) for XGBRegressor:", test_mse_xgb)
print("Test R-squared (R2) for XGBRegressor:", test_r2_xgb)


Fitting 3 folds for each of 27 candidates, totalling 81 fits


Parameters: { "use_label_encoder" } are not used.



Best hyperparameters for XGBRegressor:  {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 200}
Test Mean Squared Error (MSE) for XGBRegressor: 1.939168070855501
Test R-squared (R2) for XGBRegressor: 0.7268612145789453


In [10]:
best_model = best_gbr
best_mse = test_mse_gbr
print(f"The best model is {best_model}.")
print(f"The calculated MSE with Gradient Boosting Regressor with hyperparameter tuning is {best_mse}.")

The best model is GradientBoostingRegressor(learning_rate=0.2, max_depth=5, n_estimators=200,
                          random_state=9).
The calculated MSE with Gradient Boosting Regressor with hyperparameter tuning is 1.9350113693585484.
