![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Load dataset
data = pd.read_csv('rental_info.csv')
data.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB


In [3]:
data["special_features"].value_counts()

special_features
{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted Scenes"}                                 1011
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}     983
{Trailers,Commentaries,"Deleted Scenes"}                         916
{Trailers,"Delete

In [4]:
# data preprocessing
data["rental_length_days"] = (pd.to_datetime(data["return_date"]) - pd.to_datetime(data["rental_date"])).dt.days
# make dummy variables for special_features column: values are "Deleted Scenes", "Behind the Scenes"
dummies = pd.DataFrame({
    "deleted_scenes": data["special_features"].apply(lambda x: 1 if "Deleted Scenes" in str(x) else 0),
    "behind_the_scenes": data["special_features"].apply(lambda x: 1 if "Behind the Scenes" in str(x) else 0)
})
data = pd.concat([data, dummies], axis=1)
data.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7,0,1
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2,0,1
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4,0,1


In [5]:
data["deleted_scenes"].value_counts()
# data["behind_the_scenes"].value_counts()

deleted_scenes
0    7973
1    7888
Name: count, dtype: int64

In [6]:
# helper functions to evaluate models
def train_and_evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    return evaluate_model(model, X_test, y_test)
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return rmse

In [12]:

# Define features and target variable
y = data["rental_length_days"]
X = data.drop(columns=["rental_length_days", "rental_date", "return_date", "special_features"])

# Split the data
SEED = 9
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
print("Standard deviation of y_train:", np.std(y_train))


# Linear Regression
lr_model = LinearRegression()
lr_rmse = train_and_evaluate(lr_model, X_train, y_train, X_test, y_test)
print(f"Linear Regression RMSE: {lr_rmse}")

# SVR
svr_model = SVR()
svr_rmse = train_and_evaluate(svr_model, X_train, y_train, X_test, y_test)
print(f"SVR RMSE: {svr_rmse}")

# KNeighbors Regressor
knn_model = KNeighborsRegressor()
knn_rmse = train_and_evaluate(knn_model, X_train, y_train, X_test, y_test)
print(f"KNN RMSE: {knn_rmse}")

# lasso Regression
lasso_model = Lasso()
lasso_rmse = train_and_evaluate(lasso_model, X_train, y_train, X_test, y_test)
print(f"Lasso RMSE: {lasso_rmse}")

# Ridge Regression
ridge_model = Ridge()
ridge_rmse = train_and_evaluate(ridge_model, X_train, y_train, X_test, y_test)
print(f"Ridge RMSE: {ridge_rmse}")

# Random Forest Regressor
rf_model = RandomForestRegressor(random_state=SEED)
rf_rmse = train_and_evaluate(rf_model, X_train, y_train, X_test, y_test)
print(f"Random Forest RMSE: {rf_rmse}")

# Pipeline with StandardScaler and RandomForestRegressor
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("rf", RandomForestRegressor(random_state=SEED))
])
pipeline_rmse = train_and_evaluate(pipeline, X_train, y_train, X_test, y_test)
print(f"Pipeline RMSE: {pipeline_rmse}")

Standard deviation of y_train: 2.6275863852272217
Linear Regression RMSE: 1.7151454354361766
SVR RMSE: 2.6722527139274197
KNN RMSE: 1.6414424870784798
Lasso RMSE: 1.9508173695313487
Ridge RMSE: 1.7151555457392789
Random Forest RMSE: 1.4248304837478998
Pipeline RMSE: 1.4249644005513116


In [8]:
# Stochastic Gradient Boosting Regressor hyperparameter tuning
gb_model = GradientBoostingRegressor(random_state=SEED)
param_grid = {
    "n_estimators": [100, 200, 300],
    "learning_rate": [0.01, 0.1],
    "max_depth": [3, 5, 7],
    "subsample": [0.8, 0.9, 1.0],
    "max_features": [0.8, 0.9, 1.0]

}
gb_grid = GridSearchCV(gb_model, param_grid, cv=5, scoring="neg_mean_squared_error", n_jobs=-1)
gb_grid.fit(X_train, y_train)
print(f"Best parameters for Gradient Boosting Regressor: {gb_grid.best_params_}")

Best parameters for Gradient Boosting Regressor: {'learning_rate': 0.1, 'max_depth': 7, 'max_features': 1.0, 'n_estimators': 200, 'subsample': 0.9}


In [9]:
best_score = np.sqrt(-gb_grid.best_score_)
print(f"Best cross-validated RMSE: {best_score}")

Best cross-validated RMSE: 1.3696097770347528


In [10]:
# RandomForest Hyperparameter tuning 
rf_param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["sqrt", "log2"]
}
rf_grid = GridSearchCV(rf_model, rf_param_grid, cv=5, scoring="neg_mean_squared_error", n_jobs=-1)
rf_grid.fit(X_train, y_train)
print(f"Best parameters for Random Forest Regressor: {rf_grid.best_params_}")
rf_best_rmse = np.sqrt(-rf_grid.best_score_)
print(f"Best cross-validated RMSE for Random Forest: {rf_best_rmse}")

Best parameters for Random Forest Regressor: {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Best cross-validated RMSE for Random Forest: 1.416918958797803
