# DVD Rental Duration Prediction
### Objective

Predict the number of days a customer will rent a DVD (rental_length_days) using available features. The company aims for a model with test MSE ≤ 3 to improve inventory planning efficiency.

# Dataset Overview

The dataset rental_info.csv contains the following features:

`rental_date`, `return_date` → Rental start and end dates

`amount`, `amount_2` → Amount paid and its square

`rental_rate`, `rental_rate_2` → Rental rate and its square

`release_year` → Year the movie was released

`length`, `length_2` → Movie length in minutes and its square

`replacement_cost` → Cost to replace the DVD

`special_features` → e.g., trailers, deleted scenes

`NC-17`, `PG`, `PG-13`, `R` → Movie rating dummy variables

Additional feature engineering:

`rental_length_days` → Target variable (days between rental and return)

`deleted_scenes`, `behind_the_scenes` → Dummy variables derived from `special_features`

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
# For lasso
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [3]:
# Run OLS
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

In [5]:
# Read in data
df_rental = pd.read_csv("rental_info.csv")
df_rental.head(10)

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
5,2005-05-29 16:51:44+00:00,2005-06-01 21:43:44+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
6,2005-06-17 19:42:42+00:00,2005-06-22 20:39:42+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
7,2005-07-09 18:23:46+00:00,2005-07-13 19:04:46+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
8,2005-07-27 13:16:28+00:00,2005-07-28 13:40:28+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
9,2005-08-21 13:53:52+00:00,2005-08-25 09:03:52+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [6]:
# Add information on rental duration
df_rental["rental_length"] = pd.to_datetime(df_rental["return_date"]) - pd.to_datetime(df_rental["rental_date"])
df_rental["rental_length_days"] = df_rental["rental_length"].dt.days

In [7]:
# Dummy variables for special features
df_rental["deleted_scenes"] = np.where(df_rental["special_features"].str.contains("Deleted Scenes"), 1, 0)
df_rental["behind_the_scenes"] = np.where(df_rental["special_features"].str.contains("Behind the Scenes"), 1, 0)

In [8]:
# Choose columns to drop
cols_to_drop = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]


In [9]:
# Split into feature and target sets
X = df_rental.drop(cols_to_drop, axis=1)
y = df_rental["rental_length_days"]

# Further split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 test_size=0.2,
                                                 random_state=9)

In [10]:
# Create the Lasso model
lasso = Lasso(alpha=0.3, random_state=9)

In [11]:
# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

In [12]:
# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

In [13]:
# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)


In [14]:
# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

In [15]:
# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf,
                                 param_distributions=param_dist,
                                 cv=5,
                                 random_state=9)

In [16]:
# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

In [17]:
# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"],
                           max_depth=hyper_params["max_depth"],
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

In [18]:
# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest

In [19]:
from sklearn.metrics import r2_score

# Lasso + OLS
rmse_lin_reg_lasso = np.sqrt(mse_lin_reg_lasso)
r2_lin_reg_lasso = r2_score(y_test, y_test_pred)

# Random Forest
rmse_rf = np.sqrt(mse_random_forest)
r2_rf = r2_score(y_test, rf_pred)

print(f"Lasso + OLS: MSE={mse_lin_reg_lasso:.3f}, RMSE={rmse_lin_reg_lasso:.3f}, R²={r2_lin_reg_lasso:.3f}")
print(f"Random Forest: MSE={mse_random_forest:.3f}, RMSE={rmse_rf:.3f}, R²={r2_rf:.3f}")


Lasso + OLS: MSE=4.812, RMSE=2.194, R²=0.322
Random Forest: MSE=2.226, RMSE=1.492, R²=0.687


In [20]:
def predict_rental_days(new_data):
    """
    Predict rental duration (days) using the best Random Forest model.
    new_data: pandas DataFrame with same columns as X_train
    """
    return best_model.predict(new_data)

In [23]:
# Define new DVD data
new_dvd = pd.DataFrame({
    "amount": [5.99],
    "amount_2": [5.99**2],
    "rental_rate": [1.99],
    "rental_rate_2": [1.99**2],
    "release_year": [2022],
    "length": [120],
    "length_2": [120**2],
    "replacement_cost": [20],
    "NC-17": [0],
    "PG": [0],
    "PG-13": [1],
    "R": [0],
    "deleted_scenes": [1],
    "behind_the_scenes": [0]
})

# Ensure the new_dvd DataFrame has the same column order as X_train
new_dvd = new_dvd[X_train.columns]

# Predict rental days
predicted_days = predict_rental_days(new_dvd)
print(f"Predicted rental duration: {predicted_days[0]:.2f} days")

Predicted rental duration: 7.92 days
