![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from itertools import product

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Regression models
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# Load dataset
data = pd.read_csv('rental_info.csv')
print("Dataset loaded successfully!")
print(f"Dataset shape: {data.shape}")
data.head()

Dataset loaded successfully!
Dataset shape: (15861, 15)


Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


# 1. Data Exploration

In [2]:
# Dataset overview
print("Dataset Information:")
data.info()
print("\nDataset Description:")
data.describe()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB

Dataset 

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
count,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0
mean,4.217161,2006.885379,2.944101,114.994578,20.224727,0.204842,0.200303,0.223378,0.198726,23.355504,14832.841876,11.389287
std,2.360383,2.025027,1.649766,40.114715,6.083784,0.403599,0.400239,0.416523,0.399054,23.503164,9393.431996,10.005293
min,0.99,2004.0,0.99,46.0,9.99,0.0,0.0,0.0,0.0,0.9801,2116.0,0.9801
25%,2.99,2005.0,0.99,81.0,14.99,0.0,0.0,0.0,0.0,8.9401,6561.0,0.9801
50%,3.99,2007.0,2.99,114.0,20.99,0.0,0.0,0.0,0.0,15.9201,12996.0,8.9401
75%,4.99,2009.0,4.99,148.0,25.99,0.0,0.0,0.0,0.0,24.9001,21904.0,24.9001
max,11.99,2010.0,4.99,185.0,29.99,1.0,1.0,1.0,1.0,143.7601,34225.0,24.9001


In [3]:
# Explore special features column
print("Special Features Distribution:")
print(data["special_features"].value_counts())
print(f"\nUnique special features: {data['special_features'].nunique()}")

Special Features Distribution:
special_features
{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted Scenes"}                                 1011
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}     983
{Trailers,Commentaries,"Deleted Scenes"}               

# 2. Data Preprocessing

In [4]:
# Create target variable: rental length in days
data["rental_length_days"] = (pd.to_datetime(data["return_date"]) - pd.to_datetime(data["rental_date"])).dt.days

# Create dummy variables for special_features column
dummies = pd.DataFrame({
    "deleted_scenes": data["special_features"].apply(lambda x: 1 if "Deleted Scenes" in str(x) else 0),
    "behind_the_scenes": data["special_features"].apply(lambda x: 1 if "Behind the Scenes" in str(x) else 0),
    "commentaries": data["special_features"].apply(lambda x: 1 if "Commentaries" in str(x) else 0),
    "trailers": data["special_features"].apply(lambda x: 1 if "Trailers" in str(x) else 0),
})
data = pd.concat([data, dummies], axis=1)

print("Preprocessing completed!")
print(f"New dataset shape: {data.shape}")
print(f"\nTarget variable (rental_length_days) statistics:")
print(data["rental_length_days"].describe())
data.head()

Preprocessing completed!
New dataset shape: (15861, 20)

Target variable (rental_length_days) statistics:
count    15861.000000
mean         4.525944
std          2.635108
min          0.000000
25%          2.000000
50%          5.000000
75%          7.000000
max          9.000000
Name: rental_length_days, dtype: float64


Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes,commentaries,trailers
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3,0,1,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2,0,1,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7,0,1,0,1
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2,0,1,0,1
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4,0,1,0,1


# 3. Model Training and Evaluation

## 3.1 Data Preparation and Helper Functions

In [5]:
# Helper functions for model evaluation
def train_and_evaluate(model, X_train, y_train, X_test, y_test):
    """Train a model and return its RMSE on test set"""
    model.fit(X_train, y_train)
    return evaluate_model(model, X_test, y_test)

def evaluate_model(model, X_test, y_test):
    """Evaluate model and return RMSE"""
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    return rmse

# Prepare features and target
y = data["rental_length_days"]
X = data.drop(columns=["rental_length_days", "rental_date", "return_date", "special_features"])

# Train-test split
SEED = 9
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
y_train_std = np.std(y_train)
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"Target variable standard deviation: {y_train_std:.2f}")

Training set size: (12688, 16)
Test set size: (3173, 16)
Target variable standard deviation: 2.63


## 3.2 Baseline Models

In [6]:
# Train and evaluate baseline models
models = {
    'Linear Regression': LinearRegression(),
    'Lasso Regression': Lasso(random_state=SEED),
    'Ridge Regression': Ridge(random_state=SEED),
    'Random Forest': RandomForestRegressor(random_state=SEED),
    'SVR': SVR(),
    'K-Nearest Neighbors': KNeighborsRegressor()
}

baseline_results = {}
print("Baseline Model Results (RMSE):")
print("-" * 40)
for name, model in models.items():
    rmse = train_and_evaluate(model, X_train, y_train, X_test, y_test)
    baseline_results[name] = {'RMSE': rmse}
    print(f"{name}: {rmse:.4f}")

# Find best baseline model
best_baseline = min(baseline_results.items(), key=lambda x: x[1]['RMSE'])
print(f"\nBest baseline model: {best_baseline[0]} (RMSE: {best_baseline[1]['RMSE']:.4f})")

Baseline Model Results (RMSE):
----------------------------------------
Linear Regression: 1.7150
Lasso Regression: 1.9508
Ridge Regression: 1.7150
Lasso Regression: 1.9508
Ridge Regression: 1.7150
Random Forest: 1.4223
Random Forest: 1.4223
SVR: 2.6723
SVR: 2.6723
K-Nearest Neighbors: 1.6392

Best baseline model: Random Forest (RMSE: 1.4223)
K-Nearest Neighbors: 1.6392

Best baseline model: Random Forest (RMSE: 1.4223)


## 3.3 Hyperparameter Tuning

### Random Forest Optimization

In [7]:
# Gradient Boosting Hyperparameter Tuning
print("Tuning Gradient Boosting hyperparameters...")
gb_param_grid = {
    "n_estimators": [100, 200, 300],
    "learning_rate": [0.01, 0.1],
    "max_depth": [3, 5, 7],
    "subsample": [0.8, 0.9, 1.0],
    "max_features": [0.8, 0.9, 1.0]
}

gb_grid = GridSearchCV(
    GradientBoostingRegressor(random_state=SEED), 
    gb_param_grid, 
    cv=5, 
    scoring="neg_mean_squared_error", 
    n_jobs=-1
)
gb_grid.fit(X_train, y_train)

gb_best_rmse = np.sqrt(-gb_grid.best_score_)
print(f"Best Gradient Boosting parameters: {gb_grid.best_params_}")
print(f"Best cross-validated RMSE: {gb_best_rmse:.4f}")
print(f"Best cross-validated MSE: {gb_best_rmse**2:.4f}")

Tuning Gradient Boosting hyperparameters...
Best Gradient Boosting parameters: {'learning_rate': 0.1, 'max_depth': 7, 'max_features': 0.9, 'n_estimators': 200, 'subsample': 0.9}
Best cross-validated RMSE: 1.3675
Best cross-validated MSE: 1.8699
Best Gradient Boosting parameters: {'learning_rate': 0.1, 'max_depth': 7, 'max_features': 0.9, 'n_estimators': 200, 'subsample': 0.9}
Best cross-validated RMSE: 1.3675
Best cross-validated MSE: 1.8699


## 3.4 Ensemble Methods

In [8]:
# Random Forest Hyperparameter Tuning
print("Tuning Random Forest hyperparameters...")
rf_param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["sqrt", "log2"]
}

rf_grid = GridSearchCV(
    RandomForestRegressor(random_state=SEED), 
    rf_param_grid, 
    cv=5, 
    scoring="neg_mean_squared_error", 
    n_jobs=-1
)
rf_grid.fit(X_train, y_train)

rf_best_rmse = np.sqrt(-rf_grid.best_score_)
print(f"Best Random Forest parameters: {rf_grid.best_params_}")
print(f"Best cross-validated RMSE: {rf_best_rmse:.4f}")
print(f"Best cross-validated MSE: {rf_best_rmse**2:.4f}")

Tuning Random Forest hyperparameters...
Best Random Forest parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best cross-validated RMSE: 1.4131
Best cross-validated MSE: 1.9967
Best Random Forest parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best cross-validated RMSE: 1.4131
Best cross-validated MSE: 1.9967


### Gradient Boosting Optimization

In [9]:
# Voting Regressor with optimized models
voting_model_final = VotingRegressor(estimators=[
    ("rf", rf_grid.best_estimator_),
    ("gb", gb_grid.best_estimator_),
])

voting_rmse_final = train_and_evaluate(voting_model_final, X_train, y_train, X_test, y_test)
print(f"Final Voting Regressor Results:")
print(f"RMSE: {voting_rmse_final:.4f}")
print(f"MSE: {voting_rmse_final**2:.4f}")

Final Voting Regressor Results:
RMSE: 1.3828
MSE: 1.9121


# 4. Results Summary and Conclusion

In [10]:
# Final Model Comparison
print("=" * 60)
print("FINAL MODEL PERFORMANCE SUMMARY")
print("=" * 60)
print(f"Target: MSE < 3.0")
print("-" * 60)

# Test the best models on test set
final_results = {}

# Best Random Forest test performance
rf_test_rmse = evaluate_model(rf_grid.best_estimator_, X_test, y_test)
final_results['Random Forest (Tuned)'] = rf_test_rmse

# Best Gradient Boosting test performance  
gb_test_rmse = evaluate_model(gb_grid.best_estimator_, X_test, y_test)
final_results['Gradient Boosting (Tuned)'] = gb_test_rmse

# Final voting regressor test performance
final_results['Voting Regressor (Final)'] = voting_rmse_final

# Display results
for model_name, rmse in final_results.items():
    mse = rmse ** 2
    status = "✅ MEETS TARGET" if mse < 3.0 else "❌ ABOVE TARGET"
    print(f"{model_name:<30}: RMSE={rmse:.4f}, MSE={mse:.4f} {status}")

# Best model
best_model = min(final_results.items(), key=lambda x: x[1])
print(f"\n🏆 BEST MODEL: {best_model[0]}")
print(f"   Final Test RMSE: {best_model[1]:.4f}")
print(f"   Final Test MSE:  {best_model[1]**2:.4f}")

if best_model[1]**2 < 3.0:
    print("✅ SUCCESS: Model meets the company's MSE < 3.0 requirement!")
else:
    print("❌ Target not met - further optimization needed.")

FINAL MODEL PERFORMANCE SUMMARY
Target: MSE < 3.0
------------------------------------------------------------
Random Forest (Tuned)         : RMSE=1.4056, MSE=1.9757 ✅ MEETS TARGET
Gradient Boosting (Tuned)     : RMSE=1.3729, MSE=1.8850 ✅ MEETS TARGET
Voting Regressor (Final)      : RMSE=1.3828, MSE=1.9121 ✅ MEETS TARGET

🏆 BEST MODEL: Gradient Boosting (Tuned)
   Final Test RMSE: 1.3729
   Final Test MSE:  1.8850
✅ SUCCESS: Model meets the company's MSE < 3.0 requirement!
