## Movie Rental Duration Prediction
This Jupyter Notebook demonstrates a machine learning project to predict the duration of movie rentals. The project uses three different regression models: Lasso Regression, Ordinary Least Squares (OLS) Regression, and a Random Forest Regressor. The goal is to compare their performance and identify the best model for this prediction task.

### Project Description 💡
Time to put your regression knowledge to the test! A DVD rental company has approached you to make a regression model which will help predict the number of days a customer rents a DVD. The model will hopefully make their inventory planning much more efficient. You decide to help them by running some regression models and recommending the best-performing model to the company.


#### 1. Loading libraries 🛠️

In [2]:
import pandas as pd                                        # For data manipulation purposes
import numpy as np                                         # Since we will be working data arrays
import matplotlib.pyplot as plt                            # For data visualization purposes

from sklearn.model_selection import train_test_split        # For spliting and training our data

from sklearn.metrics import mean_squared_error               # We will use this for the evaluation score

from sklearn.linear_model import Lasso, LinearRegression    # Loading the algorithms we will use 
from sklearn.ensemble import RandomForestRegressor


from sklearn.model_selection import RandomizedSearchCV        # For cross validation purposes        

#### 2. Exploratory Data Analysis

In [55]:
df_rental = pd.read_csv("rental_info.csv")

In [56]:
df_rental.shape

(15861, 15)

In [57]:
df_rental.columns

Index(['rental_date', 'return_date', 'amount', 'release_year', 'rental_rate',
       'length', 'replacement_cost', 'special_features', 'NC-17', 'PG',
       'PG-13', 'R', 'amount_2', 'length_2', 'rental_rate_2'],
      dtype='object')

- The rental data contains 15861 rows and 15 features
- The features of the data are :
    `rental_date`
    `return_date`,
    `amount`, 
    `release_year`, 
    `rental_rate`,
    `length` 
    `replacement_cost`,
    `special_features`,
    `NC-17`, 
    `PG`,
    `PG-13`,
    `R`, 
    `amount_2`, 
    `length_2`,
    `rental_rate_2`
    
- The column of interest is not currently available in our data, which means we will have to create it
    

We calculate the rental duration in days by subtracting the rental_date from the return_date. 
This new column, rental_length_days, will be our target variable (y)


In [58]:
# Add information on rental duration
df_rental["rental_length"] = pd.to_datetime(df_rental["return_date"]) - pd.to_datetime(df_rental["rental_date"])
df_rental["rental_length_days"] = df_rental["rental_length"].dt.days

df_rental.columns

Index(['rental_date', 'return_date', 'amount', 'release_year', 'rental_rate',
       'length', 'replacement_cost', 'special_features', 'NC-17', 'PG',
       'PG-13', 'R', 'amount_2', 'length_2', 'rental_rate_2', 'rental_length',
       'rental_length_days'],
      dtype='object')

In [59]:
df_rental["release_year"] = df_rental["release_year"].astype("string")

movies_by_year = df_rental.groupby('release_year').agg({'rental_length_days': 'sum'})

movies_by_year

Unnamed: 0_level_0,rental_length_days
release_year,Unnamed: 1_level_1
2004.0,11872
2005.0,9505
2006.0,11260
2007.0,11029
2008.0,8065
2009.0,9976
2010.0,10079


- From the above we can see that the movies that were released in 2004 have the highest rental duration followed by the movies released in 2006 and 2007

#### 2. Data Preparation and Feature Engineering 🛠️
We perform some feature engineering to prepare the data for our models.

- We create two dummy variables (deleted_scenes and behind_the_scenes) from the special_features column to capture the presence of these features.

- We then split the data into training and testing sets, with 80% of the data used for training and 20% for testing.

In [60]:
# Add dummy variables
# Add dummy for deleted scenes
df_rental["deleted_scenes"] = np.where(df_rental["special_features"].str.contains("Deleted Scenes"), 1, 0)
# Add dummy for behind the scenes
df_rental["behind_the_scenes"] = np.where(df_rental["special_features"].str.contains("Behind the Scenes"), 1, 0)

# Choose columns to drop
cols_to_drop = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]

# Split into feature and target sets
X = df_rental.drop(cols_to_drop, axis=1)
y = df_rental["rental_length_days"]

# Further split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                test_size=0.2,
                                                random_state=9)

print("Data preparation complete.")

Data preparation complete.


#### 3. Model Building 

####  Lasso and OLS Regression 📏
First, we use Lasso Regression for feature selection. Lasso adds a penalty to the size of the coefficients, which helps in identifying the most important features by shrinking the coefficients of less important features to zero.

- We train a Lasso model with alpha=0.3 and then select only the features with a positive coefficient.

- Next, we train a simple OLS Regression model on this reduced set of features.

- The performance of this combined approach is evaluated using the Mean Squared Error (MSE).

In [61]:
# Create the Lasso model
lasso = Lasso(alpha=0.3, random_state=9)

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selection by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

print(f"Lasso and OLS model MSE: {mse_lin_reg_lasso}")

Lasso and OLS model MSE: 4.812297241276237


#### Random Forest Regression 🌳
Next, we train a Random Forest Regressor, a powerful ensemble model that combines multiple decision trees to improve prediction accuracy.

- To find the best configuration, we use Randomized Search Cross-Validation to tune the model's hyperparameters, specifically n_estimators (number of trees) and max_depth (the maximum depth of each tree).

- After identifying the best hyperparameters, we train the final Random Forest model and calculate its Mean Squared Error (MSE) on the test set.

In [62]:
# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1, 101, 1),
              'max_depth': np.arange(1, 11, 1)}

# Create a random forest regressor
rf = RandomForestRegressor(random_state=9)

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf,
                                 param_distributions=param_dist,
                                 cv=5,
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"],
                             max_depth=hyper_params["max_depth"],
                             random_state=9)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
mse_random_forest = mean_squared_error(y_test, rf_pred)

print(f"Random Forest model MSE: {mse_random_forest}")
print(f"Best hyperparameters found: {hyper_params}")

Random Forest model MSE: 2.225667528098759
Best hyperparameters found: {'n_estimators': 51, 'max_depth': 10}


#### 4. Model Comparison and Conclusion ✅
Finally, we compare the performance of the two approaches based on their MSE scores. The model with the lower MSE is considered the better-performing model.

In [6]:
# Compare the MSE scores
if mse_random_forest < mse_lin_reg_lasso:
    best_model = "Random Forest Regressor"
    best_mse = mse_random_forest
else:
    best_model = "Lasso + OLS Regression"
    best_mse = mse_lin_reg_lasso

print(f"The best model is the {best_model} with an MSE of {round(best_mse, 3)}.")

The best model is the Random Forest Regressor with an MSE of 2.226.


### Future Enhancements 💡

- Explore more advanced feature engineering, such as incorporating movie genres, audience ratings, or director information.

- Experiment with other regression models (e.g., Gradient Boosting, XGBoost, Support Vector Regressor).

- Perform more extensive hyperparameter tuning using GridSearchCV if computational resources allow.

- Implement cross-validation more broadly across all models for a more robust evaluation.

- Visualize the residuals and actual vs. predicted values to gain deeper insights into model performance.