# DVD Rental Prediction Project

![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

## Step 1: Data Exploration and Preparation
The data they provided is in the csv file `rental_info.csv`. It has the following features:

- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

We'll start by importing the necessary libraries and reading the data.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV


# Read the csv file into a DataFrame
rental_info = pd.read_csv("rental_info.csv", parse_dates=["return_date", "rental_date"])

# Calculate the rental length in days
rental_info["rental_length_days"] = (rental_info["return_date"] - rental_info["rental_date"]).dt.days

# Create dummy variables for special features
rental_info["deleted_scenes"] = np.where(rental_info["special_features"].str.contains("Deleted Scenes"), 1, 0)
rental_info["behind_the_scenes"] = np.where(rental_info["special_features"].str.contains("Behind the Scenes"), 1, 0)

## Step 2: Data Splitting
Next, we'll separate the DataFrame into feature and target sets, then split the data into training and test sets.

In [4]:
# Separate the DataFrame into feature and target sets
X = rental_info.drop(["special_features", "rental_length_days", "return_date", "rental_date"], axis=1)
y = rental_info["rental_length_days"]

# Further split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2, 
                                                    stratify=y,
                                                    random_state=9)

## Step 3: Lasso Regression
We'll create and train a Lasso regression model to perform feature selection.

In [5]:
# Create and train the Lasso model
lasso = Lasso(alpha=0.3, random_state=9)
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selection by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Perform OLS regression on the selected features 
ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = MSE(y_test, y_test_pred)

print("OLS gives MSE of: {:.2f}".format(mse_lin_reg_lasso))

OLS gives MSE of: 4.85


## Step 4: Random Forest Regression
We'll use Random Forest Regression and optimize hyperparameters using Randomized Search CV.

In [7]:
# Define hyperparameter space
param_dist = {'n_estimators': np.arange(1, 101, 1),
              'max_depth': np.arange(1, 11, 1)}

# Create a Random Forest Regressor object
rf = RandomForestRegressor()

# Select optimal hyperparameter by randomized search CV
rf_random = RandomizedSearchCV(rf,
                               param_distributions=param_dist,
                               cv=5,
                               random_state=9)

# Fit the random search object to the training data
rf_random.fit(X_train, y_train)

# Create a variable for the best hyperparameters
hyper_params = rf_random.best_params_

# Perform Random Forest Regression
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"],
                           max_depth=hyper_params["max_depth"],
                           random_state=9)

rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
mse_random_forest = MSE(y_test, rf_pred)

print("Random Forest gives MSE of: {:.2f}".format(mse_random_forest))

best_model = rf
best_mse = mse_random_forest

Random Forest gives MSE of: 2.16


# Results
The OLS regression with Lasso feature selection gave an MSE of 4.85, while the Random Forest model achieved an MSE of 2.16, making it the better model for predicting DVD rental lengths.