![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
import xgboost as xgb
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV


rent_df = pd.read_csv('rental_info.csv')

In [2]:
# Pre-processing
rent_df['rental_date'] = pd.to_datetime(rent_df['rental_date'])
rent_df['return_date'] = pd.to_datetime(rent_df['return_date'])
rent_df["rental_length_days"] = (rent_df['return_date'] - rent_df['rental_date']).apply(lambda x: x.days)
rent_df['behind_the_scenes'] = rent_df['special_features'].apply(lambda x: 1 if 'Behind the Scenes' in x else 0)
rent_df['deleted_scenes'] = (rent_df['special_features'].apply(lambda x: 1 if 'Deleted Scenes' in x else 0))
rent_df = rent_df.drop(columns=['special_features','rental_date', 'return_date'], axis=1)
#rent_df = rent_df.drop(columns=['amount_2','length_2', 'rental_rate_2'], axis=1)

In [3]:
# Seperating the features and the target variable
X = rent_df.drop('rental_length_days', axis=1)
y = rent_df['rental_length_days']

In [4]:
# Splitting to test and train dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=9)

In [5]:
# Making a pipeline for the Regressor models


pipelines  = {
    'linear_regression': Pipeline([('scaler', StandardScaler()), ('regressor', LinearRegression())]),
    'lasso_regression': Pipeline([('scaler', StandardScaler()), ('regressor', Lasso(alpha=0.1))]),
    'gradient_boosting': Pipeline([('scaler', StandardScaler()), ('regressor', GradientBoostingRegressor(n_estimators=100, random_state=9))]),
    'random_forest': Pipeline([('scaler', StandardScaler()), ('regressor', RandomForestRegressor(n_estimators=100, random_state=6, n_jobs=-1))]), # R search
    'ada_boosting': Pipeline([('scaler', StandardScaler()), ('regressor', AdaBoostRegressor(n_estimators=100, random_state=9))]),
    #'svr': Pipeline([('scaler', StandardScaler()), ('regressor', SVR(kernel='rbf'))]), # It takes more time compared to others
    'knn': Pipeline([('scaler', StandardScaler()), ('regressor', KNeighborsRegressor(n_neighbors=4, n_jobs=-1))]),
    'xgb': Pipeline([('scaler', StandardScaler()), ('regressor', xgb.XGBRegressor(random_state=9))]),
}



In [6]:
# Predicting and calculating MSE using the models in the pipeline
models_mse = []
models = []
for name,pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    mse = mean_squared_error(y_test,y_pred)
    models_mse.append(mse)
    models.append(pipeline[1])
    print(f"Model {name} had an MSE of {mse}")

Model linear_regression had an MSE of 2.9417238646975976
Model lasso_regression had an MSE of 3.0784650026615705
Model gradient_boosting had an MSE of 2.4253464800253557
Model random_forest had an MSE of 2.0322543432946714
Model ada_boosting had an MSE of 3.1994572542061306
Model knn had an MSE of 2.7614836117239205
Model xgb had an MSE of 1.9058708283906145


In [7]:
# The best model using MSE score
best_model = models[np.argmin(models_mse)]
best_mse = models_mse[np.argmin(models_mse)]
print(f'The best model is {best_model} with an MSE of {best_mse}')

The best model is XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=9, ...) with an MSE of 1.9058708283906145
