![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [44]:
# Importing Libraries
import pandas as pd
import numpy as np

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

In [46]:
# Loading the data
df = pd.read_csv("rental_info.csv")
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [47]:
print(f'No. of records are {df.shape[0]}')
print(f'No. of columns are {df.shape[1]}')

No. of records are 15861
No. of columns are 15


In [48]:
# Checking null values
df.isnull().sum()

rental_date         0
return_date         0
amount              0
release_year        0
rental_rate         0
length              0
replacement_cost    0
special_features    0
NC-17               0
PG                  0
PG-13               0
R                   0
amount_2            0
length_2            0
rental_rate_2       0
dtype: int64

In [49]:
# Checking data types
df.dtypes

rental_date          object
return_date          object
amount              float64
release_year        float64
rental_rate         float64
length              float64
replacement_cost    float64
special_features     object
NC-17                 int64
PG                    int64
PG-13                 int64
R                     int64
amount_2            float64
length_2            float64
rental_rate_2       float64
dtype: object

In [50]:
# Getting the number of rental days
df["rental_length"] = pd.to_datetime(df["return_date"]) - pd.to_datetime(df["rental_date"])
df["rental_length_days"] = df["rental_length"].dt.days
df.head(3)

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length,rental_length_days
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3 days 20:46:00,3
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days 20:05:00,2
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7 days 05:44:00,7


In [51]:
# Adding dummy variables using the special features column
# Dummy for deleted scenes

df["deleted_scenes"] =  np.where(df["special_features"].str.contains("Deleted Scenes"), 1, 0)
# Add dummy for behind the scenes
df["ehind_the_scenes"] =  np.where(df["special_features"].str.contains("Behind the Scenes"), 1, 0)

In [52]:
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length,rental_length_days,deleted_scenes,ehind_the_scenes
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3 days 20:46:00,3,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days 20:05:00,2,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7 days 05:44:00,7,0,1
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days 02:24:00,2,0,1
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4 days 01:05:00,4,0,1


In [53]:
# Avoiding columns that causes data leakage
data_leakage_columns = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]

X = df.drop(data_leakage_columns, axis=1)   # Features
y = df['rental_length_days']                                        # Target

# Train_Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [54]:
X_train.head(3)

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,ehind_the_scenes
6682,2.99,2010.0,2.99,90.0,25.99,1,0,0,0,8.9401,8100.0,8.9401,0,1
8908,4.99,2008.0,0.99,53.0,25.99,1,0,0,0,24.9001,2809.0,0.9801,1,0
11827,6.99,2007.0,4.99,171.0,25.99,0,0,1,0,48.8601,29241.0,24.9001,0,1


In [55]:
X_test.head(3)

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,ehind_the_scenes
15067,4.99,2005.0,0.99,184.0,9.99,0,0,1,0,24.9001,33856.0,0.9801,1,1
3808,4.99,2005.0,4.99,179.0,29.99,0,0,0,1,24.9001,32041.0,24.9001,0,1
1015,4.99,2007.0,4.99,73.0,17.99,0,1,0,0,24.9001,5329.0,24.9001,1,1


In [56]:
y_train.head(3)

6682     4
8908     8
11827    5
Name: rental_length_days, dtype: int64

In [57]:
y_test.head(3)

15067    8
3808     1
1015     6
Name: rental_length_days, dtype: int64

In [58]:
# Perform feature selectin using lasso  
lasso = Lasso(alpha=0.3, random_state=9) 
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_
X_train, X_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

In [59]:
# Bulding Function for fitting, predicting & model evaluation
def evaluate_regressor(name, regressor, params={}, X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test):
    # Perform grid search with cross-validation
    cv = GridSearchCV(regressor, params, cv=5)
    cv.fit(X_train, y_train)

    # Predict the target variable for the test set
    y_pred = cv.predict(X_test)

    # Compute and print regression metrics
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    print(f'Regressor: {name}')
    print(f'Best Parameters: {cv.best_params_}')
    print(f'Mean Squared Error (MSE): {round(mse, 2)}')

In [60]:
# Linear Regression
evaluate_regressor('Linear Regression', LinearRegression())

Regressor: Linear Regression
Best Parameters: {}
Mean Squared Error (MSE): 4.81


In [61]:
# Lasso Regression
evaluate_regressor('Lasso Regression', Lasso(),
                   {'alpha': [0.1, 1.0, 3.0, 5.0, 10.0],
                    'selection': ['cyclic', 'random']})

Regressor: Lasso Regression
Best Parameters: {'alpha': 0.1, 'selection': 'random'}
Mean Squared Error (MSE): 4.84


In [62]:
# Decision Tree Regression
evaluate_regressor('Decision Tree Regression', DecisionTreeRegressor(),
                   {'max_depth': [None, 5, 10, 15],
                    'min_samples_split': [2, 5, 10, 20]
})

Regressor: Decision Tree Regression
Best Parameters: {'max_depth': None, 'min_samples_split': 10}
Mean Squared Error (MSE): 3.65


In [63]:
# Random Forest Regression
evaluate_regressor('Random Forest Regression', RandomForestRegressor(),
                   {'n_estimators': [50, 100, 200],
                    'max_depth': [None, 10, 20],
                    'min_samples_split': [2, 5, 10]})

Regressor: Random Forest Regression
Best Parameters: {'max_depth': 20, 'min_samples_split': 10, 'n_estimators': 200}
Mean Squared Error (MSE): 3.62
