A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [33]:
import pandas as pd
import numpy as np
from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

- [x] Read in the csv file rental_info.csv using pandas.
- [x] Create a column named "rental_length_days" using the columns "return_date" and "rental_date", and add it to the pandas DataFrame. This column should contain information on how many days a DVD has been rented by a customer.
- [X] Create two columns of dummy variables from "special_features", which takes the value of 1 when:
- [X] The value is "Deleted Scenes", storing as a column called "deleted_scenes".
- [X] The value is "Behind the Scenes", storing as a column called "behind_the_scenes".
- [x] Make a pandas DataFrame called X containing all the appropriate features you can use to run the regression models, avoiding columns that leak data about the target.
- [x] Choose the "rental_length_days" as the target column and save it as a pandas Series called y.

EDA:
- [x] no null, all ranges look reasonable
- [x] datatypes: need to convert into pd.DateTime for 'return_date' and 'rental_date'
- [x] special_features need to be dumified

In [34]:
from thefuzz import process

def create_binary_columns(row, string1, string2):
    """
    Given row it will find if string is in row, or string2, or both/neither using binary columns
    
    Arguments:
    row: row of pd
    string1: str
    string2: str
    
    Returns 
    pd.Series of string1, string2
    """
    string1_bool = 1 if string1 in row else 0
    string2_bool = 1 if string2 in row else 0
    
    return pd.Series([string1_bool, string2_bool])

rental_df = pd.read_csv("rental_info.csv", parse_dates=['rental_date','return_date'])
rental_df['rental_length_days'] = (rental_df['return_date'] - rental_df['rental_date']).dt.total_seconds() / 86400

# Apply the function and create new columns
rental_df[['deleted_scenes', 'behind_the_scenes']] = rental_df['special_features'].apply(create_binary_columns,                                                        args=('Deleted Scenes', 'Behind the Scenes'))

rental_df.groupby('special_features')[['deleted_scenes', 
                                       'behind_the_scenes']].value_counts()

special_features                                              deleted_scenes  behind_the_scenes
{"Behind the Scenes"}                                         0               1                    1108
{"Deleted Scenes","Behind the Scenes"}                        1               1                    1035
{"Deleted Scenes"}                                            1               0                    1023
{Commentaries,"Behind the Scenes"}                            0               1                    1078
{Commentaries,"Deleted Scenes","Behind the Scenes"}           1               1                    1101
{Commentaries,"Deleted Scenes"}                               1               0                    1011
{Commentaries}                                                0               0                    1089
{Trailers,"Behind the Scenes"}                                0               1                    1122
{Trailers,"Deleted Scenes","Behind the Scenes"}               1         

- [x] Create X and y df
- [ ] Split data into X_train,...

In [35]:
from sklearn.model_selection import train_test_split
SEED = 9

rental_df.head()

features = [
    "amount",
    "release_year",
    "rental_rate",
    "length",
    "replacement_cost",
    "NC-17",
    "PG",
    "PG-13",
    "R",
    "deleted_scenes",
    "behind_the_scenes",
    "amount_2",
    "length_2",
    "rental_rate_2"
]
X_df = rental_df[features]

y_df = rental_df['rental_length_days']

print(X_df.shape)
X_train, X_test, y_train, y_test = train_test_split(X_df,
                                   y_df,
                                   test_size = 0.2, # size of test data
                                   random_state = SEED)

(15861, 14)


- Regression Model
- MSE - 3

In [36]:
from sklearn.metrics import mean_squared_error as MSE
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

params_dt = {
  'max_depth' : [3,4,5,6],
  'min_samples_leaf': [0.04,0.06,0.08,0.1,0.12],
  'max_features': ['auto','sqrt','log2']
}

dt = DecisionTreeRegressor(random_state=9)
grid_dt = GridSearchCV(estimator = dt,
                       param_grid = params_dt,
                       scoring = 'neg_mean_squared_error',
                       cv = 3,
                       n_jobs = -1,
                       verbose = 1)

grid_dt.fit(X_train, y_train)

best_hyperparams = grid_dt.best_params_
best_CV_score = grid_dt.best_score_

best_model = grid_dt.best_estimator_
best_mse = -best_CV_score

print("Best Hyperparameters:", best_model)
print("Best Cross-Validation Score:", best_mse)


Fitting 3 folds for each of 60 candidates, totalling 180 fits
Best Hyperparameters: DecisionTreeRegressor(max_depth=6, max_features='auto', min_samples_leaf=0.04,
                      random_state=9)
Best Cross-Validation Score: 2.3072463864816437
