## # Lab | Comparing regression models


For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs. 

### Instructions

1. In this final lab, we will model our data. Import sklearn `train_test_split` and separate the data.

2. We will start with removing outliers, if you have not already done so.  We have discussed different methods to remove outliers. Use the one you feel more comfortable with, define a function for that. Use the function to remove the outliers and apply it to the dataframe.

3. Create a copy of the dataframe for the data wrangling.

4. Normalize the continuous variables. You can use any one method you want.

5. Encode the categorical variables (See the hint below for encoding categorical data!!!)

6. The time variable can be useful. Try to transform its data into a useful one. Hint: Day week and month as integers might be useful.

7. Since the model will only accept numerical data, check and make sure that every column is numerical, if some are not, change it using encoding.


## Hint for Categorical Variables

You should deal with the categorical variables as shown below (for ordinal encoding, dummy code has been provided as well):
Encoder Type | Column 
-----------------|-----------------
One hot | state
Ordinal | coverage
Ordinal | employmentstatus
Ordinal | location code
One hot | marital status
One hot | policy type
One hot | policy
One hot | renew offercustomer_df
One hot | sales channel
One hot | vehicle class
Ordinal | vehicle size

##### Dummy code
data["coverage"] = data["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})

given that column "coverage" in the dataframe "data" has three categories:

"basic", "extended", and "premium" and values are to be represented in the same order.



8. Try a simple linear regression with all the data to see whether we are getting good results.

9. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.

10. Use the function to check `LinearRegressor` and `KNeighborsRegressor`.

11. You can check also the `MLPRegressor` for this task!

12. Check and discuss the results.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
customer_df = pd.read_csv('we_fn_use_c_marketing_customer_value_analysis.csv')
customer_df.head()

#### 1. In this final lab, we will model our data. Import sklearn train_test_split and separate the data.

In [None]:
def remove_outliers(data):
    data2 = data.copy()
    numeric = data2.select_dtypes(np.number)
    for col in numeric.columns:
        if col != 'total_claim_amount':
            iqr = np.percentile(data2[col],75) - np.percentile(data2[col],25)
            upper_limit = np.percentile(data[col],75) + 1.5*iqr
            lower_limit = np.percentile(data[col],25) - 1.5*iqr
            data2 = data2[(data2[col] > lower_limit) & (data2[col] < upper_limit)]
        return data2

custumer_df = remove_outliers(customer_df)


In [None]:
X = customer_df.drop(['total_claim_amount'], axis=1)
y = customer_df['total_claim_amount']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

X_train_discrete = X_train.select_dtypes(np.int64)
X_train_continuous = X_train.select_dtypes([np.float64, np.datetime64])
X_train_cat = X_train.select_dtypes(object)

X_test_discrete = X_test.select_dtypes(np.int64)
X_test_continuous = X_test.select_dtypes([np.float64, np.datetime64])
X_test_cat = X_test.select_dtypes(object)

In [None]:
X_train_discrete['day']   = pd.to_datetime(X_train_continuous['effective_to_date']).dt.day
X_train_discrete['month'] = pd.to_datetime(X_train_continuous['effective_to_date']).dt.month
X_train_discrete['year']  = pd.to_datetime(X_train_continuous['effective_to_date']).dt.year
X_train_continuous = X_train_continuous.drop(['effective_to_date'], axis=1)

X_test_discrete['day']   = pd.to_datetime(X_test_continuous['effective_to_date']).dt.day
X_test_discrete['month'] = pd.to_datetime(X_test_continuous['effective_to_date']).dt.month
X_test_discrete['year']  = pd.to_datetime(X_test_continuous['effective_to_date']).dt.year
X_test_continuous = X_test_continuous.drop(['effective_to_date'], axis=1)

#### 2. We will start with removing outliers, if you have not already done so. We have discussed different methods to remove outliers. Use the one you feel more comfortable with, define a function for that. Use the function to remove the outliers and apply it to the dataframe.

#### 3. Create a copy of the dataframe for the data wrangling.

In [None]:
customer_df = customer_df.copy()

#### 4. Normalize the continuous variables. You can use any one method you want.

In [None]:
from sklearn.preprocessing import PowerTransformer

pT = PowerTransformer()
pT.fit(X_train_continuous)

X_train_continuous_trans_np = pT.transform(X_train_continuous)
X_test_continuous_trans_np = pT.transform(X_test_continuous)


X_train_continuous_trans = pd.DataFrame(X_train_continuous_trans_np, columns=X_train_continuous.columns,
                                       index=X_train_continuous.index)
X_test_continuous_trans = pd.DataFrame(X_test_continuous_trans_np, columns=X_test_continuous.columns,
                                       index=X_test_continuous.index)

#### 5. Encode the categorical variables (See the hint below for encoding categorical data!!!)

In [None]:
def encode_categorical(data):
    data = data.drop(['customer'], axis=1)
    return pd.get_dummies(data, drop_first=True)
#                          columns=['state', 'marital_status',
#                                   'policy_type', 'sales_channel',
#                                   'vehicle_class'],
#                          drop_first=True)

X_train_cat_encoded = encode_categorical(X_train_cat)
X_test_cat_encoded = encode_categorical(X_test_cat)

In [None]:
X_train_final = pd.concat([X_train_discrete, X_train_continuous_trans, X_train_cat_encoded], axis=1)
X_test_final = pd.concat([X_test_discrete, X_test_continuous_trans, X_test_cat_encoded], axis=1)

#### 6. The time variable can be useful. Try to transform its data into a useful one. Hint: Day week and month as integers might be useful.

In [None]:
X_train_final.dtypes

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train_final, y_train)
print(f'Train score: {lm.score(X_train_final, y_train)}')
print(f'Test score: {lm.score(X_test_final, y_test)}')

#### 7.  Since the model will only accept numerical data, check and make sure that every column is numerical, if some are not, change it using encoding.



#### 8. Try a simple linear regression with all the data to see whether we are getting good results.

#### 9.  Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.

In [None]:
def train_models(list_of_models, X_train, y_train):
    for model in list_of_models:
        model.fit(X_train, y_train)
    return list_of_models

In [None]:
from sklearn.metrics import r2_score


def evaluate_models(list_of_models, X_train, y_train, X_test, y_test):
    
    train_scores = []
    test_scores = []
    
    for model in list_of_models:
        y_train_pred = model.predict(X_train)
        train_scores.append(r2_score(y_train, y_train_pred))

        y_test_pred = model.predict(X_test)
        test_scores.append(r2_score(y_test, y_test_pred))
        
    return train_scores, test_scores

#### 10. Use the function to check LinearRegressor and KNeighborsRegressor.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

list_of_models = [
    LinearRegression(),
    KNeighborsRegressor(n_neighbors=4)
]

list_of_trained_models = train_models(list_of_models, X_train_final, y_train)
train_scores, test_scores = evaluate_models(list_of_trained_models,
                                            X_train_final, y_train,
                                            X_test_final, y_test)
for i in range(len(list_of_models)):
    print(f'Model: {list_of_trained_models[i]}')
    print(f'    Train-Score: {train_scores[i]}')
    print(f'    Test-Score:  {test_scores[i]}')

#### 11. You can check also the MLPRegressor for this task!

In [None]:
from sklearn.neural_network import MLPRegressor

list_of_models.append(MLPRegressor())

list_of_trained_models = train_models(list_of_models, X_train_final, y_train)
train_scores, test_scores = evaluate_models(list_of_trained_models,
                                            X_train_final, y_train,
                                            X_test_final, y_test)
for i in range(len(list_of_models)):
    print(f'Model: {list_of_trained_models[i]}')
    print(f'    Train-Score: {train_scores[i]}')
    print(f'    Test-Score:  {test_scores[i]}')

#### 12. Check and discuss the results.

##### Best result on the training set is 0.76 using a linear regression model. The best result on the test set is 0.75, also with a linear regression model.