# Instructions Lab_Comparing_regression_models 
For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.

In [56]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

In [57]:
customer_df = pd.read_csv("we_fn_use_c_marketing_customer_value_analysis.csv")

In [58]:
customer_df.columns = [col.replace(" ", "_").lower() for col in customer_df]

In [59]:
customer_df['effective_to_date'] = pd.to_datetime(customer_df['effective_to_date']) 

In [60]:
customer_df = customer_df.drop("customer", axis=1)

- In this final lab, we will model our data. Import sklearn train_test_split and separate the data.

In [61]:
from sklearn.model_selection import train_test_split

In [62]:
numerical_df = customer_df.select_dtypes(include = np.number)
numerical_df =numerical_df.drop(['total_claim_amount'],axis=1)
categorical_df = customer_df.select_dtypes(include="object")

-Try a simple linear regression with all the data to see whether we are getting good results.

In [63]:
transformer = StandardScaler().fit(numerical_df)
x_standardized= transformer.transform(numerical_df)

In [64]:
encoder = OneHotEncoder(handle_unknown='error',drop='first').fit(categorical_df)
encoded= encoder.transform(categorical_df).toarray()

In [65]:
y = customer_df['total_claim_amount']
x = np.concatenate((x_standardized, encoded), axis=1)

In [66]:
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.4, random_state =100)

In [67]:
model = LinearRegression()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
score = model.score(x_test, y_test)
score

0.7695027443638532

- Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.

In [68]:
def models_train(x,y):
    x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.4, random_state =100)
    
    model = LinearRegression()
    model.fit(x_train, y_train)
    predictions_LR = model.predict(x_test)
    score_LR = model.score(x_test, y_test)
    
    model = KNeighborsRegressor(n_neighbors=30)  
    model.fit(x_train, y_train)
    predictions_KNR = model.predict(x_test)
    score_KNR = model.score(x_test, y_test)
    
    return (score_LR, score_KNR)

- Use the function to check LinearRegressor and KNeighborsRegressor.

In [69]:
R2_LR, R2_KNR = models_train(x,y)

In [70]:
print(R2_LR, R2_KNR)

0.7695027443638532 0.6348149429511091


- You can check also the MLPRegressor for this task!
- Check and discuss the results.

In [71]:
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.4, random_state =100)
regr = MLPRegressor(max_iter=500).fit(x_train, y_train)
regr.predict(X_test)

regr_score = regr.score(X_test, y_test)
regr_score



0.834878048369506

## For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.

# Instructions Lab_Data_Cleaning_AND_Wrangling
So far we have worked on EDA. This lab will focus on data cleaning and wrangling from everything we noticed before.

- We will start with removing outliers. So far, we have discussed different methods to remove outliers. Use the one you feel more comfortable with, define a function for that. Use the function to remove the outliers and apply it to the dataframe.

In [72]:
def remove_outliers(data):
    lower_limit = np.percentile(data, 20)
    upper_limit = np.percentile(data, 80)
    # la función clip() de NumPy para recortar los valores atípicos por encima 
    # y por debajo del rango permitido
    data_clean = np.clip(data, lower_limit, upper_limit)
    return data_clean

In [73]:
numerical_clean = remove_outliers(numerical_df)
numerical_clean

  return bound(*args, **kwds)


Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies
0,2763.519279,4804.451968,69,32,5,1,1
1,4804.451968,1.000000,94,13,42,1,8
2,4804.451968,4804.451968,108,18,38,1,2
3,4804.451968,1.000000,106,18,65,1,7
4,2813.692575,4804.451968,73,12,44,1,1
...,...,...,...,...,...,...,...
9129,4804.451968,4804.451968,73,18,89,1,2
9130,3096.511217,4804.451968,79,14,28,1,1
9131,4804.451968,1.000000,85,9,37,3,2
9132,4804.451968,4804.451968,96,34,3,1,3


- Create a copy of the dataframe for the data wrangling.

- Normalize the continuous variables. You can use any one method you want.

In [74]:
transformer = StandardScaler().fit(numerical_clean)
x_standardized= transformer.transform(numerical_clean)

- Encode the categorical variables

In [75]:
encoder = OneHotEncoder(handle_unknown='error',drop='first').fit(categorical_df)
encoded= encoder.transform(categorical_df).toarray()

- The time variable can be useful. Try to transform its data into a useful one. Hint: Day week and month as integers might be useful.

In [76]:
numerical_clean['month'] = customer_df['effective_to_date'].dt.month
numerical_clean['day'] = customer_df['effective_to_date'].dt.day

- Since the model will only accept numerical data, check and make sure that every column is numerical, if some are not, change it using encoding.

In [77]:
transformer = StandardScaler().fit(numerical_clean)
x_standardized= transformer.transform(numerical_clean)

-Hint for Categorical Variables

You should deal with the categorical variables as shown below (for ordinal encoding, dummy code has been provided as well):

In [None]:
"""
# One hot to state
# Ordinal to coverage
# Ordinal to employmentstatus
# Ordinal to location code
# One hot to marital status
# One hot to policy type
# One hot to policy
# One hot to renew offercustomer_df
# One hot to sales channel
# One hot vehicle class
# Ordinal vehicle size

data["coverage"] = data["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})
# given that column "coverage" in the dataframe "data" has three categories:
# "basic", "extended", and "premium" and values are to be represented in the same order.

"""