# Lab | Customer Analysis Round 7

## Modeling

### Try to improve the linear regression model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore') # no more warnings 

In [3]:
from sklearn.metrics import r2_score

# a lot of different scaler to try
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, MaxAbsScaler
from sklearn.preprocessing import RobustScaler, QuantileTransformer, PowerTransformer, PolynomialFeatures

# a lot of different regression models to try
from sklearn.linear_model import LinearRegression, HuberRegressor, RANSACRegressor
from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge

# more regression models
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor 

In [4]:
data = pd.read_csv('files_for_lab\csv_files\marketing_customer_analysis.csv')

In [5]:
poly = PolynomialFeatures()
use_poly = False
reg =  LinearRegression()                #initialization of the method
numerical_transformer = StandardScaler() #initialization of the method
split_feature = 'total_claim_amount'     # choose feature to split
train_test_split_ratio = 0.2             # ratio for train and test split
outlier = ['number_of_policies']         # columns to be cleaned of outliers

# list of features/columns that have low coefficient values for the Standard model and will be dropped
drop_columns = ['customer', 'effective_to_date', 'sales_channel', 'state', 'education', 'number_of_open_complaints'
               , 'months_since_policy_inception', 'vehicle_size', 'policy',
                'months_since_last_claim']

In [6]:
data.columns = data.columns.str.lower()             # transform all column names to lower case
data.columns = data.columns.str.replace(' ','_')    # repalce spaces inbetween with '_'

Q1 = data.quantile(0.25) # first quantile
Q3 = data.quantile(0.75) # third quantile
IQR = Q3 - Q1            # inter quantile range

data = data[~(                     # negation so we get the datapoints within the whiskers
(data[outlier] < (Q1 - 1.5 * IQR)) # datapoints left of the "left whisker"
|(data[outlier] > (Q3 + 1.5 * IQR) # datapoints right of the 'right whisker'
)).any(axis=1)]

data = data.drop(columns=drop_columns) # dropping low impact columns

In [7]:
x = data[data.columns.drop(split_feature)] # features
y = data[split_feature]                    # target feature
x = pd.get_dummies(x, drop_first = True)   # dummification

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = train_test_split_ratio, random_state = 42) # train test split

if use_poly:
    X_train = poly.fit_transform(X_train) # fit and transform X_train with PolynomialFeatures
    X_test = poly.transform(X_test)       # transformX_test with PolynomialFeatures
else:
    X_train = numerical_transformer.fit_transform(X_train) # fit and transform X_train
    X_test = numerical_transformer.transform(X_test)       # transformX_test

In [9]:
reg.fit(X_train, y_train) # train on the train data

LinearRegression()

In [10]:
predictions_train = reg.predict(X_train) # create predictions for our train data
predictions_test = reg.predict(X_test)   # create predictions for our test data

In [11]:
r2_train = r2_score(y_train, predictions_train) # calculate r2 score for train data
r2_test = r2_score(y_test, predictions_test)    # calculate r2 score for test data

print('R2 value for train: {}'.format(r2_train))
print('R2 value for test: {}'.format(r2_test))  

R2 value for train: 0.7696091117258597
R2 value for test: 0.7716572809167026


In [12]:
# Trying to find features that are of lesser importance for the regression and replace them in the next iteration

#features = x.columns.to_list()
#feature_importance = pd.DataFrame({"features": features,
#    "values":abs(reg.coef_)}) # get absolute values

#feature_importance.sort_values(["values"], ascending=False) # most important features on top

In [13]:
# safe scores from previous attempts in a dictionary
rms = {'Model used' : ['LinearRegression', 'HuberRegressor', 'RANSACRegressor', 'ElasticNet', 
                       'SGDRegressor', 'BayesianRidge', 'SVR', 'KernelRidge', 'GradientBoostingRegressor'],
    'Train score' : [0.77, 0.75, 0.66, 0.66, 0.76, 0.76, 0.31, -1.4, 0.86],
    'Test score' : [0.77, 0.73, 0.67, 0.68, 0.74, 0.74, 0.32, -1.5, 0.82],
    'Train score using PolynomialFeatures' : [0.86, -0.25, 0.42, 0.84, -3, 0.83, 0.08, 0.59, 0.88],
    'Test scote using PolynomialFeatures' : [0.80, -0.25, 0.42, 0.82, -3, 0.81, 0.09, 0.46, 0.82],
}

In [14]:
regression_model_scores = pd.DataFrame(data=rms)
regression_model_scores

Unnamed: 0,Model used,Train score,Test score,Train score using PolynomialFeatures,Test scote using PolynomialFeatures
0,LinearRegression,0.77,0.77,0.86,0.8
1,HuberRegressor,0.75,0.73,-0.25,-0.25
2,RANSACRegressor,0.66,0.67,0.42,0.42
3,ElasticNet,0.66,0.68,0.84,0.82
4,SGDRegressor,0.76,0.74,-3.0,-3.0
5,BayesianRidge,0.76,0.74,0.83,0.81
6,SVR,0.31,0.32,0.08,0.09
7,KernelRidge,-1.4,-1.5,0.59,0.46
8,GradientBoostingRegressor,0.86,0.82,0.88,0.82


### Conclusion

First I was using the *LinearRegression* model with *StandardScaler* to scale the data.

Then I calculated the coeffients for the Regression model and gradually replaced some features that seemed to have a small impact on the model.

The test split ratio was not affecting scores a lot and set at either 0.2 or 0.3.

This resulted in a r2 score for the train data of 0.77 and 0.77 for the test data.

Afterwards I tried different Scaler to scale the model. The only one that seemed to have a larger impact was the *PolynomialFeatures* scaler.

This resulted in a r2 score of 0.86 for train and 0.80 for test.

I continued by trying different regression models.

The best result was achieved by using the *GradientBoostingRegressor* together with the *PolynomialFeatures* scaler.
R2 score of 0.88 for train data and 0.82 for test data.

Unfortunatly I was not able to receive the coefficients for that approach and decieded to stop searching for a better model.