## Lab | Predicting Claim Amount with ML Linear Regression


### 01 - Problem (case study)
Familiarise yourself with Data Descriptions and the Goal.

**Goal:** We want to train our model in the way that it will be able to predict for a new customer the total claim amount depending on his/her features.

**Data Descriptions:**<br>
Unnamed: Index
<br>
customer: Customer ID 
<br>
state: US State 
<br>
customer_lifetime_value: CLV is the client economic value for a company during all their relationship 
<br>
response: Response to marketing calls (customer engagement) 
<br>
coverage: Customer coverage type 
<br>
education: Customer education level 
<br>
effective_to_date: Effective to date 
<br>
employmentstatus: Customer employment status 
<br>
gender: Customer gender 
<br>
income: Customer income 
<br>
location_code: Customer living zone 
<br>
marital_status: Customer marital status 
<br>
monthly_premium_auto: Monthly premium 
<br>
months_since_last_claim: Last customer claim <br>
months_since_policy_inception: Policy Inception <br>
number_of_open_complaints: Open claims <br>
number_of_policies: Number policies <br>
policy_type: Policy type <br>
policy: Policy <br>
renew_offer_type: Renew 
<br>sales_channel: Sales channel (customer-company first contact) <br>total_claim_amount: Claims amount <br>
vehicle_class: Vehicle class<br>
vehicle_size: Vehicle size <br>
vehicle_type: Vehicle type

### 02 - Getting Data

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [5]:
data = pd.read_csv('marketing_customer_analysis.csv')
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'marketing_customer_analysis.csv'

### 03 - Cleaning/Wrangling/EDA

Change headers names. Deal with NaN values, replace with appropriate method.

In [None]:
data.columns = data.columns.str.replace(' ', '_').str.lower()
data.head()

In [None]:
data.isnull().sum()

In [None]:
data.info()

#### Defining our dependent variable y

In [None]:
y = data['total_claim_amount']

#### Defining our X

In [None]:
X = data.drop(['total_claim_amount'], axis=1)

In [None]:
data = data.drop(['total_claim_amount'], axis=1)

#### Split categorical Features and Numerical Features

In [None]:
X_num = data.select_dtypes(include = np.number)
X_cat = data.select_dtypes(include = np.object)

X_cat

In [None]:
X_num

#### Explore visually both sets of features, to identify next steps.

In [None]:
X_num.columns

In [None]:
data.hist(figsize = (15,20));
plt.show()

'customer_lifetime_value', ->skewed to the right
'income', -> skewed to right, but look slike outliers
'monthly_premium_auto',->skewed to the right
'months_since_last_claim', ->skewed to the right
'months_since_policy_inception', ->uniform
'number_of_open_complaints',->skewed to the right
'number_of_policies'->skewed to the right

In [None]:
sns.distplot(data['customer_lifetime_value'])
plt.show()

In [None]:
sns.boxplot(x=data['customer_lifetime_value'])

In [None]:
sns.distplot(data['income'])
plt.show()

In [None]:
sns.distplot(data['monthly_premium_auto'])
plt.show()

In [None]:
sns.boxplot(x=data['monthly_premium_auto'])

In [None]:
sns.distplot(data['months_since_last_claim'])
plt.show()

In [None]:
sns.distplot(data['months_since_policy_inception'])
plt.show()

In [None]:
sns.distplot(data['number_of_open_complaints'])
plt.show()

In [None]:
sns.distplot(data['number_of_policies'])
plt.show()

#### Multicollinearity
Look at potential multicollinearity using a correlation matrix or other approach.

In [None]:
correlations_matrix = data.corr()
mask = np.zeros_like(correlations_matrix)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(correlations_matrix, mask=mask, annot=True)
plt.show()

There is only a mild correlation between monthly premium auto and customer lifetime value.
It is however under 0.9 so we will not drop anything.

### 04 - Pre-Processing Data

Dealing with outliers. Normalization - ie use chosen scaler to transform selected columns into normal distribution as needed for linear regression model. Propose: MinMax scaler on 'effective_to_date' and standard scaler on numerical columns.

Encoding Categorical Data fields using OHE.

Bring categorical and numerical columns back together using pd.concat.

Define X and y, the y value you are seeking to predict is claim amount.

Splitting into train set and test dataset using random state, eg 80%:20%

Numeric features

Remove Outliers

In [None]:
iqr = np.percentile(data['monthly_premium_auto'],75) - np.percentile(data['monthly_premium_auto'],25)
upper_limit = np.percentile(data['monthly_premium_auto'],75) + 1.5*iqr
lower_limit = np.percentile(data['monthly_premium_auto'],25) - 1.5*iqr

In [None]:
data = data[(data['monthly_premium_auto']>lower_limit) & (data['monthly_premium_auto']<upper_limit)]

In [None]:
sns.distplot(data['monthly_premium_auto'])
plt.show()

In [None]:
iqr = np.percentile(data['customer_lifetime_value'],75) - np.percentile(data['customer_lifetime_value'],25)
upper_limit = np.percentile(data['customer_lifetime_value'],75) + 1.5*iqr
lower_limit = np.percentile(data['customer_lifetime_value'],25) - 1.5*iqr

In [None]:
data = data[(data['customer_lifetime_value']>lower_limit) & (data['customer_lifetime_value']<upper_limit)]

In [6]:
sns.distplot(data['customer_lifetime_value'])
plt.show()

NameError: name 'data' is not defined

In [None]:
from sklearn.preprocessing import StandardScaler
transformer = StandardScaler().fit(X_num)
x_standardized = transformer.transform(X_num)
print(x_standardized.shape)

In [None]:
from sklearn.preprocessing import MinMaxScaler
transformer = MinMaxScaler().fit(X_num['effective_to_date'])
x_mmscaled = transformer.transform(X_num)
print(x_mmscaled.shape)

In [None]:
data['effective_to_date']=MinMaxScaler().fit_transform(data['effective_to_date'].values.reshape(-1, 1))

In [None]:
data['effective_to_date'].dt.strftime("%n/%d/%Y")

Categoric features

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='error', drop='first').fit(X_cat)
encoded = encoder.transform(X_cat).toarray()
encoded

#### Bringing the X Data back together

In [None]:
X = np.concatenate((x_standardized, encoded), axis=1)

In [None]:
X.shape

In [None]:
y.shape

#### Splitting into train set and test dataset using random state, eg 80%:20%

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=57)

### 05 - Modeling

#### Apply linear regression model from sklearn.linear_model.

In [None]:
lm = linear_model.LinearRegression()
model = lm.fit(X_train,y_train)

#### Fit over your train data and predict against X test

In [None]:
predictions  = lm.predict(X_test)

### 06 - Model Validation

You should gather appropriate metrics to evaluate model accuracy over y_test- such as : R2. MSE. RMSE. MAE.

In [None]:
r2 = r2_score(y_test, predictions)
print(r2)

R2 shows how well terms (data points) fit a curve or line. 
<br><br>
The coefficient of determination can be thought of as a percent. It gives you an idea of how many data points fall within the results of the line formed by the regression equation. The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted. If the coefficient is 0.80, then 80% of the points should fall within the regression line. Values of 1 or 0 would indicate the regression line represents all or none of the data, respectively. A higher coefficient is an indicator of a better goodness of fit for the observations.
<br><br>
As the r2 value is reltivly high, we could say, that the **model is a good fit to predict the total claim amount.**<br>
But lets also look at the other metrics.

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
mae = mean_absolute_error(y_test, predictions)
print(mae)

Absolute Error is the amount of error in your measurements. It is the difference between the measured value and “true” value. For example, if a scale states 90 pounds but you know your true weight is 89 pounds, then the scale has an absolute error of 90 lbs – 89 lbs = 1 lbs.
<br>
<br>
The **Mean Absolute Error(MAE)** is the average of all absolute errors.
<br>
<br>
This is quiet high. The **model seems not to be the right one OR the data was not pre processes goo enough.**

In [None]:
mse = mean_squared_error(y_test, predictions)
print(mse)

The **mean squared error (mse)** tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences.
<br>
<br>
The smaller the means squared error, the closer you are to finding the line of best fit. Depending on your data, it may be impossible to get a very small value for the mean squared error. 
<br>
<br>
This seems **NOT to be the line of best fit**, as the value is quiet high.

In [None]:
import math 
rmse = math.sqrt(mse)
print(rmse)

**Root Mean Square Error (RMSE)** is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. 
<br><br>
Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction. 
<br><br>
As the value here is also very high, I would say, that again the **model is not the right fit or we need to redo the pre processing.**

**Overall** I would say, the model might be a good fit in prediciting the accurate total claim amount, but the pre processing need to be redone.

### Visualization
We will plot y_true vs. y_predicted. For an ideal model, these dots will form a line.

In [3]:
# generating value pairs for an ideal model
# which predicts the exact same y-value for a given test-y-value
line_x = line_y = np.linspace(min(y_test), max(y_test), num=len(y_test))

fig, ax = plt.subplots(figsize=(12,8))
plt.plot(y_test, y_pred_linreg, ms=5, marker=".", ls='')

# plot the ideal model together with our dots
plt.plot(line_x, line_y, ms=0.1, marker=".", ls='-', c='r', label='ideal model')

# show legend
plt.legend();

plt.xlabel('y_test (total_claim_amount in $)');
plt.ylabel('y_predicted (total_claim_amount in $)');

NameError: name 'y_test' is not defined

In [None]:
# or easier, with sns.lmplot(). seaborn likes to be fed a dataframe
# so we create one from our y_test and y_pred
df_y = pd.DataFrame({'y_test':y_test, 'y_predicted':y_pred_linreg})

sns.lmplot(x='y_test',
           y='y_predicted',
           data=df_y,
           markers='.',
           line_kws={'color': 'red', 'lw':1},
           scatter_kws={'alpha':.8, 's':2},
           height=7,
           aspect=12/8,
          );
plt.xlabel('y_test (total_claim_amount in $)');
plt.ylabel('y_predicted (total_claim_amount in $)');