# lab-customer-analysis-round-4

In today's lesson we talked about continuous distributions (mainly normal distribution), linear regression and how multicollinearity can impact the model. In this lab, we will test your knowledge on those things using the `marketing_customer_analysis.csv` file. You have been using the same data in the previous labs (round 2 and 3). You can continue using the same jupyter file. The file can be found in the `files_for_lab` folder.


**Use the jupyter file from the last lab (Customer Analysis Round 3)**

### 1. Check the data types of the columns. Get the numeric data into dataframe called `numerical` and categorical columns in a dataframe called `categoricals`.
**Hint**: You can use np.number and np.object to select the numerical data types and categorical data types respectively


In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import norm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('files_for_lab/csv_files/marketing_customer_analysis.csv')

number_col = list(data.select_dtypes(include=[np.number]).columns.values)
number_col
#use 'data.select_dtypes(include=[np.number])' to get a full column view

In [None]:
object_col = list(data.select_dtypes(include=[np.object]).columns.values)
object_col

### 2. Now we will try to check the normality of the numerical variables visually
  - Use seaborn library to construct distribution plots for the numerical variables
  - Use Matplotlib to construct histograms
  - Do the distributions for different numerical variables look like a normal distribution 


In [None]:
for column in number_col:
    plt.figure(figsize=(8,5))
    sns.distplot(data[column])
    plt.show()

In [None]:
for column in number_col:
    plt.figure(figsize=(8,5))
    plt.hist(data[column], bins=50)
    plt.xlabel(column)
    plt.show()
    
#Non of the variables appear normally distributed. The one that comes fairly close is total_claim_amount, although it is no really symetrical and values  smaller than the mean are not continously rising

### 3. For the numerical variables, check the multicollinearity between the features. Please note that we will use the column `total_claim_amount` later as the target variable. 


In [None]:
drop_col = object_col + [number_col[7]]
x = data.drop(drop_col, axis=1)
y = data['Total Claim Amount']

x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

print(model.summary())

In [None]:
lm = LinearRegression()
model2 = lm.fit(x,y)
predictions = lm.predict(x)
rmse = mean_squared_error(y, predictions, squared=False) 

print("R2:", round(lm.score(x,y),2)) # or r2_score(Y, predictions)
print("RMSE:", rmse)

# Condition number suggests possibly strong colineanearities
# low R**2 values suggest low predictive power of the models
# RMSE is very high meaning prediction errors are big

### 4. Drop one of the two features that show a high correlation between them (greater than 0.9). Write code for both the correlation matrix and for seaborn heatmap. If there is no pair of features that have a high correlation, then do not drop any features

In [None]:
data_corr = data.drop(object_col, axis=1)
c_matrix = data_corr.corr()
c_matrix

In [None]:
mask = np.zeros_like(c_matrix)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(c_matrix, mask=mask, annot=True)
plt.show()
# no features with high correlation