# Multipe Regression

In this notebook we use a different data set to practice what we have learned so far:
* Upload and preprocess the data
* Write a function to compute the Multiple Regression weights
* Write a function to make predictions of the output given the input feature
* Compare different models for predicting the output

Look at the Multiple Regression notebook on insurance data!

# Import all required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

# Upload, preprocess and check the data

Dataset on insurance data (insurance.csv) is obtained from the Machine Learning course website (Spring 2017) from Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6.

In [None]:
# Import dataset
data = pd.read_csv('insurance.csv')

# Look at the table to check potential features
data[:10]

In [None]:
# Check if the dataset contains NaN values
data.isnull().any()

In [None]:
# Check some statistics of the data
data.describe()

In [None]:
# Plot Scatter Matrix for relevant variables
plot_data_new = data[['expenses', 'age', 'bmi', 'children']] 
from pandas.plotting import scatter_matrix
sm = scatter_matrix(plot_data_new, figsize = (10,10))

In [None]:
# Plot some feature relations
data.plot(x='age', y='expenses', style='o')
plt.title('expenses vs. age')
plt.xlabel('age')
plt.ylabel('expenses')
plt.show()

In [None]:
# Plot some feature relations
data.plot(x='children', y='expenses', style='o')
plt.title('expenses vs. children')
plt.xlabel('children')
plt.ylabel('expenses')
plt.show()

In [None]:
# Plot some feature relations
data.plot(x='bmi', y='expenses', style='o')
plt.title('expenses vs. bmi')
plt.xlabel('bmi')
plt.ylabel('expenses')
plt.show()

In [None]:
# Divide the data into some 'attributes' (X) and 'labels' (y).
X = data[['bmi','children', 'age']]
y = data['expenses']

# Split data into training and testing

In [None]:
# Split data set into 80% train and 20% test data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Look at the shape to check the split ratio
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Use a pre-build multiple regression function 



In [None]:
# Train a Sklearn built-in function
reg = LinearRegression()
reg.fit(X_train, y_train)

In [None]:
X.columns

In [None]:
# See intercept and coefficients chosen by the model

print('Intercept:', reg.intercept_)
coeff_df = pd.DataFrame({'Features': X.columns, 'Coefficients': reg.coef_}).set_index('Features')
coeff_df

In [None]:
# Do prediction on test data
y_pred = reg.predict(X_test)

In [None]:
# Check differences between actual and predicted value
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Difference': y_pred - y_test},
                   columns=['Actual', 'Predicted', 'Difference']).astype(int)
df.head()

In [None]:
# Evaluate the performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

# Check for overfitting

In [None]:
# If r2 and RMSE in train and test differ dramatically => Overfitting!
# => Compare r2 and RSME in Test and Train!

# "Prework" needed to do the comparison
y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)
from sklearn.metrics import r2_score

# => Compare r2 and RSME in Test and Train!
print('RSME Train:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train)))
print('RSME Test: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
print('R-2 Train:', r2_score(y_train, y_pred_train))
print('R-2 Test: ', r2_score(y_test, y_pred_test))

# Additional task: What is the best combination of input features showing the least MSE
## (1) bmi + age + children
## (2) bmi + age
## (3) bmi + children
## (4) age + children

# For more information on performance evaluation see also
* https://en.wikipedia.org/wiki/Mean_absolute_error
* https://en.wikipedia.org/wiki/Mean_squared_error
* https://en.wikipedia.org/wiki/Root-mean-square_deviation