<a href="https://colab.research.google.com/github/HuyenNguyenHelen/INFO-5505---Machine-learning/blob/main/HuyenNguyen_Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1:  Linear Regression
Dataset: monet.csv

Dependent variable: PRICE

In [31]:
# Import primary libraries
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


## Load the dataset

In [32]:
# Open and load dataset
data = pd.read_csv('/content/monet.csv')
print('data shape: ', data.shape)
data.head(5)


FileNotFoundError: ignored

## Exploratory Analysis

### Explore missing values

In [None]:
# Investigate missing values
data.isnull().sum()

It shows that there is no missing values, so we do not need to do any imputing steps.

### Discriptive analysis
By having some discriptive analysis, we could have some sense of how our data vary over each variable in the dataset.

In [None]:
data.describe()

In [None]:
# Plot histograms for each variable to see how they vary
histograms = data.hist(grid=False, figsize=(10, 10))

It is clear that values of the WIDTH and the SALE are not nomally distributed. We may think about normalization for these variables.

### Explore the distribution of the dependent variable - PRICE

In [None]:
# Explore the distribution of the dependent variable - PRICE
data['PRICE'].describe()

In [None]:
sns.distplot(data['PRICE'], bins = 30)

By looking into the shape of how the dependent variable distributes, we can see most of the density falls between three first bins. There may be some outliers from bin 15.  

### Create a new variable
For the simple LR and multivariate LR that we are going to build, we can create a new variable by combining HEIGHT and WIDTH as sizes of pictures. 
SIZE = HEIGHT * WIDTH


In [None]:
# Create a new variable by combining HEIGHT AND WIDTH
data['SIZE'] = data['HEIGHT'] * data['WIDTH']
data.head(5)

### Select independent variables
To select potential predictors for the LR models, we can base on how they are correlated with the target variable. We can visualize their correlations in a heatmap or a scatter plot as follows.

In [None]:
# Plot a heatmap with correlation score
plt.subplots(figsize = (12,8))
sns.heatmap(data.corr(), annot=True)    # get correlation score matrix

The correlation score is from -1 to 1. The score value that is close to -1 shows a strong negative correlation whereas the score close to 1 indicates a strong positive correlation between two variables. If it is close to 0, the two variables are not correlated.

From the heatmap, it seems that no variables are highly correlated with the dependent variable, PRICE. WIDTH and SIZE are most correlated with the same score (0.35), so either of them could be potential predictors of the models. We could use scatter plots to see their correlations more clearly. Note that HEIGHT and WIDTH certainly have co-linearity with SIZE since HEIGHT and WIDTH were combined to create SIZE. Therefore, we will not input either HEIGHT or WIDTH together with SIZE into the training model.

In [None]:
# Plot SIZE and PRICE 
sns.lmplot(x= 'SIZE', y = 'PRICE', data = data, ci = None)

In [None]:
# Plot WIDTH and PRICE
sns.lmplot(x= 'WIDTH', y = 'PRICE', data = data, ci = None)


In [None]:
# Plot HEIGHT and PRICE
sns.lmplot(x= 'HEIGHT', y = 'PRICE', data = data, ci = None)

It looks like PRICE increases along with SIZE or WIDTH or HEIGHT; however, there are not exactly clear lines fitted in the data. 

## Linear Regression Models



X: variables known as independent variables, predictors, features

Y: variables known as dependent or target variable 

### Univariate LR Model

Univariate LR or Simple LR models get only one input variable as its single predictor. It has the line fit to data with a form: 
                                  y = ax+b

with a known as coefficient or slope and b as adjustment or intercept.

#### Model 1
Predictor/Indepedent variable (X): SIZE

Dependent variable (Y): PRICE

In [None]:
# Split dataset for training (80%) and testing (20%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data[['SIZE']], data['PRICE'], train_size = 0.8)

print ('Shapes of X_train, y_train: ', X_train.shape, y_train.shape)
print ('Shapes of X_test, y_test: ', X_test.shape, y_test.shape)


In [None]:
# Build a LR model
from sklearn.linear_model import LinearRegression
slr = LinearRegression()

# Fit the model into the training data
slr.fit (X_train, y_train)

In [None]:
# Apply the model to predict y in the test set
y_test_pred = slr.predict (X_test) 

# Apply the model to predict y in the train set
y_train_pred = slr.predict(X_train)

In [None]:
# Print coefficient and intercept of the model
print ('Intercept of the model 1: ', slr.intercept_)
print ('Coefficient of the model 1: ', slr.coef_)

**Model Evaluation**

To evaluate the performance of LR model, we could use Mean Squared Error (MSE) as Cost Function.
MSE measures how much the model prediction varies from the actual values.

In general, we try to minimize the MSE cost function.


In [None]:
# Evaluate the model performance in the training set
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error (y_train, y_train_pred)
print('Model 1 - Evaluation on MSE:')
print ('-'*30)
print('MSE in the training set: {:.2f}'.format(mse))


# Evaluate the model performance in the testing set
mse = mean_squared_error (y_test, y_test_pred)
print('\nMSE in the test set: {:.2f}'.format(mse))

Obviously, the first model seems to work quite well but its MSE cost function in the train set is still pretty high. We can finetune this model with the goal to decrease its MSE cost function. Looking back the distribution of the dependent variable, we can see it is substantially positively-skewed. For this case, we can try transforming the target variable by applying a logarithmic function for it before training.

The plots below presents how the logarithmic transformation could make the dependent variable less skewed:

In [None]:
# Plot the original and transformed dependent variable

# original dependent variable
y = data['PRICE'] 

# apply logarithmic function to transform the dependent variable 
y_trans = np.log(y.values.reshape(-1,1))       

f, (ax0, ax1) = plt.subplots(1, 2)

# Plot the original dependent variable
ax0.hist(y, bins=100)
ax0.set_ylabel('Density')
ax0.set_xlabel('Target values')
ax0.set_title('Target distribution')

# Plot the transformed dependent variable
ax1.hist(y_trans, bins=100)
ax1.set_ylabel('Density')
ax1.set_xlabel('Target values')
ax1.set_title('Transformed target distribution')

Similarly, we can try transforming the SIZE independent variable since previously it showed to be positive skewed.

In [None]:
# Plot the original and transformed dependent variable

# original independent variable
x = data['SIZE']

# apply logarithmic function to transform the independent variable 
x_trans = np.log(x.values.reshape(-1,1))      

f, (ax0, ax1) = plt.subplots(1, 2)

# Plot the original dependent variable
ax0.hist(x, bins=100)
ax0.set_ylabel('Density')
ax0.set_xlabel('Predictor values')
ax0.set_title('Predictor distribution')

# Plot the transformed dependent variable
ax1.hist(x_trans, bins=100)
ax1.set_ylabel('Density')
ax1.set_xlabel('Predictor values')
ax1.set_title('Transformed predictor distribution')

In [None]:
# Plot a scatter plot of the transformed SIZE and the transformed PRICE
sns.regplot(x=x_trans, y=y_trans, ci = None)

The above scatter indicates transformations create a better linear relationship between the two variables.

#### Model 2
This model is the finetuned by transforming independent and dependent variables of previous model.

X : log(SIZE)

Y: log(PRICE)

In [None]:
from sklearn.compose import TransformedTargetRegressor

# Split data for training (80%) and testing (20%)
X_train_trans, X_test_trans, y_train_trans, y_test_trans = train_test_split(x_trans, y_trans, train_size = 0.8)
print ('Shape of X_train and y_train: ', X_train.shape, y_train.shape)
print ('Shape of X_test and y_test: ', X_test.shape, y_test.shape)

# Build a LR model
from sklearn.linear_model import LinearRegression
slr_trans = LinearRegression()

# Fit the model into the training data
slr_trans.fit (X_train_trans, y_train_trans)

# Apply the model to predict y in the test set
y_test_trans_pred = slr_trans.predict (X_test_trans) 

# Apply the model to predict y in the train set
y_train_trans_pred = slr_trans.predict(X_train_trans)

# Evaluate the model performance in the training set
from sklearn.metrics import mean_squared_error
mse_train = mean_squared_error (y_train_trans, y_train_trans_pred)

print ('\nModel 2 - evaluation on MSE: ')
print ('-'*30)
print('MSE in the training set: {:.2f}'.format(mse_train))

# Evaluate the model performance in the testing set
mse_test = mean_squared_error (y_test_trans, y_test_trans_pred)
print('\nMSE in the test set: {:.2f}'.format(mse_test))


In [None]:
# Print intercept and coefficient of the model
print ('Intercept of the model 2: ', slr_trans.intercept_)
print('Coefficient of the model 2: ', slr_trans.coef_)

***Conclusion: Based on MSE of the above simple LR models, we found that the model 2, which has the transformed SIZE and the transformed PRICE as independent and dependent variables, achieved the best performance. ***

### Multivariate LR Model
Multivariate LR has more than one predictors.
For our given dataset, the variables which are likely predictors are the SIZE,  the HOUSE, and the SIGNED. The HEIGHT and WIDTH variables are multicolinear with SIZE as mentioned previously, and PICTURE contains unique values. Therefore, they will not be used in the model. 

The SIGNED and HOUSE variables contain discrete values:

In [None]:
# Explore the two discrete variables, SIGNED and HOUSE
var_names = [ 'SIGNED', 'HOUSE ']
for name in var_names: 
  print(name,'\n','-'*25)
  print(data[name].value_counts(),'\n')
  

The SIGNED is a binary variable, but the HOUSE variable has three classes. We still can use discrete variables for LR, but we need to do dummy coding before inputing into the LR model.

In [None]:
# Create dummy columns of the HOUSE variable
dummies = pd.get_dummies(data['HOUSE '], prefix='HOUSE')

# Join the dummy columns with the dataset
data=data.join(dummies)
display(data)

As mentioned previously, since the SIZE independent variable and PRICE dependent variable are positively skewed, we will use their logarithmicaly-transformed values.

In [None]:
# Prepare data

# add the transformed SIZE variable into the dataset
data['SIZE_log']=x_trans

X = data[['SIZE_log', 'SIGNED', 'HOUSE_1', 'HOUSE_2', 'HOUSE_3']]
y = y_trans
print('X shape and y shape: ', X.shape, y.shape)

In [None]:
# Split data for training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8 )

print ('Shape of X_train and y_train: ', X_train.shape, y_train.shape)
print ('Shape of X_test and y_test: ', X_test.shape, y_test.shape)

In [None]:
# Build a multivariate LR model
mlr=LinearRegression()

# Fit the built model into training set
mlr.fit(X_train, y_train)

In [None]:
# Apply the model to predict the PRICE in the test set
y_test_pred = mlr.predict(X_test)

# Apply the model to predict the PRICE in the training set
y_train_pred = mlr.predict(X_train)

In [None]:
# Evaluate the model performance in the training set
from sklearn.metrics import mean_squared_error
mse_train = mean_squared_error (y_train, y_train_pred)

print ('\nMultivariate LR model - Evaluation on MSE: ')
print ('-'*40)
print('MSE in the training set: {:.2f}'.format(mse_train))

# Evaluate the model performance in the testing set
mse_test = mean_squared_error (y_test, y_test_pred)
print('\nMSE in the test set: {:.2f}'.format(mse_test))


The model performed very well on low MSE scores. However, the model achieved a lower (better) MSE on the test set than on the traning set. This leaves some thoughts about underfitting. However, generally, low MSE indicates good performance of the model. 

In [None]:
# Print coefficients and intercepts of the model
print ('Intercept of the multivariate model: ', mlr.intercept_)
print('Coefficient of the multivariate model: ', mlr.coef_)