<a href="https://www.kaggle.com/code/theikechukwu/price-prediction?scriptVersionId=193296296" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Linear Regression
Linear Regression models are typically used to study the relationship between a single dependent variable $Y$ and one or more independent variable $X$. 

They have an easy-to-interpret mathematical formula that can generate predictions. A simple linear regression model can be used when working with one independent variable, and a multiple regression model can be used when there are more than one independent variables.

We can start with a hypothesis that resembles the line, $Y=\theta_0X+\theta_1$, where $\theta_0$ and $\theta_1$ are the regression coefficients.

Now how do we pick the values of ($\theta_0$) and ($\theta_1$) so that our model predictions are accurate?

We use an optimization method to minimize the loss function so as to reduce the error between model predictions and the ground truth. We start by picking random values of ($\theta_0$) and ($\theta_1$), and continue to update values of the coefficients till convergence. If our loss function stops decreasing, we have reached our local minima.

In multiple linear regression, we use more than one independent features ($X$) and a single dependent feature ($Y$). If we have $n$ features, our formula is as follows. Instead of considering a vector of ($m$) data entries, we will consider the ($n X m$) matrix of $X$.

$Y=\theta_0+\theta_1X_1+\theta_2X_2+\theta_3X_3+...++\theta_nX_n$

## Diamond price prediction 

Build a linear model that predicts the prices of diamonds based on their attributes

## About the data

This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners learning to work with data analysis and visualization. However, I intend to build a model that predicts the price of the diamonds based on their attributes 

In [None]:
#load the necessary libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

import statsmodels.api as sm
from statsmodels.formula.api import ols

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn import metrics
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#load the data

df = pd.read_csv('/kaggle/input/diamond-dataset/diamonds.csv')
df

## Data Cleaning and validation 
* Check for missing values
* Reaname columns into more meaningful names
* Drop redundant or unnecessary columns 

In [None]:
df.info()

In [None]:
# check for missing values 
df.isna().sum()

The dataset have 53,940 observations and 11 features, with no missing values

In [None]:
#drop duplicate values
df.drop_duplicates(inplace = True)

In [None]:
#drop the unnamed column
df.drop('Unnamed: 0', axis = 1, inplace = True)

#rename columns into a more meaningful names
colnames = {'x': 'length', 
           'y': 'width',
           'z': 'depth',
           'depth': 'total_depth'}
df.rename(columns = colnames, inplace = True)
df

## Exploratory analysis

Description of the columns:
* price: price in US dollars (\$326--\$18,823)

* carat: weight of the diamond (0.2--5.01)

* cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

* color: diamond colour, from J (worst) to D (best)

* clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

* x: length in mm (0--10.74)

* y: width in mm (0--58.9)

* z: depth in mm (0--31.8)

* depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

* table: width of top of diamond relative to widest point (43--95)

In [None]:
# Summary statistic of the data
df.describe()

In [None]:
display(pd.DataFrame(df['cut'].value_counts()))

plt.figure(figsize = (10, 8))
sns.countplot(x = df['cut'])
plt.title('Number of diamonds based on the cut')
plt.xlabel('Cut')
plt.show()

From the table and visualization above, the dataset have more diamonds of Ideal cut, with 21,551 observations being ideal, followed by Premium and Very good, with Fair cut being the category with the least amount of diamonds.

In [None]:
display(pd.DataFrame(df.groupby('cut')['price'].mean()))

# The average price of diamonds based on their cut
plt.figure(figsize = (10, 8))
sns.barplot(x = df['cut'], y = df['price'])
plt.title('The average price of diamonds based on their cut')
plt.show()

There is significant difference in the average price of diamonds based on their cut, with ideal having the least average price which is counter intuitive as you would normally expect diamonds with an ideal cut to cost averagely more than diamonds with fair cut.

To confirm there's a statiscally significant difference between the average prices of the categories as the graph suggested, I performed an Anova test on the groups 

In [None]:
# To confirm if there's significant difference in the mean price of diamonds based on cut

ideal_cut = np.array(df[df['cut'] == 'Ideal']['price'])
premium_cut = np.array(df[df['cut'] == 'Premium']['price'])
good_cut = np.array(df[df['cut'] == 'Good']['price'])
very_good_cut = np.array(df[df['cut'] == 'Very Good']['price'])
fair_cut = np.array(df[df['cut'] == 'Fair']['price'])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(ideal_cut, premium_cut, good_cut, very_good_cut, fair_cut)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# With p-value of 8.43e-150, we can reject the null hypothesis and infer there's significant difference in the 
# mean price of diamonds across the cuts 

With p-value of 8.43e-150, we can reject the null hypothesis and infer there's significant difference in the mean price of diamonds across the cuts.

In [None]:
# count of daimonds based on their clarity
display(pd.DataFrame(df['clarity'].value_counts()))

plt.figure(figsize = (10, 8))
sns.countplot(x = df['clarity'])
plt.title('Number of diamonds based on clarity')
plt.show()

In [None]:
plt.figure(figsize = (10, 8))
sns.barplot(x = df['clarity'], y = df['price'])
plt.title('The average price of diamonds based on their clarity')
plt.show()

The visualizations above shows that category SI1 have the highest number of diamonds in the dataset, and on average category SI2 have the most expensive diamonds.

In [None]:
display(pd.DataFrame(df['clarity'].value_counts()))

plt.figure(figsize = (10, 8))
sns.countplot(x = df['color'])
plt.title('Number of diamonds based on color')
plt.show()

The diamonds' colors are ranked from D (best) to J (Worst). The dataset have the highest number of diamonds that are ranked G. 

In [None]:
display(pd.DataFrame(df.groupby(['color', 'cut'])['price'].count()))

plt.figure(figsize = (10, 8))
sns.countplot(x = df['color'], hue = df['cut'])
plt.title('Number of diamonds based on color and cut')
plt.show()

From the visualization above, there seems to be a relationship between color and cut variable which might've played a role in diamonds with fair cut having the highest average price as I shown earlier. This is because diamonds with fair cut are higher in higher rank compared to lower rank.

In [None]:
display(pd.DataFrame(df.groupby(['color', 'cut'])['price'].mean()))

plt.figure(figsize = (10, 8))
sns.barplot(x = df['color'], y = df['price'])
plt.title('The average price of diamonds based on their color')
plt.show()

On average, J costs more than the rest which is also unexpected. However, a closer look look at the table further suggests there is a relationship between the two variables, cut and color.

To test if there's an interaction effect between the two variables I conducted a Two-Anova test

In [None]:
model = ols('price ~ color + cut + color:cut', data = df).fit()
anova_table = sm.stats.anova_lm(model, typ = 2)
print(anova_table)

From the Anova table, a 5% significance value, there is a statistically significant interaction between the two variables; cut and color

In [None]:
# Visualizing the distribution of the continuous features 
continuous_features = ['carat', 'depth', 'table', 'price', 'length', 'width', 'depth']

fig, axes = plt.subplots(nrows = 1, ncols = len(continuous_features), figsize =(15, 5))

for i, var in enumerate(continuous_features):
    sns.histplot(df[var],color = 'blue', kde = True, ax = axes[i])
    axes[i].set_title(f'Distribution of {var}')
    axes[i].set_xlabel(var)
    axes[i].set_ylabel('Frequency')

#adjust layout 
plt.tight_layout()

plt.show()

The distribution of the continuous variables shows that most of the variables are skewed to the right apart from the variable 'length'

## Feature Selection

To select the important features for our model, we need to have find out the features that are strongly correlated to the target variable, price.

In [None]:
df['area'] = df['length'] * df['width']
df

In [None]:
sns.pairplot(df)
plt.show()

In [None]:
# Checking the correlation between the features 
correlation_matrix = df[['carat', 'total_depth', 'table', 'price', 'length', 'width', 'depth', 'area']].corr()
correlation_matrix

In [None]:
plt.figure(figsize = (10, 8))
sns.heatmap(correlation_matrix, annot = True, cmap = 'coolwarm', fmt = '.2f', linewidth = 0.5)
plt.title('Correlation heatmap')
plt.show()

From the heatmap and correlation matrix above; carat, length, width, area, and depth are strongly correlated with price i.e they have a linear relationship. 

## The Prediction Model

Since the target variable is a continuous variable, linear regression is the easy option 

In [None]:
Y = df['price'].to_numpy(dtype = float)
Y

In [None]:
X = df[['carat','depth', 'color', 'cut', 'clarity', 'area', 'length', 'width']]
X = pd.get_dummies(X).to_numpy(dtype = float)
X

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#instantiate the model
lr = LinearRegression()

lr = lr.fit(X_train, Y_train)
print('Trainning accuracy:', r2_score(Y_train, lr.predict(X_train)))

In [None]:
y_pred = lr.predict(X_test)
print('Coefficient of determination :', r2_score(Y_test, y_pred))
print('Root mean squared error:', np.sqrt(mean_squared_error(Y_test, y_pred)))

The model was able to explain 92% of variabilty in the target variable. The model did not just perform well on the train dataset but it also performed well on test data.

## Ridge Regression

To see if building a more complex model can model the relationship between the variables better, I built a Ridge regression model.

In [None]:
ridge = Ridge()
parameters = {'alpha': [0.01, 0.1, 1.0, 10.0]}
grid_search = GridSearchCV(estimator = ridge, param_grid = parameters, cv = 5)
grid_search.fit(X_train, Y_train)
print('The best Alpha value:', grid_search.best_params_)

In [None]:
ridge = Ridge(alpha = 10)
ridge.fit(X_train, Y_train)
print('Coefficient of determination :', r2_score(Y_test, ridge.predict(X_test)))
print('Root mean squared error:', np.sqrt(mean_squared_error(Y_test, ridge.predict(X_test))))

Even after regularization, there was no significant improvement in the performance of the model. 

## Neural Network Model

In [None]:
output_size=1
hidden_layer=3
input_size=1
learning_rate=0.01
loss_function='mean_squared_error'
epochs=50
batch_size=10

In [None]:
model = keras.Sequential()
model.add(keras.layers.Dense(hidden_layer, activation = 'relu'))
model.add(keras.layers.Dense(output_size))
model.compile(keras.optimizers.Adam(learning_rate = learning_rate), loss_function)

In [None]:
history = model.fit(X_train, Y_train, epochs = epochs, batch_size = batch_size,
                    verbose = False, validation_split = 0.3)

In [None]:
def plot_loss(history):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.legend()
    plt.grid(True)

In [None]:
plot_loss(history)

In [None]:
y_pred = model.predict(X_test)
print('Coefficient of determination: ', r2_score(Y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(Y_test, y_pred)))

The NN model performed significantly better than both Linear regression model and Ridge regression, with RMSE of 685 and Coefficient of determination value of 0.97 meaning the NN model was able to explain 97% of the variability in the test data. 

In [None]:
model = keras.Sequential()
model.add(keras.layers.Dense(10, activation = 'relu'))
model.add(keras.layers.Dense(output_size))
model.compile(keras.optimizers.Adam(learning_rate = learning_rate), loss_function)

In [None]:
history = model.fit(X_train, Y_train, epochs = epochs, batch_size = batch_size,
                    verbose = False, validation_split = 0.3)

In [None]:
plot_loss(history)
y_pred = model.predict(X_test)
print('Coefficient of determination: ', r2_score(Y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(Y_test, y_pred)))

Changing the number of hidden layers to 10 slightly improved the model performance of the model 