# Hands On Project

car price prediction model based on feature.

We will use the dataset from kaggle
https://www.kaggle.com/datasets/CooperUnion/cardataset

<br>
Here is the pan:
- Prepare data and do EDA
- Use linear regression
- Understand the internals of linear regression
- Evaluate with RMSE
- Feature engineering
- Regularization
- Use the model


## 1: Data Prep



In [None]:
#imports
import pandas as pd
import numpy as np

In [None]:
# To load data form a csv use pd.read_csv()


df = pd.read_csv('data/car-price.csv')
df.head()

notice how there are some inconsistencies for the column names, some have upper case some have under scores. Lets make them lower case and mak ethem lower case

In [None]:
df.columns

In [None]:
df.columns = df.columns.str.lower().str.replace(' ','_')
df.head()

Now we normalize the values. First we need to find the non string columns

In [None]:
# get all types
df.dtypes

# we only care about the objects

In [None]:
strings = list(df.dtypes[df.dtypes == 'object'].index)
# care about the index, since values are all object\
strings

In [None]:
# loop over to clean them up
for col in strings:
    df[col] = df[col].str.lower().str.replace(' ','_')
    
df.head()

# Exploratory Data Analysis


In [None]:
for col in df.columns:
    print(col)
    print(df[col].unique()) 
    print(df[col].nunique()) # see how many unique values there actually are per column
    print()

In [None]:
#now we visualize it

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


In [None]:
#lets see the price distribution
#bins is the width of each bar
sns.histplot(df.msrp[df.msrp < 100000], bins = 50)

# we can see msot cars are cheap but there are a few ars are up to the millions.
# this is the long tail distribution, by butting the requirement of < 100k we remove the long tail

In [None]:
# lets get rid of this long tail, with the log distribution. 

np.log([1, 10, 1000, 100000]) 
# you can see it brings high values and bring it lower
# we cannot log 0
#it is common to add one to each item in the list, insetad of log it is log1p

np.log1p([1, 10, 1000, 100000]) 

#use this on the prices

price_logs = np.log1p(df.msrp)
sns.histplot(price_logs, bins = 50) #notice how the tail is gone. Now this is normal dist 
#models love normal distribution

In [None]:
# gettinf rid of missing values

df.isnull().sum()

# setting up validation framework to validate the model

as mentioend before, we need to take our model to use some ata fro training, some for validatoin, and some for testing.

lets do 60 20 20 split



In [None]:
n = len(df)

n_val = int(n*0.2)
n_test = int(n*0.2)
n_train = n - n_val - n_test
#this way there are no rounding erros and we use all the data

In [58]:
# now lets decompose the df, we can use oloc


df_train = df.iloc[:n_train]
df_val = df.iloc[n_train:n_train + n_val]
df_test = df.iloc[n_train+ n_val:]


In [59]:
# we need to shuffle the records so it distributes evenly
# here is how we do it

idx = np.arange(n)
np.random.seed(2) # so that it is reproducable
np.random.shuffle(idx)


df_train = df.iloc[idx[:n_train]]
df_val = df.iloc[idx[n_train:n_train + n_val]]
df_test = df.iloc[idx[n_train+ n_val:]]

#now we are getting it through the index instad of directly using the iloc


# check to make sure the numbers are right
len(df_train), len(df_val), len(df_test)

(7150, 2382, 2382)

In [60]:
# now we reset the index column 


df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test = df_train.reset_index(drop=True)


In [61]:
# getting the target variables (y)

df
y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)
y_test = np.log1p(df_test.msrp.values)


In [62]:

# remove the values from original df

del df_train['msrp']
del df_val['msrp']
del df_test['msrp']

# Linear regression!!!

this is a model for solving regression tasks, in which the objective is to adjust a line for the data and make predictions on new values. The input is the feature matrix X, and a y vector of predictions is obtained. Trying to be as close as possible to the actual y values. 

<br>

the formula is the sum of the feature times their corrisponding weight so x1*w1+x2*w2 etc.

<br>
there is a function g that takes in all the features and produces something similar to y

So the simple linear regression formula looks like:

$g(x_i) = w_0 + x_{i1} \cdot w_1 + x_{i2} \cdot w_2 + ... + x_{in} \cdot w_n$.

And that can be further simplified as:

$g(x_i) = w_0 + \displaystyle\sum_{j=1}^{n} w_j \cdot x_{ij}$

Here is a simple implementation of Linear Regression in python:

~~~~python
w0 = 7.1
def linear_regression(xi):
    
    n = len(xi)
    
    pred = w0
    w = [0.01, 0.04, 0.002]
    for j in range(n):
        pred = pred + w[j] * xi[j]
    return pred
~~~~
        

In [65]:
#we are gonna use the training data to run this
df_train.iloc[10]

# lets take 3 params for now,hp, city mpg, and popularity
xi = [453, 11, 86]



In [72]:
#now we get the intercept and the weights:
w0 = 7.17 # intercept
w = [0.01, 0.04, 0.002]

In [70]:
#making the g function
def linear_regression(xi):
    n = len(xi)
    
    pred = w0
    for j in range(n):
        pred = pred + w[j] * xi[j]
    return pred

In [74]:
solution = linear_regression(xi)
# right now it makes no sense bcause the weights are all arbitrary values
solution

12.312

In [76]:
# since we did log dist, the result is jus the exponent, we need to get the exponent 
# we did minus one here since we added one to everything in the log to prevent log 0
np.expm1(solution)

np.float64(222347.2221101062)

# Liner regression vector form


The formula of linear regression can be synthesized with the dot product between features and weights. The feature vector includes the *bias* term with an *x* value of one, such as $w_{0}^{x_{i0}},\ where\ x_{i0} = 1\ for\ w_0$.

When all the records are included, the linear regression can be calculated with the dot product between ***feature matrix*** and ***vector of weights***, obtaining the `y` vector of predictions. 


**to put it in human words, it is basically the same thing as the one we did before, but instaed of multiplying numbers we are getting the dot product for vectors**

<br>
notes: https://knowmledge.com/2023/09/20/ml-zoomcamp-2023-machine-learning-for-regression-part-5/

In [77]:
#making the dot product function
def dot(xi, w):
    n = len(xi)
    res = 0.0
    
    
    for j in range(n):
        res = res + xi[j] * w[j]
    return res

In [78]:
# linear regerssoin
def linear_regression(xi):
    return w0 + dot(xi, w)

In [79]:
# to simplify the equatoin, we can imagine there is one more feaure xi0 that is always equal to one, this way we  do  not have to leave w0 all alone
w_new = [w0] + w
def linear_regression(xi):
    xi = [1] + xi
    return dot(xi, w_new)

## here is a small example of how this calculation works

We have a matrix $X \in \mathbb{R}^{m \times (n+1)}$ and weight vector $w \in \mathbb{R}^{(n+1) \times 1}$:

$$
X =
\begin{bmatrix}
1 & x_{11} & \cdots & x_{1n} \\
1 & x_{21} & \cdots & x_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{m1} & \cdots & x_{mn}
\end{bmatrix}, 
\quad
w =
\begin{bmatrix}
w_0 \\ w_1 \\ \vdots \\ w_n
\end{bmatrix}
$$

Predictions are:

$$
y_{\text{pred}} = X w
$$

Each element is a dot product of a row of $X$ with $w$:

$$
y_{\text{pred},i} = x_i^T w = \sum_{j=0}^{n} x_{ij} w_j
$$

**Example:**

$$
X = \begin{bmatrix}1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}, \quad
w = \begin{bmatrix}0.5 \\ 2 \end{bmatrix}
$$

Step-by-step calculation:

\[
\begin{aligned}
y_{\text{pred},1} &= 1 \cdot 0.5 + 2 \cdot 2 = 0.5 + 4 = 4.5 \\
y_{\text{pred},2} &= 1 \cdot 0.5 + 3 \cdot 2 = 0.5 + 6 = 6.5 \\
y_{\text{pred},3} &= 1 \cdot 0.5 + 4 \cdot 2 = 0.5 + 8 = 8.5
\end{aligned}
\]

So the prediction vector is:

$$
y_{\text{pred}} = \begin{bmatrix}4.5 \\ 6.5 \\ 8.5\end{bmatrix}
$$




The general formula, as mentioned above, is:

g(xi) = W0 + summation(1,n) (w[j] * xi[j])  
However, it can be written as another form. Specifically if we look at (w[j] * xi[j]):

g(xi) = w0 + xi^T * W  
whereby xi^T is the transpose of xi.

Why is this so?

- Both the weights and feature matrices are vectors, with size (n,1) where n is the number of features.
- The number of weights equals the number of features, so they both have the same size.
- For vector-vector multiplication to occur, the first vector needs to have the same number of columns as the number of rows of the second vector.
- We can either transpose the weights or the feature matrix.
- In this case, we transpose xi so that we get a matrix with size (1,n).
- Since we want to get the inner product, we use xi^T W to get a product of (1,1), which is the prediction (instead of W xi^T).
- The vector-vector multiplication occurs, and we get the prediction.
- Note: We can transpose either vector, but since we're transposing xi, we must ensure we change the position of the matrices, as the position affects the product in matrix multiplication (or in this case, vector-vector multiplication).

In [None]:
#using the same data as an example, but with a matrix instad. Here is the base data 
w0 = 7.17 # intercept
xi = [453, 11, 86]
w = [0.01, 0.04, 0.002]
w_new = [w0] + w

In [82]:
#now we modify it to make an example of actual data that we used

x1 = [1, 148, 24, 1385] # notice how we have the 1 to make it so that we dont even out the w
x2 = [1, 132, 25, 2031]
x10 = [1, 453, 11, 86]

X = [x1, x2, x10]
X = np.array(X)

In [84]:
X.dot(w_new) # this is the results of the linear regression

array([12.38 , 13.552, 12.312])