## Task: Predict car price using linear regression


You are provided with a dataset about cars in the form of a csv file.
The data contains different details like name, year and other

You are provided with the code to download the csv file that contains the dataset.

Your task is to **train a linear regression model** and predicts the car price.

**Divide the data** into a train (80%) and a validation data set (20%).

**Print train and validation losses**.


In [2]:
from IPython.display import clear_output

In [3]:
%pip install gdown
clear_output()

In [4]:
# Download the CSV file.
!gdown 1dail55JlMcsOlZKiSQeZSbTNHYiBH3U9

Downloading...
From: https://drive.google.com/uc?id=1dail55JlMcsOlZKiSQeZSbTNHYiBH3U9
To: /content/car_dataset.csv
  0% 0.00/355k [00:00<?, ?B/s]100% 355k/355k [00:00<00:00, 60.2MB/s]


In [5]:
import pandas as pd
import numpy as np

In [6]:
data_df = pd.read_csv('car_dataset.csv')

In [7]:
data_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


## Pre-process the dataframe

In [8]:
data_df = data_df.drop(["name", "owner"], axis = 1)

fuel_type_map = {"Petrol": 1, "Diesel": 0}
seller_tupe = {"Dealer": 1, "Individual": 0}
transmission_map = {"Manual": 1, "Automatic": 0}


data_df['fuel'] = data_df['fuel'].map(fuel_type_map)
data_df['seller_type'] = data_df['seller_type'].map(seller_tupe)
data_df["transmission"] = data_df["transmission"].map(transmission_map)

# Dropping null values (if any)
data_df = data_df.dropna()


## split the targeted feature from the input features

In [10]:
data_y = data_df['selling_price'] # target
data_x = data_df.drop(['selling_price'], axis=1) # input features

data_x.head()

Unnamed: 0,year,km_driven,fuel,seller_type,transmission
0,2007,70000,1.0,0.0,1
1,2007,50000,1.0,0.0,1
2,2012,100000,0.0,0.0,1
3,2017,46000,1.0,0.0,1
4,2014,141000,0.0,0.0,1


In [11]:
data_y

0        60000
1       135000
2       600000
3       250000
4       450000
         ...  
4335    409999
4336    409999
4337    110000
4338    865000
4339    225000
Name: selling_price, Length: 4174, dtype: int64

## Convert the input and target featurs to numpy arrays

In [20]:
X = data_x.values
y = data_y.values

## divide the data into training and testing sets where the training set is 80% and the test set is 20%

In [21]:
train_size = int(0.8 * len(X))

X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

## Use the linear regression closed-form solution and calculate the coefficient of the linear regression equation (theta) and the predictions (Y)

In [22]:
# Add a column of ones to the data which represents the base (b)
X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = np.c_[np.ones(X_test.shape[0]), X_test]

# calculate theta
theta = np.linalg.inv(X_train.T @ X_train) @ (X_train.T @ y_train)

# calculate the predictions over the training set
y_train_pred = X_train @ theta

## Defining the Mean Squared Error Function and show the loss of your linear regression model

In [23]:
# Defining the Mean Squared Error Function
def mean_squared_error(y, yhat):
  return 1/len(y) * np.sum(np.square(y - yhat))

In [24]:
# calculating the MSELoss for the training set
train_mse = mean_squared_error(y_train, y_train_pred)
print(train_mse)

178411508695.8574


In [26]:
# calculate the predictions over the testing set
y_test_pred = X_test @ theta

In [27]:
# calculating the MSELoss for the testing set
test_mse = mean_squared_error(y_test, y_test_pred)
print(test_mse)

210865698440.9195


## Is the result of the loss function good?
