<p align="center">
  <img width="800" height="600" alt="Hyundai Logo" src="https://p1.pxfuel.com/preview/251/598/497/car-hyundai-steering-wheel-vehicle.jpg">
</p>

# Introduction:

Hyundai is one of the biggest lead car brands for the mid-wealth people around the world, it's established in South Korea in 1967 and since then its cars keep hitting the market with special average exterior designs and suitable prices.

Today we are gonna explore the dataset of **Hyundai** used cars to learn more about the following questions:
- What is the relation between dataset features and how are they correlated?
- How does the age of the car affect its price?
- How does the transmission type affect the price?
- Does the milage of the same car with the same age can change the price?
- How does the fuel type affect the price?
- How does the road tax affected by car age and transmission type?
- How does the fuel consumption rate affected by engine size?
- How does the engine size affect the price?

Price is one of the most aspects and keys to be studied before selling or buying used cars and the relation between it and the status of the car health and specs is important.

After exploring our data and answering our questions we are gonna try to build a price prediction model and evaluate it.

#### Let's get started...
***

# Discover the data:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['font.size']= 14
plt.rcParams['figure.figsize']= [10,7]



df = pd.read_csv('../input/used-car-dataset-ford-and-mercedes/hyundi.csv')
print(df.shape)
df.head(10)

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.duplicated().sum()

* We see that our data is clean and there are no null values.
* We have the proper data types for our features.
* Although we have **86 duplicate** records we can process our exploration normally without being worry.

We have three problems with our data:
1. We have some records with **0 engine size** that we need to remove.
2. The tax column has a euro sign that will make it hard to index and call it.
3. We have a weird range in mpg records as the minimum value is **1.1** and the maximum value is **256.8**.

Let's solve those problems from the last to the first one before get starting.

In [None]:
df.mpg.value_counts()

We see that there are 4 cars with mpg equal to **1.1** that are impossible, and 3 cars with mpg equal to **256.8** that are high value. 

I choose to drop them as they seem to be outliers.

In [None]:
#problem 3
df.drop(df[(df.mpg == 1.1) | (df.mpg == 256.8)].index, inplace= True)
#problem 2
df.rename(columns= {'tax(£)': 'tax'}, inplace= True)
#problem 1
df.drop(df[df.engineSize == 0].index, inplace= True)

# EDA and Answeres:

Before getting into our questions' answers let's first create a new column for our cars' age.

In [None]:
df['age'] = 2021 - df['year']
df.head()

## 1. What is the relation between dataset features and how are they correlated?

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
sns.heatmap(df.corr(), annot= True, linewidth= 1, ax=ax)

We can see from above that there are **positive correlation** between:
- Price and Engine Size.
- Mileage and Age.
and **negative correlation** between:
- Price and Mileage.
- Price and Age.

a logical answer, isn't it?

## 2. How does the age of the car affect its price?

<p align="center">
  <img width="800" height="600" alt="Hyundai old car" src="https://upload.wikimedia.org/wikipedia/commons/b/bf/My_New_Car_Hyundai_i30_-_August_2009_%283831038685%29.jpg">
</p>

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.scatter(data= df, x='age', y='price')
ax.set_title('Age vs Price')
ax.set_ylabel('Price')
ax.set_xlabel('Age')

Here, as you can see the more cars get older, the more drop in the price.

This is logically to happen for cars.

*Note: we can observe an outlier in the age of 4 that has a very high price, actually the highest in our data and we need to investigate it*

In [None]:
df[df.price > 80000]

Let's see **I 10 2017 Models** max price in our dataset

In [None]:
df[(df.model == ' I10') & (df.year == 2017)].price.max()

We can now say that there is a human mistake in collecting data from our investigation, Let's correct it.

In [None]:
df.loc[df.price > 80000, 'price'] = 9200

In [None]:
df[df.price > 80000]

## 3.How does the transmission type affect the price?

<p align="center">
  <img width="800" height="600" alt="tarnsmission stick" src="https://cdn.pixabay.com/photo/2015/07/31/11/36/shift-868980_1280.jpg">
</p>

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.bar(df.transmission, df.price)
ax.set_title('Transmission vs Price')
ax.set_ylabel('Price')
ax.set_xlabel('Transmission')
fig.show()

Transmission type reflects on the price of the car as the **Semi-Auto** cars are the highest in the prices rates and then comes **Automatic** cars.

## 4.Does the milage of the same car with the same age can change the price?

<p align="center">
  <img width="800" height="600" alt="tarnsmission stick" src="https://live.staticflickr.com/1367/4606495023_7a7719a312_b.jpg">
</p>

Let's choose the most year with models and the most model to do investigation for this question.

In [None]:
most_year = df.year.value_counts().index[0]
most_model = df[df.year == most_year].model.value_counts().index[0]
print(most_year, most_model)

So the largest sample to check our answer is the **Tucson 2017 Model**, Let's split its data.

In [None]:
tucson_17 = df[(df.year == most_year) & (df.model == most_model)]
tucson_17

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.scatter(tucson_17.mileage, tucson_17.price)
ax.set_title('Mileage vs Price')
ax.set_ylabel('Price')
ax.set_xlabel('Mileage')
fig.show()

Mileage is affecting the price slightly as the more mileage number the less is the price.

However, it seems that the mileage is not a perfect factor by its own as we can see how our data are scattered.

## 5.How does fuel type affect the price?

<p align="center">
  <img width="800" height="600" alt="tarnsmission stick" src="https://p1.pxfuel.com/preview/492/488/949/petrol-gasoline-diesel-gas.jpg">
</p>

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.bar(df.fuelType, df.price)
ax.set_title('Fuel Type vs Price')
ax.set_ylabel('Price')
ax.set_xlabel('Fuel Type')
fig.show()

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
sns.countplot(x='fuelType', data=df, ax=ax)
fig.show()

The cars with diesel fuel seem to be more expensive than the petrol although more cars are using petrol as a fuel type.

That may be because diesel nowadays costs more the petrol.

## 6.How does the road tax affected by car age and transmission type?

<p align="center">
  <img width="800" height="600" alt="tarnsmission stick" src="https://www.stockvault.net/data/2018/07/30/253496/preview16.jpg">
</p>

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.bar(df.age, df.tax)
ax.set_title('Age vs Tax')
ax.set_ylabel('Tax')
ax.set_xlabel('Age')
fig.show()

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.scatter(df.transmission, df.tax)
ax.set_title('Transmission vs Tax')
ax.set_ylabel('Tax')
ax.set_xlabel('Transmission')
fig.show()

The two charts above show that the road tax is not affected by both transmission type or age.

*Note: we can see an outlier car in our data set, Let's see it close to make a decision about it.*

In [None]:
df[df.tax >= 500]

Let's check the road taxes for **Santa fe** cars and **2.4 engine** cars.

In [None]:
df[df.model == ' Santa Fe'].tax.value_counts()

In [None]:
df[df.engineSize >= 2.4].tax.value_counts()

Now we can say that this car is absolutely an outlier and we will need to remove it to avoid problems with our linear regression model.

In [None]:
df.drop(df[df.tax == 555].index, inplace= True)

## 7.How does the fuel consumbtion rate affected by engine size?

<p align="center">
  <img width="800" height="600" alt="tarnsmission stick" src="https://www.maxpixel.net/static/photo/1x/Petrol-Tank-Full-Fuel-Fuel-Gauge-Ad-Gas-Empty-70507.jpg">
</p>

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.scatter(df.engineSize, df.mpg)
ax.set_title('Engine Size vs Consumption')
ax.set_ylabel('MPG')
ax.set_xlabel('Engine Size')
ax.set_xlim(0.5,3)
fig.show()

We can tell that the mpg is not affected with engine size in our data as our sample starts to get low fuel consumption after engine size of 1.6L and this maybe not happened in the real-life, also we see that the highest consumption rate trophy goes to te cars with 1.6L engine size.

## 8. How does the engine size affect the price?

<p align="center">
  <img width="800" height="600" alt="tarnsmission stick" src="https://live.staticflickr.com/6106/6309884152_6e4851b41a_b.jpg">
</p>

In [None]:
fig, ax = plt.subplots(figsize= (10,7))
ax.scatter(df.engineSize, df.price)
ax.set_title('Engine Size vs Consumption')
ax.set_ylabel('MPG')
ax.set_xlabel('Engine Size')
ax.set_xlim(0.5,3)
fig.show()

Prices are increasing with increment in the engine size however we have cars with high engine size with lower price maybe because they are old or have high mileage.

Now let's start our journey to the price prediction model.

# Linear Regression Price Prediction Model:

We are interested to build a linear regression model that helps us to predict the price of used Hyundai cars in the future.

Linear Regression is one of the algorithms of machine learning that helps in predicting numerical features but if we have a key feature as categorical we need to create what is called dummies variables for these features.

Also, we need to calculate the total price for our data to predict the total price too so we will create a new feature that sums the price of the car and its tax.

Let's start our job.

## Get the data ready for the model:

### 1. Total Price Feature:

In [None]:
df_model = df.copy()
df_model['total_price'] = df.price + df.tax
df_model.drop(columns = ['price', 'tax', 'year'], inplace= True)
df_model.head()

### 2. Dummies:

Let's first see each categorical feature's unique values again to understand what will be the shape of our dataframe for the model after creating the dummies variable.

In [None]:
df_model.model.value_counts()

In [None]:
df_model.transmission.value_counts()

In [None]:
df_model.fuelType.value_counts()

Creating dummies variables makes a new dummy feature for each categorical feature variable with the value of 0 or 1, also our method will drop the original categorical features and the first dummy feature.

In [None]:
df_model = pd.get_dummies(df_model, columns= ['model', 'transmission', 'fuelType'], drop_first= True)
df_model

## Seperate Data to Train and Test Data:

In [None]:
#Our model features
X = df_model.drop(columns= 'total_price')
#Our predicted feature
y= df_model.total_price

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.4, random_state= 27)

## Train The model:

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

## Predict Total Price:

<p align="center">
  <img width="800" height="600" alt="tarnsmission stick" src="https://cdn.pixabay.com/photo/2017/08/21/15/55/money-2665824_1280.jpg">
</p>

In [None]:
pred_total_price = lr_model.predict(X_test)

## Coefficients and Intercept of the Model:

In [None]:
coef = pd.DataFrame(lr_model.coef_, index= X_train.columns)
coef

In [None]:
lr_model.intercept_

## Evaluate our model:

In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

mse = mean_squared_error(y_test, pred_total_price)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, pred_total_price)
r2 = r2_score(y_test, pred_total_price)

labels = ['Mean-Squared-Error','Root-Mean-Squared-Error','Mean-Absolute-Error','R^2 Score']

eval_model = pd.DataFrame([mse, rmse, mae, r2], index= labels)
pd.options.display.float_format = "{:f}".format
eval_model

We can see that our model R-square score is about **88%** and that's not bad for our model to predict the total price for the **Hyundai Cars**.