# 🙋‍♂️ Preamble 🙋‍♂️ #

Name : Ayoub Choukri
Date : 13 October 2023
Subject : Car Prices Modeling and Prediction

**🤞Note🤞** : This notebook is a part of a series of notebooks that I will be publishing on my [Github]() and [Kaggle]() accounts. Please feel free to check them out and give me your feedback.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msn
import pandas as pd
import prince as pr
from mca import *

# 🛣️General Context🛣️

An Asian Car company **Auto_Asia** is trying to enter the European market. They aspire to set up ther manufacturing plant in Europe in order to produce locally and compete with the European car manufacturers.

**Auto_Asia** has taken the first step forward and started to gather data about the European car market. They have collected data about the car prices and the features of the cars from different sources. Analysing this Data will help them to understand the European market.


Techniqualy, **Auto_Asia** wants to know the following:

1.   Which variables are really significant in determining the price of a car?

2.  How well those variables describe the price of a car?

**Auto_Asia** has hired you as a Data Scientist to help them answer those questions.

# 📋Importing the DataSet📋

For this project, we will be using the [Car Price DataSet](https://www.kaggle.com/hellbuoy/car-price-prediction) from Kaggle.

Let's import the DataSet

In [7]:
data=pd.read_csv("./Data/CarPrice_Assignment.csv")
pd.set_option('display.max_columns',None)

# 🧹Data cleaning🧹

## Data Dtypes

Let's First take a glimpse at the data.

In [35]:
data.sample(5)['CarName']

23                          dodge d200
174                   toyota celica gt
53                          mazda rx-4
154           toyota corolla 1600 (sw)
74     buick regal sport coupe (turbo)
Name: CarName, dtype: category
Categories (147, object): ['Nissan versa', 'alfa-romero Quadrifoglio', 'alfa-romero giulia', 'alfa-romero stelvio', ..., 'volvo 264gl', 'volvo diesel', 'vw dasher', 'vw rabbit']

We notice that our DataSet contains $205$ rows and $26$ columns.

Each column gives us information about a specific feature of the car.

We notice the existence of both numerical and categorical features.

Let's first put every feature in its corresponding type.

In [14]:
df=data
categorical_columns = ['car_ID','symboling','CarName','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation','wheelbase','enginetype','cylindernumber','fuelsystem']


numerical_columns=['carlength','carwidth','carheight','curbweight','enginesize','boreratio','stroke','compressionratio','horsepower','peakrpm','citympg','highwaympg','price']


len(categorical_columns)+len(numerical_columns)

26

In [17]:
df[categorical_columns] = df[categorical_columns].astype('category')
df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric,errors='coerce')
display(df.dtypes)

car_ID              category
symboling           category
CarName             category
fueltype            category
aspiration          category
doornumber          category
carbody             category
drivewheel          category
enginelocation      category
wheelbase           category
carlength            float64
carwidth             float64
carheight            float64
curbweight             int64
enginetype          category
cylindernumber      category
enginesize             int64
fuelsystem          category
boreratio            float64
stroke               float64
compressionratio     float64
horsepower             int64
peakrpm                int64
citympg                int64
highwaympg             int64
price                float64
dtype: object

## 🤷Missing Values🤷

Let's compute the percentage of missing values in each column.

In [25]:
def percentage_missing(data=df,columns=df.columns,rows=df.index):
    miss = df.loc[rows,columns].isnull()
    non_miss= df.loc[rows,columns].notna()
    print("Sum of Missing Values per Column")
    result = miss.sum()
    display(result)
    print("Percentage of Missing Values per Column")
    result=miss.sum()/non_miss.sum()
    display(result)
    print("Percentage of Missing Values")
    result = miss.sum() / df.loc[rows,columns].shape[0]
    display(result)


In [26]:
percentage_missing()

Sum of Missing Values per Column


car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

Percentage of Missing Values per Column


car_ID              0.0
symboling           0.0
CarName             0.0
fueltype            0.0
aspiration          0.0
doornumber          0.0
carbody             0.0
drivewheel          0.0
enginelocation      0.0
wheelbase           0.0
carlength           0.0
carwidth            0.0
carheight           0.0
curbweight          0.0
enginetype          0.0
cylindernumber      0.0
enginesize          0.0
fuelsystem          0.0
boreratio           0.0
stroke              0.0
compressionratio    0.0
horsepower          0.0
peakrpm             0.0
citympg             0.0
highwaympg          0.0
price               0.0
dtype: float64

Percentage of Missing Values


car_ID              0.0
symboling           0.0
CarName             0.0
fueltype            0.0
aspiration          0.0
doornumber          0.0
carbody             0.0
drivewheel          0.0
enginelocation      0.0
wheelbase           0.0
carlength           0.0
carwidth            0.0
carheight           0.0
curbweight          0.0
enginetype          0.0
cylindernumber      0.0
enginesize          0.0
fuelsystem          0.0
boreratio           0.0
stroke              0.0
compressionratio    0.0
horsepower          0.0
peakrpm             0.0
citympg             0.0
highwaympg          0.0
price               0.0
dtype: float64

We notice that our DataSet doesn't contain any missing values. We can now move to the next step.

## 🛠️Feature Engineering🛠️

Let's review the features of our DataSet.

In [36]:
df.sample(5)

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,carheight,curbweight,enginetype,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
167,168,2,toyota corona liftback,gas,std,two,hardtop,rwd,front,98.4,176.2,65.6,52.0,2540,ohc,four,146,mpfi,3.62,3.5,9.3,116,4800,24,30,8449.0
130,131,0,renault 12tl,gas,std,four,wagon,fwd,front,96.1,181.5,66.5,55.2,2579,ohc,four,132,mpfi,3.46,3.9,8.7,90,5100,23,31,9295.0
35,36,0,honda accord lx,gas,std,four,sedan,fwd,front,96.5,163.4,64.0,54.5,2010,ohc,four,92,1bbl,2.91,3.41,9.2,76,6000,30,34,7295.0
65,66,0,mazda glc,gas,std,four,sedan,rwd,front,104.9,175.0,66.1,54.4,2670,ohc,four,140,mpfi,3.76,3.16,8.0,120,5000,19,27,18280.0
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0


We notice that **`CarName`** column specifies the name of the company and the model of the car. We can split this column into two columns **`CompanyName`** and **`CarModel`**.

In [44]:
def split_one_carname(carname,sep=' '):
    return carname.split(sep=sep)


def split_carname_column(data=df,sep=' ',column='CarName',name_column_1 = 'CompanyName',name_column_2='CarModel'):
    print(df[column].apply(split_one_carname))



split_carname_column()

TypeError: unhashable type: 'list'