# Relevant Information Paragraph:

-   Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

   ## CAR                      car acceptability
   - PRICE                  overall price
   - . buying               buying price
   - . maint                price of the maintenance
   - TECH                   technical characteristics
   - . COMFORT              comfort
   - . . doors              number of doors
   - . . persons            capacity in terms of persons to carry
   - . . lug_boot           the size of luggage boot
   - . safety               estimated safety of the car

   Input attributes are printed in lowercase. Besides the target
   concept (CAR), the model includes three intermediate concepts:
   PRICE, TECH, COMFORT. Every concept is in the original model
   related to its lower level descendants by a set of examples (for
   these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).

   The Car Evaluation Database contains examples with the structural
   information removed, i.e., directly relates CAR to the six input
   attributes: buying, maint, doors, persons, lug_boot, safety.

   Because of known underlying concept structure, this database may be
   particularly useful for testing constructive induction and
   structure discovery methods.

## Number of Instances: 1728
   (instances completely cover the attribute space)

## Number of Attributes: 6

## Attribute Values:

 -  buying       v-high, high, med, low
 - maint        v-high, high, med, low
 - doors        2, 3, 4, 5-more
 - persons      2, 4, more
 - lug_boot     small, med, big
 - safety       low, med, high

## Missing Attribute Values: none

## Class Distribution (number of instances per class)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %) 


In [3]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
#Loading the dataset
df = pd.read_csv("data")

In [5]:
df.shape

(1728, 7)

In [6]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [56]:
df.describe(include="all")

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
count,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728
unique,,,,,,,4
top,,,,,,,unacc
freq,,,,,,,1210
mean,2.5,2.5,3.5,3.666667,1.0,2.0,
std,1.118358,1.118358,1.118358,1.24758,0.816733,0.816733,
min,1.0,1.0,2.0,2.0,0.0,1.0,
25%,1.75,1.75,2.75,2.0,0.0,1.0,
50%,2.5,2.5,3.5,4.0,1.0,2.0,
75%,3.25,3.25,4.25,5.0,2.0,3.0,


In [8]:
df["class"].value_counts()/len(df)

unacc    0.700231
acc      0.222222
good     0.039931
vgood    0.037616
Name: class, dtype: float64

### Observation 
- It is a highly imbalanced dataset.
- 70% of the value belongs to unacc.
- 3.7%  of value belongs to vgood which is the least amongst all.

In [9]:
df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

### Observation
- There are no missing values in any column.

In [10]:
df.columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

In [11]:
df["buying"].replace("vhigh",4,inplace = True)
df["buying"].replace("high",3,inplace = True)
df["buying"].replace("med",2,inplace = True)
df["buying"].replace("low",1,inplace = True)

In [12]:
df["buying"].value_counts()

4    432
3    432
2    432
1    432
Name: buying, dtype: int64

In [13]:
df["maint"].value_counts()

vhigh    432
med      432
high     432
low      432
Name: maint, dtype: int64

In [14]:
df["maint"].replace("vhigh",4,inplace = True)
df["maint"].replace("high",3,inplace = True)
df["maint"].replace("med",2,inplace = True)
df["maint"].replace("low",1,inplace = True)

In [15]:
df["maint"].value_counts()

4    432
3    432
2    432
1    432
Name: maint, dtype: int64

In [16]:
df["doors"].value_counts()

3        432
5more    432
4        432
2        432
Name: doors, dtype: int64

In [17]:
df["doors"].replace("5more",5,inplace = True)
df["doors"].replace("4",4,inplace = True)
df["doors"].replace("3",3,inplace = True)
df["doors"].replace("2",2,inplace = True)

In [18]:
df["persons"].value_counts()

more    576
4       576
2       576
Name: persons, dtype: int64

In [23]:
df["persons"].replace("2",2,inplace = True)
df["persons"].replace("4",4,inplace = True)
df["persons"].replace("more",5,inplace = True)

In [42]:
df["lug_boot"].value_counts()

small    576
med      576
big      576
Name: lug_boot, dtype: int64

In [43]:
from sklearn.preprocessing import LabelEncoder

In [44]:
le = LabelEncoder()

In [48]:
df["lug_boot"] = le.fit_transform(df["lug_boot"])

In [49]:
df["lug_boot"].unique()

array([2, 1, 0], dtype=int64)

In [50]:
df["safety"].value_counts()

med     576
high    576
low     576
Name: safety, dtype: int64

In [55]:
df["safety"].replace({"med":2,"high":3,"low":1},inplace = True)