Steps for Handling Categorical Data

1. Import Libraries
2. Load Data
3. Seprate Input and Output attributes
4. Convert the categorical data into numerical data

In [1]:
# Step 1: Import Libraries

import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

# Step 2: Load Data
        
datasets = pd.read_csv('Datasets/Exercise-CarData.csv') 
print("\nData :\n",datasets)
print("\nData statistics\n",datasets.describe())


Data :
       Unnamed: 0  Price   Age     KM FuelType   HP  MetColor  Automatic    CC  \
0              0  13500  23.0  46986   Diesel   90       1.0          0  2000   
1              1  13750  23.0  72937   Diesel   90       1.0          0  2000   
2              2  13950  24.0  41711   Diesel   90       NaN          0  2000   
3              3  14950  26.0  48000   Diesel   90       0.0          0  2000   
4              4  13750  30.0  38500   Diesel   90       0.0          0  2000   
...          ...    ...   ...    ...      ...  ...       ...        ...   ...   
1431        1431   7500   NaN  20544   Petrol   86       1.0          0  1300   
1432        1432  10845  72.0     ??   Petrol   86       0.0          0  1300   
1433        1433   8500   NaN  17016   Petrol   86       0.0          0  1300   
1434        1434   7250  70.0     ??      NaN   86       1.0          0  1300   
1435        1435   6950  76.0      1   Petrol  110       0.0          0  1600   

      Doors  Weigh

In [2]:
# Step 3: Seprate Input and Output attributes

X = datasets.iloc[:, 2:].values 
  
Y = datasets.iloc[:, 1].values 

print("\n\nInput : \n", X) 
print("\n\nOutput: \n", Y) 



Input : 
 [[23.0 '46986' 'Diesel' ... 2000 'three' 1165]
 [23.0 '72937' 'Diesel' ... 2000 '3' 1165]
 [24.0 '41711' 'Diesel' ... 2000 '3' 1165]
 ...
 [nan '17016' 'Petrol' ... 1300 '3' 1015]
 [70.0 '??' nan ... 1300 '3' 1015]
 [76.0 '1' 'Petrol' ... 1600 '5' 1114]]


Output: 
 [13500 13750 13950 ...  8500  7250  6950]


In [3]:
# Step 4a: Apply LabelEncoder on the data to convert FuelType names into numeric values

le = LabelEncoder()
X[ : ,2] = le.fit_transform(X[ : ,2])
print("\n\nInput : \n", X) 



Input : 
 [[23.0 '46986' 1 ... 2000 'three' 1165]
 [23.0 '72937' 1 ... 2000 '3' 1165]
 [24.0 '41711' 1 ... 2000 '3' 1165]
 ...
 [nan '17016' 2 ... 1300 '3' 1015]
 [70.0 '??' 3 ... 1300 '3' 1015]
 [76.0 '1' 2 ... 1600 '5' 1114]]


In [4]:
# Step 4b: Use dummy variables from pandas library to create one column for each type of Fuel

dummy = pd.get_dummies(datasets['FuelType'])
print("\n\nDummy :\n",dummy)

datasets = datasets.drop(['FuelType'],axis=1)
datasets = pd.concat([dummy,datasets],axis=1)
print("\n\nFinal Data :\n",datasets)



Dummy :
       CNG  Diesel  Petrol
0       0       1       0
1       0       1       0
2       0       1       0
3       0       1       0
4       0       1       0
...   ...     ...     ...
1431    0       0       1
1432    0       0       1
1433    0       0       1
1434    0       0       0
1435    0       0       1

[1436 rows x 3 columns]


Final Data :
       CNG  Diesel  Petrol  Unnamed: 0  Price   Age     KM   HP  MetColor  \
0       0       1       0           0  13500  23.0  46986   90       1.0   
1       0       1       0           1  13750  23.0  72937   90       1.0   
2       0       1       0           2  13950  24.0  41711   90       NaN   
3       0       1       0           3  14950  26.0  48000   90       0.0   
4       0       1       0           4  13750  30.0  38500   90       0.0   
...   ...     ...     ...         ...    ...   ...    ...  ...       ...   
1431    0       0       1        1431   7500   NaN  20544   86       1.0   
1432    0       0       1   