In [27]:
import pandas as pd
import sklearn
from sklearn.preprocessing import (LabelEncoder, OneHotEncoder)
import numpy as np

In [16]:
data = pd.read_csv("C:\\Users\\Prabin\\clean_data.csv")

In [17]:
data.head()

Unnamed: 0,year,month,stateDescription,sectorName,customers,price,revenue,sales
0,2001,1,Wyoming,all sectors,,4.31,48.1284,1116.17208
1,2001,1,Wyoming,commercial,,5.13,12.67978,247.08691
2,2001,1,Wyoming,industrial,,3.26,19.60858,602.30484
3,2001,1,Wyoming,other,,4.75,0.76868,16.17442
4,2001,1,Wyoming,residential,,6.01,15.07136,250.60591


**Here we can see sectorName which is categorical varible written in as string. Now we will learn how to fit categorical data in Machine learning.**

To transform this data we learn `label Encoding` and `One-Hot Encoding`

In [18]:
data.sectorName.unique()

array(['all sectors', 'commercial', 'industrial', 'other', 'residential',
       'transportation'], dtype=object)

## Label Endcoding
This method works best on a dataset with hierarchical or ordinal data. 

In [19]:
X = data['sectorName']

In [20]:
## Using pandas

encoded, rule = X.factorize()

In [21]:
encoded

array([0, 1, 2, ..., 2, 4, 5], dtype=int64)

In [24]:
## Sklearn way
lEncoder = LabelEncoder()

encoded = lEncoder.fit_transform(X)

In [25]:
lEncoder.inverse_transform([0,1,4,5])

array(['all sectors', 'commercial', 'residential', 'transportation'],
      dtype=object)

**Why it is superior to pandas.**
LabelEncoder() is object it has many inside function. So, suppose we want label name instead of numbers by using `inverse_transform` it can be done easily while in pandas it is difficult.


## One-Hot Encoding

Use for Nominal Datasets.

In [28]:
# pandas way
pd.get_dummies(X, dtype=int)


Unnamed: 0,all sectors,commercial,industrial,other,residential,transportation
0,1,0,0,0,0,0
1,0,1,0,0,0,0
2,0,0,1,0,0,0
3,0,0,0,1,0,0
4,0,0,0,0,1,0
...,...,...,...,...,...,...
85865,1,0,0,0,0,0
85866,0,1,0,0,0,0
85867,0,0,1,0,0,0
85868,0,0,0,0,1,0


In [32]:
# Sklearn Way

one_hot_encoder = OneHotEncoder(sparse_output = False)
encoder = one_hot_encoder.fit_transform(X.values.reshape(-1,1))


In [33]:
encoder

array([[1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1.]])

In [37]:
one_hot_encoder.categories_


[array(['all sectors', 'commercial', 'industrial', 'other', 'residential',
        'transportation'], dtype=object)]

In [39]:
one_hot_encoder