![en2](https://user-images.githubusercontent.com/54896849/66270661-afcf6b00-e873-11e9-8157-69f3197611aa.PNG)

![en](https://user-images.githubusercontent.com/54896849/66270708-2a988600-e874-11e9-8319-64d4ea6a6b3f.jpg)

### The encoding techniques that I'll cover are the following:

```
1) Replacing values
2) Label encoding
3) One-Hot encoding
4) Binary encoding
5) Miscellaneous : Encoding feature with ranges
```

# Importing Libraries

In [1]:
from sklearn import preprocessing
import pandas as pd

# Creating a Dataset

In [2]:
raw_data = {'patient': [1, 1, 1, 2, 2],
        'obs': [1, 2, 3, 1, 2],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong']
           }
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df

Unnamed: 0,patient,obs,treatment,score
0,1,1,0,strong
1,1,2,1,weak
2,1,3,0,normal
3,2,1,1,weak
4,2,2,0,strong


In [3]:
#The columns with object dtype are the possible categorical features in your dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
patient      5 non-null int64
obs          5 non-null int64
treatment    5 non-null int64
score        5 non-null object
dtypes: int64(3), object(1)
memory usage: 240.0+ bytes


In [48]:
#displaying all columns with object dtype
df.select_dtypes(include=['object'])

Unnamed: 0,score
0,strong
1,weak
2,normal
3,weak
4,strong


# Replacing values
- The idea is that you have the liberty to choose whatever numbers you want to assign to the categories.

In [6]:
df2 = df.select_dtypes(include=['object']).copy()
df2

Unnamed: 0,score
0,strong
1,weak
2,normal
3,weak
4,strong


In [7]:
# Create mapper
scale_mapper = {'strong':1, 
                'weak':2,
                'normal':3}

In [9]:
# Map feature values to scale
df2['score_new'] = df2['score'].replace(scale_mapper)

# View data frame
df2

Unnamed: 0,score,score_new
0,strong,1
1,weak,2
2,normal,3
3,weak,2
4,strong,1


# Label encoding
- Allows you to convert each value in a column to a number.
- Numerical labels are always between 0 and n_categories-1.

In [42]:
df2 = df.select_dtypes(include=['object']).copy()
df2

Unnamed: 0,score
0,strong
1,weak
2,normal
3,weak
4,strong


In [43]:
# Create a label (category) encoder object
le = preprocessing.LabelEncoder()

In [44]:
# Fit the encoder to the pandas column
le.fit(df2['score'])

LabelEncoder()

In [45]:
# View the labels (if you want)
list(le.classes_)

['normal', 'strong', 'weak']

In [46]:
x=le.transform(df2['score'])

In [47]:
df2['score_new']=x
df2

Unnamed: 0,score,score_new
0,strong,1
1,weak,2
2,normal,0
3,weak,2
4,strong,1


# One-Hot encoding
- The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column.
- This has the benefit of not weighting a value improperly.

## Method 1 (LabelBinarizer)

In [20]:
from sklearn.preprocessing import LabelBinarizer

In [21]:
df2 = df.select_dtypes(include=['object']).copy()
df2

Unnamed: 0,score
0,strong
1,weak
2,normal
3,weak
4,strong


In [24]:
one_hot = LabelBinarizer()

# One-hot encode data
x=one_hot.fit_transform(df2['score'])

In [25]:
# View the labels (if you want)
one_hot.classes_

array(['normal', 'strong', 'weak'], dtype='<U6')

In [26]:
df3=pd.DataFrame(x,columns=one_hot.classes_)

In [27]:
df3

Unnamed: 0,normal,strong,weak
0,0,1,0
1,0,0,1
2,1,0,0
3,0,0,1
4,0,1,0


## Method 2 (dummies)

In [28]:
pd.get_dummies(df2['score'])

Unnamed: 0,normal,strong,weak
0,0,1,0
1,0,0,1
2,1,0,0
3,0,0,1
4,0,1,0


# Binary encoding
- In this technique, first the categories are encoded as ordinal,
- Then those integers are converted into binary code,
- Then the digits from that binary string are split into separate columns.

In [29]:
df2 = df.select_dtypes(include=['object']).copy()
df2

Unnamed: 0,score
0,strong
1,weak
2,normal
3,weak
4,strong


In [31]:
!pip install category_encoders
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['score'])
df_binary = encoder.fit_transform(df2)

df_binary.head()

Collecting category_encoders
  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


Unnamed: 0,score_0,score_1,score_2
0,0,0,1
1,0,1,0
2,0,1,1
3,0,1,0
4,0,0,1


# Miscellaneous : Encoding feature with ranges
- Sometimes you may encounter categorical feature columns which specify the ranges of values for observation points

In [39]:
dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})
dummy_df_age

Unnamed: 0,age
0,0-20
1,20-40
2,40-60
3,60-80


## Method 1
- Split the column into two separate columns

In [40]:
dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})
dummy_df_age['start'], dummy_df_age['end'] = zip(*dummy_df_age['age'].map(lambda x: x.split('-')))

dummy_df_age.head()

Unnamed: 0,age,start,end
0,0-20,0,20
1,20-40,20,40
2,40-60,40,60
3,60-80,60,80


## Method 2
- Replace the column with some measure like the mean of that range

In [41]:
dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})
def split_mean(x):
    split_list = x.split('-')
    mean = (float(split_list[0])+float(split_list[1]))/2
    return mean

dummy_df_age['age_mean'] = dummy_df_age['age'].apply(lambda x: split_mean(x))

dummy_df_age.head()

Unnamed: 0,age,age_mean
0,0-20,10.0
1,20-40,30.0
2,40-60,50.0
3,60-80,70.0
