Encoding using sklearn:
--
Encoding in sklearn is done using the __preprocessing module__ which comes with a variety of options of manipulating data before going into the analysis of data. We will focus on two forms of encoding for now, the __LabelEncoder__ and the __OneHotEncoder__.

#### 1. Label Encoder
- First, we have to import the preprocessing library.
- Let's create __a dummy dataframe__ named data with a column whose values we want to __transform from categories to integers__

__简单来说 LabelEncoder 是对不连续的数字或者文本进行编号__

##### fit(y)
- Fit label encoder

| | |
|-----------|---------------|
|Parameters | __y__ : array-like of shape (n_samples,) Target values.|
|Returns|__self__ : returns an instance of self.|

In [23]:
import pandas as pd
from sklearn import preprocessing as pp

#sample data
sample_data = {'name':['Ray', 'Adam', 'Jason', 'Varun', 'Xiao'],
               'health':['fit', 'slim', 'obese', 'fit', 'slim']}
#store sample data in dataframe
data = pd.DataFrame(sample_data,columns = ['name','health'])
print data
#categories to integers
label_encoder = pp.LabelEncoder()
#label_encoder.fit_transform(data['health'])
label_encoder.fit(data['health'])
label_encoder.transform(data['health'])

    name health
0    Ray    fit
1   Adam   slim
2  Jason  obese
3  Varun    fit
4   Xiao   slim


array([0, 2, 1, 0, 2], dtype=int64)

One thing to keep in mind when encoding data is the fact that you do not want to skew your analysis because of the numbers that are assigned to your categories. For example, in the above example, slim is assigned a value 2 and obese a value 1. This is not to say that the intention here is to have slim be a value that is empirically twice is likely to affect your analysis as compared to obese. In such situations it is better to one-hot encode your data as all categories are assigned a 0 or a 1 value thereby removing any unwanted biases that may creep in if you simply label encode your data.

#### 2. One-hot Encoder
If we were to apply the one-hot transformation to the same example we had above, we'd do it in __Pandas__ using __get_dummies__ as follows:

__OneHotEncoder 用于将表示分类的数据扩维__

In [24]:
dummy = pd.get_dummies(data['health'])
print dummy

#We could do this in sklearn on the label encoded data 
#using OneHotEncoder as follows:

ohe = pp.OneHotEncoder() # creating OneHotEncoder object
label_encoded_data = label_encoder.fit_transform(data['health'])
print label_encoded_data
print ohe.fit_transform(label_encoded_data.reshape(-1,1))


   fit  obese  slim
0    1      0     0
1    0      0     1
2    0      1     0
3    1      0     0
4    0      0     1
[0 2 1 0 2]
  (0, 0)	1.0
  (1, 2)	1.0
  (2, 1)	1.0
  (3, 0)	1.0
  (4, 2)	1.0
