# Variable conversion

In this activity you will learn to convert variables from one type into the other.

## Numeric to categorical

Consider the wine dataset we used earlier:

In [1]:
import sklearn.datasets as datasets
import pandas as pd
import numpy as np

dataset = datasets.load_wine()
X = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])

print(X.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  
0                  

Let's first bin the variable 'flavanoids' into 5 bins using pandas:

In [4]:
# let's first see 20 example values of flanoids
print(X['flavanoids'].head(20))

0     3.06
1     2.76
2     3.24
3     3.49
4     2.69
5     3.39
6     2.52
7     2.51
8     2.98
9     3.15
10    3.32
11    2.43
12    2.76
13    3.69
14    3.64
15    2.91
16    3.14
17    3.40
18    3.93
19    3.03
Name: flavanoids, dtype: float64


In [5]:
# and now we can create 5 bins (bands/groups) to put them in
flavanoids = pd.cut(X['flavanoids'], 5)
print(flavanoids.value_counts())

# this will print ranges of each bin, and count of items in it 
# (notice they are in count orderm not in range order)

(2.236, 3.184]    64
(0.335, 1.288]    51
(1.288, 2.236]    43
(3.184, 4.132]    19
(4.132, 5.08]      1
Name: flavanoids, dtype: int64


Notice that the bins are all of an equal width, but the distribution is uneven.
We can use a different function to obtain equal-size bins:

In [6]:
flavanoids = pd.qcut(X['flavanoids'], 5)
print(flavanoids.value_counts())

(0.339, 0.872]    36
(1.738, 2.46]     36
(2.46, 2.98]      36
(0.872, 1.738]    35
(2.98, 5.08]      35
Name: flavanoids, dtype: int64


## Categorical to numeric

Let's create a colour variable:

In [22]:
vehicles =     ['none', 'car', 'bicycle', 'skateboard']
probabilities = [0.5, 0.1, 0.3, 0.1] #(note: sum of p needs to be 1)

# by the way, this is very useful to quickly create a random dataset
# How does it work? Change values and observe outcome 

vehicles_data = np.random.choice(vehicles, 100, p=probabilities)
print(vehicles_data)

['car' 'bicycle' 'none' 'none' 'bicycle' 'none' 'bicycle' 'bicycle' 'none'
 'bicycle' 'car' 'none' 'none' 'none' 'bicycle' 'bicycle' 'bicycle' 'car'
 'bicycle' 'none' 'none' 'bicycle' 'skateboard' 'none' 'none' 'bicycle'
 'skateboard' 'none' 'bicycle' 'bicycle' 'none' 'none' 'bicycle' 'bicycle'
 'bicycle' 'none' 'none' 'bicycle' 'none' 'none' 'none' 'bicycle' 'none'
 'bicycle' 'bicycle' 'bicycle' 'none' 'none' 'none' 'none' 'skateboard'
 'bicycle' 'bicycle' 'bicycle' 'none' 'bicycle' 'none' 'none' 'car' 'none'
 'bicycle' 'none' 'car' 'none' 'none' 'skateboard' 'bicycle' 'bicycle'
 'car' 'bicycle' 'none' 'bicycle' 'none' 'bicycle' 'skateboard' 'none'
 'bicycle' 'none' 'none' 'none' 'none' 'none' 'car' 'bicycle' 'none'
 'bicycle' 'none' 'car' 'none' 'car' 'bicycle' 'none' 'bicycle' 'none'
 'bicycle' 'none' 'bicycle' 'skateboard' 'none' 'none']


We can easily obtain dummies by using the following code:

What are Dummy variables, or Indicator variables? They represent as a boolean value (0 or 1) if a given item has a given value. If item at index 0 is red, then first row will have 1 for color_red, and 0 for other columns: 

In [25]:
dummy_vehicles = pd.get_dummies(vehicles_data, prefix='vehicle')
dummy_vehicles.head(10)

Unnamed: 0,vehicle_bicycle,vehicle_car,vehicle_none,vehicle_skateboard
0,0,1,0,0
1,1,0,0,0
2,0,0,1,0
3,0,0,1,0
4,1,0,0,0
5,0,0,1,0
6,1,0,0,0
7,1,0,0,0
8,0,0,1,0
9,1,0,0,0


In [26]:
# because each row includes only one true value, in some situations we simplify data by removing first column
# since it's value can be implied (if no other column is 1, it is like first column was 1)

dummy_vehicles = pd.get_dummies(vehicles_data, prefix='vehicle', drop_first=True)
dummy_vehicles.head(10)

# in some contexts it makes more sense than in others. 

Unnamed: 0,vehicle_car,vehicle_none,vehicle_skateboard
0,1,0,0
1,0,0,0
2,0,1,0
3,0,1,0
4,0,0,0
5,0,1,0
6,0,0,0
7,0,0,0
8,0,1,0
9,0,0,0


Notice that first value is not included? All encoding is relative to the presence of firxt variable. This is due to the ```drop_first``` parameter.

We can also use scikit-learn:

In [28]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# We can use a label encoder to transform categories into numbers
enc = LabelEncoder()
vehicles_label = enc.fit_transform(vehicles_data)
print(vehicles_label)

[1 0 2 2 0 2 0 0 2 0 1 2 2 2 0 0 0 1 0 2 2 0 3 2 2 0 3 2 0 0 2 2 0 0 0 2 2
 0 2 2 2 0 2 0 0 0 2 2 2 2 3 0 0 0 2 0 2 2 1 2 0 2 1 2 2 3 0 0 1 0 2 0 2 0
 3 2 0 2 2 2 2 2 1 0 2 0 2 1 2 1 0 2 0 2 0 2 0 3 2 2]


You will notice that every colour now has its own integer value.