# Categorical Variable Embeddings

In this activity we will build two different models

    For first model, categorical attributes are convered in to dummy numeric variables
    For second model, categorical attributes not convered in to numberic. Each level is given unique number starting with zero and categorical embedding is used 

#### Load the requied libraries

In [1]:
import pandas as pd
import numpy as np

from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Embedding, concatenate, Flatten, Input

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


#### Read the data

In [2]:
df = pd.read_csv('cmc.data', header=None, names=['Age','Education','H_education',
                                                 'num_child','Religion', 'Employ',
                                                 'H_occupation','living_standard',
                                                 'Media_exposure','contraceptive'])

#### Understand the data

In [3]:
df.shape

(1473, 10)

Look at first few records

In [4]:
df.head()

Unnamed: 0,Age,Education,H_education,num_child,Religion,Employ,H_occupation,living_standard,Media_exposure,contraceptive
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1


Summary statistics 

In [5]:
df.describe(include='all')

Unnamed: 0,Age,Education,H_education,num_child,Religion,Employ,H_occupation,living_standard,Media_exposure,contraceptive
count,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0
mean,32.538357,2.958588,3.429735,3.261371,0.850645,0.749491,2.137814,3.133741,0.073999,1.919891
std,8.227245,1.014994,0.816349,2.358549,0.356559,0.433453,0.864857,0.976161,0.261858,0.876376
min,16.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
25%,26.0,2.0,3.0,1.0,1.0,0.0,1.0,3.0,0.0,1.0
50%,32.0,3.0,4.0,3.0,1.0,1.0,2.0,3.0,0.0,2.0
75%,39.0,4.0,4.0,4.0,1.0,1.0,3.0,4.0,0.0,3.0
max,49.0,4.0,4.0,16.0,1.0,1.0,4.0,4.0,1.0,3.0


Type of all the attributes

In [6]:
df.dtypes

Age                int64
Education          int64
H_education        int64
num_child          int64
Religion           int64
Employ             int64
H_occupation       int64
living_standard    int64
Media_exposure     int64
contraceptive      int64
dtype: object

Identify the unique values for each of the attributes

In [7]:
for i in df.columns.values:
    print (i)
    print (pd.value_counts(df[i].values))

Age
25    80
26    69
32    64
30    64
28    63
35    62
24    61
22    59
27    59
29    59
36    57
33    55
37    51
34    50
21    48
31    46
23    44
38    44
47    43
45    41
42    40
44    39
43    34
39    34
40    34
41    34
48    30
20    28
49    23
46    22
19    18
17     8
18     7
16     3
dtype: int64
Education
4    577
3    410
2    334
1    152
dtype: int64
H_education
4    899
3    352
2    178
1     44
dtype: int64
num_child
2     276
1     276
3     259
4     197
5     135
0      97
6      92
7      49
8      47
9      16
11     11
10     11
12      4
13      2
16      1
dtype: int64
Religion
1    1253
0     220
dtype: int64
Employ
1    1104
0     369
dtype: int64
H_occupation
3    585
1    436
2    425
4     27
dtype: int64
living_standard
4    684
3    431
2    229
1    129
dtype: int64
Media_exposure
0    1364
1     109
dtype: int64
contraceptive
1    629
3    511
2    333
dtype: int64


Observation:

    Education, H_education, Religion, Employ, H_occupation, living_standard, Media_exposure and Contraceptive are having fixed set of values. Even though they are number they are Categorical Attributes. 

Note: 
    
    Religion, Employ and Media_exposure are Categorical attributes with only two values/levels.
    Even after converting them to numeric using dummification, they will still have same 0 and 1 values, so directly took them as Numeric attributes.

##### Following two cells are for explination purpose

In [8]:
pd.value_counts(df.contraceptive.values)

1    629
3    511
2    333
dtype: int64

In [9]:
pd.value_counts(df.contraceptive.values-1)

0    629
2    511
1    333
dtype: int64

For second model, each level/value in each independent categorical attribute is represented as number starting with zero 

    All categorical attributes has levels starting with 1, so substract one.

In [10]:
df[['Education', 'H_education', 'H_occupation', 'living_standard', 'contraceptive']] = df[['Education', 'H_education', 'H_occupation', 'living_standard', 'contraceptive']] -1

In [11]:
edu_ind_attr = df.Education.values
h_edu_ind_attr = df.H_education.values
h_occ_ind_attr = df.H_occupation.values
liv_std_ind_attr = df.living_standard.values 

Convert the attributes to appropriate type

In [12]:
for col in ['Education', 'H_education', 'H_occupation', 'living_standard', 'contraceptive']:
    df[col] = df[col].astype('category')

In [13]:
df.dtypes

Age                   int64
Education          category
H_education        category
num_child             int64
Religion              int64
Employ                int64
H_occupation       category
living_standard    category
Media_exposure        int64
contraceptive      category
dtype: object

Summary Statistics

In [14]:
df.describe(include = "all")

Unnamed: 0,Age,Education,H_education,num_child,Religion,Employ,H_occupation,living_standard,Media_exposure,contraceptive
count,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0,1473.0
unique,,4.0,4.0,,,,4.0,4.0,,3.0
top,,3.0,3.0,,,,2.0,3.0,,0.0
freq,,577.0,899.0,,,,585.0,684.0,,629.0
mean,32.538357,,,3.261371,0.850645,0.749491,,,0.073999,
std,8.227245,,,2.358549,0.356559,0.433453,,,0.261858,
min,16.0,,,0.0,0.0,0.0,,,0.0,
25%,26.0,,,1.0,1.0,0.0,,,0.0,
50%,32.0,,,3.0,1.0,1.0,,,0.0,
75%,39.0,,,4.0,1.0,1.0,,,0.0,


#### Missing value imputation

In [15]:
df.isnull().sum()

Age                0
Education          0
H_education        0
num_child          0
Religion           0
Employ             0
H_occupation       0
living_standard    0
Media_exposure     0
contraceptive      0
dtype: int64

#### Select only numeric independent attributes

In [16]:
num_ind_attr_names = df.select_dtypes(include=['int64']).columns

In [17]:
num_ind_attr = df[num_ind_attr_names]

In [18]:
num_ind_attr.head()

Unnamed: 0,Age,num_child,Religion,Employ,Media_exposure
0,24,3,1,1,0
1,45,10,1,1,0
2,43,7,1,1,0
3,42,9,1,1,0
4,36,8,1,1,0


#### Min Max Scaling

In [19]:
scaler = MinMaxScaler()

scaled_num_ind_attr = scaler.fit_transform(num_ind_attr) 

In [20]:
scaled_num_ind_attr

array([[0.24242424, 0.1875    , 1.        , 1.        , 0.        ],
       [0.87878788, 0.625     , 1.        , 1.        , 0.        ],
       [0.81818182, 0.4375    , 1.        , 1.        , 0.        ],
       ...,
       [0.6969697 , 0.5       , 1.        , 0.        , 0.        ],
       [0.51515152, 0.25      , 1.        , 0.        , 0.        ],
       [0.03030303, 0.0625    , 1.        , 1.        , 0.        ]])

#### Select categorical attributes

In [21]:
cat_attr_names = df.select_dtypes(include=['category']).columns

cat_attr_names

Index([u'Education', u'H_education', u'H_occupation', u'living_standard',
       u'contraceptive'],
      dtype='object')

In [22]:
cat_ind_attr_names = cat_attr_names[cat_attr_names != 'contraceptive']

print(cat_ind_attr_names)

cat_tar_attr_names = ['contraceptive']

Index([u'Education', u'H_education', u'H_occupation', u'living_standard'], dtype='object')


#### Dummification 
    
    For first model, convert independetn categorical attributes to numeric using dummificcation
    
    Target categorical attribut has more than two level. For both the models, target categorical attribute is convert to numeric use dummification.

Convert independent categorical attribute to numeric using dummification 

In [23]:
cat_ind_attr = pd.get_dummies(df[cat_ind_attr_names]).values

In [24]:
cat_ind_attr

array([[0, 1, 0, ..., 0, 1, 0],
       [1, 0, 0, ..., 0, 0, 1],
       [0, 1, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 1, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 1]], dtype=uint8)

In [25]:
cat_ind_attr.shape

(1473, 16)

Convert targeet categorical attribute to numeric using dummification

In [26]:
cat_tar_attr = pd.get_dummies(df[cat_tar_attr_names]).values

In [27]:
cat_tar_attr

array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       ...,
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]], dtype=uint8)

In [28]:
cat_tar_attr.shape

(1473, 3)

In [29]:
n_edu_levels = np.size(np.unique(edu_ind_attr, return_counts=True)[0])
n_h_edu_levels = np.size(np.unique(h_edu_ind_attr, return_counts=True)[0])
n_h_occ_levels = np.size(np.unique(h_occ_ind_attr, return_counts=True)[0])
n_liv_std_levels = np.size(np.unique(liv_std_ind_attr, return_counts=True)[0])

In [30]:
print(n_edu_levels, n_h_edu_levels, n_h_occ_levels, n_liv_std_levels)

(4, 4, 4, 4)


In [31]:
scaled_num_ind_attr_train, scaled_num_ind_attr_test, \
cat_ind_attr_train, cat_ind_attr_test, \
edu_ind_attr_train, edu_ind_attr_test, \
h_edu_ind_attr_train, h_edu_ind_attr_test, \
h_occ_ind_attr_train, h_occ_ind_attr_test, \
liv_std_ind_attr_train, liv_std_ind_attr_test, \
Y_train, Y_test = train_test_split(scaled_num_ind_attr,
                                                         cat_ind_attr, 
                                                         edu_ind_attr, 
                                                         h_edu_ind_attr,
                                                         h_occ_ind_attr,
                                                         liv_std_ind_attr,
                                                         cat_tar_attr,
                                                         test_size=0.1, random_state=123) 


### Build First Model

In [32]:
X_train = np.hstack((scaled_num_ind_attr_train, cat_ind_attr_train))
X_test = np.hstack((scaled_num_ind_attr_test, cat_ind_attr_test))

In [33]:
model = Sequential()
model.add(Dense(12, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(3, activation='softmax'))

In [34]:
model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['accuracy'])

In [35]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 12)                264       
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 39        
Total params: 303
Trainable params: 303
Non-trainable params: 0
_________________________________________________________________


In [36]:
model.fit(X_train, Y_train, epochs=100, verbose=2)

Epoch 1/100
 - 0s - loss: 1.0502 - acc: 0.4204
Epoch 2/100
 - 0s - loss: 1.0231 - acc: 0.4377
Epoch 3/100
 - 0s - loss: 1.0128 - acc: 0.4702
Epoch 4/100
 - 0s - loss: 1.0068 - acc: 0.4762
Epoch 5/100
 - 0s - loss: 1.0024 - acc: 0.4845
Epoch 6/100
 - 0s - loss: 0.9988 - acc: 0.4830
Epoch 7/100
 - 0s - loss: 0.9961 - acc: 0.4891
Epoch 8/100
 - 0s - loss: 0.9934 - acc: 0.4906
Epoch 9/100
 - 0s - loss: 0.9909 - acc: 0.4974
Epoch 10/100
 - 0s - loss: 0.9888 - acc: 0.4928
Epoch 11/100
 - 0s - loss: 0.9870 - acc: 0.4936
Epoch 12/100
 - 0s - loss: 0.9853 - acc: 0.4966
Epoch 13/100
 - 0s - loss: 0.9835 - acc: 0.4921
Epoch 14/100
 - 0s - loss: 0.9825 - acc: 0.4996
Epoch 15/100
 - 0s - loss: 0.9809 - acc: 0.4981
Epoch 16/100
 - 0s - loss: 0.9799 - acc: 0.4958
Epoch 17/100
 - 0s - loss: 0.9786 - acc: 0.5019
Epoch 18/100
 - 0s - loss: 0.9776 - acc: 0.5011
Epoch 19/100
 - 0s - loss: 0.9765 - acc: 0.4974
Epoch 20/100
 - 0s - loss: 0.9755 - acc: 0.4958
Epoch 21/100
 - 0s - loss: 0.9746 - acc: 0.4928
E

<keras.callbacks.History at 0x7f32f42a7b50>

In [37]:
model.evaluate(X_test, Y_test, )



[0.9902546824635686, 0.4797297305351979]

In [38]:
model.metrics_names

['loss', 'acc']

In [39]:
p = model.predict(X_test)
p[:5]

array([[0.6905642 , 0.1787483 , 0.13068743],
       [0.34813896, 0.28640974, 0.36545134],
       [0.4910512 , 0.1859255 , 0.32302326],
       [0.3624086 , 0.10890827, 0.5286831 ],
       [0.53287077, 0.13681106, 0.33031815]], dtype=float32)

## Building Second Model

<img src='img/model_architecture.png' />

Word Embedding for Education

In [40]:
edu_input = Input(shape=(1, ), name="edu")
edu_embed = Embedding(input_dim=n_edu_levels, output_dim=4,)(edu_input)

Word Embedding for H_education

In [41]:
h_edu_input = Input(shape=(1, ), name="h_edu")
h_edu_embed = Embedding(input_dim=n_h_edu_levels, output_dim=4)(h_edu_input)

Word Embedding for H_occupation

In [42]:
h_occ_input = Input(shape=(1, ), name="occ")
h_occ_embed = Embedding(input_dim=n_h_occ_levels, output_dim=4)(h_occ_input)

Word Embedding for living_standard

In [43]:
liv_input = Input(shape=(1, ),name="Liv")
liv_embed = Embedding(input_dim=n_liv_std_levels, output_dim=4 )(liv_input)

Concatenating output from all 4 embedding layers and Flatten it

In [44]:
merge_cat_emb = concatenate([edu_embed, h_edu_embed, h_occ_embed, liv_embed])
merge_cat_emb_flat = Flatten()(merge_cat_emb)

In [45]:
num_input = Input(shape=(scaled_num_ind_attr_train.shape[1], ))

merge_two = concatenate([merge_cat_emb_flat, num_input])

merged_layer = Dense(8, activation= 'relu')(merge_two)
output_layer = Dense(3, activation='softmax')(merged_layer)

model = Model(inputs=[edu_input, h_edu_input, h_occ_input, liv_input, num_input], outputs=output_layer)

In [46]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
edu (InputLayer)                (None, 1)            0                                            
__________________________________________________________________________________________________
h_edu (InputLayer)              (None, 1)            0                                            
__________________________________________________________________________________________________
occ (InputLayer)                (None, 1)            0                                            
__________________________________________________________________________________________________
Liv (InputLayer)                (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_

In [47]:
model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['accuracy'])

In [48]:
model.fit([edu_ind_attr_train, h_edu_ind_attr_train,
           h_occ_ind_attr_train, liv_std_ind_attr_train, 
           scaled_num_ind_attr_train], 
          y=Y_train, 
          epochs=150, verbose=2)

Epoch 1/150
 - 0s - loss: 1.0540 - acc: 0.4317
Epoch 2/150
 - 0s - loss: 1.0360 - acc: 0.4392
Epoch 3/150
 - 0s - loss: 1.0257 - acc: 0.4558
Epoch 4/150
 - 0s - loss: 1.0175 - acc: 0.4566
Epoch 5/150
 - 0s - loss: 1.0114 - acc: 0.4604
Epoch 6/150
 - 0s - loss: 1.0063 - acc: 0.4694
Epoch 7/150
 - 0s - loss: 1.0020 - acc: 0.4702
Epoch 8/150
 - 0s - loss: 0.9986 - acc: 0.4830
Epoch 9/150
 - 0s - loss: 0.9960 - acc: 0.4815
Epoch 10/150
 - 0s - loss: 0.9933 - acc: 0.4830
Epoch 11/150
 - 0s - loss: 0.9912 - acc: 0.4868
Epoch 12/150
 - 0s - loss: 0.9889 - acc: 0.4943
Epoch 13/150
 - 0s - loss: 0.9870 - acc: 0.4936
Epoch 14/150
 - 0s - loss: 0.9853 - acc: 0.4981
Epoch 15/150
 - 0s - loss: 0.9837 - acc: 0.5049
Epoch 16/150
 - 0s - loss: 0.9824 - acc: 0.5004
Epoch 17/150
 - 0s - loss: 0.9811 - acc: 0.5019
Epoch 18/150
 - 0s - loss: 0.9799 - acc: 0.5042
Epoch 19/150
 - 0s - loss: 0.9787 - acc: 0.5042
Epoch 20/150
 - 0s - loss: 0.9776 - acc: 0.5094
Epoch 21/150
 - 0s - loss: 0.9765 - acc: 0.5072
E

<keras.callbacks.History at 0x7f32eaf5fa50>

In [49]:
model.evaluate([edu_ind_attr_test, h_edu_ind_attr_test, h_occ_ind_attr_test, liv_std_ind_attr_test, scaled_num_ind_attr_test], Y_test)



[0.970124280130541, 0.5067567567567568]

In [50]:
p = model.predict([edu_ind_attr_test, h_edu_ind_attr_test, h_occ_ind_attr_test, liv_std_ind_attr_test, scaled_num_ind_attr_test])
p[:5]

array([[0.7307578 , 0.16653378, 0.1027085 ],
       [0.44281942, 0.30038533, 0.25679523],
       [0.47137308, 0.17236717, 0.35625976],
       [0.41553497, 0.090563  , 0.49390203],
       [0.57974356, 0.13192008, 0.2883363 ]], dtype=float32)