<a href="https://colab.research.google.com/github/RaghadAlnouri/Deep-learning-projects/blob/master/R_Alnouri_NLP_char_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework: classify the origin of names using a character-level RNN

In this homework we will use an rnn-based model to perform classification. The goal is threefold:

1. Get more hands on with the preprocessing needed to perform text classification from A to Z. No preprocessing is done for you!
2. Use embeddings and RNNs in conjunction at the character level to perform classification.
3. Write a function that takes as input a string, and outputs the name of the predicted class.

However, here are guidelines to help you through all the steps:

1. Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
2. Use the keras tokenizer at the character level to tokenize your input into integer sequences.
3. Pad your sequences using the keras preprocessing tools.
4. Build a model that uses, minimally, an embedding layer, an RNN (of your choice) and a dense layer to output the logits or probabilities for the target classes (name origins).
5. Fit the model and evaluate on the test set.
6. Write a function that takes a string as input and predicts the origin (as its original string value)

In [None]:
%tensorflow_version 2.x
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
import tensorflow as tf

TensorFlow 2.x selected.


In [None]:
# install missing package
!pip install unidecode


Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |█▍                              | 10kB 32.8MB/s eta 0:00:01[K     |██▊                             | 20kB 6.2MB/s eta 0:00:01[K     |████▏                           | 30kB 6.9MB/s eta 0:00:01[K     |█████▌                          | 40kB 5.9MB/s eta 0:00:01[K     |██████▉                         | 51kB 6.1MB/s eta 0:00:01[K     |████████▎                       | 61kB 7.2MB/s eta 0:00:01[K     |█████████▋                      | 71kB 8.0MB/s eta 0:00:01[K     |███████████                     | 81kB 7.3MB/s eta 0:00:01[K     |████████████▍                   | 92kB 8.1MB/s eta 0:00:01[K     |█████████████▊                  | 102kB 8.5MB/s eta 0:00:01[K     |███████████████▏                | 112kB 8.5MB/s eta 0:00:01[K     |████████████████▌               | 122kB 8.5MB/

In [None]:
# Download the data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2020-03-23 01:22:18--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 13.227.198.24, 13.227.198.30, 13.227.198.106, ...
Connecting to download.pytorch.org (download.pytorch.org)|13.227.198.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2020-03-23 01:22:18 (91.6 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Kore

Building the category_lines dictionary, a list of names per language


In [None]:
data = []
for filename in glob('data/names/*.txt'):
  origin = filename.split('/')[-1].split('.txt')[0]
  names = open(filename).readlines()
  for name in names:
    data.append((name.strip(), origin))

names, origins = zip(*data)
names_train, names_test, origins_train, origins_test = train_test_split(names, origins, test_size=0.25, shuffle=True, random_state=123)

# Lets look at the data

In [None]:
for name, origin in zip(names_train[:10], origins_train[:10]):
  print(name.ljust(10), origin)

Yusuf      English
Vikhrev    Russian
Ilyuhin    Russian
Paterson   English
Abt        German
Koury      Arabic
Mihalkin   Russian
Favre      French
Frolkov    Russian
Agabekov   Russian


In [None]:
len(names_train)

15055

In [None]:
print (names_train[:100])

['Yusuf', 'Vikhrev', 'Ilyuhin', 'Paterson', 'Abt', 'Koury', 'Mihalkin', 'Favre', 'Frolkov', 'Agabekov', 'Awturkhanoff', 'Aglinskas', 'Jigailov', 'Vallah', 'Verber', 'Bishara', 'Amari', 'Onoda', 'Viola', 'Hlopkin', 'Lind', 'Baturov', 'Farrow', 'Veletsky', 'Djabrailov', 'Yablonovsky', 'Kikutake', 'Kalihov', 'Tubolkin', 'Mikhail', 'Fearghal', 'Borovski', 'Mills', 'Balandyuk', 'Schoettmer', 'Yushkevich', 'Sha', 'Balazowski', 'Padovano', 'Kilshaw', 'Agranovitch', 'Desmond', 'Orwin', 'Cameron', 'Wiggins', 'Leys', 'Aboimov', 'Xiang', 'Lumley', 'Xun', 'Bakhelov', 'Nijo', 'Ricci', 'Velikanov', 'Naifeh', 'Destunis', 'Sakoda', 'Horn', 'Said', 'Aihara', 'Dagher', 'Grout', 'Quraishi', 'Fischer', 'King', 'Garofalo', 'Alesci', 'Gladilin', 'Zasetsky', 'Jivoluk', 'Nahmanovich', 'Zheltov', 'Bui', 'Dubrovo', 'Agapiev', 'Yanovka', 'Ko', 'Becke', 'Granger', 'Shionoya', 'Zherebin', 'Maher', 'Fricker', 'Muhlfeld', 'Yamawaki', 'Asghar', 'Agrachev', 'Agafonoff', 'Kram', 'Sakiyaev', 'Goldman', 'an', 'Kachemaev'

In [None]:
# Build the category_lines dictionary, a list of names per language
all_categories=list(set(origins))
n_categories = len(all_categories)
n_categories

18

In [None]:
for i, c in enumerate(all_categories):
  print (i,":",c)

0 : Irish
1 : Scottish
2 : Czech
3 : French
4 : Chinese
5 : Vietnamese
6 : German
7 : Portuguese
8 : Dutch
9 : Arabic
10 : Russian
11 : Italian
12 : Polish
13 : Greek
14 : Japanese
15 : Spanish
16 : English
17 : Korean


In [None]:
y_train = [all_categories.index(o) for o in origins_train]
y_test = [all_categories.index(o) for o in origins_test]

In [None]:
y_train[0], origins_train[0]

(16, 'English')

In [None]:
len(y_train), len(y_test)

(15055, 5019)

In [None]:
import numpy as np

y_train, y_test = np.asarray([all_categories.index(s) for s in origins_train]), np.asarray([all_categories.index(s) for s in origins_test])


Preparing the Data


In [None]:
# Turn a Unicode string to plain ASCII

all_letters = '0' + string.ascii_letters + " .,;'"
n_letters = len(all_letters)

import unicodedata
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )
  

print(unicodeToAscii('Ślusàrski'))

Slusarski


In [None]:
names_train[:10]

['Yusuf',
 'Vikhrev',
 'Ilyuhin',
 'Paterson',
 'Abt',
 'Koury',
 'Mihalkin',
 'Favre',
 'Frolkov',
 'Agabekov']

In [None]:
X_train_clean = [unicodeToAscii(x) for x in names_train]
X_test_clean = [unicodeToAscii(x) for x in names_test]

In [None]:
X_train_clean[:10]

['Yusuf',
 'Vikhrev',
 'Ilyuhin',
 'Paterson',
 'Abt',
 'Koury',
 'Mihalkin',
 'Favre',
 'Frolkov',
 'Agabekov']

In [None]:
n_categories

18

In [None]:
import numpy as np
import string

# compute max_len of sentences in train data
def compute_max_len(data):
  max_n = 1
  for i, sent in enumerate(data):
    if len(sent) > max_n:
      max_n = len(sent)
  return max_n

# Find letter index from all_letters, e.g. "a" = 1, "0" = 0 (padding value)
def letterToIndex(letter):
    ind = all_letters.find(letter)
    if ind < 0:
      raise Exception('unknown letter:' + letter)
    return ind

# Turn a line into an array of one-hot letter vectors
def lineToTensor(line, max_n):
    tensor = np.zeros((max_n, n_letters))
    for li, letter in enumerate(line):
        tensor[li][letterToIndex(letter)] = 1
    if len(line) < max_n:
      tensor[len(line):][0] = 1
    return tensor

# Turn a line into an array of indices
def lineToIndex(line, max_n):
    tensor = np.zeros(max_n, dtype=int)
    for li, letter in enumerate(line):
        tensor[li] = letterToIndex(letter)
    return tensor

max_len = max(compute_max_len(X_train_clean), compute_max_len(X_test_clean))

In [None]:
max_len


19

In [None]:

# lineToTensor('Jones', max_len).shape
print(lineToIndex('Joanes', max_len))

[36 15  1 14  5 19  0  0  0  0  0  0  0  0  0  0  0  0  0]


In [None]:
# data_train = np.array([lineToTensor(x, max_len) for x in X_train_clean])
# data_test =  np.array([lineToTensor(x, max_len) for x in X_test_clean])
data_train = np.array([lineToIndex(x, max_len) for x in X_train_clean])
data_test =  np.array([lineToIndex(x, max_len) for x in X_test_clean])

(15055, 20)

Building the network

In [None]:

model = tf.keras.models.Sequential(layers=[
   tf.keras.layers.Embedding(input_dim=n_letters,
                            output_dim=32, mask_zero=True),
  tf.keras.layers.LSTM(units=64, return_sequences=True), 
  tf.keras.layers.GlobalAveragePooling1D(),
  tf.keras.layers.Dense(n_categories, activation='softmax')
])
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(), optimizer=tf.keras.optimizers.Adam(lr=0.005), metrics=['accuracy'])


In [None]:
model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, None, 32)          1856      
_________________________________________________________________
lstm_8 (LSTM)                (None, None, 64)          24832     
_________________________________________________________________
global_average_pooling1d_8 ( (None, 64)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 18)                1170      
Total params: 27,858
Trainable params: 27,858
Non-trainable params: 0
_________________________________________________________________


In [None]:
# for i in range(0, len(data_train), 32):
#   #print(i)
#   out = model.predict(data_train[i:i+32])
#   #print(out.shape)

history = model.fit(data_train, y_train, validation_data=(data_test, y_test), batch_size=32, shuffle=True, epochs=30)

Train on 15055 samples, validate on 5019 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
def predict_origin(name):
  assert isinstance(name, str)
  the_origin = None
  x = np.array(lineToIndex(name, max_len))
  output = model.predict(x.reshape(1,-1))
  # print(output)
  the_origin = np.argmax(output)
  # print(the_origin)
  return all_categories[the_origin]

In [None]:
predict_origin('Alnouri')

'Arabic'

Successfully predicted my origin :)