#### Reading this notebook:
This notebook does not use a lot of markdowns, but is well commented, reading through the comments shall get you thorough the code!

> Lets extract our sentences and have a look at the data we will be dealing with

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("all-sentences.txt", names=["sentence", "language"], header=None, delimiter="|")
data.describe()

Unnamed: 0,sentence,language
count,69347,69347
unique,69347,7
top,Ernesto estió o primero en prener o títol.,Galego
freq,1,10000


- We have got a total of 69,347 different sentences in 7 unique languages, our task is to train an Reccurent neural network based on LSTMs to correctly identify which language a sentence is written in.
- At first this seems like a trivial task but what's important here is we are not defining any rules to identify the language, instead just giving the data to an neural network to train on for it to correcrtly identify the language!

In [2]:
data.head()

Unnamed: 0,sentence,language
0,"Cuando llegaron los manifestantes, la escuela ...",Castellano
1,Experto Comisión Mundial de Áreas Protegidas –...,Castellano
2,Ya habían pasado tres años de la condena y seg...,Castellano
3,“Buscamos la adhesión porque todos tenemos una...,Castellano
4,La aplicación del Plan con los alumnos se real...,Castellano


This is how the data is organised

#### Data Cleaning and re structuring

In [3]:
import re

def process_sentence(sentence):
    '''Removes all special characters from sentence. It will also strip out
    extra whitespace and makes the string lowercase.
    '''
    return re.sub(r'[\\\\/:*«`\'?¿";!<>,.|]', '', sentence.lower().strip())

In [4]:
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split

# As our sentences in all_sentences.txt are in order, we need to shuffle it first.
sss = StratifiedShuffleSplit(test_size=0.2, random_state=0)

# Clean the sentences
X = data["sentence"].apply(process_sentence)
y = data["language"]

# Split all our sentences
elements = (' '.join([sentence for sentence in X])).split()

X_train, X_test, y_train, y_test = None, None, None, None

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Now we have split our data into training and testing sets, our target shall be to acheive as good a accuracy as possible on the test data, the data our neural network hasn't seen while training!

#### basic EDA

In [5]:
languages = set(y)
print("Languages in our dataset: {}".format(languages))

Languages in our dataset: {'Català', 'Euskara', 'Aragonés', 'Castellano', 'Asturianu', 'Occitan', 'Galego'}


In [6]:
print("Feature Shapes:")
print("\tTrain set: \t\t{}".format(X_train.shape),
      "\n\tTest set: \t\t{}".format(X_test.shape))
print("Totals:\n\tWords in our Dataset: {}\n\tLanguages: {}".format(len(elements), len(languages)))

Feature Shapes:
	Train set: 		(55477,) 
	Test set: 		(13870,)
Totals:
	Words in our Dataset: 1348815
	Languages: 7


#### Pre Processing

We need to convert our text data into a suitable format for the neural networks to consume (numbers)

In [7]:
def create_lookup_tables(text):
    """Create lookup tables for vocabulary
    :param text: The text split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    vocab = set(text)
    
    vocab_to_int = {word: i for i, word in enumerate(vocab)}
    int_to_vocab = {v:k for k, v in vocab_to_int.items()}
    
    return vocab_to_int, int_to_vocab

In [8]:
elements.append("<UNK>")

# Map our vocabulary to int
vocab_to_int, int_to_vocab = create_lookup_tables(elements)
languages_to_int, int_to_languages = create_lookup_tables(y)

print("Vocabulary of our dataset: {}".format(len(vocab_to_int)))

Vocabulary of our dataset: 187734


In [9]:
def convert_to_int(data, data_int):
    """Converts all our text to integers
    :param data: The text to be converted
    :return: All sentences in ints
    """
    all_items = []
    for sentence in data: 
        all_items.append([data_int[word] if word in data_int else data_int["<UNK>"] for word in sentence.split()])
    
    return all_items

In [10]:
# Convert our inputs
X_test_encoded = convert_to_int(X_test, vocab_to_int)
X_train_encoded = convert_to_int(X_train, vocab_to_int)

y_data = convert_to_int(y_test, languages_to_int)

In [11]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()

enc.fit(y_data)

# One hot encoding our outputs
y_train_encoded = enc.fit_transform(convert_to_int(y_train, languages_to_int)).toarray()
y_test_encoded = enc.fit_transform(convert_to_int(y_test, languages_to_int)).toarray()

#### Making sure the system configurations are right

In [12]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

  from ._conv import register_converters as _register_converters


TensorFlow Version: 1.7.0
Default GPU Device: /device:GPU:0


#### Modeling

In [13]:
# Hyperparameters
max_sentence_length = 200
embedding_vector_length = 300
dropout = 0.5

In [14]:
# Import Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

Using TensorFlow backend.


In [15]:
# Truncate and pad input sentences
X_train_pad = sequence.pad_sequences(X_train_encoded, maxlen=max_sentence_length)
X_test_pad = sequence.pad_sequences(X_test_encoded, maxlen=max_sentence_length)

In [16]:
# Create the model
model = Sequential()

model.add(Embedding(len(vocab_to_int), embedding_vector_length, input_length=max_sentence_length))
model.add(LSTM(256, return_sequences=True, dropout=dropout, recurrent_dropout=dropout))
model.add(LSTM(256, dropout=dropout, recurrent_dropout=dropout))
model.add(Dense(len(languages), activation='softmax'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 300)          56320200  
_________________________________________________________________
lstm_1 (LSTM)                (None, 200, 256)          570368    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 1799      
Total params: 57,417,679
Trainable params: 57,417,679
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 300)          56320200  
_________________________________________________________________
lstm_1 (LSTM)                (None, 200, 256)          570368    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 1799      
Total params: 57,417,679
Trainable params: 57,417,679
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
# Train the model# Train  
model.fit(X_train_pad, y_train_encoded, epochs=2, batch_size=256)

# Final evaluation of the model
scores = model.evaluate(X_test_pad, y_test_encoded, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/2
Epoch 2/2
Accuracy: 99.64%


#### Conclusions:
A simple 2 layer LSTM trained for only 2 epochs gives a commendable accuracy pf 99.64% on unseen data in identifying the correct language of a given sentence. This captures how powerful RNNs are.

#### Next Steps:
The next step in this project is to take on the problem on a more challenging dataset, the Europarl dataset looks like a good one to pick!