# Language Detection
This project aims as creating an NLP capable of detecting upto 17 different languages using neural networks.

So first we start by importing the libraries we'll be using.

In [45]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import nltk
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.text import Tokenizer
from keras import models
from keras import layers


%matplotlib inline

Then we load our dataset and preview it.

In [46]:
data = pd.read_csv('./dataset/Language Detection.csv')

#load the first 10 entries of our data
data.head(10)

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English
5,"[2] In ancient philosophy, natura is mostly us...",English
6,"[3][4] \nThe concept of nature as a whole, the...",English
7,During the advent of modern scientific method ...,English
8,"[5][6] With the Industrial revolution, nature ...",English
9,"However, a vitalist vision of nature, closer t...",English


Now, lets look at the shape and summary of the dataset.

In [47]:
print(data.info())
print('Shape of the data is', data.shape, f'meaning {data.shape[0]} rows and {data.shape[1]} columns.' )

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10337 entries, 0 to 10336
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      10337 non-null  object
 1   Language  10337 non-null  object
dtypes: object(2)
memory usage: 161.6+ KB
None
Shape of the data is (10337, 2) meaning 10337 rows and 2 columns.


In [48]:
data['Text'].unique()

array([' Nature, in the broadest sense, is the natural, physical, material world or universe.',
       '"Nature" can refer to the phenomena of the physical world, and also to life in general.',
       'The study of nature is a large, if not the only, part of science.',
       ...,
       "ಹೇಗೆ ' ನಾರ್ಸಿಸಿಸಮ್ ಈಗ ಮರಿಯನ್ ಅವರಿಗೆ ಸಂಭವಿಸಿದ ಎಲ್ಲವನ್ನೂ ಹೇಳಿದೆ ಮತ್ತು ಅವಳು ಆ ಸಮಯದಿಂದ ತುಂಬಾ ಬದಲಾಗಿದ್ದಾಳೆ.",
       'ಅವಳು ಈಗ ಹೆಚ್ಚು ಚಿನ್ನದ ಬ್ರೆಡ್ ಬಯಸುವುದಿಲ್ಲ ಎಂದು ನಾನು ess ಹಿಸಿದ್ದೇನೆ.',
       'ಟೆರ್ರಿ ನೀವು ನಿಜವಾಗಿಯೂ ಆ ದೇವದೂತನಂತೆ ಸ್ವಲ್ಪ ಕಾಣುತ್ತಿದ್ದೀರಿ ಆದರೆ ನಾನು ಏನು ನೋಡುತ್ತಿದ್ದೇನೆ ನೀವು ಹೇಗೆ ಅವನಾಗಬಹುದು ನೀವು ಇಬ್ಬರು ತುಂಬಾ ಒಳ್ಳೆಯವರು'],
      dtype=object)

## Data Preprocessing
Having familiarised ourselves with the data, now lets do some preprocessing.

We'll start by splitting it into `X` and `y`.

In [49]:
#y is our target
X = data['Text']
y = data['Language'] 

Next, lets split the data into `train` and `test` sets.

In [50]:
#splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

For our `train` set, we'll split it further into `training` and `validation` sets to help us down the road with `hyperparameter tuning`.

In [51]:
x_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, test_size=1000, random_state=23)

### One Hot Encoding
After splitting the data, we need to transform the text data into numerical values for our neural network. 

We'll start with `One Hot Encoding` the text.

In [52]:
#one hot encode the text to make it into a matrix of vectors
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")


x_train_tokenized = [tokenizer.encode(text, add_special_tokens=True) for text in x_train_final]
x_val_tokenized = [tokenizer.encode(text, add_special_tokens=True) for text in X_val]
x_test_tokenized = [tokenizer.encode(text, add_special_tokens=True) for text in X_test]

# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(x_train_final)

# X_train_token = tokenizer.texts_to_matrix(x_train_final, mode='binary')
# X_val_token = tokenizer.texts_to_matrix(X_val, mode='binary')
# X_test  = tokenizer.texts_to_matrix(X_test, mode='binary')

Token indices sequence length is longer than the specified maximum sequence length for this model (881 > 512). Running this sequence through the model will result in indexing errors


Now we'll encode the `language` column; first we'll transform them to `integer labels` then retransform them into a matrix of `binary flags` one for each of the different languages.

In [30]:
lb = LabelEncoder()
lb.fit(y_train_final)

y_train_lb = to_categorical(lb.transform(y_train_final))
y_val_lb = to_categorical(lb.transform(y_val))
y_test_lb = to_categorical(lb.transform(y_test))
print(y_val_lb.shape)

(1000, 17)


In [40]:
print(x_train_final.tail(1))

2048    [9][10][11][12] மேலும், இது அலெக்சா இணையத்தளத்...
Name: Text, dtype: object


In [41]:
print(X_train_token[0])  # Print the tokenized sequence for the first sample


[0. 0. 0. ... 0. 0. 0.]


## Baseline Model
With our data ready for modelling lets start with a basic model with only `two layers.`

In [21]:
baseline_model = models.Sequential()

baseline_model.add(layers.Dense(128, activation='relu'))
baseline_model.add(layers.Dense(64, activation='relu'))

#output layer
baseline_model.add(layers.Dense(17, activation='softmax'))


#### Compiling the model
After adding the layers, we compile then train the model.

In [22]:
baseline_model.compile(metrics='acc', optimizer='sgd', loss='categorical_crossentropy')

For the training, we'll start with `150 epochs` and `256` for `batch size`.  

In [None]:
baseline_model.fit(X_train_token, y_train_lb, epochs=150, batch_size=256, validation_data=(X_val_token, y_val_lb))