# ~ PoC AI Pool 2024 ~
- ## Day 5: NLP
    - ### Module 1: Emotion Recognition with NLP
-----
Welcome to the final day of your PoC AI Pool !

In this module, we'll see a different way of using PyTorch to to build a Natural Language Processing neural network which is capable of detecting the language of a given sentence.

## 1. Data Cleaning

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
import torch
import torch.nn as nn

Let's import the language dataset from the `datasets` package 📦 :

>Datasets is a library for easily accessing and sharing datasets for Audio 🔉, Computer Vision 👁️ , and Natural Language Processing (NLP) 📖 tasks.

We will be using the [papluca/language-identification](https://huggingface.co/datasets/papluca/language-identification)

In [None]:
from datasets import load_dataset

dataset = load_dataset("papluca/language-identification")

The below code will transform your dataset into a pandas Dataframe which we will use for the rest of this module.

In [None]:
def filter_dataset(data, languages):
    return data.filter(lambda x: languages.__contains__(x['labels']))

def process_dataset(data):
    return data.map(lambda x: {'data': (x['labels'], x['text'])})['data']

languages = {
    'fr': 'french',
    'en': 'english',
    'es': 'spanish',
    'de': 'german'
}

filtered_data = filter_dataset(dataset['train'], list(languages.keys()))
processed_data = process_dataset(filtered_data)

df = pd.DataFrame(processed_data, columns=["languages", "text"])
df

Your output should look like this:

![](images/expected_output_lang.png)

#### 1. Cleaning the data 🧹

<img src="images/data_cleaning.png" width=700 >

First off, you need to clean the data using natural language processing techniques.

However you achieve this, your cleaned data should be available inside a pandas dataframe.

As long as you've cleaned it correctly, it doesn't matter what your result is.

As an example, the sentence "May The Force be with you." might become "may force" when cleaned.\
If your result looks like that, it means you've implemented the cleaning process correctly. 👍

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("popular")

import re

In [None]:
languages = [languages[language] for language in languages.keys()]
stop_words = stopwords.words(languages)

def clean(sentence):
    """
    You should clean the data inside this function by using
    different nlp techniques.
    """

    clean_data = sentence

    # Enter your code here




    #

    return clean_data

df["clean"] = df["text"].apply(clean)
df

##### 2. Count Vectorizer 💻


Now, in order to prepare the data for usage inside a neural network, you need to vectorize each word in the vocabulary and replace all usages inside your data with the corresponding tensors.

- Step 1: Build a vocabulary containing each word in the dataset (each word must only appear once)
- Step 2: Vectorize each sentence in the dataset 🔡 -> 🔢 by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence.
- Step 3: Vectorize your labels (for example, you can replace french 🇫🇷 with index 0, spanish 🇪🇸 with index 1, etc... )

If you implement all of these steps correctly, you will have a vectorized dataset which will be processable inside a neural network ! 

<img src="images/countvec.png" width=1000 >

You might first want to create a vocabulary comprised of all the words in your cleaned data.

>Build a vocabulary containing each word in the dataset (each word must only appear once)

In [None]:
def build_vocab(sentences):
    """
    This method should return a vocabulary of all unique words in our dataframe
    """
    ### Enter your code here


    

    ###

    return None

If the `build_vocab()` function is implemented properly, you should be able to run the code below 👇 and see how many words were removed thanks to cleaning.

In [None]:
vocab_vanilla = build_vocab(df["text"].apply(nltk.word_tokenize))
vocab = build_vocab(df["clean"])

print(f"Number of words in unprocessed data: {len(vocab_vanilla)}")
print(f"Number of words in processed data: {len(vocab)}")

vocab

Now, for the fun part: implement the Count Vectorizer

>Vectorize each sentence in the dataset 🔡 -> 🔢 by replacing it with an array containing the number of occurences of each word in the vocabulary inside the sentence.

In [None]:
word2idx = {}

for index, word in enumerate(vocab):
    word2idx[word] = index

def vectorize(sentences):
    vectorized = []

    ### Enter your code here





    ###

    return vectorized

df["vectorized"] = vectorize(df["clean"])

Now for the label vectorization:

>Vectorize your labels (for example, you can replace french 🇫🇷 with index 0, spanish 🇪🇸 with index 1, etc... )

In [None]:
# Label Vectorizer

languages_dict = {
    "fr": 0,
    "en": 1,
    "es": 2,
    "de": 3,
}

labels = []

# Enter your code here

#

labels

## 2. Neural Network 🧠

<img src="images/nn.png" width=1000 >

In order to process the data with PyTorch, let's convert it into tensors:

In [None]:
x = torch.FloatTensor(df["vectorized"])
y = torch.LongTensor(labels)

Now, you need to create your neural network and train a model on our data.

- Step 1: Build a network in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) (your model can be simple as long as it does the job)
- Step 2: Split your data into train and test subsets (you can use [sklearn's method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for this)
- Step 3: Train a model on your data until you reach a good accuracy (above 90%)

In [None]:
### Neural Network

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()

    def forward(self, x):
        pass

###

model = Network()

criterion = None
optimizer = None

from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

class MyData(Dataset):
    """
    This class will be useful when working with batches
    """

    def __init__(self, x, y):
        self.data = x
        self.target = y

    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]

        return x, y

    def __len__(self):
        return len(self.data)

### Training and Testing

def training_loop(x, y):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

    train_dataset = MyData(x_train, y_train)
    test_dataset = MyData(x_test, y_test)

    train_dataset = DataLoader(train_dataset, batch_size=32)
    test_dataset = DataLoader(test_dataset, batch_size=32)

    # Enter your code here






    #

    train_accuracy = None
    test_accuracy = None

    return train_accuracy, test_accuracy

###

# Store the predictions for all of our data as well as the % of training and testing accuracy inside `predictions`, `train_accuracy` and `test_accuracy`
train_accuracy, test_accuracy = training_loop(x, y)

print(f"Train accuracy: {train_accuracy}")
print(f"Test accuracy: {test_accuracy}")

If all went well, your accuracy should be close to 100%. 💯

Now, let's see how well the model guesses a language:

In [None]:
### Prediction

idx2lang = {
    0: "fr",
    1: "en",
    2: "es",
    3: "de",
}

def predict(x):
    predictions = []

    return predictions

predictions = predict(x)

df["predictions"] = predictions

df

In [None]:
sns.countplot(x='value', hue="variable", data=df[['languages', 'predictions']].melt())

### Awesome ! 😄

You've successfully created a language detection AI using Natural Language Processing and neural networks.

In [None]:
def predict_sentence(sentence):
    return predict(vectorize([clean(sentence)]))

predict_sentence("J'ai réussi à implémenter une intelligence artificielle !")