<a href="https://colab.research.google.com/github/MariamAtefMah/Colab-ML-Project/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Sequence Modling**


*NLP(Natural Language Processing)*
  The four main layers for every project:
  _Simple RNN(): Recurrent neural network.
  _Embidding()
  _LSTM(): Long Short Term Memory.
  _GRU(): Gated Recurrent Unit.
  _Bidirection(), work with both direction from left to right and vise versa.


Main Architecture:
  _one to many ex. image caption.
  _many to one ex. sentiement classification.
  _Many to many has two cases:
    _input length equal to output length, ex. name entity recognation.
    _input length does not equal to output length, ex. Machine translation.

In [None]:
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import model_selection
import re # For regular expression, use it when you search about particular pattern like phone numbers.
import tqdm #provides a simple and convenient way to add progress bars to loops and iterable objects.

In [None]:
# this code to download kaggle.json in Colab.
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mariamatefmah","key":"5f69e58464f1187994818c02382d5fbe"}'}

**Machine Translation**

In [None]:
#Those are the basic steps to download data from kaggle.
! mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
#This API command from kaggle to translate from english to spanish.
!kaggle datasets download -d lonnieqin/englishspanish-translation-dataset
# the file will be in content folder.

Downloading englishspanish-translation-dataset.zip to /content
  0% 0.00/2.72M [00:00<?, ?B/s]
100% 2.72M/2.72M [00:00<00:00, 184MB/s]


In [None]:
! unzip /content/englishspanish-translation-dataset.zip

Archive:  /content/englishspanish-translation-dataset.zip
  inflating: data.csv                


In [None]:
#store data frame.
data_df = pd.read_csv('/content/data.csv')
data_df

Unnamed: 0,english,spanish
0,Go.,Ve.
1,Go.,Vete.
2,Go.,Vaya.
3,Go.,Váyase.
4,Hi.,Hola.
...,...,...
118959,There are four main causes of alcohol-related ...,Hay cuatro causas principales de muertes relac...
118960,There are mothers and fathers who will lie awa...,Hay madres y padres que se quedan despiertos d...
118961,A carbon footprint is the amount of carbon dio...,Una huella de carbono es la cantidad de contam...
118962,Since there are usually multiple websites on a...,Como suele haber varias páginas web sobre cual...


In [None]:
#before deal with content we clean the text from any strange symbol and so on.
def clean_text(text):
  text =  text.lower()
  text = re.sub('\[.*?\]', '', text) #it say replace '\[.*?\]' to '' and save the result in text variable.
  text = re.sub('https?://\S+|www\.\S+', '', text)
  text = re.sub('\[<.*?>+]', '', text)
  text = re.sub('\n', '', text)
  text = re.sub(r'[^\w]', '', text)
  text = re.sub('\w*\d\w*]', '', text)
  return text

In [None]:
#function to pass on every element in the table of english word to clean its text, using clean_text function we created.
data_df.english = data_df.english.map(clean_text)
data_df.spanish = data_df.spanish.map(clean_text)


In [None]:
#Function to detemine the start and the end of the text.
def add_start_end(text):
  #add <start> at the beginning of the text and <end> at the end of text.
  text = f'<start> {text} <end>'
  return text

data_df.english = data_df.english.map(add_start_end)
data_df.spanish = data_df.spanish.map(add_start_end)

In [None]:
data_df

Unnamed: 0,english,spanish
0,<start> go <end>,<start> ve <end>
1,<start> go <end>,<start> vete <end>
2,<start> go <end>,<start> vaya <end>
3,<start> go <end>,<start> váyase <end>
4,<start> hi <end>,<start> hola <end>
...,...,...
118959,<start> therearefourmaincausesofalcoholrelated...,<start> haycuatrocausasprincipalesdemuertesrel...
118960,<start> therearemothersandfatherswhowilllieawa...,<start> haymadresypadresquesequedandespiertosd...
118961,<start> acarbonfootprintistheamountofcarbondio...,<start> unahuelladecarbonoeslacantidaddecontam...
118962,<start> sincethereareusuallymultiplewebsiteson...,<start> comosuelehabervariaspáginaswebsobrecua...




In [None]:
#Tokenizer func. returns a Python generator of token objects.(encode the data)
def Tokenize(lang): #anguage
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n', oov_token='<OOV>' #this is standard
      #we removed <> from filter to not clean it from text
  )
  lang_tokenizer.fit_on_texts(lang)
  tensor = lang_tokenizer.texts_to_sequences(lang) #tensor to store sequence.
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post') #post to be after clean.
  return tensor, lang_tokenizer

In [None]:
#Here we store data as numbers sequence, and its token in tokenizer variable.
eng_sequence, eng_tokenizer = Tokenize(data_df.english)
sp_sequence, sp_tokenizer = Tokenize(data_df.spanish)


In [None]:
eng_sequence



<keras.src.preprocessing.text.Tokenizer at 0x7ade6d143640>