# FastText

## Cas Pratique - Utiliser la librairie fastText pour vectoriser un corpus

- 1. Se renseigner sur la librairie fasttext
- 2. Aller sur colab et installer la librairie
- 3. Charger et nettoyer rapidement le dataset labeled_data.csv
(https://drive.google.com/drive/folders/1GCWcIvE3ZipWiV8567CswTNUaNzce3sT?u
sp=sharing )
- 4. Se renseigner sur la manière de labelliser les data avec fasttext
- 5. Entrainer un modele supervisé de fastext avec les data ci-dessus
- 6. Afficher les mots que le modèle a appris
- 7. Donner la représentation vectorielle du mot 'guy' à l’aide du modèle que vous avez
entraîné ci-dessus.

## Ressource
- https://fasttext.cc/docs/en/supervised-tutorial.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html
- https://fasttext.cc/docs/en/options.html

### Question 1

- fastText is a library for efficient learning of word representations and sentence classification.
- Text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. In this tutorial, we describe how to build a text classifier with the fastText tool.





### Question 2

In [1]:
!pip list | grep fast

fastai                        1.0.61
fastdtw                       0.3.4
fastprogress                  1.0.0
fastrlock                     0.8


In [6]:
pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l[K     |████▊                           | 10 kB 15.4 MB/s eta 0:00:01[K     |█████████▌                      | 20 kB 19.6 MB/s eta 0:00:01[K     |██████████████▎                 | 30 kB 7.7 MB/s eta 0:00:01[K     |███████████████████             | 40 kB 6.1 MB/s eta 0:00:01[K     |███████████████████████▉        | 51 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████▋   | 61 kB 5.4 MB/s eta 0:00:01[K     |████████████████████████████████| 68 kB 3.0 MB/s 
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.9.0-py2.py3-none-any.whl (210 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3127634 sha256=a1cf880c5ab27dab0765b4a0e5b31659f68f8ffbda72dc6afb44f2293aa2dddb
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a6597

### Question 3

In [7]:
import pandas as pd
import fasttext
import re

In [13]:
df = pd.read_csv("https://raw.githubusercontent.com/BaptisteHurel/DeepLearning/main/NLP/labels.csv")

In [11]:
#df.isna().sum()

In [14]:
df['tweet'] = df['tweet'].apply(lambda tweet: re.sub('[^A-Za-z]+', ' ', tweet.lower()))

df=df.drop(['Unnamed: 0',
 'count',
 'hate_speech',
 'offensive_language',
 'neither'], axis=1)

In [15]:
df.head()

Unnamed: 0,class,tweet
0,2,rt mayasolovely as a woman you shouldn t comp...
1,1,rt mleew boy dats cold tyga dwn bad for cuffi...
2,1,rt urkindofbrand dawg rt sbaby life you ever ...
3,1,rt c g anderson viva based she look like a tr...
4,1,rt shenikaroberts the shit you hear about me ...


### Question 4

In [16]:
all_texts = df['tweet'].tolist()
all_labels = df['class'].tolist()
prep_datapoints=[]

In [17]:

for i in range(len(all_texts)):
    sample = '__label__'+ str(all_labels[i]) + ' '+ all_texts[i]
    prep_datapoints.append(sample)

### Question 5

In [18]:
len(prep_datapoints)==len(df)

True

In [20]:
with open('./test_train_fasttext.txt','w') as f:
    for datapoint in prep_datapoints:
        f.write(datapoint)
        f.write('\n')
    f.close()

In [22]:
model = fasttext.train_supervised('./test_train_fasttext.txt')

In [23]:
# Skipgram model :
model_skpg = fasttext.train_unsupervised('./test_train_fasttext.txt', model='skipgram')

In [24]:
model.predict("thanks you for your services ")

(('__label__2',), array([0.93107045]))

### Question 6

In [25]:
print(model_skpg.words)   # list of words in dictionary



### Question 7

In [26]:
print(model['guy']) # get the vector of the word 'guy'

[ 3.3703824e-03  4.3903269e-02 -3.4703076e-02 -9.9609764e-03
  5.0250039e-02  4.1995343e-02 -3.4936883e-02 -6.5112044e-03
  6.2223069e-02  6.1684102e-02 -3.9118953e-02  3.3633668e-02
  7.5473441e-03  2.7089003e-02 -3.9821219e-02 -3.4406725e-03
 -1.1901334e-02  7.3567205e-03  2.3933628e-03 -4.7844607e-02
  7.9935160e-04  6.8490438e-02  4.0158965e-02 -1.6873037e-02
 -9.2896124e-05 -2.5723780e-02  3.5162613e-02 -2.5447959e-02
 -4.2506024e-02  4.1629177e-02 -2.7567688e-02  1.1498865e-03
 -3.3675984e-02 -3.2707334e-02 -7.8856125e-02 -3.3027441e-03
  1.3883647e-02 -4.0587652e-02 -2.3469882e-02 -8.1027839e-03
 -9.3367248e-04  7.5077318e-02  1.9625464e-02  1.8724816e-02
  3.8230278e-02 -9.2955790e-03  2.3734808e-02 -3.1857383e-02
 -2.9704809e-02  3.7635643e-02 -7.0228316e-02  2.6178515e-02
 -5.0400421e-02  3.6917452e-02 -3.5903707e-02 -5.1297572e-02
 -1.2404419e-02  2.4696223e-02  5.4116592e-02  2.6066661e-02
 -2.8540678e-02  4.4114240e-02  7.1061321e-02  2.4808416e-02
 -3.2681838e-02 -8.16837