# Challenge - Preprocessing pipeline

![](https://media.ouest-france.fr/v1/pictures/fe9603bace85f5c3339acb605cb31894-17133782.jpg?width=1400&client_id=eds&sign=9fb46757bc793cfe75ca6a14462ccbf26bbff31d9a7ce55d426c03ae31da2465)

## Objectives

First, the goal is to optimize the time to preprocessing text data with Spacy.
Second, classify french tweets between negative and positive tweets. 

## Guidelines

🚰 The preprocessing of texts can be time-consuming and costly for your computer, especially if your dataset is large. Spacy has developed a [feature](https://spacy.io/usage/processing-pipelines) to implement a text pre-processing pipeline to optimise the process.

To measure the time of preprocessing we will use tqdm package to display a progress bar.

## Dataset

📥 In this exercise we will use a dataset of 1.5 million French tweets and their sentiment (negative and positive) from Kaggle : https://www.kaggle.com/hbaflast/french-twitter-sentiment-analysis

To avoid any Github issue, don't forget to store the dataset in your local `data` folder.

### GOAL.1: Optimize the time to preprocessing text data with Spacy.
### GOAL.2: Classify french tweets between negative and positive tweets. 

In [2]:
# import librairies
import numpy as np
import pandas as pd
from sklearn.utils import resample
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# 1. Load data and explore it

In [3]:
# load the french small model
nlp = spacy.load("fr_core_news_sm")

In [4]:
# import data
df = pd.read_csv('../french_tweets/french_tweets.csv',header=0)
df.head()

Unnamed: 0,label,text
0,0,"- Awww, c'est un bummer. Tu devrais avoir davi..."
1,0,Est contrarié qu'il ne puisse pas mettre à jou...
2,0,J'ai plongé plusieurs fois pour la balle. A ré...
3,0,Tout mon corps a des démangeaisons et comme si...
4,0,"Non, il ne se comporte pas du tout. je suis en..."


In [5]:
# check the labels balance
df.label.value_counts() #balanced

0    771604
1    755120
Name: label, dtype: int64

> Explore some tweets, which label corresponds to which sentiment ? Are the tweets properly labelled ?

In [6]:
# explore some tweets and consider the pre-processing steps that will be required
## label 0
df[df.label==0].iloc[:10]

Unnamed: 0,label,text
0,0,"- Awww, c'est un bummer. Tu devrais avoir davi..."
1,0,Est contrarié qu'il ne puisse pas mettre à jou...
2,0,J'ai plongé plusieurs fois pour la balle. A ré...
3,0,Tout mon corps a des démangeaisons et comme si...
4,0,"Non, il ne se comporte pas du tout. je suis en..."
5,0,Pas l'équipage complet
6,0,besoin d'un câlin
7,0,"Bonjour pas de vue! Oui ... pleut un peu, just..."
8,0,"Non, ils ne l'avaient pas"
9,0,Je meurs?


In [7]:
## label 1: seems more positive
df[df.label==1].iloc[:10]

Unnamed: 0,label,text
771604,1,"Je vous aime, les gars sont les meilleurs!"
771605,1,Je me retrouve avec un de mes besties ce soir!...
771606,1,"Merci pour l'ajout de Twitter, sunisa! Je dois..."
771607,1,Être malade peut être vraiment bon marché quan...
771608,1,Il a cet effet sur tout le monde
771609,1,Vous pouvez lui dire que je viens d'éclater de...
771610,1,Merci pour votre réponse. J'avais déjà trouvé ...
771611,1,"Je suis tellement jaloux, j'espère que vous av..."
771612,1,"Ah, félicite mr fletcher pour finalement rejoi..."
771613,1,J'ai répondu que le chat stupide m'aidé à tape...


> 0 seems to be the negative tweets and 1 the positives **but the labels are not always very accurate...** This is not very big deal for our exercise which aims to see the possibilities of preprocessing a large dataset with Spacy.


For the moment create a sample with just 20000 tweets and process it with Spacy.

# 2. Sample the dataset and preprocess it 

In [8]:
n_samples = 20000

In [9]:
# sample the dataset (20000 tweets) in df_sample 
df_sample = resample(df, n_samples=n_samples, random_state=42)
df_sample.shape

(20000, 2)

Create a preprocessing function. For the moment don't bother with the preprocessing step, keep it simple:
- remove the punctuation
- remove stopwords
- lemmatization

In [10]:
# create the preprocessing function
def preprocess_nltk(sent):
    # tokenization
    tokens = word_tokenize(sent)
    # remove stopwords
    stop_words = stopwords.words('english')
    tokens = [t for t in tokens if not t in stop_words]
    # lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

In [11]:
# apply onto text
df_sample['preprocessed_text_nltk'] = df_sample.text.apply(preprocess_nltk)
df_sample.head()

Unnamed: 0,label,text,preprocessed_text_nltk
121958,0,Mais je ne peux pas envoyer de courrier électr...,"[Mais, je, ne, peux, pa, envoyer, de, courrier..."
671155,0,Se sent mal après avoir mangé & quot; cinema s...,"[Se, sent, mal, après, avoir, mangé, &, quot, ..."
131932,0,awwe..hey..i doivent chat lit g2 plus tard? :],"[awwe, .., hey, .., doivent, chat, lit, g2, pl..."
1414414,1,La famille et les amis sont partis après un be...,"[La, famille, et, le, amis, sont, partis, aprè..."
259178,0,"Oui, je commence définitivement à ressentir le...","[Oui, ,, je, commence, définitivement, à, ress..."


To measure the time to preprocessing tweets, create a new list with lists of tokens and measure time a process. To do that put the iterator in the `tqdm()` function, like this : 

```python
tokens = list()
for tweet in tqdm(df_sample.text):
    tokens.append(preprocessing(tweet))
```

In [12]:
# create a new list with all tweets tokens 
tokens = list()
for tweet in tqdm(df_sample.text):
    tokens.append(preprocess_nltk(tweet))
# 5.7s
tokens
# linear extrapolate for 1.5M: 6 * 75 = 7.5mins

100%|██████████| 20000/20000 [00:07<00:00, 2511.61it/s]


[['Mais',
  'je',
  'ne',
  'peux',
  'pa',
  'envoyer',
  'de',
  'courrier',
  'électronique',
  'à',
  'personne',
  '!',
  'Ce',
  "n'est",
  'pa',
  'seulement',
  'votre',
  'adresse',
  '...'],
 ['Se',
  'sent',
  'mal',
  'après',
  'avoir',
  'mangé',
  '&',
  'quot',
  ';',
  'cinema',
  'sweet',
  '&',
  'quot',
  ';',
  'pop',
  'corn',
  '.',
  '.',
  '.',
  '.',
  '.',
  'Sans',
  'le',
  'cinéma'],
 ['awwe',
  '..',
  'hey',
  '..',
  'doivent',
  'chat',
  'lit',
  'g2',
  'plus',
  'tard',
  '?',
  ':',
  ']'],
 ['La',
  'famille',
  'et',
  'le',
  'amis',
  'sont',
  'partis',
  'après',
  'un',
  'beau',
  'week-end',
  'de',
  'fin',
  "d'études",
  'de',
  'deux',
  'jours',
  '!',
  "J'ai",
  'passé',
  'un',
  'bon',
  'moment',
  '!',
  'Merci',
  "d'être",
  'venu',
  '!',
  ';',
  ')',
  'Le',
  'week-end',
  'prochain',
  'pt',
  '2',
  '!'],
 ['Oui',
  ',',
  'je',
  'commence',
  'définitivement',
  'à',
  'ressentir',
  'le',
  'décalage',
  'maintenant',

How long did it take ? 

It takes several minutes for just 20000 tweets, imagine that for 1.5 million 🤯!

Now we try to optimize the process with a Spacy pipeline.

A Spacy pipeline take in pute texts and have some interesting arguments to optimize the time of preprocessing (see the documentation: https://spacy.io/usage/processing-pipelines)

We will look at two in particular: 
- **disable**: when we pass a text into a spacy model, by default it will do a lot of processing (named entity recognition, retrieving embedding vectors, etc.). With this parameter we can remove these unnecessary steps here.

- **batch_size**: the number of texts pre-processed at a time 

In [13]:
## disable and batch_size params
docs = nlp.pipe(df_sample.text, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "ner"], batch_size=20)
docs ## attention se vide après boucle car générateur

<generator object Language.pipe at 0x7f543b4f1a10>

`pipe` return a iterator of spacy docs. Write once again a preprocessing function that takes as input a spacy doc object and not a tweet directly !

In [14]:
# create a preprocessing_2 function
def preprocessing_2(my_doc):
    # Document created by preprocessing the text with the nlp object
    tokens = [t.text for t in my_doc]
    # stop_words removal
    stop_words = stopwords.words('english')
    tokens = [t for t in tokens if not t in stop_words]
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens



In [15]:
# create a new list with all tweets tokens from pipeline
## !!! il faut faire retourner la cellule avec le générateur de docs !!!!
new_tweets = []
for doc in docs:
    tokens = preprocessing_2(doc)
    new_tweets.append(tokens)
new_tweets

[['Mais',
  'je',
  'ne',
  'peux',
  'pa',
  'envoyer',
  'de',
  'courrier',
  'électronique',
  'à',
  'personne',
  '!',
  'Ce',
  "n'",
  'est',
  'pa',
  'seulement',
  'votre',
  'adresse',
  '...'],
 ['Se',
  'sent',
  'mal',
  'après',
  'avoir',
  'mangé',
  '&',
  'quot',
  ';',
  'cinema',
  'sweet',
  '&',
  'quot',
  ';',
  'pop',
  'corn',
  '.',
  '.',
  '.',
  '.',
  '.',
  'Sans',
  'le',
  'cinéma'],
 ['awwe',
  '..',
  'hey',
  '..',
  'doivent',
  'chat',
  'lit',
  'g2',
  'plus',
  'tard',
  '?',
  ':]'],
 ['La',
  'famille',
  'et',
  'le',
  'amis',
  'sont',
  'partis',
  'après',
  'un',
  'beau',
  'week',
  '-',
  'end',
  'de',
  'fin',
  "d'",
  'études',
  'de',
  'deux',
  'jours',
  '!',
  "J'",
  'ai',
  'passé',
  'un',
  'bon',
  'moment',
  '!',
  'Merci',
  "d'",
  'être',
  'venu',
  '!',
  ';)',
  'Le',
  'week',
  '-',
  'end',
  'prochain',
  'pt',
  '2',
  '!'],
 ['Oui',
  ',',
  'je',
  'commence',
  'définitivement',
  'à',
  'ressentir',
 

In [16]:
np.array(new_tweets).shape

  np.array(new_tweets).shape


(20000,)

In [17]:
new_col = np.array(new_tweets)
print(type(new_col))
new_col

df_sample['preprocessed_spacy'] = new_col # pas besoin de convertir en pd.Series
df_sample.head()

<class 'numpy.ndarray'>


  new_col = np.array(new_tweets)


Unnamed: 0,label,text,preprocessed_text_nltk,preprocessed_spacy
121958,0,Mais je ne peux pas envoyer de courrier électr...,"[Mais, je, ne, peux, pa, envoyer, de, courrier...","[Mais, je, ne, peux, pa, envoyer, de, courrier..."
671155,0,Se sent mal après avoir mangé & quot; cinema s...,"[Se, sent, mal, après, avoir, mangé, &, quot, ...","[Se, sent, mal, après, avoir, mangé, &, quot, ..."
131932,0,awwe..hey..i doivent chat lit g2 plus tard? :],"[awwe, .., hey, .., doivent, chat, lit, g2, pl...","[awwe, .., hey, .., doivent, chat, lit, g2, pl..."
1414414,1,La famille et les amis sont partis après un be...,"[La, famille, et, le, amis, sont, partis, aprè...","[La, famille, et, le, amis, sont, partis, aprè..."
259178,0,"Oui, je commence définitivement à ressentir le...","[Oui, ,, je, commence, définitivement, à, ress...","[Oui, ,, je, commence, définitivement, à, ress..."


Compare the time with this method.

# 3. Sentiment classification

Now that you have the tools to pre-process large bodies of text, you can try to classify more than 20,000 tweets (100,000 for example) 🔥!

For this part, you are free to use the classification methods of your choice.
Focus on preprocessing by exploring the tweets further 🔬! 

In [18]:
## Build initial X and y
X = df_sample.preprocessed_spacy
y = df_sample.label.to_numpy()
X.shape

(20000,)

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(f'X_train.shape: {X_train.shape}')
print(f'X_test.shape: {X_test.shape}')
print(f'y_train.shape: {y_train.shape}')
print(f'y_test.shape: {y_test.shape}')

X_train.shape: (15000,)
X_test.shape: (5000,)
y_train.shape: (15000,)
y_test.shape: (5000,)


In [20]:
X_train

518895                                           [suis, moi]
1195889                     [Merci, pour, votre, suppotoday]
825992       [Je, l', aime, ,, dwight, est, mon, préféré, .]
1505759            [Dit, que, l', amour, vivant, grandit, .]
1303175    [En, regardant, le, tweet, de, tht, ,, il, rap...
                                 ...                        
1087288    [Dîner, et, ensuite, jouer, au, softball, avec...
78343      [Je, viens, de, rentrer, de, la, pendaison, av...
1431728    [Uriel, !, Vous, êtes, en, train, de, faire, l...
1386465    [toutes, no, félicitations, !, Très, drôle, ,,...
685521     [Vous, n', étiez, pa, un, bon, camarade, de, n...
Name: preprocessed_spacy, Length: 15000, dtype: object

In [21]:
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)
tf_idf_train = vectorizer.fit_transform(X_train).toarray()
tf_idf_train.shape

(15000, 20230)

In [22]:
## TEST: transform only on Test
tf_idf_test = vectorizer.transform(X_test).toarray()

In [23]:
lr = LogisticRegression(max_iter=100)
lr.fit(tf_idf_train, y_train)

# prédictions sur Train et Test
y_pred_train = lr.predict(tf_idf_train)
y_pred_test  = lr.predict(tf_idf_test)

In [24]:
# F1-score de Test
from sklearn.metrics import f1_score

vect_f1_score_test = f1_score(y_test, y_pred_test, average=None)
print(f" F1-score sur Test - label 0: {vect_f1_score_test[0]}")
print(f" F1-score sur Test - label 1: {vect_f1_score_test[1]}")
# on 20000 samples:
"""
F1-score sur Test - label 0: 0.7455023246411967
F1-score sur Test - label 1: 0.7508410845042549
"""

 F1-score sur Test - label 0: 0.7455023246411967
 F1-score sur Test - label 1: 0.7508410845042549


'\nF1-score sur Test - label 0: 0.7455023246411967\nF1-score sur Test - label 1: 0.7508410845042549\n'