# Challenge - Preprocessing pipeline

![](https://media.ouest-france.fr/v1/pictures/fe9603bace85f5c3339acb605cb31894-17133782.jpg?width=1400&client_id=eds&sign=9fb46757bc793cfe75ca6a14462ccbf26bbff31d9a7ce55d426c03ae31da2465)

## Objectives

First, the goal is to optimize the time to preprocessing text data with Spacy.
Second, classify french tweets between negative and positive tweets. 

## Guidelines

🚰 The preprocessing of texts can be time-consuming and costly for your computer, especially if your dataset is large. Spacy has developed a [feature](https://spacy.io/usage/processing-pipelines) to implement a text pre-processing pipeline to optimise the process.

To measure the time of preprocessing we will use tqdm package to display a progress bar.

## Dataset

📥 In this exercise we will use a dataset of 1.5 million French tweets and their sentiment (negative and positive) from Kaggle : https://www.kaggle.com/hbaflast/french-twitter-sentiment-analysis

To avoid any Github issue, don't forget to store the dataset in your local `data` folder.

In [None]:
# import librairies
from tqdm import tqdm

# 1. Load data and explore it

In [None]:
# load the french small model

In [None]:
# import data

In [None]:
# check the labels balance

> Explore some tweets, which label corresponds to which sentiment ? Are the tweets properly labelled ?

In [None]:
# explore some tweets and consider the pre-processing steps that will be required

> 0 seems to be the negative tweets and 1 the positives **but the labels are not always very accurate...** This is not very big deal for our exercise which aims to see the possibilities of preprocessing a large dataset with Spacy.


For the moment create a sample with just 20000 tweets and process it with Spacy.

# 2. Sample the dataset and preprocess it 

In [None]:
# sample the dataset (20000 tweets) in df_sample 

Create a preprocessing function. For the moment don't bother with the preprocessing step, keep it simple:
- remove the punctuation
- remove stopwords
- lemmatization

In [None]:
# create the preprocessing function

To measure the time to preprocessing tweets, create a new list with lists of tokens and measure time a process. To do that put the iterator in the `tqdm()` function, like this : 

```python
tokens = list()
for tweet in tqdm(df_sample.text):
    tokens.append(preprocessing(tweet))
```

In [None]:
# create a new list with all tweets tokens 

How long did it take ? 

It takes several minutes for just 20000 tweets, imagine that for 1.5 million 🤯!

Now we try to optimize the process with a Spacy pipeline.

A Spacy pipeline take in pute texts and have some interesting arguments to optimize the time of preprocessing (see the documentation: https://spacy.io/usage/processing-pipelines)

We will look at two in particular: 
- **disable**: when we pass a text into a spacy model, by default it will do a lot of processing (named entity recognition, retrieving embedding vectors, etc.). With this parameter we can remove these unnecessary steps here.

- **batch_size**: the number of texts pre-processed at a time 

In [None]:
docs = nlp.pipe(df_sample.text, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "ner"], batch_size=20)

`pipe` return a iterator of spacy docs. Write once again a preprocessing function that takes as input a spacy doc object and not a tweet directly !

In [None]:
# create a preprocessing_2 function

In [None]:
# create a new list with all tweets tokens from pipeline

Compare the time with this method.

# 3. Sentiment classification

Now that you have the tools to pre-process large bodies of text, you can try to classify more than 20,000 tweets (100,000 for example) 🔥!

For this part, you are free to use the classification methods of your choice.
Focus on preprocessing by exploring the tweets further 🔬! 