# Data cleaning

This part aims to clean and prepare the tweets stored in `data/tweets.csv` in order to do topic extraction afterward. Pre-requisite is that tweets are in english.

Cleaning includes:
- Removing URLs,
- Removing tweets that are too short (50 characters is the current threshold) and words that are too short from tweets (less than 4 characters),
- Put tweets in lowercase,
- Tokenizing (which removes ponctuation),
- Removing stop words
- Lemmetization ([more here](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)).

In [15]:
import pandas as pd
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [17]:
tweets = pd.read_csv("data/tweets.csv")

# Remove URLs from tweets
tweets["text"] = tweets["text"].apply(lambda tweet: re.sub(r'https?:\/\/.*[\r\n]*', '', tweet, flags=re.MULTILINE))

# Remove tweets with less than 50 characters
tweets = tweets[tweets["text"].apply(len) < 50]

# Lower case
tweets["text"] = tweets["text"].str.lower()

# Tokenize
tweets["tokenized"] = tweets["text"].apply(word_tokenize)

# Remove too short words (<= 3 chars)
tweets["tokenized"] = tweets["tokenized"].apply(lambda tokens: [token for token in tokens if len(token) > 3])

# Remove english stop words
tweets["tokenized"] = tweets["tokenized"].apply(lambda tokens: [token for token in tokens if token not in stopwords.words('english')])

# Lemmetization
lemmatizer = WordNetLemmatizer()
tweets["tokenized"] = tweets["tokenized"].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])

tweets.to_csv("data/out.csv")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\launeau\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


## Sources

https://arxiv.org/ftp/arxiv/papers/1608/1608.02519.pdf  
http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf  
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html