# Text Preprocessing

Lab overview:


* Normalization
* Tokenization
* Lematization
* Stemming
* Stopwords removal





##Text normalization (cleaning)

Depending on the task you are cleaning the text for, you may perform one or more of: 

* Transform text to lowercase
* Remove emoticons ( :) :D) and emojis (💙 🐱)
* Remove punctuation
* Remove digits or transform them to words
* Correct spelling errors


Python Regular Expressions 
*   [re Python documentation](https://docs.python.org/3/library/re.html)
*   [Quick reference](https://www.computerhope.com/unix/regex-quickref.htm)
*   [Cheat Sheet](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

![regular_expressions](https://res.cloudinary.com/practicaldev/image/fetch/s--_iE0KvdT--/c_imagga_scale,f_auto,fl_progressive,h_900,q_auto,w_1600/https://dev-to-uploads.s3.amazonaws.com/i/zpek00ubevoxvn458b01.png)

[Photo source](https://dev.to/mconner89/regular-expressions-grouping-and-string-methods-3ijn)

Here is our text sample, a short review of the movie [Jaws](https://en.wikipedia.org/wiki/Jaws_(film))

In [14]:
text = '" Jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. The movie opens with blackness, and only distant, alien-like underwater sounds. :) :D It deserves 5 stars, not 4 stars.'
text

'" Jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. The movie opens with blackness, and only distant, alien-like underwater sounds. :) :D It deserves 5 stars, not 4 stars.'

Transform text to lowercase

In [15]:
text = text.lower()
text

'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves 5 stars, not 4 stars.'

importing [re](https://docs.python.org/3/library/re.html) library

In [16]:
import re

Remove digits

In [17]:
re.sub(' \d+', '', text)

'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves stars, not stars.'

Converting numbers to words using [num2words](https://github.com/savoirfairelinux/num2words) (it works on multiple languages)

We need to install the num2words library first.

In [18]:
!pip install num2words

Collecting num2words
  Downloading num2words-0.5.10-py3-none-any.whl (101 kB)
[?25l[K     |███▎                            | 10 kB 21.2 MB/s eta 0:00:01[K     |██████▌                         | 20 kB 24.3 MB/s eta 0:00:01[K     |█████████▊                      | 30 kB 11.2 MB/s eta 0:00:01[K     |█████████████                   | 40 kB 9.0 MB/s eta 0:00:01[K     |████████████████▏               | 51 kB 5.0 MB/s eta 0:00:01[K     |███████████████████▍            | 61 kB 5.4 MB/s eta 0:00:01[K     |██████████████████████▋         | 71 kB 5.7 MB/s eta 0:00:01[K     |█████████████████████████▉      | 81 kB 6.4 MB/s eta 0:00:01[K     |█████████████████████████████   | 92 kB 4.7 MB/s eta 0:00:01[K     |████████████████████████████████| 101 kB 3.5 MB/s 
Installing collected packages: num2words
Successfully installed num2words-0.5.10


After installing, we can import it.

In [19]:
from num2words import num2words

text = ' '.join([num2words(word) if word.isdigit() else word for word in text.split()])
text


'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves five stars, not four stars.'

Remove emoticons ( :) :D) and emojis (💙 🐱)

Using [emoji](https://github.com/carpedm20/emoji) library or the corresponding unicode characters.

We need to install the emoji library first.

In [20]:
!pip install emoji



After installing, we can import it.

In [21]:
import emoji

emoji.get_emoji_regexp().sub(u'', text)

'" jaws "  is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves five stars, not four stars.'

The *get_emoji_regexp()* function returns a regex to match any emoji.

Another way of removing emojis with regex:


In [22]:
emoj = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese char
    u"\U00002702-\U000027B0"
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001f926-\U0001f937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"
    u"\u3030"
    "]+", re.UNICODE)

text = re.sub(emoj, '', text)
text

'" jaws "  is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves five stars, not four stars.'

Removing emoticons (regex from [nltk Twitter Tokenizer](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/casual.py))

In [23]:
emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
      |
      </?3                       # heart
    )"""
    
emoticon_re = re.compile(emoticon_string, re.VERBOSE | re.I | re.UNICODE)
text = re.sub(emoticon_re, '', text)
text

'" jaws "  is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds.   it deserves five stars, not four stars.'

## Tokenization


*   Word level: Split by whitespace, [nltk.word_tokenize](https://www.nltk.org/api/nltk.tokenize.html)
*   Sentence level: Split by punctuation, [nltk.sent_tokenize](https://www.nltk.org/api/nltk.tokenize.html)


In [24]:
print(text.split())

['"', 'jaws', '"', 'is', 'a', 'rare', 'film', 'that', 'grabs', 'your', 'attention', 'before', 'it', 'shows', 'you', 'a', 'single', 'image', 'on', 'screen.', 'the', 'movie', 'opens', 'with', 'blackness,', 'and', 'only', 'distant,', 'alien-like', 'underwater', 'sounds.', 'it', 'deserves', 'five', 'stars,', 'not', 'four', 'stars.']


We need to download first the Punkt Tokenizer Models.

In [25]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [26]:
from nltk import word_tokenize
tokenized_text_nltk = word_tokenize(text)
print(tokenized_text_nltk)

['``', 'jaws', '``', 'is', 'a', 'rare', 'film', 'that', 'grabs', 'your', 'attention', 'before', 'it', 'shows', 'you', 'a', 'single', 'image', 'on', 'screen', '.', 'the', 'movie', 'opens', 'with', 'blackness', ',', 'and', 'only', 'distant', ',', 'alien-like', 'underwater', 'sounds', '.', 'it', 'deserves', 'five', 'stars', ',', 'not', 'four', 'stars', '.']


Sentence tokenization using regex

In [27]:
 re.split('(?<=[.!?]) +', text)

['" jaws "  is a rare film that grabs your attention before it shows you a single image on screen.',
 'the movie opens with blackness, and only distant, alien-like underwater sounds.',
 'it deserves five stars, not four stars.']

Sentence tokenization using nltk.sent_tokenize

In [28]:
nltk.sent_tokenize(text)

['" jaws "  is a rare film that grabs your attention before it shows you a single image on screen.',
 'the movie opens with blackness, and only distant, alien-like underwater sounds.',
 'it deserves five stars, not four stars.']

In [29]:
text_example = 'I was good.Thanks.'
re.split('(?<=[.!?]) +', text_example)

['I was good.Thanks.']

In [30]:
nltk.sent_tokenize(text_example)

['I was good.Thanks.']

Removing punctuation


In [31]:
re.sub(r'[^\w\s]','', text)

' jaws   is a rare film that grabs your attention before it shows you a single image on screen the movie opens with blackness and only distant alienlike underwater sounds   it deserves five stars not four stars'

Using [string](https://docs.python.org/3/library/string.html) library. 

The string.punctuation method returns a list of punctuation marks. 

We use the translate() method which replaces every instance of a punctuation mark with the value '' in our strings. We use the str.maketrans() method to support the translation.

In [32]:
import string
text = text.translate(str.maketrans('', '', string.punctuation))
text

' jaws   is a rare film that grabs your attention before it shows you a single image on screen the movie opens with blackness and only distant alienlike underwater sounds   it deserves five stars not four stars'

Removing multiple spaces between words

In [33]:
text = re.sub(' +', ' ', text)
text

' jaws is a rare film that grabs your attention before it shows you a single image on screen the movie opens with blackness and only distant alienlike underwater sounds it deserves five stars not four stars'

## Removing stopwords

![stopwords.jpg](https://user.oc-static.com/upload/2021/01/06/16099626487943_P1C2.png) 

[Photo source](https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/6980726-remove-stop-words-from-a-block-of-text)






###Why do we Need to Remove Stopwords?

For tasks such as text classification, we may want to remove any unnecessary words and keep only words with meaning. 

Stopwords removal is not used in tasks such as machine translation or text summarization.

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Stopwords removal using nltk

In [34]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words_nltk = set(stopwords.words('english'))
print(len(stop_words_nltk))
print(stop_words_nltk)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
179
{'won', 'my', 'being', 'under', "wasn't", "needn't", 'you', 'should', 'll', 'or', 'not', "didn't", "should've", "that'll", 'further', 'is', 'him', 'off', 'myself', 'of', 'mightn', 'didn', 'itself', 'so', 'which', 'most', 'an', 'i', 've', 'wouldn', 'do', 'both', 'but', 'into', 's', 'theirs', 'ourselves', 'above', 'd', 'have', "you'd", 'during', "mightn't", 'any', 'haven', 'they', 'ain', 'all', 'than', "mustn't", 'be', 'himself', 't', 'yours', 'for', 'hadn', 'what', "hadn't", 'does', 'here', 'just', 'couldn', "don't", "isn't", 'mustn', 'if', "it's", 'to', 'because', "shan't", 'ma', 'your', 'only', 'a', 'same', 'yourselves', 'too', 'y', 'me', 'has', 'out', 'there', 'once', 'needn', 'no', 'nor', 'why', 'from', 'can', 'until', 'herself', 'did', 'this', 'very', "wouldn't", 'its', 'when', 'don', 'own', 'hasn', 'few', 'and', 'having', 'hers', 'their', 'themselves', 'at', '

In [35]:
tokenized_text_without_stopwords = [i for i in tokenized_text_nltk if not i in stop_words_nltk]
print(tokenized_text_without_stopwords)

['``', 'jaws', '``', 'rare', 'film', 'grabs', 'attention', 'shows', 'single', 'image', 'screen', '.', 'movie', 'opens', 'blackness', ',', 'distant', ',', 'alien-like', 'underwater', 'sounds', '.', 'deserves', 'five', 'stars', ',', 'four', 'stars', '.']


Stopwords removal using spacy

In [36]:
import spacy
nlp = spacy.load('en_core_web_sm')
stop_words_spacy = nlp.Defaults.stop_words
print(len(stop_words_spacy))
print(stop_words_spacy)

326
{'enough', 'becoming', "'d", 'must', 'sometimes', 'since', '’ve', 'thereby', 'though', 'off', 'of', 'unless', 'due', 'next', 'various', 'amount', 'i', 'somewhere', 'do', 'but', 'thus', 'above', 'have', 'upon', 'during', 'all', 'nobody', 'yours', 'elsewhere', 'what', 'rather', 'together', 'another', 'whose', 'because', 'wherever', 'only', 'whereupon', 'too', 'across', 'often', 'no', 'really', 'besides', 'its', 'twelve', '’ll', 'at', 'our', 'either', 'whether', 'are', 'whither', 're', 'used', 'along', 'else', 'he', 'in', 'forty', 'as', "'ll", 'beyond', 'take', 'us', 'whole', 'put', 'becomes', 'name', 'whereafter', 'about', 'amongst', 'seem', 'many', 'call', 'sometime', 'throughout', 'alone', 'whatever', 'was', 'up', 'cannot', 'whom', 'nine', 'down', 'one', 'fifteen', 'front', 'toward', 'moreover', 'whence', 'under', 'not', 'almost', 'thence', 'around', 'is', 'someone', '’d', 'whenever', 'five', '‘ve', 'most', 'an', 'please', "'m", 'ourselves', 'by', 'than', 'be', 'himself', 'therefor

In [37]:
tokenized_text_spacy = nlp(text)
tokenized_text_without_stopwords = [i for i in tokenized_text_spacy if not i in stop_words_spacy]
print(tokenized_text_without_stopwords)

[ , jaws, is, a, rare, film, that, grabs, your, attention, before, it, shows, you, a, single, image, on, screen, the, movie, opens, with, blackness, and, only, distant, alienlike, underwater, sounds, it, deserves, five, stars, not, four, stars]


## Lematization/Stemming

![1_HLQgkMt5-g5WO5VpNuTl_g.jpeg](https://miro.medium.com/max/564/1*HLQgkMt5-g5WO5VpNuTl_g.jpeg)

[Photo source](https://tr.pinterest.com/pin/706854104005417976/)

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Lematization

Using the WordNetLemmatizer from nltk


In [38]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [39]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = word_tokenize(text)
for word in words:
    print(word, lemmatizer.lemmatize(word))

jaws jaw
is is
a a
rare rare
film film
that that
grabs grab
your your
attention attention
before before
it it
shows show
you you
a a
single single
image image
on on
screen screen
the the
movie movie
opens open
with with
blackness blackness
and and
only only
distant distant
alienlike alienlike
underwater underwater
sounds sound
it it
deserves deserves
five five
stars star
not not
four four
stars star


Using the [lemmatizer](https://spacy.io/api/lemmatizer) from spacy

In [40]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting pip
  Downloading pip-21.3-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.2 MB/s 
Collecting setuptools
  Downloading setuptools-58.2.0-py3-none-any.whl (946 kB)
[K     |████████████████████████████████| 946 kB 40.2 MB/s 
Installing collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 57.4.0
    Uninstalling setuptools-57.4.0:
      Successfully uninstalled setuptools-57.4.0
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
Successfully installed pip-21.3 setuptools-58.2.0


Collecting spacy
  Downloading spacy-3.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
     |████████████████████████████████| 5.9 MB 5.1 MB/s            
[?25hCollecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.0-py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.1 MB/s             
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
     |████████████████████████████████| 10.1 MB 39.5 MB/s            
[?25hCollecting thinc<8.1.0,>=8.0.9
  Downloading thinc-8.0.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623 kB)
     |████████████████████████████████| 623 kB 51.8 MB/s            
[?25hCollecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Co

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
     |████████████████████████████████| 13.6 MB 78 kB/s             
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [41]:
import spacy

# Load English tokenizer, tagger, parser, etc.
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

for token in doc:
  print(token, token.lemma_)

OSError: ignored

Stemming in using nltk

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
for word in words:
    print(word, ps.stem(word))

[Other stemmers in nltk](https://www.nltk.org/api/nltk.stem.html)

The spacy library does not perform stemming, only lemmatization.

# Assignment

To be uploaded here: https://forms.gle/ygCNwFM4i5RMPtsC6

Preprocess texts from Twitter

## Data

We will use the twitter corpus from nltk, usually used in sentiment analysis.

The fist step is downloading the dataset using the *download* function.

In [3]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In order to inspect our data, we look at the first 25 tweets from the dataset. The text contains a lot of mentions, hashtags and emoticons.

In [4]:
from nltk.corpus  import twitter_samples

tweets = twitter_samples.strings('positive_tweets.json')
tweets = tweets[:25]
tweets

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days',
 '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM',
 "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI",
 '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.',
 'Jgh , but we have to go to Bayan :D bye',
 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing 

**Given a list of tweets, preprocess each tweet from the list.**

**Instructions**: Implement the *preprocess* function. You can do the text cleaning in any order you prefer.

**Hint**: You may need to use regex expressions (use the resources provided above).


In [45]:
from nltk import word_tokenize
import emoji
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
      |
      </?3                       # heart
    )"""
emoticon_re = re.compile(emoticon_string, re.VERBOSE | re.I | re.UNICODE)

stop_words_nltk = set(stopwords.words('english'))

lemmatizer = WordNetLemmatizer()


def preprocess(tweets):

    """
    Input: 
        tweets: a list of tweets
    Output: 
        prepocessed_tweets: a list of preprocessed tweets
    """

    ###you may need to create an additional list in which to store the processed tweets
    ###pay attention that some of the cleaning steps can be done at the document level, while others may be computed at word level
    prepocessed_tweets = []

    for tweet in tweets:
        ###remove new line characters '\n'
        ###remove links http://t.co/of3DyOzML0
        ###remove mentions '@'
        ###remove hashtags '#'
        ###lowercase text
        tweet = ' '.join([word.lower() for word in tweet.split() if word[:4] != "http" and word[0] not in ['@', '#']])

        ###remove emojis and emoticons '👌 🍭 :) :D'
        tweet = re.sub(emoticon_re, '', emoji.get_emoji_regexp().sub(u'', tweet))
        
        ###remove digits
        tweet = re.sub(' \d+', '', tweet)

        ###remove punctuation
        tweet = re.sub(r'[^\w\s]','', tweet)

        ###tokenize tweet into separate words
        ###remove stopwords
        ###lematization or stemming
        tweet = ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(tweet) if word not in stop_words_nltk])

        prepocessed_tweets += [tweet]

    return prepocessed_tweets

preprocess(tweets)

['top engaged member community week',
 'hey james odd please call contact centre able assist many thanks',
 'listen last night bleed amazing track scotland',
 'congrats',
 'yeaaaah yippppy accnt verified rqst succeed got blue tick mark fb profile day',
 'one irresistible',
 'dont like keep lovely customer waiting long hope enjoy happy friday lwwf',
 'second thought there enough time dd new short entering system sheep must buying',
 'jgh go bayan bye',
 'act mischievousness calling etl layer inhouse warehousing app katamari well name implies',
 'top influencers community week',
 'wouldnt love bigjuicyselfies',
 'follow follow u back',
 'perfect already know whats waiting',
 'great new opportunity junior triathletes aged gatorade series get entry',
 'laying greeting card range print today love job',
 'friend lunch yummmm',
 'id conflict thanks help here screenshot working',
 'hi liv',
 'hello need know something u fm twitter sure thing dm x',
 'top new follower community week',
 'ive hea

Tools:

* [Preprocessing library for Twitter](https://github.com/s/preprocessor)
* [Emoji library](https://github.com/carpedm20/emoji)
* [Demoji library](https://github.com/bsolomon1124/demoji)
* [Gensim](https://radimrehurek.com/gensim/)


Further reading:

* [Lexical Normalization](https://arxiv.org/pdf/1710.03476.pdf)
* [On learning and representing social meaning in NLP: a sociolinguistic perspective](https://aclanthology.org/2021.naacl-main.50.pdf)






