# Text Preprocessing

Lab overview:


* Normalization
* Tokenization
* Lematization
* Stemming
* Stopwords removal





##Text normalization (cleaning)

Depending on the task you are cleaning the text for, you may perform one or more of: 

* Transform text to lowercase
* Remove emoticons ( :) :D) and emojis (💙 🐱)
* Remove punctuation
* Remove digits or transform them to words
* Correct spelling errors


Python Regular Expressions 
*   [re Python documentation](https://docs.python.org/3/library/re.html)
*   [Quick reference](https://www.computerhope.com/unix/regex-quickref.htm)
*   [Cheat Sheet](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

![regular_expressions](https://res.cloudinary.com/practicaldev/image/fetch/s--_iE0KvdT--/c_imagga_scale,f_auto,fl_progressive,h_900,q_auto,w_1600/https://dev-to-uploads.s3.amazonaws.com/i/zpek00ubevoxvn458b01.png)

[Photo source](https://dev.to/mconner89/regular-expressions-grouping-and-string-methods-3ijn)

Here is our text sample, a short review of the movie [Jaws](https://en.wikipedia.org/wiki/Jaws_(film))

In [1]:
text = '" Jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. The movie opens with blackness, and only distant, alien-like underwater sounds. :) :D It deserves 5 stars, not 4 stars.'
text

'" Jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. The movie opens with blackness, and only distant, alien-like underwater sounds. :) :D It deserves 5 stars, not 4 stars.'

Transform text to lowercase

In [2]:
text = text.lower()
text

'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves 5 stars, not 4 stars.'

importing [re](https://docs.python.org/3/library/re.html) library

In [3]:
import re

Remove digits

In [4]:
re.sub(' \d+', '', text)

'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves stars, not stars.'

Converting numbers to words using [num2words](https://github.com/savoirfairelinux/num2words) (it works on multiple languages)

We need to install the num2words library first.

In [6]:
!pip install num2words

Collecting num2words
  Downloading num2words-0.5.10-py3-none-any.whl (101 kB)
[?25l[K     |███▎                            | 10 kB 22.5 MB/s eta 0:00:01[K     |██████▌                         | 20 kB 25.7 MB/s eta 0:00:01[K     |█████████▊                      | 30 kB 30.1 MB/s eta 0:00:01[K     |█████████████                   | 40 kB 23.3 MB/s eta 0:00:01[K     |████████████████▏               | 51 kB 12.2 MB/s eta 0:00:01[K     |███████████████████▍            | 61 kB 12.4 MB/s eta 0:00:01[K     |██████████████████████▋         | 71 kB 9.6 MB/s eta 0:00:01[K     |█████████████████████████▉      | 81 kB 10.6 MB/s eta 0:00:01[K     |█████████████████████████████   | 92 kB 9.8 MB/s eta 0:00:01[K     |████████████████████████████████| 101 kB 5.6 MB/s 
Installing collected packages: num2words
Successfully installed num2words-0.5.10


After installing, we can import it.

In [7]:
from num2words import num2words

text = ' '.join([num2words(word) if word.isdigit() else word for word in text.split()])
text


'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves five stars, not four stars.'

Remove emoticons ( :) :D) and emojis (💙 🐱)

Using [emoji](https://github.com/carpedm20/emoji) library or the corresponding unicode characters.

We need to install the emoji library first.

In [8]:
!pip install emoji

Collecting emoji
  Downloading emoji-1.6.1.tar.gz (170 kB)
[?25l[K     |██                              | 10 kB 27.7 MB/s eta 0:00:01[K     |███▉                            | 20 kB 33.7 MB/s eta 0:00:01[K     |█████▉                          | 30 kB 31.1 MB/s eta 0:00:01[K     |███████▊                        | 40 kB 22.4 MB/s eta 0:00:01[K     |█████████▋                      | 51 kB 10.4 MB/s eta 0:00:01[K     |███████████▋                    | 61 kB 11.2 MB/s eta 0:00:01[K     |█████████████▌                  | 71 kB 8.8 MB/s eta 0:00:01[K     |███████████████▍                | 81 kB 9.8 MB/s eta 0:00:01[K     |█████████████████▍              | 92 kB 10.6 MB/s eta 0:00:01[K     |███████████████████▎            | 102 kB 9.3 MB/s eta 0:00:01[K     |█████████████████████▏          | 112 kB 9.3 MB/s eta 0:00:01[K     |███████████████████████▏        | 122 kB 9.3 MB/s eta 0:00:01[K     |█████████████████████████       | 133 kB 9.3 MB/s eta 0:00:01[K     |████

After installing, we can import it.

In [9]:
import emoji

emoji.get_emoji_regexp().sub(u'', text)

'" jaws "  is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves five stars, not four stars.'

The *get_emoji_regexp()* function returns a regex to match any emoji.

Another way of removing emojis with regex:


In [9]:
emoj = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese char
    u"\U00002702-\U000027B0"
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001f926-\U0001f937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"
    u"\u3030"
    "]+", re.UNICODE)

text = re.sub(emoj, '', text)
text

'" jaws "  is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves five stars, not four stars.'

Removing emoticons (regex from [nltk Twitter Tokenizer](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/casual.py))

In [10]:
emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
      |
      </?3                       # heart
    )"""
    
emoticon_re = re.compile(emoticon_string, re.VERBOSE | re.I | re.UNICODE)
text = re.sub(emoticon_re, '', text)
text

'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds.   it deserves five stars, not four stars.'

## Tokenization


*   Word level: Split by whitespace, [nltk.word_tokenize](https://www.nltk.org/api/nltk.tokenize.html)
*   Sentence level: Split by punctuation, [nltk.sent_tokenize](https://www.nltk.org/api/nltk.tokenize.html)


In [11]:
print(text.split())

['"', 'jaws', '"', '🦈🦈🦈', 'is', 'a', 'rare', 'film', 'that', 'grabs', 'your', 'attention', 'before', 'it', 'shows', 'you', 'a', 'single', 'image', 'on', 'screen.', 'the', 'movie', 'opens', 'with', 'blackness,', 'and', 'only', 'distant,', 'alien-like', 'underwater', 'sounds.', 'it', 'deserves', 'five', 'stars,', 'not', 'four', 'stars.']


We need to download first the Punkt Tokenizer Models.

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
from nltk import word_tokenize
tokenized_text_nltk = word_tokenize(text)
print(tokenized_text_nltk)

['``', 'jaws', '``', '🦈🦈🦈', 'is', 'a', 'rare', 'film', 'that', 'grabs', 'your', 'attention', 'before', 'it', 'shows', 'you', 'a', 'single', 'image', 'on', 'screen', '.', 'the', 'movie', 'opens', 'with', 'blackness', ',', 'and', 'only', 'distant', ',', 'alien-like', 'underwater', 'sounds', '.', 'it', 'deserves', 'five', 'stars', ',', 'not', 'four', 'stars', '.']


Sentence tokenization using regex

In [14]:
 re.split('(?<=[.!?]) +', text)

['" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen.',
 'the movie opens with blackness, and only distant, alien-like underwater sounds.',
 'it deserves five stars, not four stars.']

Sentence tokenization using nltk.sent_tokenize

In [11]:
nltk.sent_tokenize(text)

['" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen.',
 'the movie opens with blackness, and only distant, alien-like underwater sounds.',
 'it deserves five stars, not four stars.']

In [15]:
text_example = 'I was good.Thanks.'
re.split('(?<=[.!?]) +', text_example)

['I was good.Thanks.']

In [13]:
nltk.sent_tokenize(text_example)

['I was good.Thanks.']

Removing punctuation


In [16]:
re.sub(r'[^\w\s]','', text)

' jaws   is a rare film that grabs your attention before it shows you a single image on screen the movie opens with blackness and only distant alienlike underwater sounds   it deserves five stars not four stars'

Using [string](https://docs.python.org/3/library/string.html) library. 

The string.punctuation method returns a list of punctuation marks. 

We use the translate() method which replaces every instance of a punctuation mark with the value '' in our strings. We use the str.maketrans() method to support the translation.

In [17]:
import string
text = text.translate(str.maketrans('', '', string.punctuation))
text

' jaws  🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen the movie opens with blackness and only distant alienlike underwater sounds   it deserves five stars not four stars'

Removing multiple spaces between words

In [18]:
text = re.sub(' +', ' ', text)
text

' jaws 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen the movie opens with blackness and only distant alienlike underwater sounds it deserves five stars not four stars'

## Removing stopwords

![stopwords.jpg](https://user.oc-static.com/upload/2021/01/06/16099626487943_P1C2.png) 

[Photo source](https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/6980726-remove-stop-words-from-a-block-of-text)






###Why do we Need to Remove Stopwords?

For tasks such as text classification, we may want to remove any unnecessary words and keep only words with meaning. 

Stopwords removal is not used in tasks such as machine translation or text summarization.

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Stopwords removal using nltk

In [19]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words_nltk = set(stopwords.words('english'))
print(len(stop_words_nltk))
print(stop_words_nltk)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
179
{'yours', 'again', "isn't", 'but', 'most', 'aren', "she's", 'were', 'on', 'theirs', 'against', 'now', 'won', 'is', "shan't", 'nor', 'those', 'her', 'then', 'some', 'd', 'hadn', 'while', 'these', 'a', 'under', 'be', 'any', 'which', 'no', 've', 'weren', 'other', 'should', "aren't", 'don', 'hers', "wouldn't", 'over', 'having', "weren't", 'doing', 'up', 'into', 'had', 'such', 'haven', 'each', 'off', "shouldn't", 'themselves', 'who', 'our', 'she', 'or', 'once', 'how', 'with', 'out', 'they', 'from', "mustn't", 'm', 'the', "you'd", 't', 'doesn', "wasn't", "couldn't", 'll', 'yourselves', 'where', 'between', 'hasn', 'whom', 'before', 'its', 'been', 'for', "should've", "you'll", "you're", 'what', 'him', 'his', "didn't", 'more', 'very', 'we', 'do', 'yourself', 'all', 'this', 'when', 'there', 'below', "hasn't", 'an', 'being', 'isn', 'didn', 'can', 'myself', 'as', "it's", 'did', 'has',

In [20]:
tokenized_text_without_stopwords = [i for i in tokenized_text_nltk if not i in stop_words_nltk]
print(tokenized_text_without_stopwords)

['``', 'jaws', '``', '🦈🦈🦈', 'rare', 'film', 'grabs', 'attention', 'shows', 'single', 'image', 'screen', '.', 'movie', 'opens', 'blackness', ',', 'distant', ',', 'alien-like', 'underwater', 'sounds', '.', 'deserves', 'five', 'stars', ',', 'four', 'stars', '.']


Stopwords removal using spacy

In [21]:
import spacy
nlp = spacy.load('en_core_web_sm')
stop_words_spacy = nlp.Defaults.stop_words
print(len(stop_words_spacy))
print(stop_words_spacy)

326
{'most', 'on', 'thru', 'against', 'even', 'next', 'then', 'these', 'any', 'no', 'hence', 'whether', 'along', "'d", 'wherein', 'everyone', 'namely', 'others', 'somehow', 'up', 'around', 'although', 'such', 'bottom', 'who', 'our', 'with', 'out', 'they', 'from', 'using', 'much', 'anyway', 'since', 'nevertheless', 'various', 'former', 'beforehand', 'anywhere', "'m", 'before', 'third', 'for', 'we', 'all', 'becoming', 'serious', 'towards', '’s', 'below', 'except', 'being', 'can', 'get', 'i', 'at', 'three', 'of', 'whereupon', 'alone', 'rather', 'than', 'still', 'cannot', 'in', 'my', 'none', '‘s', 'without', 'me', 'whoever', 'noone', 'enough', '’ll', 'least', 'onto', 'until', 'eight', 'show', 'same', 'per', '‘ll', 'though', 'make', 'again', 'everything', 'top', 'were', 'five', 'twelve', 'thence', 'nor', 'together', 'some', 'beside', 'seem', 'be', 'unless', 'many', 'must', 'whereby', 'anyone', 'used', '‘ve', 'anyhow', 'hers', 'call', 'empty', 'however', 'had', 'nobody', '‘m', 'say', 'or', '

In [22]:
tokenized_text_spacy = nlp(text)
tokenized_text_without_stopwords = [i for i in tokenized_text_spacy if not i in stop_words_spacy]
print(tokenized_text_without_stopwords)

[ , jaws, 🦈, 🦈, 🦈, is, a, rare, film, that, grabs, your, attention, before, it, shows, you, a, single, image, on, screen, the, movie, opens, with, blackness, and, only, distant, alienlike, underwater, sounds, it, deserves, five, stars, not, four, stars]


## Lematization/Stemming

![1_HLQgkMt5-g5WO5VpNuTl_g.jpeg](https://miro.medium.com/max/564/1*HLQgkMt5-g5WO5VpNuTl_g.jpeg)

[Photo source](https://tr.pinterest.com/pin/706854104005417976/)

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Lematization

Using the WordNetLemmatizer from nltk


In [26]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = word_tokenize(text)
for word in words:
    print(word, lemmatizer.lemmatize(word))

Using the [lemmatizer](https://spacy.io/api/lemmatizer) from spacy

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

In [None]:
import spacy

# Load English tokenizer, tagger, parser, etc.
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

for token in doc:
  print(token, token.lemma_)

Stemming in using nltk

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
for word in words:
    print(word, ps.stem(word))

[Other stemmers in nltk](https://www.nltk.org/api/nltk.stem.html)

The spacy library does not perform stemming, only lemmatization.

# Assignment

To be uploaded here: https://forms.gle/ygCNwFM4i5RMPtsC6

Preprocess texts from Twitter

## Data

We will use the twitter corpus from nltk, usually used in sentiment analysis.

The fist step is downloading the dataset using the *download* function.

In [29]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In order to inspect our data, we look at the first 25 tweets from the dataset. The text contains a lot of mentions, hashtags and emoticons.

In [122]:
from nltk.corpus  import twitter_samples

tweets = twitter_samples.strings('positive_tweets.json')
tweets = tweets[:25]
tweets

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days',
 '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM',
 "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI",
 '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.',
 'Jgh , but we have to go to Bayan :D bye',
 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing 

**Given a list of tweets, preprocess each tweet from the list.**

**Instructions**: Implement the *preprocess* function. You can do the text cleaning in any order you prefer.

**Hint**: You may need to use regex expressions (use the resources provided above).


In [124]:
def preprocess(tweets):

    """
    Input: 
        tweets: a list of tweets
    Output: 
        prepocessed_tweets: a list of preprocessed tweets
    """

    ###you may need to create an additional list in which to store the processed tweets
    prepocessed_tweets=[]
    ###pay attention that some of the cleaning steps can be done at the document level, while others may be computed at word level


    for tweet in tweets:
    
        ###remove new line characters '\n'
        preproc = re.sub('\n', '', tweet)
        ###remove links http://t.co/of3DyOzML0
        # preproc = re.sub("http\S+"," ",preproc)  #cleans only links
        ###remove mentions '@'
        preproc = re.sub('(@|https?)\S+|#'," ",preproc)  #cleans both links and mentions, also hashtags
        ###remove hashtags '#'

        ###lowercase text
        preproc = preproc.lower()
        ###remove emojis and emoticons '👌 🍭 :) :D'
        preproc = emoji.get_emoji_regexp().sub(u'', preproc)
        preproc = re.sub(emoticon_re, '', preproc)
        ###remove digits
        preproc = re.sub(' \d+', '', preproc)
        ###remove punctuation
        preproc = re.sub(r'[^\w\s]','', preproc)
        ###tokenize tweet into separate words
        preproc = word_tokenize(preproc)
        ###remove stopwords
        preproc = [i for i in preproc if not i in stop_words_nltk]
        ###lematization or stemming
        preproc = [lemmatizer.lemmatize(j) for j in preproc]
        prepocessed_tweets.append(preproc)
    
    return prepocessed_tweets

preprocess(tweets)

[['followfriday', 'top', 'engaged', 'member', 'community', 'week'],
 ['hey',
  'james',
  'odd',
  'please',
  'call',
  'contact',
  'centre',
  'able',
  'assist',
  'many',
  'thanks'],
 ['listen', 'last', 'night', 'bleed', 'amazing', 'track', 'scotland'],
 ['congrats'],
 ['yeaaaah',
  'yippppy',
  'accnt',
  'verified',
  'rqst',
  'succeed',
  'got',
  'blue',
  'tick',
  'mark',
  'fb',
  'profile',
  'day'],
 ['one', 'irresistible', 'flipkartfashionfriday'],
 ['dont',
  'like',
  'keep',
  'lovely',
  'customer',
  'waiting',
  'long',
  'hope',
  'enjoy',
  'happy',
  'friday',
  'lwwf'],
 ['second',
  'thought',
  'there',
  'enough',
  'time',
  'dd',
  'new',
  'short',
  'entering',
  'system',
  'sheep',
  'must',
  'buying'],
 ['jgh', 'go', 'bayan', 'bye'],
 ['act',
  'mischievousness',
  'calling',
  'etl',
  'layer',
  'inhouse',
  'warehousing',
  'app',
  'katamariwell',
  'name',
  'implies'],
 ['followfriday', 'top', 'influencers', 'community', 'week'],
 ['wouldnt',

In [109]:
### Don't mind me, just verifying some stuff

# procc_tweet=[]
# for t in tweets:
#   # procc_tweet.append(re.sub('\n', '', t))
#   # procc_tweet.append(t.lower())
#   procc_tweet.append(tokenized_text_without_stopwords = [i for i in tokenized_text_nltk if not i in stop_words_nltk])


In [None]:
# procc_tweet

Tools:

* [Preprocessing library for Twitter](https://github.com/s/preprocessor)
* [Emoji library](https://github.com/carpedm20/emoji)
* [Demoji library](https://github.com/bsolomon1124/demoji)
* [Gensim](https://radimrehurek.com/gensim/)


Further reading:

* [Lexical Normalization](https://arxiv.org/pdf/1710.03476.pdf)
* [On learning and representing social meaning in NLP: a sociolinguistic perspective](https://aclanthology.org/2021.naacl-main.50.pdf)






