# Hands-on: Text Processing
This hands-on will cover the necessary steps in a text processing pipeline for Human Language Technologies (HLT). Some examples of projects and tasks in which this pipeline will be useful are the following:
- **Language Translation** - translation of a sentence or body of text from one language to another
- **Word Sense Disambiguation** - determining the meaning and context of a polysemic word in a body of text
- **Sentiment Analysis** - determining the overall sentiment towards a certain topic or word, whether it's positive, negative, or neutral
- **Topic Modeling** - identifying the different topics discusses in a text and determining the most prevalent one

And there are others more like question answering, information extraction, and more recently, detecting mis/disinformation.

## Pre-processing Pipeline

- **Tokenization** — split sentences into words and symbols
- **Convert to lowercase**
- **Removing unnecessary punctuation, tags, and emojis**
- **Removing stop words** — removing frequently occurring words like articles (e.g. ”the”, ”is”, etc.) that do not have specific meanings
- **Stemming** — transforms a word to their root form by removing inflectional endings. It is done by usually dropping the suffixes.

```
The stemmed form of cries is: cri
The stemmed form of crying is: cry
```

- **Lemmatization** — properly removing inflectional endings by determining the part of speech and doing morphological analysis. It transforms words to their base or dictionary form.

```
The lemmatized form of cries is: cry
The lemmatized form of crying is: cry
```

> **NOTE:** Not all HLT tasks/projects will follow the same pipeline. For example, topic modeling were proven to be better with stop words, so the removal of stop words is typicallly skipped. 

In [49]:
import os
import pandas as pd
import numpy as np
import json #loading/writing json files
import re #regular expressions
import gensim
import nltk

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models

from nltk.stem import WordNetLemmatizer

### Load JSON dataset containing posts from r/waze
This dataset was collected in 2019 using the pushshift.io API. It contains the `submission ID` and the post's `body` of text.

In [50]:
json_file = open('allArticles.json')
json_data = json.load(json_file)
documents = pd.DataFrame(json_data)

In [51]:
documents.head()

Unnamed: 0,author,title,date,source,article_body
0,[],Families risk it all to escape Myanmar's deadl...,2021/03/24,http://cnn.com/videos/world/2021/03/24/myanmar...,"Indian officials say more than 400 refugees, d..."
1,[],Coronavirus: German Chancellor Angela Merkel a...,2021/03/24,http://cnn.com/videos/world/2021/03/24/angela-...,German Chancellor Angela Merkel has walked bac...
2,[],Bulgaria accuses 6 citizens of spying for Russia,2021/03/24,http://cnn.com/videos/world/2021/03/24/bulgari...,Investigators have released video that alleged...
3,[],Suez Canal blocked by container ship the lengt...,2021/03/24,http://cnn.com/videos/world/2021/03/24/suez-ca...,A large container ship is stuck in the souther...
4,[],CNN speaks to family of Canadian detained by B...,2021/03/23,http://cnn.com/videos/world/2021/03/23/canada-...,Canada has labeled China's detention of Canadi...


### Remove URLs from text
You can use [Regex101](https://regex101.com/) for checking your regular expressions

In [52]:
def removeURLFromText(text):
    result = re.sub(r"http\S+", "", text)
    result = result.strip()
    return result

In [53]:
processed_docs = documents['article_body'].map(removeURLFromText)
processed_docs.head()

0    Indian officials say more than 400 refugees, d...
1    German Chancellor Angela Merkel has walked bac...
2    Investigators have released video that alleged...
3    A large container ship is stuck in the souther...
4    Canada has labeled China's detention of Canadi...
Name: article_body, dtype: object

### Pre-process post's body
`gensim`'s `simple_preprocess()` converts a text into a list of tokens that are already in lowercase.

This step also removes stop words and words with less than 3 characters.

In [54]:
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [55]:
preprocess("She cries,,, text")

['cry', 'text']

### Stem and lemmatize per text
Lemmatize the word first. If there will be words missed, the stemmer should be able to handle it. The lemmatization will only be done for verbs.

In [56]:
nltk.download('wordnet')

def lemmatize_stemming(text):
    return WordNetLemmatizer().lemmatize(text, pos='v')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [57]:
processed_docs = processed_docs.map(preprocess)
processed_docs.head()

0    [indian, officials, refugees, desperate, flee,...
1    [german, chancellor, angela, merkel, walk, pla...
2    [investigators, release, video, allegedly, sho...
3    [large, container, ship, stick, southern, suez...
4    [canada, label, china, detention, canadians, m...
Name: article_body, dtype: object

### Create a `gensim` Dictionary
This will organize your bag of words into word <-> id mappings

In [58]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [59]:
for x in range(0, 20):
    print(x,":",dictionary[x])

0 : border
1 : coup
2 : crackdown
3 : cross
4 : dangerous
5 : desperate
6 : flee
7 : india
8 : indian
9 : journey
10 : junta
11 : military
12 : officials
13 : refugees
14 : speak
15 : vedika
16 : angela
17 : apologize
18 : bell
19 : cause


### Filter words
Filter out tokens that appear in

- less than `no_below` documents (absolute number) or
- more than `no_above` documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `None`).

In [60]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [61]:
dictionary.cfs

{0: 24,
 1: 22,
 2: 77,
 3: 77,
 4: 36,
 5: 150,
 7: 64,
 6: 36,
 8: 42,
 9: 21,
 11: 29,
 10: 22,
 12: 49,
 13: 43,
 14: 37,
 15: 47,
 17: 28,
 16: 48,
 18: 44,
 19: 43,
 29: 41,
 27: 97,
 26: 48,
 22: 57,
 23: 139,
 21: 25,
 24: 36,
 20: 19,
 25: 25,
 28: 42}

In [62]:
dictionary.dfs #how many documents each word is appering in rows

{0: 20,
 1: 15,
 2: 29,
 3: 24,
 4: 21,
 5: 21,
 7: 15,
 6: 20,
 8: 18,
 9: 15,
 11: 17,
 10: 15,
 12: 21,
 13: 19,
 14: 15,
 15: 17,
 17: 15,
 16: 18,
 18: 19,
 19: 17,
 29: 16,
 27: 24,
 26: 19,
 22: 17,
 23: 21,
 21: 22,
 24: 17,
 20: 17,
 25: 15,
 28: 15}

In [63]:
dictionary.num_pos #how many processed words

14295

In [64]:
dictionary.num_docs #number of processed documents

61

### Map bag of words per document

So far, we have only counted the occurrence of each word across all documents. Next, we need to know how often each word appeared in each document, but now using the IDs generated in the previous step.

In [65]:
#bow = bag of words
#for every document in processed docs, we want to convert that into a bag of words array

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [66]:
bow_corpus

[[(0, 1)],
 [(1, 1), (2, 1)],
 [(0, 1), (2, 1), (3, 1)],
 [(4, 1)],
 [],
 [(3, 1), (5, 1)],
 [(0, 1), (2, 1), (4, 1), (6, 1), (7, 1)],
 [(8, 1), (9, 1)],
 [(1, 1), (3, 2), (5, 1), (10, 1), (11, 1), (12, 1)],
 [(2, 1), (13, 1)],
 [(1, 1), (2, 1), (9, 1), (13, 1), (14, 2)],
 [],
 [(14, 1)],
 [(2, 1), (9, 1), (15, 1)],
 [(12, 1), (13, 3)],
 [(2, 1), (16, 1), (17, 1)],
 [(14, 1), (18, 1)],
 [(8, 1), (13, 2), (18, 1)],
 [(17, 1), (19, 1)],
 [(2, 1)],
 [],
 [(0, 2),
  (2, 3),
  (3, 5),
  (4, 1),
  (5, 2),
  (6, 5),
  (8, 1),
  (10, 1),
  (11, 4),
  (12, 1),
  (15, 3),
  (18, 1),
  (19, 3),
  (20, 1),
  (21, 1),
  (22, 2),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 2),
  (28, 1),
  (29, 7)],
 [(0, 1),
  (2, 3),
  (3, 3),
  (4, 2),
  (5, 2),
  (6, 1),
  (8, 1),
  (9, 1),
  (11, 1),
  (12, 3),
  (15, 4),
  (16, 1),
  (17, 1),
  (18, 3),
  (19, 3),
  (20, 1),
  (21, 1),
  (23, 1),
  (24, 1),
  (26, 2),
  (27, 6),
  (29, 4)],
 [(0, 2),
  (2, 2),
  (3, 2),
  (4, 1),
  (7, 1),
  (11, 3),
  

### Compute the TF-IDF per word in a document

- **Term Frequency (TF)** is the number of times token `t` appears in a document divided by the total number of tokens in the document.
- **Inverse Document Frequency (IDF)** is the log(N/n), where, `N` is the number of documents and `n` is the number of documents a token t has appeared in. A less frequently used word will have a high IDF, whereas the IDF of a frequent word is likely to be low. 

We calculate TF-IDF value of a term as = **TF * IDF**

Example:
```
Document 1: "I worked my whole life, just to get right, just to be like"
[work, 1]
[whole, 1]
[life, 1]
[just, 2]
[right, 1]
[like, 1]
```

```
Document 2: "I worked my whole life, just to get high, just to realize"
[work, 1]
[whole, 1]
[life, 1]
[just, 2]
[high, 1]
[realize, 1]
```

```
TF('just',Document1) = 2/7, IDF('just')=log(2/2) = 0
TF('right',Document1) = 1/7,  IDF(‘right’)=log(2/1) = 0.30

TF-IDF(‘just’, Document1) = (2/7)*0 = 0
TF-IDF(‘right’, Document1) = (1/7)*0.30 = 0.42
```

In [67]:
tfidf = models.TfidfModel(bow_corpus)

In [68]:
corpus_tfidf = tfidf[bow_corpus]
for x in corpus_tfidf:
    print(x)

[(0, 1.0)]
[(1, 0.8835516613095219), (2, 0.4683337077311259)]
[(0, 0.6828902168036421), (2, 0.4553521895764523), (3, 0.5712401729936413)]
[(4, 1.0)]
[]
[(3, 0.6584096737775302), (5, 0.7526597514655384)]
[(0, 0.44985104858373065), (2, 0.29996133333795366), (4, 0.4301689681225383), (6, 0.44985104858373065), (7, 0.5659027527258935)]
[(8, 0.6563794119345652), (9, 0.7544309561440559)]
[(1, 0.4168895572814165), (3, 0.5544288154839636), (5, 0.316897116709507), (10, 0.4168895572814165), (11, 0.3796937153323194), (12, 0.316897116709507)]
[(2, 0.5375448969212121), (13, 0.8432351296014435)]
[(1, 0.3787134857957082), (2, 0.20074014767581275), (9, 0.3787134857957082), (13, 0.31489675636607695), (14, 0.7574269715914164)]
[]
[(14, 1.0)]
[(2, 0.3648657466293472), (9, 0.688350488695506), (15, 0.6269342801676002)]
[(12, 0.29149826386373445), (13, 0.9565713575914913)]
[(2, 0.3713044137564543), (16, 0.6094556297044739), (17, 0.70049758582489)]
[(14, 0.7689176128436329), (18, 0.6393478745243852)]
[(8, 0.42

### Placing it in a JSON file

In [74]:
allArticlesBOW_json = []

artAuthor = documents['author']
artTitle = documents['title']
artDate = documents['date']
artSource = documents['source']
artBody = documents['article_body']

for i in range(0, len(corpus_tfidf)):
    if(artDate.empty):
        allArticlesBOW_json.append({
            "author": artAuthor[i],
            "title": artTitle[i],
            "source": artSource[i],
            "article_body": artBody[i],
            "article_body_bow": corpus_tfidf[i]
        })
    else:
        allArticlesBOW_json.append({
            "author": artAuthor[i],
            "title": artTitle[i],
            "date": artDate[i],
            "source": artSource[i],
            "article_body": artBody[i],
            "article_body_bow": corpus_tfidf[i]
        })
    
with open('allArticlesBOW.json', 'w') as json_file:
    json.dump(allArticlesBOW_json, json_file)

allArticlesBOW_json

[{'author': [],
  'title': "Families risk it all to escape Myanmar's deadly junta",
  'date': '2021/03/24',
  'source': 'http://cnn.com/videos/world/2021/03/24/myanmar-india-police-refugee-flee-sud-pkg-intl-hnk-vpx.cnn/video/playlists/around-the-world/',
  'article_body': "Indian officials say more than 400 refugees, desperate to flee the military junta's crackdown, have crossed the border into India since the military coup. CNN's Vedika Sud speaks to refugees who made the dangerous journey.",
  'article_body_bow': [(0, 1.0)]},
 {'author': [],
  'title': 'Coronavirus: German Chancellor Angela Merkel apologizes for Easter restrictions confusion',
  'date': '2021/03/24',
  'source': 'http://cnn.com/videos/world/2021/03/24/angela-merkel-germany-coronavirus-easter-mistake-bell-ctw-intl-ldn-vpx.cnn/video/playlists/around-the-world/',
  'article_body': 'German Chancellor Angela Merkel has walked back on her plan to impose a new hard lockdown during the upcoming Easter holiday, apologizing fo