# Typical NLP-heavy problems

* *Classification*: Classifying documents into particular categories (giving labels).
* *Regression*: Predict numerical values
* *Clustering*: Separating documents into non- overlapping subsets.
* *Ranking*: For a given input, rank documents according to some metric. 
* *Association rule mining*: Infer likely associations patterns in data.
* *Structured output*: Bounding boxes, parse trees, etc.
* *Sequence-to-sequence*: For a given input/source text, generate output annotations or another text.

To address many of the challenges above, several important things has to be clarified beforehand:

1. Set the research/exploration goal (e.g., how heavy will be the earthquake on a given day).
2. Make a hypothesis ( e.g., strength of the earthquake is an informative signal).
3. Collect the data (e.g., collect historical strength of the quakes on eachday).
4. Test the hypothesis (e.g., train a model using the data)
5. Analyze the results (e.g., are results better than existing systems).
6. Reach a conclusion (e.g., the model should be used or not because of A, B, C).
7. Refine hypothesis and repeat. (e.g., time of the year could be a useful signal).

Below is the table of the approaches that can be applied to the problems/applications above.

| Type                   | Input   | Clarifications|
|------------------------|---------|---------------|
|   Rule-based           |Explicit linguistic patterns, <br>lexicons, etc.| Such an approach always have predefined behaviour and usually can't generalize usually. |
|   Supervised           | Training examples: typically <br>tuples of the form <br>(features, label) | Here input features are usually some vectorized, transformed, normalized input representation.|
|   Semi-supervised or <br> Pseudo-relevance <br>feedback   | Same as in supervised base, <br>but also results of the prediction <br>(e.g. with greatest confidence) <br>are used as input pairs.   | Once we trained the system on the input (feature, label), we label unknown examples <br>with the model. If the confidence of the label is high (on that unseen example), <br>we add those examples to the input set and retrain the model. |
|   Distant supervision  | Same as for supervised learning, <br>but (feature, label) pairs does not <br> come from the annotation, but from <br>some heuristic-based annotation.   | Rather than annotating thousands of examples (documents, sentences, words, etc.), <br>we take a few examples of the class and try to generalize and match those <br>in the domain-specific corpus. Such systems first generate or extract examples that are <br> very similar or slight modifications of the labelled input examples. For example, if you need to <br> find all the sentences about Obama's marriage, you would search for all sentences that match <br> Michelle and Barack Obama. Furthermore, those examples could be used to find any sentence <br> about a marriage. Another example could be for a given list of movies, match all the sentences that have <br> movie names. Clearly, such heuristics result in noisy input sets.|
|   Unsupervised         | Unlabeled features   | The system is expected to find some dependencies and patterns without clearly stating <br>which patterns we are looking for. E.g., clustering of documents, topics extraction <br>from the documents, etc. |
|   Hybrid               | Varies for method <br>combinations   | Combines several approaches that are mentioned earlier for various purposes. |

# Evaluation Metrics

Depending on the problem you solve or the  data (balanced number of classes or not) different metrics
<br>might be more suitable.

* **Accuracy**
<br>
$ Accuracy = \frac{n_{correct}}{N}$, where $N$ is a total number of example that we were analysing,
<br>$n_{corrent}$ is the number of examples that we have guessed the label.

*a.k.a. Classification*

* **Precision**
<br>
In case of classification, and in particular, if the classes are unbalanced, Precision should be a better measure to check.
<br>
$ Precision = \frac{n_{correct\ class\ prediction}}{n_{class\ predictions}} = \frac{TP}{TP + FP}$,
where $TP, FP, FN, TN$ are explained [here](https://en.wikipedia.org/wiki/False_positives_and_false_negatives).

* **Precision@K**
<br>
Once your task is not simply to classify some examples, but e.g. rank them, a popular metric is P@K.
<br>Here you use K (typically, 1,3,5, 10, etc.) and compute precision for those K results in your ranking.
<br>For example, if you have a query for which you need to find similar documents,
<br>P@K would be computed for the top K documents that are returned for a query as described above for those K elements.

* **Recall**
<br>
Another very important concept in NLP, is Recall - which basically tell us how many of the class examples
<br>or positive examples (maybe documents), our system had managed to extract.
<br>
$ Precision = \frac{n_{correct\ class\ prediction}}{n_{positive\ class\ examples}} = \frac{TP}{TP + FN}$ $

* **F1**
<br>
[F1](https://en.wikipedia.org/wiki/F1_score) is the harmonic mean of the two above metrics.
<br>Usually used to compare different approaches when you are not optimizing for P and R in particular
<br>but rather overall performance.

*a.k.a. Clustering* 

* [**Silhouette coefficient, Modularity, etc**](https://en.wikipedia.org/wiki/Cluster_analysis).
<br>
Once we move from the classification problems, and focus on clustering of the documents, many things can be measured depending
<br>if you have labels or not.
<br>
The case where we do not have labels: 
  * For already proposed clustering of the input, *modularity* would measure how well nodes are assigned to the clusters.
  <br> In particular, we would estimate how our current assignment is different from the assumed random graph.
  * Silhouette coefficient estimates how average distance between objects in the same clusters differs
  <br>from the average distance of those objects to the other clusters.
  * [Davies–Bouldin index](https://en.wikipedia.org/wiki/Davies-Bouldin_index) measures the difference between inter- and intra-cluster similarity.

*a.k.a. Language models*

* **Perplexity**
<br>
Once we consider text generation tasks, where we typically do not have labelled examples, or multiple correct answers
<br>are possible, *perplexity* can be used to evaluate your model. tl;dr - Perplexity estimates how surprised the model
<br>is upon receiving an input, e.g., how the model is surprised that the next word after a current one
<br>"eat" is "me", or "meat" or whatever. Typically, the lower the perplexity the more information about
<br>the input the model has (no surprises). Another interpretation is that we compare our probability
<br>distribution to the fair die.


# Where text data comes from

Like anywhere in Data Science, it is important to first understand your data! Note: Of course, if you have it!

In case you do not have the data:

*Raw text data (Unlabeled/Non-annotated):*
* Pay for it :)
* Crawl it from the Web.
<br>
Examples: [Scrappy](https://scrapy.org/), [wiki/google crawler](http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/)
<br>
Of course, you might need to play with various IP addresses, throttling rates etc.
* Twitter
<br>
You can get both historical or livestream data. For any of the two, you can get 1% of the stream for free.
  * 1% historical tweets from [archive.org](https://archive.org/details/twitterstream?sort=-date).
  * Similarly, you can get 1% of full twitter stream as tweets appear via Twitter API.
  <br>You can also specify keywords (up to 2K) that would match tweets in real time - as a result you get up to
  <br>1% total stream of messages. Note: if you have a not very popular query, you might get all the tweets about it,
  <br>but of course, no guarantees.
* News media
<br>
[News Archive](https://archive.org),
[GDELT](https://www.gdeltproject.org)
* Wikipedia
<br>
[Wikipedia dumps](https://dumps.wikimedia.org/)

*Annotated data: *
* Annotate and/or even generate your data using CrowdSourcing
  <br>
  [Crowdflower](https://www.figure-eight.com/), [MechanicalTurk](https://www.mturk.com/), etc.
* Annotate using auxiliary lexical resources
<br>
[LIWC](http://liwc.wpengine.com/) a tool that analyses your textual input on the presence of various
<br>"shades" - informal speech; syntactic structures; affect, social words; conginitive, perpetual,
<br>biological processes; relativity; personal concerns, etc.

*Finally, you have the data!
... it is still not the final truth! *

Work of [A. Olteanu](http://www.aolteanu.com/) well describes various [biases and pitfalls](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2886526) when it comes to the data analysis.
* Biases
* Not representative
* Noisy data
* Incomplete data
* Incorrect data
* Missing data, etc.

So, first, get to know your data!

Note: In this class, we will be working with the data that fits into the memory.
<br>However, once it does not, you should adapt other methods to scale your pipelines - Flume, Spark, hdfs, etc. -
<br>which are out of the scope of this class.


---

---

# Dataset for the course: Tweets about natural disasters

*Goal of our 1st application*: Analyze tweets about natural disasters. 

We will use publicly shared twitter data with provided human annotations:

[Disasters on Social Media](https://data.world/crowdflower/disasters-on-social-media) from https://www.figure-eight.com/data-for-everyone/

Description from the website: `Contributors looked at over 10,000 tweets culled with a variety of searches like "ablaze", "quarantine", and "pandemonium", then noted whether the tweet referred to a disaster event (as opposed to a joke with the word or a movie review or something non-disastrous)`

In [None]:
# You can download these data from the following link and store somewhere locally:

csv_file = 'https://www.figure-eight.com/wp-content/uploads/2016/03/socialmedia-disaster-tweets-DFE.csv'

In [10]:
#    Example on how the data can be loaded here from the local file system.
#    More options here: https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb&scrollTo=vz-jH8T_Uk2c
#    Downloaded the input csv.

from google.colab import files
uploaded = files.upload()

import pandas as pd
with open('socialmedia-disaster-tweets-DFE.csv',
          mode = 'r',
          encoding = 'ascii',
          errors = 'ignore'
         ) as csvfile:
  disasters_df = pd.read_csv(csvfile, header=0)
print ("Features:\n", disasters_df.keys())
print ("Number of entries/tweets and number of the corresponding data columns.", disasters_df.shape)
disasters_df.sample(2, random_state=432)

Saving socialmedia-disaster-tweets-DFE.csv to socialmedia-disaster-tweets-DFE.csv
Features:
 Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'choose_one', 'choose_one:confidence',
       'choose_one_gold', 'keyword', 'location', 'text', 'tweetid', 'userid'],
      dtype='object')
Number of entries/tweets and number of the corresponding data columns. (10876, 13)


Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,choose_one,choose_one:confidence,choose_one_gold,keyword,location,text,tweetid,userid
7474,778252291,False,finalized,5,8/30/15 1:48,Not Relevant,0.7982,,obliteration,,@ashberxo @mind_mischief the removal of all tr...,6.27674e+17,321536657.0
8093,778252910,False,finalized,5,8/27/15 14:20,Not Relevant,1.0,,rescued,,Val rescued the sister but alone died. In the ...,6.2936e+17,282079015.0


Let's now check out what actually we have in the dataset. 
Typically if you create a dataset you know what is there, however, if you acquired it then we need to get familiar with it.

Note: It is important to look into your data and particular examples to understand it.
<br>However, normally you should sample a representative set of data to avoid non-intentional overfitting.

## Application-specific data preparation

Considering the nature of the data set (crowd-annotated), let's create some filters to separate golden and annotated examples. In particular, let's ignore the examples where annotators did not have enough confidence.

#### Helper Load Clean Data

In [None]:
#    Golden examples
golden = disasters_df['_unit_state'] == "golden"
golden_positive = disasters_df['choose_one_gold'] == "Relevant"
golden_negative = disasters_df['choose_one_gold'] == "Not Relevant"

#    Annotated examples
finalized = disasters_df['_unit_state'] == "finalized"
confident = disasters_df['choose_one:confidence'] > 0.8
finalized_positive = disasters_df['choose_one'] == "Relevant"
finalized_negative = disasters_df['choose_one'] == "Not Relevant"

In [12]:
print ("All golden examples:", disasters_df[golden].shape[0])
print ("Positive golden examples:", disasters_df[golden & golden_positive].shape[0])
print ("Annotated examples:", disasters_df[finalized].shape[0])
print ("Annotated examples positive and confident:",
       disasters_df[finalized & confident & finalized_positive].shape[0])
print ("Annotated examples negative and confident:",
       disasters_df[finalized & confident & finalized_negative].shape[0])

All golden examples: 87
Positive golden examples: 57
Annotated examples: 10789
Annotated examples positive and confident: 2950
Annotated examples negative and confident: 3445


In [13]:
for t in disasters_df[golden & golden_positive].text[0:10]:
  print (t)

Just happened a terrible car crash
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Heard about #earthquake is different cities, stay safe everyone.
there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas


In [None]:
clean_confident_entries = disasters_df[
    golden | (finalized & finalized_positive & confident) | 
    (finalized & finalized_negative & confident)]

In [15]:
clean_confident_entries.shape

(6482, 13)

---

# Preprocessing

---

Usually when we deal with text, we deal with raw input represented as a string or lists of strings (sentences), etc.
<br>
First, those strings could be very long and quite unique to be of any use. 
<br>
Second, your input could be rather noisy, contain outliers, have some mistakes or missing pieces.
<br>
All of this would be great to detect and possibly remove from your data.
<br>
Moreover, by preprocessing and cleaning the input, the size of the input could be significantly reduced.

Some of the basic preprocessing steps we can perform on the text is sentence splitting and word tokenization.
<br>Luckily for us, there are already some tools that work for most of the cases or already agreed in the teams.
<br>Though you can implement and adjust the preprocessing manually, if needed.

## Tokenization + Sentence Splitting

Tokens are basically anything you encounter in the text: words (alpha), numbers (numerics), punctuation, various encodings, etc.
<br>
Token is anything without word boundaries - that could be different for different languages.
<br>
Typically, " " space and some punctuation is a good enough approximation of word boundaries, assuming latin-script input.

In [1]:
#    Need to decide on tokenization - most of the time " " is a good guess.

# Imports
# Note: following nltk packages should be downloaded
import nltk
nltk.download('punkt')

!pip install pyicu

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Collecting pyicu
[?25l  Downloading https://files.pythonhosted.org/packages/e9/35/211ffb949c68e688ade7d40426de030a24eaec4b6c45330eeb9c0285f43a/PyICU-2.3.1.tar.gz (214kB)
[K     |████████████████████████████████| 215kB 2.9MB/s 
[?25hBuilding wheels for collected packages: pyicu
  Building wheel for pyicu (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/3f/45/7e/ccee9f1fe52787595e92641b5645cdf2cb40096749b39b4422
Successfully built pyicu
Installing collected packages: pyicu
Successfully installed pyicu-2.3.1


In [16]:
from nltk import (
    sent_tokenize as splitter,
    wordpunct_tokenize as tokenizer
)
from nltk.tokenize import TweetTokenizer
import icu

# Splits a string into sentences and words.
def tokenize(text):
  return [tokenizer(sentence) for sentence in splitter(text)]

# In this exercise we do no!pip install pyicut care about the sentences (if any),
# so let's flatten the list.
def flatten(nested_list):
  return [item for sublist in nested_list for item in sublist]

# Let's see if it works ;)
# Also would be great to write some TESTS!
print (flatten(tokenize("Let's clean up a bit the dataset.")))

['Let', "'", 's', 'clean', 'up', 'a', 'bit', 'the', 'dataset', '.']


In [None]:
weird_text = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"

In [None]:
def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary
        
icu_words = icu.BreakIterator.createWordInstance(icu.Locale('en_US'))

In [None]:
from spacy.tokenizer import Tokenizer

from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

**Note**: Look at the way this tokenization function handles words like "Let's".
<br>Experiment to see what happens with some other punctuation symbols, and consider using another library
<br>like ICU that provides both sentence and word break iterators.
<br>
To import `icu` use `!pip install pyicu,` and then you can create break iterators like `BreakIterator.createWordInstance`.


#### *Exercise*: Compre and play with 2 different tokenizers

NLTK VS ICU

In [6]:
print(weird_text)
print(flatten(tokenize(weird_text)))
print(TweetTokenizer().tokenize(weird_text))
print(list(iterate_breaks(weird_text, icu_words)))
print(weird_text.split())
print([word.text for word in tokenizer(weird_text)])

This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--
[This, is, a, cooool, #dummysmiley:, :-), :-P, <3, and, some, arrows, <, >, ->, <--]
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
['This', ' ', 'is', ' ', 'a', ' ', 'cooool', ' ', '#', 'dummysmiley', ':', ' ', ':', '-', ')', ' ', ':', '-', 'P', ' ', '<', '3', ' ', 'and', ' ', 'some', ' ', 'arrows', ' ', '<', ' ', '>', ' ', '-', '>', ' ', '<', '-', '-']
['This', 'is', 'a', 'cooool', '#dummysmiley:', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
['This', 'is', 'a', 'cooool', '#dummysmiley:', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']


#### Clean up the input

In [None]:
#    Let's clean up the dataset a bit.
#    Below are just some examples what can be done to clean up the data.
#    Note: all this transformation should be done at your own risk ;), since
#          they might introduce some bias in the data, and remove some important
#          information.

def tokenize_flatten_df(row, field):
  return flatten(tokenize(row[field]))

In [18]:
clean_confident_entries['text_tokenized'] = clean_confident_entries.apply(lambda row: tokenize_flatten_df (row, 'text'), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [19]:
for line in clean_confident_entries['text_tokenized'][1000:1010]:
  print(line)

['Japan', 'Marks', '70th', 'Anniversary', 'of', 'Hiroshima', 'Atomic', 'Bombing', 'http', '://', 't', '.', 'co', '/', 'jzgxwRgFQg']
['@', 'rinkydnk2', '@', 'ZaibatsuNews', '@', 'NeoProgressive1', 'When', 'push2Left', 'talk', "='", 'ecology', "'&", 'amp', ";'", 'human', 'rts', "'&", 'amp', ";'", 'democracy', "'.", 'War', 'Afghetc', "='", 'Left', "'", 'humanitarian', 'bombing']
['Hiroshima', 'bombing', 'justified', ':', 'Majority', 'Americans', 'even', 'today', '-', 'Hindustan', 'Times', 'http', '://', 't', '.', 'co', '/', 'cC9z5asVZh']
['Japan', 'marks', 'the', '70th', 'anniversary', 'of', 'the', 'atomic', 'bombing', 'of', 'Hiroshima', '.', 'http', '://', 't', '.', 'co', '/', 'YmKn1IwPvF', 'http', '://', 't', '.', 'co', '/', 'mMmJ8Bo9y3']
["'", 'Japan', 'Marks', '70th', 'Anniversary', 'of', 'Hiroshima', 'Atomic', 'Bombing', "'", 'by', 'THE', 'ASSOCIATED', 'PRESS', 'via', 'NYT', 'http', '://', 't', '.', 'co', '/', 'kKULqGB9e3']
['Japan', 'marks', '70th', 'anniversary', 'of', 'Hiroshima',

Probably a bad idea. Need to remove urls first!

Thus, below are the things we can do before tokenization.

## Normalization

Normalization usually refers to the unification of terms in the document.
<br>
"Run" and "runs" are probably referring to the same word and for particular applications it does not make sense to
<br>treat them differently.

### Text normalization or cleaning options

Depending on the problem you are solving, your methods could be sensitive to the noise present in the data.
<br>
In order to reduce the input size and increase the importance of specific tokens in your documents,
<br>
some clean up could be useful. Cleaning the input usually includes the following:

* Removal of the sensitive or ambiguous information (urls, names, numbers, hashtags, emojis, non-alpha-numeric, etc.)
* Dealing with punctuation
* Lowercasing and/or stemming/lemmatizing words
* Removing tokens with low information gain (tokens with high document frequency or stop words)

### Removing Twitter-specific characters

In [None]:
import re

# remove urls
def remove_urls(text):
  return re.sub(r"(https?\://)\S+", "", text)

# remove mentions (@name) completely
def remove_mentions(text):
  return re.sub(r"@[^:| ]+:? ?", "", text)

# remove "RT:", if the tweet contains it.
def remove_rt(text):
  if text.lower().startswith("rt:"):
    return text[3:].strip()
  return text

In [21]:
print (remove_urls("RT: @julia: Calgary Police Flood Road Closures in Calgary. http://t.co/RLN09WKe9g"))
print (remove_mentions("RT: @julia Calgary Police Flood Road Closures in Calgary. http://t.co/RLN09WKe9g"))
print (remove_rt("RT: @julia Calgary Police Flood Road Closures in Calgary. http://t.co/RLN09WKe9g"))
print (remove_rt(
          remove_mentions(
              remove_urls(
                  "RT: @julia Calgary Police Flood Road Closures in Calgary. http://t.co/RLN09WKe9g"))))

RT: @julia: Calgary Police Flood Road Closures in Calgary. 
RT: Calgary Police Flood Road Closures in Calgary. http://t.co/RLN09WKe9g
@julia Calgary Police Flood Road Closures in Calgary. http://t.co/RLN09WKe9g
Calgary Police Flood Road Closures in Calgary.


In [22]:
def remove_urls_mentions_rt_df(row, field):
  return remove_rt(remove_mentions(remove_urls(row[field])))

clean_confident_entries['text_cleaned_from_url_mentions_rt'] = clean_confident_entries.apply(lambda row: remove_urls_mentions_rt_df (row, 'text'), axis=1)

clean_confident_entries['text_tokenized'] = clean_confident_entries.apply(lambda row: tokenize_flatten_df (row, 'text_cleaned_from_url_mentions_rt'), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


### *Exercise*: Hashtag removal

Write function to replace hashtags '#word' with 'word' only in texts!

Type your solution below:

In [23]:
def replace_hashtags_from_text(text):
  return re.sub(r"#+ ?", '', text)

print(replace_hashtags_from_text(text='Hello #world how are you'))

Hello world how are you


### Removing punctuation, Numbers, and others

Although some approaches could also handle and deal with punctuation, occasionally, it might be usefull to remove the punctuation.
<br>
Moreover, for the simple scenarious, punctuation might have very high document frequency and,
<br> as a result would be removed during stopwords removal that we cover later.
<br>
Of course, if we are working with some complicated sequence models where punctuation is part of the result, it is crutial to leave it.

In [None]:
# remove hashtags
def replace_hashtags_from_list(tokens_list):
  return [token for token in tokens_list if token != "#"]

# remove digits
def remove_digits(tokens_list):
  return [token for token in tokens_list if not re.match(r"[-+]?\d+(\.[0-9]*)?$", token)]

# remove all tokens that contains non alpha numeric, punctuation
def remove_containing_non_alphanum(tokens_list):
  return [token for token in tokens_list if token.isalpha()]

### Lowercasing

Important: Specific word cases could be important and have a separate meaning, e.g., "the" - article, "THE" - possible abbreviation.
<br>
In Twitter and other informal messages, capitalized words can also indicate emphasis ('It was THE DAY!').  

In [None]:
# lowercase everything
def lowercase_list(tokens_list):
  return [token.lower() for token in tokens_list]

### Remove stopwords

Spot words are typically seens as high frequency (document-wise) noise, or function words
<br>(words that does not convey a meaning but rather perform some function).
* Created *stopwords lexicons* (nltk.stopwords)

In [29]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# remove stopwords
def remove_stopwords(tokens_list):
  return [token for token in tokens_list if not token in stopwords.words(u'english')]

In [31]:
for token in stopwords.words(u'english')[0:20]:
  print(token)

i
me
my
myself
we
our
ours
ourselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
he
him
his


In [32]:
print (replace_hashtags_from_list(
            remove_digits(
                remove_containing_non_alphanum(
                    lowercase_list(remove_stopwords(
                        ["Calgary", "#", "Police", ",", "123", ",",
                         "?", "Flood", "#", "Road", "Closures", "in",
                         "Calgary", ".", "the", "13,000"]))))))

['calgary', 'police', 'flood', 'road', 'closures', 'calgary']


### Final full text cleaning

In [None]:
from tqdm import tqdm
tqdm.pandas()

In [34]:
# Iterates over the elements of the list with tokens and performs cleanup.
def clean_tokens(row, field):
  return replace_hashtags_from_list(
            remove_digits(
                remove_containing_non_alphanum(
                    lowercase_list(remove_stopwords(row[field])))))

clean_confident_entries['text_tokenized_cleaned'] = clean_confident_entries.progress_apply(lambda row: clean_tokens(row, 'text_tokenized'), axis=1)

100%|██████████| 6482/6482 [00:15<00:00, 427.64it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


### Stemming and lemmatization

In NLP, to deal with the variety of representation of the various words/tokens in language, **stemming** or **lemmatization** is used.
<br>
Stemming is typically faster, more naive word shortening. 
<br>
Lemmatization take deeper approach to extracting lemmas of the words
<br>(this includes identification of the part of speech, language specific dictionary with mapping between words and lemmas).

*Examples:* 

am, are, is $\Rightarrow$ be

car, cars, car's, cars' $\Rightarrow$ car

<img src="https://nlp.stanford.edu/IR-book/html/htmledition/img102.png" width=50%>

(c) [Stanford NLP](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Let's now lemmatize/stem the words we are using.

In [35]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [36]:
porter_stemmer = nltk.PorterStemmer()
lancaster_stemmer = nltk.LancasterStemmer()
snowball_stemmer = nltk.SnowballStemmer(u'english')
lemmatizer = nltk.WordNetLemmatizer()
lemmatizer.stem = lemmatizer.lemmatize
normalizers = [
    ('porter_stemmer', porter_stemmer),
    ('lancaster_stemmer', lancaster_stemmer),
    ('snowball_stemmer', snowball_stemmer),
    ('wordnet_lemmatizer', lemmatizer)
]

for stemmer_name, normalizer in normalizers:
    print (stemmer_name, [normalizer.stem(token) for token in tokenizer("Calgary Police Policy Flood Road Closures in Calgary".lower())])

porter_stemmer ['calgari', 'polic', 'polici', 'flood', 'road', 'closur', 'in', 'calgari']
lancaster_stemmer ['calg', 'pol', 'policy', 'flood', 'road', 'clos', 'in', 'calg']
snowball_stemmer ['calgari', 'polic', 'polici', 'flood', 'road', 'closur', 'in', 'calgari']
wordnet_lemmatizer ['calgary', 'police', 'policy', 'flood', 'road', 'closure', 'in', 'calgary']


### *Exercise*: POS integration

Try and do it similarly to the previous cell.

In [37]:
lemmatizer.lemmatize("Calgary Police Policy Flood Road Closures in Calgary".lower(), 'n')

'calgary police policy flood road closures in calgary'

## Text Annotation

### Part of speech

In many languages, the word behaviour can be described by the word's part of speech (how the word would behave in a sequence of words).
<br>
POS - at least for english and very very simplified - nouns, verbs, adjectives, adverbs.
<br>
Check [nltk practical examples](https://www.nltk.org/book/ch05.html) on how to get the tags for your input.
<br>
Moreover, nltk has an interface to [Stanford POS tagger](https://nlp.stanford.edu/software/tagger.shtml).

Actual tags go beyond just simplistic noun vs verb classification.
<br>
There are also proper nouns, count nouns, common nouns.
<br>
Moreover, the following tags are available - adjectives, adverbs, locative, degrees, prepositions, articles, pronouns,
<br>
phrasal vebs, auxilairies, modal verbs, etc. Also singular or plural tags are added.
![List of POS](https://d2vlcm61l7u1fs.cloudfront.net/media%2Ffef%2Ffef719c5-d0fe-4b32-846b-66b7b540e268%2Fphptcdp2B.png)

Let's try nltk POS tagger.

In [38]:
nltk.download('averaged_perceptron_tagger')
tagged = nltk.pos_tag("Time flies like an arrow".split())
print (tagged)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('Time', 'NNP'), ('flies', 'NNS'), ('like', 'IN'), ('an', 'DT'), ('arrow', 'NN')]


### Parse trees

* Constituency-based parse trees
<br>
Is a parse tree where the phrase structure grammars are preserved.
<br>This tree is a binary tree that contains non-terminal nodes (phrases) and terminal nodes (actual words).
<br>The standart structure of a sentence is $ S = NP\ VP$, that is a sentence should contain a noun phrase and a verbal phrase.
* Dependency-based parse trees
<br>
In these trees, all nodes are terminal where edges specify the actual dependencies between the words. 
<br>This means that there are less nodes in such trees (since we do not create non terminal nodes).
<br>The constituency is also acknowledged in such graphs as any complete sub-tree.
The tree could employ various dependencies: morphological, semantic, syntactic.
<br>Syntactic functions of such tree could be useful if you are interested in extracting some particular phrases, or construction:
<br>e.g., you can get information if something is an ATTRibute of an object, or if a noun is a COMPlement TO the object,
<br> you can also discover what is a subject and what is an object in the sentence, etc.

![Constituency and Dependency parse trees](https://upload.wikimedia.org/wikipedia/commons/0/0d/Wearetryingtounderstandthedifference_%282%29.jpg)

(c) taken from [Wikipedia](https://en.wikipedia.org/wiki/Dependency_grammar).

### *Exercise*: Extract parse trees

Try it out yourself during the break and let's discuss afterwards.

In order to use Stanford parses, follow the instruction [here](https://stanfordnlp.github.io/CoreNLP/download.html).
<br>Once, you have downloaded CoreNLP libs, unzipped it, specify the jar and models_jar paths in the code below and you should be good to go :)

In [39]:
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser(path_to_jar="",
                                             path_to_models_jar="")

result = dependency_parser.raw_parse('Time flies like an arrow')
dependency_tree = result.next()
list(dependency_tree.triples())

LookupError: ignored

Or if we first run this code locally, you could run the StanfordNLP server and run the following commands: 

`java -mx1g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9010 -timeout 15000`

In [40]:
from nltk.parse.corenlp import CoreNLPParser
parser = CoreNLPParser(url='http://localhost:9010')

next(
     parser.raw_parse('The quick brown fox sucks at jumping.')
     ).pretty_print()

next(
     parser.raw_parse('Time flies like an arrow.')
     ).pretty_print()


from nltk.parse.corenlp import CoreNLPDependencyParser
dep_parser = CoreNLPDependencyParser(url='http://localhost:9010')

parse, = dep_parser.raw_parse('Time flies like an arrow.')

print(parse.to_conll(4))

ConnectionError: ignored

There is also a wrapper around the library. Check [here](https://github.com/Lynten/stanford-corenlp).

In [41]:
!pip install stanfordnlp
import stanfordnlp

stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English

Collecting stanfordnlp
[?25l  Downloading https://files.pythonhosted.org/packages/41/bf/5d2898febb6e993fcccd90484cba3c46353658511a41430012e901824e94/stanfordnlp-0.2.0-py3-none-any.whl (158kB)
[K     |████████████████████████████████| 163kB 2.8MB/s 
Installing collected packages: stanfordnlp
Successfully installed stanfordnlp-0.2.0
Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: en_ewt
Download location: /root/stanfordnlp_resources/en_ewt_models.zip


100%|██████████| 235M/235M [00:38<00:00, 4.99MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.
Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/root/stanfordnlp_resources

In [43]:
doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
print (doc.sentences[0].print_tokens())
print('-' * 100)
print (doc.sentences[0].print_dependencies())

<Token index=1;words=[<Word index=1;text=Barack;lemma=Barack;upos=PROPN;xpos=NNP;feats=Number=Sing;governor=4;dependency_relation=nsubj:pass>]>
<Token index=2;words=[<Word index=2;text=Obama;lemma=Obama;upos=PROPN;xpos=NNP;feats=Number=Sing;governor=1;dependency_relation=flat>]>
<Token index=3;words=[<Word index=3;text=was;lemma=be;upos=AUX;xpos=VBD;feats=Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin;governor=4;dependency_relation=aux:pass>]>
<Token index=4;words=[<Word index=4;text=born;lemma=bear;upos=VERB;xpos=VBN;feats=Tense=Past|VerbForm=Part|Voice=Pass;governor=0;dependency_relation=root>]>
<Token index=5;words=[<Word index=5;text=in;lemma=in;upos=ADP;xpos=IN;feats=_;governor=6;dependency_relation=case>]>
<Token index=6;words=[<Word index=6;text=Hawaii;lemma=Hawaii;upos=PROPN;xpos=NNP;feats=Number=Sing;governor=4;dependency_relation=obl>]>
<Token index=7;words=[<Word index=7;text=.;lemma=.;upos=PUNCT;xpos=.;feats=_;governor=4;dependency_relation=punct>]>
None
------------

# What to do with the clean data?

# Documents representation

Let's finally impose some vectorized representation on our documents so that the machine and math could easily operate over it.

### Document representation as term counts (Bag-Of-Words, or BOW)

In [None]:
#    Now we need to convert our documents to the common representation.
#    You can do it manually, just for fun, or we can already use some libs.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = CountVectorizer()

corpus = []
for document_id, row in clean_confident_entries.iterrows():
  corpus.append(" ".join(row['text_tokenized_cleaned']))
  
document_term_matrix = vectorizer.fit_transform(corpus).toarray()

In [45]:
document_term_matrix[0]

array([0, 0, 0, ..., 0, 0, 0])

## Representation

* We need to represent our data somehow
  * Bag-of-words
  <br>
  The concept comes from the fact, that we could represent our input as a set of tokens or words without preserving any notion of order,
  <br> just like items in the bag. In such cases, we assume that each document is simply a set of words and each
  <br> word could appear several or zero times in a bag.
  <br> In order to somehow presenve the order in the documents, n-grams (n consequitive words/tokens in the document) could be introduced.
  <br> Though being rather powerful, number of features/dimentions grow exponentially.
  * Document-term frequency
  <br>
  Similar to the BOW model though instead of 0s and 1s for each word that appear in the document,
  <br> we would place TF or TFIDF score of the word in the document.
  * Semantic Vector space representation
  <br>
  Previous representation is rather simple but very very high dimentional, as usually number of tokens in the "bag" could be 
  <br> around 1M (despite them being present or not in the document). 
  <br> As you could have seen in the ML course, we could reduce the dimentionality by somehow converting our 
  <br> 1M dimentional document representation to let's say 300. 
  <br> Such transformation can be both done for words (special case of the document with a single 1 on the word position and 0s elsewhere),
  <br> combination of words, sentences, paragraphs, and whole documents.

### Feature weighting and noise removal

Apart from the general stopword removal that you might decide to do, several other techniques exist.
<br>Those could be used for, first, cleaning the overall corpus from too frequent/too sparse tokens.
<br>And, second, for potentially better representations of feature weights (rather thatn 0 and 1).

Note: would work better for BOW representation, but might work worse for sequence-to-sequence tasks.

*Weighting tokens/features*:

* *TF-IDF*
<br>
Term frequency * inverse document frequency. The intuition here is to preserve all words in the dataset in each document,
<br>however, **correct** their frequencies with word specificity, i.e., the more specific word is (if the word appear in only several
<br>documents it might be quite specific) the more we would want to promote word's score.
<br>One way to compute this specificity is to inverse docuemtn frequency. This way, if the term is very frequent accross documents -
<br>we would have to divide the term frequency with a high value, and vice versa.
<br>Note: here still the more frequent the word is in a document, the more likely it is to be important, though it might be slightly degraded if it is not specific.

* Filter words with the highest difference between *conditional probability* of a word within a document vs. a word within the whole corpus
<br>
For conditional probability, we do not care that much about the word frequency in a document, but rather how much its
<br>probability changes in a given document sample with respect to the overall probability in the entire dataset.
<br>
As a result, conditional probability might score higher more specific (but less frequent) words.

*Removing noise*:

* Filter out ones with the *highest Document Frequency or part of the probability mass*
<br>
You might have heard, that natural language vocabularies exhibit frequency distribution similar to [Zipf law](https://en.wikipedia.org/wiki/Zipfs_law)
<br>(subclass of the long-tail distribution where i_-frequent word would be twice as likely to appear in the corpus is _i+1_-frequent word).
<br>Though zipf allow more relaxed behaviour, i.e., a lot of tokens have low frequency and few have high frequency.

* *Background-foreground overlap idea*
<br>
The idea behind this methods is to compare frequency distributions of two independent datasets and diminish the intersection.
<br>This allows to deemphasize the effect of the high frequency noise. In more details, we take frequency distribution of one dataset,
<br>then same for some other totally independent (unrelated) dataset, extract top-n words in each distribution, and remove any items in the intersection.

In [None]:
#@title Document Frequency Helper Functionality
# Compute vector representation for each document in the collection.
# Term frequency
# TFIDF

from collections import defaultdict, Counter
import math

class TermDocumentCounts:
  def __init__(self):
    # Counters of all the words in the corpus
    self.total_word_counts = Counter()
    self.total_number_of_words = 0
    self.term_count_per_document = defaultdict(Counter)
    self.number_of_words_per_document = defaultdict(int)
    self.number_of_document = 0
    self.df = defaultdict(int)
    
  def update(self, document_id, tokens):
    self.number_of_document += 1
    num_tokens = len(tokens)
    self.total_word_counts.update(tokens)
    self.total_number_of_words += num_tokens
    self.term_count_per_document[document_id].update(tokens)
    self.number_of_words_per_document[document_id] += num_tokens
    for token in set(tokens):
      self.df[token] += 1
  
  def most_common_word_in_document(self, document_id, top_n = None):
    return self.term_count_per_document[document_id].most_common(top_n)
  
  def _compute_tfidf_for_word(self, document_id, word):
        tf = self.term_count_per_document[document_id][word]
        idf = math.log(self.number_of_document / self.df[word], 10)
        return tf * idf
  
  # Returns the list of words ranked accoring to TFIDF score.
  def ranked_document_words_tfidf(self, document_id, top_n=None):
        tfidfs = [
            (word, self._compute_tfidf_for_word(document_id, word))
              for word in self.term_count_per_document[document_id].keys()
        ]
        tfidfs.sort(key=lambda x: x[1], reverse=True)
        if not top_n:
            top_n = len(tfidfs)
        return tfidfs[:top_n]
      
  # Returns the list of words ranked accoring to max conditional probability
  # change.
  def ranked_document_words_conditional_probability(self, document_id,
                                                    top_n = None):
    word_posterior = [(word, self.compute_posterios(document_id, word)) \
              for word, count in \
                self.term_count_per_document[document_id].items()]
    word_prior = {word: self.compute_priors(word) for word, _ in word_posterior}
    conditional_probability = sorted(
        [(word, (math.log((probability / word_prior[word]), 2))) \
            for word, probability in word_posterior],
        key=lambda x: x[1],
        reverse=True
    )
    if not top_n:
        top_n = len(conditional_probability)
    return conditional_probability[:top_n]  
      
  # Computes posterior word distribution over given document.
  def compute_posterios(self, document_id, word):
    return self.term_count_per_document[document_id][word] \
              / self.number_of_words_per_document[document_id]
  
  # Computes prior word probability distribution
  def compute_priors(self, word):
    return self.total_word_counts[word] / self.total_number_of_words
    

In [None]:
corpus_counts = TermDocumentCounts()
for document_id, row in clean_confident_entries.iterrows():
  corpus_counts.update(document_id, row['text_tokenized_cleaned'])

What see what are words that are common in some of the documents.

In [None]:
for document_id in range(2):
  print ("\nDocument", document_id + 1,
         "with top 5 important words by tfidf or condiotional probability.")
  print (clean_confident_entries['text_tokenized'][document_id])
  print ("TFIDF", "\t\t\t\t\t", "Condifional probability")
  for word_tfidf, word_conditional in zip(
      corpus_counts.ranked_document_words_tfidf(document_id, 5),
      corpus_counts.ranked_document_words_conditional_probability(document_id,
                                                                  5)):
    print (word_tfidf, "\t", word_conditional)
for document_id in range(1001,1003):
  print ("\nDocument", document_id + 1, "with top 5 important words.")
  print (clean_confident_entries['text_tokenized'][document_id])
  print ("TFIDF", "\t\t\t\t\t", "Condifional probability")
  for word_tfidf, word_conditional in zip(
      corpus_counts.ranked_document_words_tfidf(document_id, 5),
      corpus_counts.ranked_document_words_conditional_probability(document_id,
                                                                  5)):
    print (word_tfidf, "\t\t", word_conditional)


Document 1 with top 5 important words by tfidf or condiotional probability.
['Just', 'happened', 'a', 'terrible', 'car', 'crash']
TFIDF 					 Condifional probability
('terrible', 2.966610986681934) 	 ('terrible', 10.79125593263882)
('happened', 2.697765674389354) 	 ('happened', 9.898171136555332)
('just', 2.033557776312547) 	 ('just', 7.691720259087907)
('car', 1.8874297406343095) 	 ('car', 7.122877423730027)
('crash', 1.7905197276262528) 	 ('crash', 6.778431892281237)

Document 2 with top 5 important words by tfidf or condiotional probability.
['Our', 'Deeds', 'are', 'the', 'Reason', 'of', 'this', '#', 'earthquake', 'May', 'ALLAH', 'Forgive', 'us', 'all']
TFIDF 					 Condifional probability
('deeds', 3.811709026696191) 	 ('deeds', 12.920538949583786)
('forgive', 3.3345877719765284) 	 ('forgive', 11.33557644886263)
('allah', 3.1127390223601723) 	 ('allah', 10.598610854696425)
('our', 2.581260105317917) 	 ('our', 8.833076108333447)
('reason', 2.581260105317917) 	 ('reason', 8.833076108

## Feature extraction

Before coding, check modules of [scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction) and more examples [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). 

*Potential features*:

* Words
* N-grams
* Character N-gram
* Skip-gram
* Part-of-Speech (POS)

*Potential values*:

* TF/Count Vectors ([CountVectorizer scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html),  ([HashingVectorizer scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)))
* TF-IDF ([TFIDFVectorizer scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) or [TFIDFTransformer scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html))


## Preserving Word Order

Depending what is your particular problem, you might need to care about the order of words in your input data.

* This could happen either on the data preprocessing step, where you would encode the order as part of the feature input,\
<br>e.g., n-grams, specific features that correspond to the positions of the words.
<br>Further all those features are sent to the particular methods of your choice.
* If you do not want and can't polute the input with those additional information, you might need to reply on the methods that
<br>implicitly encode the order in the data as they read the input.
  * Hiden Markov Models ([HMM on wiki](https://en.wikipedia.org/wiki/Hidden_Markov_model), [HMM Fundamentals](http://cs229.stanford.edu/section/cs229-hmm.pdf)), Conditional Random Fields ([CRF](https://en.wikipedia.org/wiki/Conditional_random_field)), Reccurrent neural networks, etc.
