### Fintech_ML_Training_Wing_Session_6_Notebook_NLP
<hr>

## Learning Objectives
General objectives:

- define **corpus**, **documents** and **terms** as part of the study of Natural Language Processing

- define **tokenisation** as breaking a document into terms

- understand the definition of **root form** of a word for verbs and nouns

- identify **stemming** as a way to find the root form of a word

- list tags from a Part-of-Speech tagger related to verbs, nouns, prepositions, adverbs, adjectives and d

- interpret the output of a **Part-of-Speech tagger**

- define a **lemma** as a word that can be found in a dictionary

- identify lemmatisation a way to find the root form of a word

- learn how to use **stop words** to filter out terms in a document that is not meaningful

- use **Jaccard Similarity** to find similar texts

- perform **sentiment analysis** using the `SentimentIntensityAnalyzer`



Guidelines:

- use `word_tokenize` from `nltk.tokenize` to break a document into a list of words

- use `PorterStemmer`'s `stem` from `nltk.stem` to perform stemming of words

- use `pos_tag` from `nltk` to perform Part-of-Speech (POS) tagging of a sentence

- given a word and its POS tag, use `WordNetLemmatizer`'s `lemmatize` from `nltk.stem` to find its corresponding lemma 

- retrieve a list of stopwords defined in `stopwords.words()` from `nltk.corpus`

- extend the existing implementation of Jaccard Similarity to find similar texts

- find additional corpora in the `nltk` library

- further understanding of `CountVectorizer`

### Datasets Required for this Self-Study
1. `songs-100.csv`

2. `loans-descs-1k.csv`

Adapted from Hackwagon DS102

#### Import Libraries

In [1]:
import pandas as pd
import nltk
import re

In [2]:
# Use this cell to download all the required corpora first. Then, comment out this block of code.
print("Downloading corpora...")    
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
print("Corpora download complete.")

Downloading corpora...


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\simsh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\simsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\simsh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\simsh\AppData\Roaming\nltk_data...


Corpora download complete.


[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\simsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# If you are running this for the first time, use the previous cell to download all 
# the corpora before starting.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### Corpus, Documents and Terms

In lingustics (the study of language), a **corpus** is a collection of texts, represented by documents. A **document** contains multiple words and when strung together, produce meaning. Each word is called a **term**. 

Consider the following corpus of 4 documents from the 100 Song Titles dataset:

In [4]:
song_titles = ['shape of you', 'paris', 'scared to be lonely', 
               'symphony feat zara larsson',]

Each <u>song title is a document</u>. The <u>collection of song titles forms the corpus</u>. The first document `shape of you` has <u>3 terms</u>. The second document `paris` has <u>1 term</u>. The third document `scared to be lonely` has <u>4 terms</u>.

We now read from `songs-100.csv`, a CSV file which, in this case is our corpus. There are $100$ documents in the song titles corpus.

In [5]:
# Read from 'songs-100.csv' into song_titles_df
#
song_titles_df = pd.read_csv('songs-100.csv')
# Use count() to find the number of documents in the corpus.
#
song_titles_df.count()

Unnamed: 0    100
name          100
dtype: int64

### Tokenisation using `nltk`

The Natural Language Toolkit or `nltk` library is a very powerful library used for natural language processing. We will be using `nltk.tokenize.word_tokenize()` to perform tokenisation.

In [6]:
loan_descs_df = pd.read_csv('loans-descs-1k.csv')

In [7]:
# Convert the raw text into 2 sentences that can be used for processing
s1 = loan_descs_df.loc[4]['desc']
print(s1)

  Borrower added on 02/14/14 > I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.<br>


Using `re.sub()` to substitute the initial phrase with an empty string.

In [8]:
print(s1)
r = 'Borrower added on \d+/\d+/\d+ >|<br>'
s2 = re.sub(r, '', s1)
s2 = s2.strip()
print()
print(s2)

  Borrower added on 02/14/14 > I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.<br>

I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.


To tokenise a sentence, simply use `word_tokenize()` from the `nltk` library. This will convert the sentence into individual words AND special characters like full stops and commas.

In [9]:
# Use word_tokenize(string) to convert the string into a list of tokens.
# Assign this to a new variable called ts1
#
ts1 = word_tokenize(s1)
print(ts1)

['Borrower', 'added', 'on', '02/14/14', '>', 'I', 'am', 'consolidating', 'credit', 'card', 'debt', 'incurred', 'over', 'three', 'years', 'ago', 'and', 'having', 'a', 'concrete', 'end', 'in', 'sight', 'is', 'more', 'motivating', '.', 'I', 'am', 'eagerly', 'striving', 'towards', 'becoming', 'completely', 'debt', 'free.', '<', 'br', '>']


### Linguistics - The root form of a word (verbs & nouns)
Stemming is one way to find the **root form of a word**. We will only limit our discussion to verbs (action words) and nouns (naming words). First, consider the following 3 sentences that use different forms of the word `watch`:

- `Larry watches television.` (singular present tense)

- `The children watch television.` (simple tense / plural, present tense)

- `My son is watching television.` (present participle tense / present continuous tense)

- `My mum watched television with me.` (past tense)

The word `watch` exists in 4 different <u>forms of the **verb**</u> as they exist in different tenses. However, algorithms treat them as **separate words** during analysis. Hence, we need to find the root form of the verb so they can be treated as the same word during analysis as they have the same meaning, in this case `watch`. 

`nltk` implements the **Porter Stemmer** and you can find the reference for the rules [here](http://www.nltk.org/howto/stem.html). Use `stemmer = PorterStemmer()` and then use the `stem()` method for each word to get its root form.

In [10]:
stemmer = PorterStemmer()

ss_verbs = ['Larry watches television.', 'The children watch television.', 
       'My son is watching television.', 'My mum watched television with me.']

for s in ss_verbs:
    for st in word_tokenize(s):
        print(stemmer.stem(st))
    print()

larri
watch
televis
.

the
children
watch
televis
.

My
son
is
watch
televis
.

My
mum
watch
televis
with
me
.



Now consider the next 2 sentences:

- `This is a very expensive vase.` (singular noun)

- `The third floor in this mall sells vases.` (plural noun)

Similarly, we need to find the root <u>form of the **noun**</u>, in this case `vase`. Although only differing in one letter, the ending `s`, algorithms treat them as distinct words. Hence, we need to find the root form of the noun so they can be treated as the same word as they refer to the same object in real life.

In [11]:
ss_nouns = ['This is a very expensive vase.', 
            'The third floor in this mall sells vases.', ]
# Exercise: Iterate through the list of sentences. Tokenise each sentense using word_tokenize().
# Then for every term, print out its stemmed form using stemmer.stem(term)
#
for s in ss_nouns:
    for st in word_tokenize(s):
        print(stemmer.stem(st))
    print()

thi
is
a
veri
expens
vase
.

the
third
floor
in
thi
mall
sell
vase
.



### Stemming

Now, we apply the stemming step on the initial loans sentence.

In [12]:
# Just to refresh our memory...
print(ts1)

['Borrower', 'added', 'on', '02/14/14', '>', 'I', 'am', 'consolidating', 'credit', 'card', 'debt', 'incurred', 'over', 'three', 'years', 'ago', 'and', 'having', 'a', 'concrete', 'end', 'in', 'sight', 'is', 'more', 'motivating', '.', 'I', 'am', 'eagerly', 'striving', 'towards', 'becoming', 'completely', 'debt', 'free.', '<', 'br', '>']


In [13]:
# Instantiate a stemmer
#
stemmer = PorterStemmer()
stemmed_words = []
for t in ts1:
    # Use stemmer.stem() to find the root form of the word
    #
    u = stemmer.stem(t)
#    stemmed_words.append(t)    
    stemmed_words.append(u)
print(stemmed_words)

['borrow', 'ad', 'on', '02/14/14', '>', 'I', 'am', 'consolid', 'credit', 'card', 'debt', 'incur', 'over', 'three', 'year', 'ago', 'and', 'have', 'a', 'concret', 'end', 'in', 'sight', 'is', 'more', 'motiv', '.', 'I', 'am', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free.', '<', 'br', '>']


Notice, some stemmed words are not valid english words. For example, `consolid` is not an English word. `motiv` and `eagerli` too. However, because of its relatively simple algorithm, some applications accept this form of the word and hence this algorithm is useful. Examples of implementations of stemming are in search engines as both the search term and text can be stemmed.

The following shows the original form of the sentence and the result after stemming for easy comparison.

In [14]:
print("%15s   %15s   " % ("Raw", "Stemming"))
print("%15s-- %15s" % ("------------", "------------"))
for i in range(0, len(stemmed_words)-1):
    print ("%15s   %15s" % (ts1[i], stemmed_words[i]))

            Raw          Stemming   
   --------------    ------------
       Borrower            borrow
          added                ad
             on                on
       02/14/14          02/14/14
              >                 >
              I                 I
             am                am
  consolidating          consolid
         credit            credit
           card              card
           debt              debt
       incurred             incur
           over              over
          three             three
          years              year
            ago               ago
            and               and
         having              have
              a                 a
       concrete           concret
            end               end
             in                in
          sight             sight
             is                is
           more              more
     motivating             motiv
              .                 .
           

We will consider another algorithm to find the root form of a word, called **Lemmatisation**. Before we start talking about Lemmatisation, we need to first understand **Part-of-Speech (POS) Tagging**. 

### Part-of-Speech (POS) Tagging

POS tagging is a way to group a word into its **class**. Commonly, a word and tag pair is represented as a tuple. We will use one of `nltk`'s tagged corpora, in particular the *Penn Treebank Project* to help us tag newly discovered words. To find the POS tag of a word, use `nltk.pos_tag(word_tokens)`.

Notice that the resulting value consists of many tuples. The first element in the tuple is the original word from the sentence and the second element is the assigned POS tag. Refer to this [link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to understand the meaning of each tag.

In [15]:
tagged_words_by_treebank = nltk.pos_tag(ts1)
print(tagged_words_by_treebank)
# Tags to be aware of: PRP, VBP, VBG, NN, VBN, NNS+

[('Borrower', 'NNP'), ('added', 'VBD'), ('on', 'IN'), ('02/14/14', 'CD'), ('>', 'NN'), ('I', 'PRP'), ('am', 'VBP'), ('consolidating', 'VBG'), ('credit', 'NN'), ('card', 'NN'), ('debt', 'NN'), ('incurred', 'VBN'), ('over', 'IN'), ('three', 'CD'), ('years', 'NNS'), ('ago', 'RB'), ('and', 'CC'), ('having', 'VBG'), ('a', 'DT'), ('concrete', 'JJ'), ('end', 'NN'), ('in', 'IN'), ('sight', 'NN'), ('is', 'VBZ'), ('more', 'RBR'), ('motivating', 'JJ'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('eagerly', 'RB'), ('striving', 'JJ'), ('towards', 'NNS'), ('becoming', 'VBG'), ('completely', 'RB'), ('debt', 'NN'), ('free.', 'NN'), ('<', 'NNP'), ('br', 'NN'), ('>', 'NN')]


In [16]:
tagged_words_by_treebank2 = nltk.pos_tag(word_tokenize("Mary leaves the room"))
print(tagged_words_by_treebank)

[('Borrower', 'NNP'), ('added', 'VBD'), ('on', 'IN'), ('02/14/14', 'CD'), ('>', 'NN'), ('I', 'PRP'), ('am', 'VBP'), ('consolidating', 'VBG'), ('credit', 'NN'), ('card', 'NN'), ('debt', 'NN'), ('incurred', 'VBN'), ('over', 'IN'), ('three', 'CD'), ('years', 'NNS'), ('ago', 'RB'), ('and', 'CC'), ('having', 'VBG'), ('a', 'DT'), ('concrete', 'JJ'), ('end', 'NN'), ('in', 'IN'), ('sight', 'NN'), ('is', 'VBZ'), ('more', 'RBR'), ('motivating', 'JJ'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('eagerly', 'RB'), ('striving', 'JJ'), ('towards', 'NNS'), ('becoming', 'VBG'), ('completely', 'RB'), ('debt', 'NN'), ('free.', 'NN'), ('<', 'NNP'), ('br', 'NN'), ('>', 'NN')]


In [17]:
tagged_words_by_treebank3 = nltk.pos_tag(word_tokenize("Dew drops fall from the leaves"))
print(tagged_words_by_treebank)

[('Borrower', 'NNP'), ('added', 'VBD'), ('on', 'IN'), ('02/14/14', 'CD'), ('>', 'NN'), ('I', 'PRP'), ('am', 'VBP'), ('consolidating', 'VBG'), ('credit', 'NN'), ('card', 'NN'), ('debt', 'NN'), ('incurred', 'VBN'), ('over', 'IN'), ('three', 'CD'), ('years', 'NNS'), ('ago', 'RB'), ('and', 'CC'), ('having', 'VBG'), ('a', 'DT'), ('concrete', 'JJ'), ('end', 'NN'), ('in', 'IN'), ('sight', 'NN'), ('is', 'VBZ'), ('more', 'RBR'), ('motivating', 'JJ'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('eagerly', 'RB'), ('striving', 'JJ'), ('towards', 'NNS'), ('becoming', 'VBG'), ('completely', 'RB'), ('debt', 'NN'), ('free.', 'NN'), ('<', 'NNP'), ('br', 'NN'), ('>', 'NN')]


Notice that the first letter of the tag represent similar classes. In particular, 

- the pattern `N[A-Z]+` represents nouns and 

- the pattern `V[A-Z]+` represent verbs

Hence, we can take the first character and convert it to lower case. This first letter of the tag will be used for Lemmatisation.

In [18]:
tagged_words = []
for twt in tagged_words_by_treebank:
    # Get the first element of the tuple, and the first letter of the second element
    # of the tuple.
    tagged_words.append((twt[0], twt[1][0].lower()))
print(tagged_words)

[('Borrower', 'n'), ('added', 'v'), ('on', 'i'), ('02/14/14', 'c'), ('>', 'n'), ('I', 'p'), ('am', 'v'), ('consolidating', 'v'), ('credit', 'n'), ('card', 'n'), ('debt', 'n'), ('incurred', 'v'), ('over', 'i'), ('three', 'c'), ('years', 'n'), ('ago', 'r'), ('and', 'c'), ('having', 'v'), ('a', 'd'), ('concrete', 'j'), ('end', 'n'), ('in', 'i'), ('sight', 'n'), ('is', 'v'), ('more', 'r'), ('motivating', 'j'), ('.', '.'), ('I', 'p'), ('am', 'v'), ('eagerly', 'r'), ('striving', 'j'), ('towards', 'n'), ('becoming', 'v'), ('completely', 'r'), ('debt', 'n'), ('free.', 'n'), ('<', 'n'), ('br', 'n'), ('>', 'n')]


### Lemmatisation

A **Lemma** is a word found in the dictionary. Hence, you can think of them as the root form of a word. Given a word and its corresponding tag, we can find the word's root form in English. This will be easier for human interpretation. Use `WordNetLemmatizer.lemmatize(term, pos=tag)` to find the root form of the word given the source word and its associated POS tag.

Note that if the POS tag cannot be found, a `KeyError` will be thrown. For example, the first word will have the following result:

In [19]:
import pandas as pd
import nltk
import re

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# lemmatiser = WordNetLemmatizer()
# lemmatiser.lemmatize('I', pos='p') # Uncomment this line to see the KeyError

Hence, wrap the Lemmatisation step into `try...catch` block so the program continues even if a `KeyError` is encountered.

In [20]:
lemmatiser = WordNetLemmatizer()
lemmed_words = []
for tw_pair in tagged_words:
    tw_word, tw_tag = tw_pair[0], tw_pair[1]
    lemm_word = tw_word
    try:
        lemm_word = lemmatiser.lemmatize(tw_word, pos=tw_tag)
    except KeyError:
        # If an error is thrown, then the word is assumed to be in its root form.
        print("KeyError: " + tw_word)
        pass

    lemmed_words.append(lemm_word)
print(lemmed_words)

KeyError: on
KeyError: 02/14/14
KeyError: I
KeyError: over
KeyError: three
KeyError: and
KeyError: a
KeyError: concrete
KeyError: in
KeyError: motivating
KeyError: .
KeyError: I
KeyError: striving
['Borrower', 'add', 'on', '02/14/14', '>', 'I', 'be', 'consolidate', 'credit', 'card', 'debt', 'incur', 'over', 'three', 'year', 'ago', 'and', 'have', 'a', 'concrete', 'end', 'in', 'sight', 'be', 'more', 'motivating', '.', 'I', 'be', 'eagerly', 'striving', 'towards', 'become', 'completely', 'debt', 'free.', '<', 'br', '>']


The following shows the original form of the sentence, the result after stemming and the result after lemmatisation for easy comparison.

In [21]:
print("%15s   %15s   %15s" % ("Raw", "Stemming", "Lemmatisation"))
print("%15s-- %15s-- %15s" % ("------------", "------------", "--------------"))
for i in range(0, len(stemmed_words)-1):
    print ("%15s   %15s   %15s" % (ts1[i], stemmed_words[i], lemmed_words[i]))

            Raw          Stemming     Lemmatisation
   --------------    --------------  --------------
       Borrower            borrow          Borrower
          added                ad               add
             on                on                on
       02/14/14          02/14/14          02/14/14
              >                 >                 >
              I                 I                 I
             am                am                be
  consolidating          consolid       consolidate
         credit            credit            credit
           card              card              card
           debt              debt              debt
       incurred             incur             incur
           over              over              over
          three             three             three
          years              year              year
            ago               ago               ago
            and               and               and
         hav

### Removing Stop Words

Finally, before performing analysis, remove **stop words** from the sentence. A stop word is a word that usually appears in many texts, and hence do not hold any meaning. In signal processing language, this is referred to as <u>noise</u>. Refer to this [Github link](https://gist.github.com/sebleier/554280) for the list of stop words from `nltk`. `nltk.corpus.stopwords.words()` contains the list of stop words and if the word exists in them, ignore them.

Recall that 

```python
    word in wordlist
``` 
is used to check if a word is in a list. It returns `True` if the word is found and `False` otherwise.

In [22]:
final_list_of_words = []
for l in stemmed_words:
    # Use not in stopwords.words('english') to check if the word 
    # is a stop word. If it isn't, then append to the final_list_of_words.
    if l not in stopwords.words('english'):
        final_list_of_words.append(l)
print(stemmed_words)
print()
print(final_list_of_words)
len(stemmed_words) - len(final_list_of_words)
# 12

['borrow', 'ad', 'on', '02/14/14', '>', 'I', 'am', 'consolid', 'credit', 'card', 'debt', 'incur', 'over', 'three', 'year', 'ago', 'and', 'have', 'a', 'concret', 'end', 'in', 'sight', 'is', 'more', 'motiv', '.', 'I', 'am', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free.', '<', 'br', '>']

['borrow', 'ad', '02/14/14', '>', 'I', 'consolid', 'credit', 'card', 'debt', 'incur', 'three', 'year', 'ago', 'concret', 'end', 'sight', 'motiv', '.', 'I', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free.', '<', 'br', '>']


10

## Summary of pipeline 
- Convert the sentence to lowercase
- Remove all special characters according to the pattern `[.®'&$’\"\-()]`
- Perform tokenisation, followed by stemming of your selected sentence
- Remove stop words from the list of stemmed words

Note: You can use `'''` to specify a multi-line string. 

In [23]:
s2 = '''I really need to consolidate my credit card debt so that I can become debt free. 
The interest is killing me and I'm just not getting anywhere with the balances. Help!'''

s3 = '''Hello, I just closed on the house of my dreams and I would like to 
use this loan to pay off my high interest credit cards and build a deck on my home.'''

In [24]:
# Step 1: Convert to lower case using lower()
#
s2_1 = s3.lower()
print(s2_1)

hello, i just closed on the house of my dreams and i would like to 
use this loan to pay off my high interest credit cards and build a deck on my home.


In [25]:
# Step 2: Perform regex substitution to remove special characters.
#
s2_2 = re.sub("[.®'&$’\"\-()]", " ", s2_1)
print(s2_2)

hello, i just closed on the house of my dreams and i would like to 
use this loan to pay off my high interest credit cards and build a deck on my home 


In [26]:
# Step 3: use word_tokenize() to tokenise the sentence and get a list of terms.
#
s2_3 = nltk.word_tokenize(s2_2)
print(s2_3)

['hello', ',', 'i', 'just', 'closed', 'on', 'the', 'house', 'of', 'my', 'dreams', 'and', 'i', 'would', 'like', 'to', 'use', 'this', 'loan', 'to', 'pay', 'off', 'my', 'high', 'interest', 'credit', 'cards', 'and', 'build', 'a', 'deck', 'on', 'my', 'home']


In [27]:
# Step 4: Use stemmer.stem() to get the list of stemmed terms.
#
st = PorterStemmer()
s2_4 = [st.stem(s) for s in s2_3]
print(s2_4)

['hello', ',', 'i', 'just', 'close', 'on', 'the', 'hous', 'of', 'my', 'dream', 'and', 'i', 'would', 'like', 'to', 'use', 'thi', 'loan', 'to', 'pay', 'off', 'my', 'high', 'interest', 'credit', 'card', 'and', 'build', 'a', 'deck', 'on', 'my', 'home']


In [28]:
# Step 5: Remove stop words. Remove the word if it appears in stopwords.words('english')
#
s2_5 = [s for s in s2_4 if s not in stopwords.words('english')]

In [29]:
# Finally, print() the sentence.
#
print(s2_5)

['hello', ',', 'close', 'hous', 'dream', 'would', 'like', 'use', 'thi', 'loan', 'pay', 'high', 'interest', 'credit', 'card', 'build', 'deck', 'home']


In [30]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re

#If you are running this for the first time, use the next cell to download all the corpora first
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [31]:
# Download the VADER list of words / lexicon
# nltk.download('vader_lexicon')

### Jaccard Similarity

Jaccard Similarity is used to show how similar two documents are. Given two documents, $A$ and $B$, the Jaccard Similarity Score is calculated as:

$$
\text{Jaccard Similarity Score} = \frac{A\cap B}{A \cup B}
$$

Simply put, the numerator is the number of words that **are common across both documents** and the denominator is the **total number of words in both documents**. Keep in mind that the words here refer to **unique words**.

The function below, `calculate_jaccard_score` will return the similarity score of two documents, `d1` and `d2`. It uses list comprehensions and the documenation for that can be found [here](https://docs.python.org/3/tutorial/datastructures.html). Also, find out more about how multiple variables can be declared in the same line [here](https://docs.python.org/3.6/tutorial/datastructures.html#tuples-and-sequences).

In [32]:
def calculate_jaccard_score(d1, d2):
    set_a, set_b = set(d1), set(d2)
    return len(set_a & set_b) / len(set_a | set_b)

In [33]:
s1 = 'I would like to consolidate a few of my higher interest rate credit cards.'.split()
s2 = 'this loan is to consolidate credit card debt and pay of debt'.split()
s3 = 'card'
s4 = 'Cards'
print(calculate_jaccard_score(s1, s2))
print(calculate_jaccard_score(s3, s4))
set(s3)

0.19047619047619047
0.5


{'a', 'c', 'd', 'r'}

In [34]:
!pip install fuzzywuzzy
!pip install python-Levenshtein

from fuzzywuzzy import fuzz 
from fuzzywuzzy import process 



FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses  Levenshtein Distance to calculate the differences between sequences.

In [35]:
fuzz.WRatio('card', 'Cards') 

89

### Sentiment Analysis with `nltk`

The `nltk` library has a sentiment analyser. It uses the VADER method or **Valence Aware Dictionary for
sEntiment Reasoning**. It is a lexicon (vocabulary) of words and their relative sentiment strength. For example:
    
- `Good` has a positive but weak score, while `Excellent` scores more
- `Bad` has a negative but weaks score, while `Tragedy` scores more

Use `sid.polarity_scores(t)` to find the sentiment of a text. Then, use the `compound` value to determine the overall score. Note that `compound` give a (normalised) value from $-1$ to $1$, and hence a positive number is good sentiment while a negative number is bad sentiment.

In [36]:
sid = SentimentIntensityAnalyzer()

Observe how the sentiment scores change based on the sentiment of a movie review.

In [37]:
# This is an example of a positive review, showing positive sentiment.
review_1 = """I thoroughly enjoyed this movie because there was a genuine sincerity in the acting."""
ss = sid.polarity_scores(review_1)
print(ss)
print(ss['compound'])

{'neg': 0.0, 'neu': 0.754, 'pos': 0.246, 'compound': 0.5563}
0.5563


In [38]:
review_2 = "I found it really boring and silly."
ss2 = sid.polarity_scores(review_2)
print(ss2)
print(ss2['compound'])
# This is an example of a negative review.

{'neg': 0.326, 'neu': 0.503, 'pos': 0.171, 'compound': -0.3025}
-0.3025


In [39]:
review_3 = "My personal favorite horror film."
ss3 = sid.polarity_scores(review_3)
print(ss3)
print(ss3['compound'])

{'neg': 0.381, 'neu': 0.309, 'pos': 0.309, 'compound': -0.1779}
-0.1779


In [40]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re

#If you are running this for the first time, use the next cell to download all the corpora first
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

### Convert a Document to a Vector using `CountVectorizer`

The following are the words in the corpus in the Naïve Bayes Classification example:

In [41]:
docs_as_s = ['enjoy like', 
             'enjoy funny happy', 
             'hate boring like', 
             'like happy', 
             'boring dull']

`fit_transform()` will first fit the dataset to a vocabulary. 

In [42]:
count_vectorizer = CountVectorizer()
ft = count_vectorizer.fit_transform(docs_as_s)

First, let's see the vocabulary. Every unique term in the corpus is assigned a position in the dictionary. 

In [48]:
count_vectorizer.vocabulary_
#["boring","dull","enjoy","funny","happy","hate","like"]
#"enjoy like"
#[0,0,1,0,0,0,1]

{'enjoy': 2,
 'like': 6,
 'funny': 3,
 'happy': 4,
 'hate': 5,
 'boring': 0,
 'dull': 1}

The positions are useful when finding if the word exists in the particular document. For example, if the column with index `2` has a value greater than `0` then the document will contain the word `enjoy`. 

**NOTE: ** The position of the words are random and hence the description does not fit the result.

In [44]:
ft.A

array([[0, 0, 1, 0, 0, 0, 1],
       [0, 0, 1, 1, 1, 0, 0],
       [1, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 1, 0, 1],
       [1, 1, 0, 0, 0, 0, 0]], dtype=int64)

### Further Exploration - `nltk.corpus`

`nltk.corpus` has many corpora (plural form of corpus) to allow you to download text to play with. The following are 2 more contemporary corpora. Before you access the corpus, ensure that you use `nltk.download()` to download the corpus to your local machine.

<div class="alert alert-info">The Brown corpus is curated by Brown University. It has 1 million words in English and contains text from 500 sources. Each source is categorised into a genre.</alert>

In [45]:
from nltk.corpus import brown
try:
    #Use corpus.categories() to show all category tags of each document.
    print(brown.categories())
except LookupError:
    print("Downloading brown...")    
    nltk.download('brown')
    print("Download brown complete")

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


<div class="alert alert-info">The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics.</alert>

In [46]:
from nltk.corpus import reuters
try:
    #Use corpus.words() to show all words that appear in the corpus.
    print(reuters.words()[:20])
except LookupError:
    print("Downloading reuters...")      
    nltk.download('reuters')
    print("Download reuters complete")    

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.']


**Credits**
- Hackwagon
- [sebleier](https://gist.github.com/sebleier/554280)
- [Kaggle (Billboard 1964-2015 Songs + Lyrics)](https://www.kaggle.com/rakannimer/billboard-lyrics)
- [Kaggle (Bag of Words Meets Bags of Popcorn)](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
- [Kaggle (Top tracks of 2017)](https://www.kaggle.com/nadintamer/top-tracks-of-2017)
- [Kaggle (Lending club loan data)](https://www.kaggle.com/wendykan/lending-club-loan-data)

**Footnote**

(1) : The reviews are partially processed. Only removal of special characters was performed. The remaining steps to be performed are stemming and removal of stop words.