## Text Mining and NLP

## Part 2

### Situation:

Priya works at an international PR firm in the Europe division. Their largest client has offices in Ibiza, Madrid, and Las Palmas. She needs to keep her boss aware of current events and provide a weekly short list of articles concerning political events in Spain. The problem is, this takes hours every week to review articles on the BBC and Priya is very busy! She wonders if she could automate this process using text mining to save her time.

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

## Refresher on cleaning text
![gif](https://www.nyfa.edu/student-resources/wp-content/uploads/2014/10/furious-crazed-typing.gif)


In [4]:
from __future__ import print_function


import nltk
import sklearn

from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
import urllib
from nltk.stem.snowball import SnowballStemmer

url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"
url_b = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/B.txt"
article_a = urllib.request.urlopen(url_a).read()
article_a_st = article_a.decode("utf-8")


In [5]:
# tokens
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
arta_tokens_raw = nltk.regexp_tokenize(article_a_st, pattern)

# lower case
arta_tokens = [i.lower() for i in arta_tokens_raw]

# stop words
from nltk.corpus import stopwords
stopwords.words("english")

stop_words = set(stopwords.words('english'))
arta_tokens_stopped = [w for w in arta_tokens if not w in stop_words]

# stem words
stemmer = SnowballStemmer("english")
arta_stemmed = [stemmer.stem(word) for word in arta_tokens_stopped]

In [6]:
# repeat w second article
article_b = urllib.request.urlopen(url_b).read()
article_b_st = article_b.decode("utf-8")
artb_tokens_raw = nltk.regexp_tokenize(article_b_st, pattern)
artb_tokens = [i.lower() for i in artb_tokens_raw]
artb_tokens_stopped = [w for w in artb_tokens if not w in stop_words]
artb_stemmed = [stemmer.stem(word) for word in artb_tokens_stopped]

### Document statistics

what's wrong with the table from yesterday? what does it not consider?


### Document Frequency (DF)

$\begin{align}
 tf_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

### DF-IDF score

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
\end{align} $


### The from scratch method
![homemade](https://media2.giphy.com/media/LBZcXdG0eVBdK/giphy.gif?cid=3640f6095c2d7bb2526a424a4d97117c)


Please go through the code and comment what each section does

In [9]:
wordSet = set(arta_stemmed).union(set(artb_stemmed))
# Join all non_repeated(i.e. set) words from ararta_stemmed and artbartb_stemmed

{'abstain',
 'achiev',
 'action',
 'adopt',
 'affair',
 'affect',
 'agre',
 'agreement',
 'also',
 'amazon',
 'analysi',
 'announc',
 'back',
 'bank',
 'base',
 'begun',
 'believ',
 'biggest',
 'bn',
 'bring',
 'britain',
 'briton',
 'brown',
 'busi',
 'canada',
 'chair',
 'chanc',
 'chancellor',
 'click',
 'com',
 'come',
 'commiss',
 'committe',
 'compani',
 'complet',
 'comput',
 'concern',
 'controversi',
 'could',
 'countri',
 'court',
 'creditor',
 'critic',
 'current',
 'dead',
 'deal',
 'debat',
 'debt',
 'decis',
 'develop',
 'direct',
 'disast',
 'draft',
 'earlier',
 'effect',
 'eu',
 'europ',
 'european',
 'even',
 'exampl',
 'expect',
 'face',
 'fail',
 'favour',
 'fear',
 'field',
 'fight',
 'final',
 'financ',
 'financi',
 'firm',
 'first',
 'foreign',
 'freez',
 'friday',
 'fuller',
 'fund',
 'g',
 'gain',
 'germani',
 'give',
 'gordon',
 'govern',
 'group',
 'hammer',
 'happen',
 'hit',
 'hold',
 'hope',
 'hurt',
 'idea',
 'immens',
 'impact',
 'implement',
 'implic',


In [10]:
wordDictA = dict.fromkeys(wordSet, 0) 
# Set all dicts values to 0

{'adopt': 0,
 'secretari': 0,
 'offer': 0,
 'base': 0,
 'gain': 0,
 'lobbi': 0,
 'problem': 0,
 'intern': 0,
 'straw': 0,
 'intend': 0,
 'twice': 0,
 'servic': 0,
 'govern': 0,
 'biggest': 0,
 'exampl': 0,
 'interest': 0,
 'current': 0,
 'japan': 0,
 'wealthi': 0,
 'rewrit': 0,
 'month': 0,
 'friday': 0,
 'line': 0,
 'debat': 0,
 'expect': 0,
 'hammer': 0,
 'put': 0,
 'pressur': 0,
 'hope': 0,
 'fund': 0,
 'agre': 0,
 'repay': 0,
 'softwar': 0,
 'action': 0,
 'canada': 0,
 'happen': 0,
 'commiss': 0,
 'hurt': 0,
 'us': 0,
 'might': 0,
 'amazon': 0,
 'program': 0,
 'abstain': 0,
 'monetari': 0,
 'suffer': 0,
 'moratorium': 0,
 'critic': 0,
 'dead': 0,
 'submit': 0,
 'also': 0,
 'reach': 0,
 'bring': 0,
 'law': 0,
 'foreign': 0,
 'order': 0,
 'pound': 0,
 'give': 0,
 'mean': 0,
 'analysi': 0,
 'thought': 0,
 'larger': 0,
 'lead': 0,
 'firm': 0,
 'welcom': 0,
 'even': 0,
 'court': 0,
 'comput': 0,
 'click': 0,
 'play': 0,
 'affair': 0,
 'parliament': 0,
 'chancellor': 0,
 'earlier': 0,
 '

In [24]:
wordDictB = dict.fromkeys(wordSet, 0) 
# Set all dicts values to 0

# Getting the count for all existing keys in wordDictA
for word in arta_stemmed:
    wordDictA[word]+=1

#Getting the count for all existing keys in wordDictB
for word in artb_stemmed:
    wordDictB[word]+=1    

#Getting the percentage of word frequency     
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/bowCount
    return tfDict

# assign percentage of word frequencies in two varibales
tfbowA = computeTF(wordDictA,arta_stemmed)
tfbowB = computeTF(wordDictB,artb_stemmed)

In [25]:
tfbowA

{'adopt': 0.021739130434782608,
 'secretari': 0.0,
 'offer': 0.021739130434782608,
 'base': 0.043478260869565216,
 'gain': 0.021739130434782608,
 'lobbi': 0.021739130434782608,
 'problem': 0.0,
 'intern': 0.0,
 'straw': 0.0,
 'intend': 0.021739130434782608,
 'twice': 0.021739130434782608,
 'servic': 0.021739130434782608,
 'govern': 0.021739130434782608,
 'biggest': 0.0,
 'exampl': 0.021739130434782608,
 'interest': 0.0,
 'current': 0.021739130434782608,
 'japan': 0.0,
 'wealthi': 0.0,
 'rewrit': 0.021739130434782608,
 'month': 0.021739130434782608,
 'friday': 0.0,
 'line': 0.021739130434782608,
 'debat': 0.021739130434782608,
 'expect': 0.0,
 'hammer': 0.0,
 'put': 0.021739130434782608,
 'pressur': 0.021739130434782608,
 'hope': 0.0,
 'fund': 0.0,
 'agre': 0.0,
 'repay': 0.0,
 'softwar': 0.06521739130434782,
 'action': 0.021739130434782608,
 'canada': 0.0,
 'happen': 0.021739130434782608,
 'commiss': 0.021739130434782608,
 'hurt': 0.021739130434782608,
 'us': 0.06521739130434782,
 'mig

In [23]:
tfbowA

{'adopt': 0.016304347826086956,
 'secretari': 0.0,
 'offer': 0.016304347826086956,
 'base': 0.03260869565217391,
 'gain': 0.016304347826086956,
 'lobbi': 0.016304347826086956,
 'problem': 0.0,
 'intern': 0.0,
 'straw': 0.0,
 'intend': 0.016304347826086956,
 'twice': 0.016304347826086956,
 'servic': 0.016304347826086956,
 'govern': 0.016304347826086956,
 'biggest': 0.0,
 'exampl': 0.016304347826086956,
 'interest': 0.0,
 'current': 0.016304347826086956,
 'japan': 0.0,
 'wealthi': 0.0,
 'rewrit': 0.016304347826086956,
 'month': 0.016304347826086956,
 'friday': 0.0,
 'line': 0.016304347826086956,
 'debat': 0.016304347826086956,
 'expect': 0.0,
 'hammer': 0.0,
 'put': 0.016304347826086956,
 'pressur': 0.016304347826086956,
 'hope': 0.0,
 'fund': 0.0,
 'agre': 0.0,
 'repay': 0.0,
 'softwar': 0.04891304347826087,
 'action': 0.016304347826086956,
 'canada': 0.0,
 'happen': 0.016304347826086956,
 'commiss': 0.016304347826086956,
 'hurt': 0.016304347826086956,
 'us': 0.04891304347826087,
 'migh

In [26]:

def computeIDF(docList): #docList is a list
    import math
    idfDict = {}
    N = len(docList) # get length of list
    
    idfDict = dict.fromkeys(docList[0].keys(), 0) # Assign 0 to first index in list
    for doc in docList: # iterate over list
        for word, val in doc.items(): #iterate over dictA and later dictB
            if val > 0: # As long as 
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [27]:
idfs = computeIDF([wordDictA, wordDictB])

In [28]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [29]:
tfidfBowA = computeTFIDF(tfbowA, idfs)
tfidfBowB = computeTFIDF(tfbowB, idfs)

In [30]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,abstain,achiev,action,adopt,affair,affect,agre,agreement,also,amazon,...,vocal,vote,wealthi,week,welcom,without,word,world,would,year
0,0.006544,0.006544,0.006544,0.006544,0.006544,0.0,0.0,0.0,0.0,0.006544,...,0.006544,0.006544,0.0,0.0,0.006544,0.006544,0.006544,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.002923,0.002923,0.002923,0.005845,0.0,...,0.0,0.0,0.002923,0.002923,0.0,0.0,0.0,0.002923,0.0,0.002923


## But yes, there is an easier way

![big deal](https://media0.giphy.com/media/xUA7aQOxkz00lvCAOQ/giphy.gif?cid=3640f6095c2d7c51772f47644d09cc8b)


In [33]:
# create a string again
cleaned_a = ' '.join(arta_stemmed)
cleaned_b = ' '.join(artb_stemmed)


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a, cleaned_b])

import pandas as pd
df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

    abstain    achiev    action     adopt    affair    affect      agre  \
0  0.053285  0.053285  0.053285  0.053285  0.053285  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.084167  0.084167   

   agreement      also    amazon  ...     vocal      vote   wealthi      week  \
0   0.000000  0.000000  0.053285  ...  0.053285  0.053285  0.000000  0.000000   
1   0.084167  0.168334  0.000000  ...  0.000000  0.000000  0.084167  0.084167   

     welcom   without      word     world     would      year  
0  0.053285  0.053285  0.053285  0.000000  0.113738  0.000000  
1  0.000000  0.000000  0.000000  0.084167  0.059885  0.084167  

[2 rows x 200 columns]


## Corpus Statistics 

How many non-zero elements are there?
- Adapt the code below, using the `df` version of the `response` object to replace everywhere below it says `DATA`
- Interpret the findings


In [34]:
# Edit code before running it
import numpy as np

new_val = np.array(df)

non_zero_vals = np.count_nonzero(new_val) / float(df.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_vals))

percent_sparse = 1 - (non_zero_vals / float(df.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 103.5
Percentage of columns containing 0: 0.48250000000000004


### Next Steps:
- Create the tf-idf for the **whole** corpus of 12 articles
- What are _on average_ the most important words in the whole corpus?
- Add a column named "Target" to the dataset
- Target will be set to 1 or 0 if the article is "Politics" or "Not Politics"
- Do some exploratory analysis of the dataset
 - what are the average most important words for the "Politics" articles?
 - What are the average most important words for the "Not Politics"?

## Lets talk classification
- How would you split into train and test? what would be the dataset?

In [None]:
# Sample code
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  