## Text Mining and NLP

## Part 2

### Situation:

Priya works at an international PR firm in the Europe division. Their largest client has offices in Ibiza, Madrid, and Las Palmas. She needs to keep her boss aware of current events and provide a weekly short list of articles concerning political events in Spain. The problem is, this takes hours every week to review articles on the BBC and Priya is very busy! She wonders if she could automate this process using text mining to save her time.

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

## Refresher on cleaning text
![gif](https://www.nyfa.edu/student-resources/wp-content/uploads/2014/10/furious-crazed-typing.gif)


In [9]:
from __future__ import print_function
import nltk
import sklearn

from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
import urllib
from nltk.stem.snowball import SnowballStemmer

url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"
url_b = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/B.txt"

article_a = urllib.request.urlopen(url_a).read()
article_a_st = article_a.decode("utf-8")


In [10]:
# tokens
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
arta_tokens_raw = nltk.regexp_tokenize(article_a_st, pattern)

# lower case
arta_tokens = [i.lower() for i in arta_tokens_raw]

# stop words
from nltk.corpus import stopwords
stopwords.words("english")

stop_words = set(stopwords.words('english'))
arta_tokens_stopped = [w for w in arta_tokens if not w in stop_words]

# stem words
stemmer = SnowballStemmer("english")
arta_stemmed = [stemmer.stem(word) for word in arta_tokens_stopped]

In [11]:
# repeat w second article
article_b = urllib.request.urlopen(url_b).read()
article_b_st = article_b.decode("utf-8")
artb_tokens_raw = nltk.regexp_tokenize(article_b_st, pattern)
artb_tokens = [i.lower() for i in artb_tokens_raw]
artb_tokens_stopped = [w for w in artb_tokens if not w in stop_words]
artb_stemmed = [stemmer.stem(word) for word in artb_tokens_stopped]

### Document statistics

what's wrong with the table from yesterday? what does it not consider?


### Term Frequency (TF)

$\begin{align}
 tf_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

### TF-IDF score

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
\end{align} $


### The from scratch method
![homemade](https://media2.giphy.com/media/LBZcXdG0eVBdK/giphy.gif?cid=3640f6095c2d7bb2526a424a4d97117c)


Please go through the code and comment what each section does

In [15]:
# join the union of arta_stemmed and artb_stemmed
wordSet = set(arta_stemmed).union(set(artb_stemmed))

#Create 2 initial dictionaries with all zeroes as values and words as keys
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0)


In [18]:
# add to count in wordDictA for word in arta_stemmed
for word in arta_stemmed:
    wordDictA[word]+=1

# add to count in wordDictA for word in artb_stemmed
for word in artb_stemmed:
    wordDictB[word]+=1    

def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    # iterate through each word and its count from provided wordDict
    for word, count in wordDict.items():
        # set stored dictionary word as key and count/totalwordcount as value
        tfDict[word] = count/float(bowCount)
    return tfDict

tfbowA = computeTF(wordDictA,arta_stemmed)
tfbowB = computeTF(wordDictB,artb_stemmed)

In [25]:
tfbowA

{'action': 0.010869565217391304,
 'hurt': 0.010869565217391304,
 'monetari': 0.0,
 'support': 0.021739130434782608,
 'dead': 0.0,
 'member': 0.021739130434782608,
 'busi': 0.010869565217391304,
 'vocal': 0.010869565217391304,
 'invent': 0.05434782608695652,
 'fuller': 0.010869565217391304,
 'believ': 0.0,
 'reject': 0.010869565217391304,
 'effect': 0.010869565217391304,
 'begun': 0.0,
 'read': 0.010869565217391304,
 'fear': 0.010869565217391304,
 'union': 0.010869565217391304,
 'fail': 0.010869565217391304,
 'financi': 0.010869565217391304,
 'give': 0.010869565217391304,
 'minist': 0.0,
 'let': 0.010869565217391304,
 'miss': 0.0,
 'thought': 0.0,
 'oppon': 0.010869565217391304,
 'creditor': 0.0,
 'suspend': 0.0,
 'submit': 0.010869565217391304,
 'shop': 0.010869565217391304,
 'complet': 0.0,
 'agre': 0.0,
 'number': 0.0,
 'hammer': 0.0,
 'program': 0.010869565217391304,
 'rewrit': 0.010869565217391304,
 'propos': 0.021739130434782608,
 'debt': 0.0,
 'save': 0.0,
 'fund': 0.0,
 'play': 

In [73]:
def computeIDF(docList):
    import math
    idfDict = {}
    # number of documents
    N = len(docList)
    # start empty dictionary to keep track of all counts
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    
    # for each document
    for doc in docList:
        # get count of documents where word count 
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    # for each word calculate count divided by total word count
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [74]:
idfs = computeIDF([wordDictA, wordDictB])

In [75]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [76]:
tfidfBowA = computeTFIDF(tfbowA, idfs)
tfidfBowB = computeTFIDF(tfbowB, idfs)

In [77]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,abstain,achiev,action,adopt,affair,affect,agre,agreement,also,amazon,...,vocal,vote,wealthi,week,welcom,without,word,world,would,year
0,0.003272,0.003272,0.003272,0.003272,0.003272,0.0,0.0,0.0,0.0,0.003272,...,0.003272,0.003272,0.0,0.0,0.003272,0.003272,0.003272,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.005845,0.005845,0.005845,0.01169,0.0,...,0.0,0.0,0.005845,0.005845,0.0,0.0,0.0,0.005845,0.0,0.005845


## But yes, there is an easier way

![big deal](https://media0.giphy.com/media/xUA7aQOxkz00lvCAOQ/giphy.gif?cid=3640f6095c2d7c51772f47644d09cc8b)


In [81]:
# create a string again
cleaned_a = ' '.join(arta_stemmed)
cleaned_b = ' '.join(artb_stemmed)


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a, cleaned_b])

import pandas as pd
df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

    abstain    achiev    action     adopt    affair    affect      agre  \
0  0.053285  0.053285  0.053285  0.053285  0.053285  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.084167  0.084167   

   agreement      also    amazon  ...     vocal      vote   wealthi      week  \
0   0.000000  0.000000  0.053285  ...  0.053285  0.053285  0.000000  0.000000   
1   0.084167  0.168334  0.000000  ...  0.000000  0.000000  0.084167  0.084167   

     welcom   without      word     world     would      year  
0  0.053285  0.053285  0.053285  0.000000  0.113738  0.000000  
1  0.000000  0.000000  0.000000  0.084167  0.059885  0.084167  

[2 rows x 200 columns]


## Corpus Statistics 

How many non-zero elements are there?
- Adapt the code below, using the `df` version of the `response` object to replace everywhere below it says `DATA`
- Interpret the findings


In [106]:
# Edit code before running it

non_zero_cols = np.sum(df.values>0)
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / np.ma.count(df.values))
print('Percentage of columns containing 0: {:.2%}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 207
Percentage of columns containing 0: 48.25%


In [107]:
url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"

letter_list = list(string.ascii_uppercase[0:12])
url_list = ["https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/{}.txt".format(ltr) for ltr in letter_list]

def url_to_stemmed(url):
    article_b = urllib.request.urlopen(url).read()
    article_b_st = article_b.decode("utf-8")
    artb_tokens_raw = nltk.regexp_tokenize(article_b_st, pattern)
    artb_tokens = [i.lower() for i in artb_tokens_raw]
    artb_tokens_stopped = [w for w in artb_tokens if not w in stop_words]
    artb_stemmed = [stemmer.stem(word) for word in artb_tokens_stopped]
    return ' '.join(artb_stemmed)

In [114]:
stemmed_list = [url_to_stemmed(url) for url in url_list]
stemmed_list[0]

'reboot order eu patent law european parliament committe order rewrit propos controversi new european union rule govern comput base invent legal affair committe juri said commiss submit comput implement invent direct mep fail back vocal critic say could favour larg small firm impact open sourc softwar innov support say would let firm protect invent direct intend offer patent protect invent use softwar achiev effect word comput implement invent draft law suffer setback poland one largest eu member state reject adopt twice two month intens lobbi issu start gain momentum nation parliament put immens pressur two mep back draft law juri meet one vote abstain oppon draft direct welcom decis said new first read propos would give eu chanc fuller debat implic member state us patent comput program internet busi method permit mean us base amazon com hold patent one click shop servic exampl critic concern direct could lead similar model happen europ fear could hurt small softwar develop legal fina

In [109]:
tfidf = TfidfVectorizer()
response = tfidf.fit_transform(stemmed_list)


df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

        abat    abiyot       abl   abstain    access    accord   account  \
0   0.000000  0.000000  0.000000  0.057034  0.000000  0.000000  0.000000   
1   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
2   0.000000  0.000000  0.000000  0.000000  0.000000  0.052499  0.000000   
3   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
4   0.000000  0.000000  0.000000  0.000000  0.053013  0.053013  0.046830   
5   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
6   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
7   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.045945   
8   0.000000  0.000000  0.093834  0.000000  0.080585  0.000000  0.035593   
9   0.080947  0.080947  0.000000  0.000000  0.000000  0.000000  0.000000   
10  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
11  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   

       accu

In [115]:
tfidf.vocabulary_

{'reboot': 593,
 'order': 521,
 'eu': 225,
 'patent': 534,
 'law': 415,
 'european': 227,
 'parliament': 530,
 'committe': 138,
 'rewrit': 609,
 'propos': 570,
 'controversi': 159,
 'new': 501,
 'union': 750,
 'rule': 615,
 'govern': 304,
 'comput': 149,
 'base': 65,
 'invent': 374,
 'legal': 418,
 'affair': 18,
 'juri': 392,
 'said': 618,
 'commiss': 136,
 'submit': 693,
 'implement': 350,
 'direct': 200,
 'mep': 462,
 'fail': 239,
 'back': 61,
 'vocal': 766,
 'critic': 175,
 'say': 623,
 'could': 163,
 'favour': 244,
 'larg': 408,
 'small': 657,
 'firm': 265,
 'impact': 349,
 'open': 517,
 'sourc': 668,
 'softwar': 659,
 'innov': 366,
 'support': 699,
 'would': 794,
 'let': 421,
 'protect': 571,
 'intend': 369,
 'offer': 511,
 'use': 755,
 'achiev': 8,
 'effect': 215,
 'word': 792,
 'draft': 205,
 'suffer': 695,
 'setback': 640,
 'poland': 551,
 'one': 516,
 'largest': 410,
 'member': 459,
 'state': 683,
 'reject': 600,
 'adopt': 17,
 'twice': 743,
 'two': 744,
 'month': 482,
 'inten

In [113]:
# Edit code before running it

non_zero_cols = np.sum(df.values>0)
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / np.ma.count(df.values))
print('Percentage of cells containing 0: {:.2%}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 1088
Percentage of cells containing 0: 88.67%


### Next Steps:
- Create the tf-idf for the **whole** corpus of 12 articles
- What are _on average_ the most important words in the whole corpus?
- Add a column named "Target" to the dataset
- Target will be set to 1 or 0 if the article is "Politics" or "Not Politics"
- Do some exploratory analysis of the dataset
 - what are the average most important words for the "Politics" articles?
 - What are the average most important words for the "Not Politics"?

## Lets talk classification
- How would you split into train and test? what would be the dataset?

In [None]:
# Sample code
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  