## Text Mining and NLP

## Part 2

### Situation:

Priya works at an international PR firm in the Europe division. Their largest client has offices in Ibiza, Madrid, and Las Palmas. She needs to keep her boss aware of current events and provide a weekly short list of articles concerning political events in Spain. The problem is, this takes hours every week to review articles on the BBC and Priya is very busy! She wonders if she could automate this process using text mining to save her time.

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

## Refresher on cleaning text
![gif](https://www.nyfa.edu/student-resources/wp-content/uploads/2014/10/furious-crazed-typing.gif)


In [5]:
!pip install nltk

[31mtwisted 18.7.0 requires PyHamcrest>=1.9.0, which is not installed.[0m
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [22]:
import nltk

In [24]:
from __future__ import print_function
import nltk
import sklearn

from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
import urllib
from nltk.stem.snowball import SnowballStemmer
import nltk.corpus

url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"
url_b = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/B.txt"
url_c = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/C.txt"
url_d = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/D.txt"
url_e = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/E.txt"
url_f = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/F.txt"
url_g = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/G.txt"
url_h = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/H.txt"
url_i = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/I.txt"
url_j = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/J.txt"
url_k = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/K.txt"
url_l = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/L.txt"

article_a = urllib.request.urlopen(url_a).read()
article_b = urllib.request.urlopen(url_b).read()
article_c = urllib.request.urlopen(url_c).read()
article_d = urllib.request.urlopen(url_d).read()
article_e = urllib.request.urlopen(url_e).read()
article_f = urllib.request.urlopen(url_f).read()
article_g = urllib.request.urlopen(url_g).read()
article_h = urllib.request.urlopen(url_h).read()
article_i = urllib.request.urlopen(url_i).read()
article_j = urllib.request.urlopen(url_j).read()
article_k = urllib.request.urlopen(url_k).read()
article_l = urllib.request.urlopen(url_l).read()


article_a_st = article_a.decode("utf-8")
article_b_st = article_b.decode("utf-8")
article_c_st = article_c.decode("utf-8")
article_d_st = article_d.decode("utf-8")
article_e_st = article_f.decode("utf-8")
article_a_st = article_a.decode("utf-8")


In [None]:
####Read and Process Article A
article_a = urllib.request.urlopen(url_a).read()
article_a_st = article_a.decode("utf-8")

##
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
arta_tokens_raw = nltk.regexp_tokenize(article_a_st, pattern)

In [26]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/Mango/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [27]:
# tokens
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
arta_tokens_raw = nltk.regexp_tokenize(article_a_st, pattern)

# lower case
arta_tokens = [i.lower() for i in arta_tokens_raw]

# stop words
from nltk.corpus import stopwords
stopwords.words("english")

stop_words = set(stopwords.words('english'))
arta_tokens_stopped = [w for w in arta_tokens if not w in stop_words]

# stem words
stemmer = SnowballStemmer("english")
arta_stemmed = [stemmer.stem(word) for word in arta_tokens_stopped]

In [28]:
# repeat w second article
article_b = urllib.request.urlopen(url_b).read()
article_b_st = article_b.decode("utf-8")
artb_tokens_raw = nltk.regexp_tokenize(article_b_st, pattern)
artb_tokens = [i.lower() for i in artb_tokens_raw]
artb_tokens_stopped = [w for w in artb_tokens if not w in stop_words]
artb_stemmed = [stemmer.stem(word) for word in artb_tokens_stopped]

### Document statistics

what's wrong with the table from yesterday? what does it not consider?


### Document Frequency (DF)

$\begin{align}
 tf_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

### DF-IDF score

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
\end{align} $


### The from scratch method
![homemade](https://media2.giphy.com/media/LBZcXdG0eVBdK/giphy.gif?cid=3640f6095c2d7bb2526a424a4d97117c)


Please go through the code and comment what each section does

In [29]:
##Join unique words joined from a and B
wordSet = set(arta_stemmed).union(set(artb_stemmed))
#Create dictionaries from A and B, count set to 0
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0) 

#Count unique words from A
for word in arta_stemmed:
    wordDictA[word]+=1

#Count unique words from B
for word in artb_stemmed:
    wordDictB[word]+=1    

#calculate term frequency
def computeTF(wordDict, bow):
    tfDict = {}
    #empty list
    bowCount = len(bow)
    #bowcount = values in word dictionary
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
        #word dictionary value count/total number of words in bag of words
    return tfDict

#Term frequency for both dictionaries
tfbowA = computeTF(wordDictA,arta_stemmed)
tfbowB = computeTF(wordDictB,artb_stemmed)

In [30]:
tfbowA

{'ineffici': 0.005434782608695652,
 'mep': 0.010869565217391304,
 'action': 0.005434782608695652,
 'disast': 0.0,
 'sign': 0.0,
 'line': 0.005434782608695652,
 'exampl': 0.005434782608695652,
 'gain': 0.005434782608695652,
 'say': 0.016304347826086956,
 'two': 0.010869565217391304,
 'wealthi': 0.0,
 'law': 0.02717391304347826,
 'final': 0.0,
 'would': 0.016304347826086956,
 'invent': 0.02717391304347826,
 'reach': 0.0,
 'later': 0.0,
 'begun': 0.0,
 'meet': 0.005434782608695652,
 'fuller': 0.005434782608695652,
 'juri': 0.010869565217391304,
 'said': 0.010869565217391304,
 'jack': 0.0,
 'critic': 0.010869565217391304,
 'announc': 0.0,
 'affair': 0.005434782608695652,
 'agreement': 0.0,
 'propos': 0.010869565217391304,
 'brown': 0.0,
 'financ': 0.0,
 'freez': 0.0,
 'similar': 0.005434782608695652,
 'reconstruct': 0.0,
 'friday': 0.0,
 'govern': 0.005434782608695652,
 'britain': 0.0,
 'without': 0.005434782608695652,
 'state': 0.010869565217391304,
 'year': 0.0,
 'suffer': 0.005434782608

In [16]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [31]:
idfs = computeIDF([wordDictA, wordDictB])

In [32]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [33]:
tfidfBowA = computeTFIDF(tfbowA, idfs)
tfidfBowB = computeTFIDF(tfbowB, idfs)

In [34]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,abstain,achiev,action,adopt,affair,affect,agre,agreement,also,amazon,...,vocal,vote,wealthi,week,welcom,without,word,world,would,year
0,0.001636,0.001636,0.001636,0.001636,0.001636,0.0,0.0,0.0,0.0,0.001636,...,0.001636,0.001636,0.0,0.0,0.001636,0.001636,0.001636,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.002923,0.002923,0.002923,0.005845,0.0,...,0.0,0.0,0.002923,0.002923,0.0,0.0,0.0,0.002923,0.0,0.002923


## But yes, there is an easier way

![big deal](https://media0.giphy.com/media/xUA7aQOxkz00lvCAOQ/giphy.gif?cid=3640f6095c2d7c51772f47644d09cc8b)


In [35]:
# create a string again
cleaned_a = ' '.join(arta_stemmed)
cleaned_b = ' '.join(artb_stemmed)


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a, cleaned_b])

import pandas as pd
df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

    abstain    achiev    action     adopt    affair    affect      agre  \
0  0.053285  0.053285  0.053285  0.053285  0.053285  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.084167  0.084167   

   agreement      also    amazon    ...        vocal      vote   wealthi  \
0   0.000000  0.000000  0.053285    ...     0.053285  0.053285  0.000000   
1   0.084167  0.168334  0.000000    ...     0.000000  0.000000  0.084167   

       week    welcom   without      word     world     would      year  
0  0.000000  0.053285  0.053285  0.053285  0.000000  0.113738  0.000000  
1  0.084167  0.000000  0.000000  0.000000  0.084167  0.059885  0.084167  

[2 rows x 200 columns]


## Corpus Statistics 

How many non-zero elements are there?
- Adapt the code below, using the `df` version of the `response` object to replace everywhere below it says `DATA`
- Interpret the findings


In [36]:
# Edit code before running it

non_zero_cols = DATA.nnz / float(DATA.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / float(DATA.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

NameError: name 'DATA' is not defined

### Next Steps:
- Create the tf-idf for the **whole** corpus of 12 articles
- What are _on average_ the most important words in the whole corpus?
- Add a column named "Target" to the dataset
- Target will be set to 1 or 0 if the article is "Politics" or "Not Politics"
- Do some exploratory analysis of the dataset
 - what are the average most important words for the "Politics" articles?
 - What are the average most important words for the "Not Politics"?

## Lets talk classification
- How would you split into train and test? what would be the dataset?

In [None]:
# Sample code
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  