# NLP Practice - Rujuta Gandhi

You have been provided with a pickle file, containing 100 news articles about some company.  Use appropriate topic modeling technique to identify top N most important topics.

- read_pickle(directory+file.pkl')
- Present top N most important topics in these news articles
- Select N to identify relevant topics, but minimize duplication
- Explain how you selected N


## Import Libraries and File

In [3]:
import pyforest
import time
import math
import re
from textblob import TextBlob
import pandas as pd

import nltk as nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import string

import gensim
from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis.gensim
import warnings

In [4]:
df = pd.read_pickle(r"C:\Users\gandh\Google Drive\UChicago\11_Quarter 10\Assignments\Assignment 6\webhose_cat.pkl")
df.head(5)

Unnamed: 0,crawled,language,text,title,url
0,2018-01-30T18:28:45.012+02:00,english,Avery Dennison's (AVY) Q4 results are likely t...,IRobot downgraded to neutral from buy at Sidot...,http://omgili.com/ri/.wHSUbtEfZQRfU.5KUm1RkeXy...
1,2018-01-30T18:29:07.001+02:00,french,"1m95, c’est trop grand. Et sa stature, Bertran...","""Bertrand Zibi Abeghe, encore prisonnier, et t...",http://omgili.com/ri/.wHSUbtEfZTpzFtnXyQJIwJ.j...
2,2018-01-30T18:29:40.000+02:00,english,Tuggers and Topper Industrial Carts Help Trans...,Tuggers and Topper Industrial Carts Help Trans...,http://omgili.com/ri/jHIAmI4hxg.zDiulpymXqU_n4...
3,2018-01-30T18:30:05.007+02:00,english,Currently adding the following games:\n100 (by...,,http://omgili.com/ri/.0rSU5LtMgyggHgoOVy9TMDWT...
4,2018-01-30T18:30:05.013+02:00,english,Quote: : » Currently adding the following game...,,http://omgili.com/ri/.0rSU5LtMgyggHgoOVy9TMDWT...


In [5]:
df.language.value_counts()

english    95
dutch       1
german      1
french      1
korean      1
italian     1
Name: language, dtype: int64

In [6]:
df.info()
#### Even though it says all rows are not null, there are still blank Title fields.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   crawled   100 non-null    object
 1   language  100 non-null    object
 2   text      100 non-null    object
 3   title     100 non-null    object
 4   url       100 non-null    object
dtypes: object(5)
memory usage: 4.0+ KB


# Begin Data Cleaning

In [7]:
#Keep only English and Title column

df = df[df.language=='english'].reset_index(drop=True)

df = df.drop(columns=['crawled','language','url','title'])

df.head(5)

Unnamed: 0,text
0,Avery Dennison's (AVY) Q4 results are likely t...
1,Tuggers and Topper Industrial Carts Help Trans...
2,Currently adding the following games:\n100 (by...
3,Quote: : » Currently adding the following game...
4,Quote: : » Currently adding the following game...


In [8]:
# Remove special characters to avoid problems with analysis
df['text'] = df['text'].map(lambda x: re.sub('[^a-zA-Z0-9 @ . , : - _ ]', '', str(x)))
# In initial runs, numbers were not removed
df['text'] = df['text'].map(lambda x: re.sub('[0-9]', '', str(x)))
# When replacing special characters, random colons took the place. Removing those.
df['text'] = df['text'].map(lambda x: re.sub(': :| ::', '', str(x)))
# Initial topic modeling with textblob kept showing am pm. Need to remove
df['text'] = df['text'].map(lambda x: re.sub('a.m|p.m', '', str(x)))
##Lower case applied since text blob recognizes case
df['text'] = df['text'].str.lower() 
## A lot of words less than three characters appear in topic model. Need to remove
df['text'] = df['text'].map(lambda x: re.sub(r'\b\w{1,3}\b', '', str(x)))

In [9]:
df.head()

Unnamed: 0,text
0,avery dennisons results likely gain back...
1,tuggers topper industrial carts help transpor...
2,currently adding following games: everything...
3,quote currently adding following games: eve...
4,quote currently adding following games: eve...


In [10]:
pd.set_option('display.max_colwidth', 100)
df.head(5)

Unnamed: 0,text
0,"avery dennisons results likely gain back solid momentum segments, focus productivity, ..."
1,tuggers topper industrial carts help transport materials between manufacturing plants warehous...
2,currently adding following games: everythingstaken free beetles space felony after death ha...
3,quote currently adding following games: everythingstaken free beetles space felony after de...
4,quote currently adding following games: everythingstaken free beetles space felony after de...


## Present N Top Topics in Articles

#### TextBlob

In [11]:
# http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)
# tf(word, blob) computes "term frequency" which is the number of times a word appears in a document blob, 
# normalized by dividing by the total number of words in blob. We use TextBlob for breaking up the text into words 
# and getting the word counts.


def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)
# n_containing(word, bloblist) returns the number of documents containing word. 
# A generator expression is passed to the sum() function.


def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
# idf(word, bloblist) computes "inverse document frequency" which measures how common a word is 
# among all documents in bloblist. The more common a word is, the lower its idf. 
# We take the ratio of the total number of documents to the number of documents containing word, 
# then take the log of that. Add 1 to the divisor to prevent division by zero

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)
# tfidf(word, blob, bloblist) computes the TF-IDF score. It is simply the product of tf and idf.

In [12]:
bloblist = []
del bloblist[:]

for i  in range(0,len(df)):
    bloblist.append(TextBlob(df['text'].iloc[i]))
    
len(bloblist)  

95

In [32]:
for j in range(1,6) : 
    for i, blob in enumerate(bloblist):
    # Print top 5 values
        if i == j:
            break
        print("Top words {}".format(i + 1))
        scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
        sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        for word, score in sorted_words[:5]:
            print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

Top words 1
	Word: zacks, TF-IDF: 0.10826
	Word: irobot, TF-IDF: 0.06315
	Word: motley, TF-IDF: 0.0504
	Word: globenewswire, TF-IDF: 0.0504
	Word: fool, TF-IDF: 0.04511
Top words 1
	Word: zacks, TF-IDF: 0.10826
	Word: irobot, TF-IDF: 0.06315
	Word: motley, TF-IDF: 0.0504
	Word: globenewswire, TF-IDF: 0.0504
	Word: fool, TF-IDF: 0.04511
Top words 2
	Word: carts, TF-IDF: 0.11841
	Word: topper, TF-IDF: 0.09868
	Word: industrial, TF-IDF: 0.04664
	Word: operators, TF-IDF: 0.03947
	Word: handling, TF-IDF: 0.03671
Top words 1
	Word: zacks, TF-IDF: 0.10826
	Word: irobot, TF-IDF: 0.06315
	Word: motley, TF-IDF: 0.0504
	Word: globenewswire, TF-IDF: 0.0504
	Word: fool, TF-IDF: 0.04511
Top words 2
	Word: carts, TF-IDF: 0.11841
	Word: topper, TF-IDF: 0.09868
	Word: industrial, TF-IDF: 0.04664
	Word: operators, TF-IDF: 0.03947
	Word: handling, TF-IDF: 0.03671
Top words 3
	Word: super, TF-IDF: 0.02407
	Word: tower, TF-IDF: 0.02078
	Word: adding, TF-IDF: 0.01949
	Word: dead, TF-IDF: 0.01949
	Word: domi

#### LDA

In [14]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in df['text']]     

In [15]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [16]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

In [17]:
warnings.filterwarnings(action='ignore')

In [33]:
# Running and Trainign LDA model on the document term matrix.
for i in range(1,7):
    %time ldamodel = Lda(doc_term_matrix, num_topics=i, id2word = dictionary, passes=50) #3 topics
    print(*ldamodel.print_topics(num_topics=i, num_words=5), sep='\n')

Wall time: 6.8 s
(0, '0.007*"market" + 0.006*"company" + 0.005*"caterpillar" + 0.005*"plant" + 0.004*"year"')
Wall time: 14.7 s
(0, '0.008*"company" + 0.007*"market" + 0.006*"share" + 0.005*"caterpillar" + 0.005*"city"')
(1, '0.009*"plant" + 0.007*"amazon" + 0.006*"market" + 0.006*"case" + 0.005*"sphere"')
Wall time: 11.3 s
(0, '0.013*"amazon" + 0.008*"sphere" + 0.007*"company" + 0.007*"seattle" + 0.005*"work"')
(1, '0.012*"plant" + 0.010*"caterpillar" + 0.008*"case" + 0.006*"share" + 0.006*"company"')
(2, '0.014*"market" + 0.006*"state" + 0.006*"city" + 0.005*"china" + 0.005*"company"')
Wall time: 13.5 s
(0, '0.017*"amazon" + 0.012*"sphere" + 0.011*"seattle" + 0.007*"monday" + 0.006*"grand"')
(1, '0.013*"caterpillar" + 0.009*"company" + 0.008*"share" + 0.007*"product" + 0.006*"stock"')
(2, '0.020*"plant" + 0.013*"case" + 0.009*"skid" + 0.007*"steer" + 0.007*"wardian"')
(3, '0.014*"market" + 0.008*"company" + 0.006*"city" + 0.006*"state" + 0.005*"china"')
Wall time: 10.9 s
(0, '0.012*"

#### Explain how you selected N and which method you chose.

Since this is a short text (just 95 rows), there shouldn't be more than 5 topics. I ran the LDA and TextBlob for 5 topics each to compare. As I was running these models, I noticed additional data cleaning that needed to be done, so I added those steps in pre-processing and reran the models. 

In analyzing the results, I prefer LDA as the topics seem to make more sense. It seems to take context into more account than TextBlob.

With regards to the final number of topics, I likke the LDA 5 topics. It's more clear what each topic is about and there's not duplication.