## LDA Topic Modeling (5 points) ##

Your Name:

## Assignment Question ##

Using the below data set, perform LDA topic modeling to identify multiple latent topics inside.

The number of topipcs is not fixed, it is up to you to decide how many topics to go with.

## Grading Guidelines: ##

You need to show all the steps (Codes & outputs) from uploading the data set to performing topic modeling to derive topics with keywords.

DO NOT CLEAR THE OUTPUTS (Leave the outputs printed).


## Step 1: Load the dataset

The dataset we'll use is a list of news headlines published over a period of 15 years.

We'll start by loading it from the `abcnews-date-text.csv` file.

In [None]:
#code for step 1
from google.colab import drive
drive.mount('/content/gdrive/')
import os
import pandas as pd

abc = pd.read_csv('/CIS 4120 - NLP/Assignment 2/abcnews-date-text.csv')

abc.head()

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


## Step 2: Data Preprocessing ##

We will perform the following steps:

The order of the pre-processing steps doesn't have to be in this way.

It is up to you whether you start tokenizing first or other processing steps or at the same time.

HOWEVER, make sure that all the below steps are performed and applied to the headline text.

* **Tokenization**
* **Lowercasing**
* **remove punctuations**
* **Words that have fewer than 3 characters are removed**
* **stopwords are removed**
* **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.

**Lemmatization code is give below, use the below code for lemmatization.**

In [None]:
# Remove punctuation
import regex as re

def no_punc(text):
  text = re.sub('[^A-Za-z]+', ' ', text)
  return text

abc['headline_clean'] = abc['headline_text'].apply(no_punc)

# Lowercase
abc['headline_clean'] = abc['headline_clean'].str.lower()

# Tokenize
from nltk.tokenize import regexp_tokenize
from nltk import RegexpTokenizer

re_tk = RegexpTokenizer("[\w]+")

abc['headline_clean'] = abc['headline_clean'].map(re_tk.tokenize)

# Remove words under 3 characters
abc['headline_clean'] = abc['headline_clean'].apply(lambda x: [word for word in x if len(word) >= 3])

# Remove stopwords
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk_stop = stopwords.words('english')

abc['headline_clean'] = abc['headline_clean'].apply(lambda text: [word for word in text if word not in nltk_stop and len(word) > 2])

abc.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,publish_date,headline_text,headline_clean
0,20030219,aba decides against community broadcasting lic...,"[aba, decides, community, broadcasting, licence]"
1,20030219,act fire witnesses must be aware of defamation,"[act, fire, witnesses, must, aware, defamation]"
2,20030219,a g calls for infrastructure protection summit,"[calls, infrastructure, protection, summit]"
3,20030219,air nz staff in aust strike for pay rise,"[air, staff, aust, strike, pay, rise]"
4,20030219,air nz strike to affect australian travellers,"[air, strike, affect, australian, travellers]"


In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer
import numpy as np
np.random.seed(400)
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
# Lemmatize
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

def lem_text(text):
  return [WordNetLemmatizer().lemmatize(word, pos = 'v') for word in text]

abc['headline_clean'] = abc['headline_clean'].apply(lem_text)

# Rejoin text
abc['clean_join'] = abc['headline_clean'].apply(lambda word: (' '.join(word)))

abc.head()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,publish_date,headline_text,headline_clean,clean_join
0,20030219,aba decides against community broadcasting lic...,"[aba, decide, community, broadcast, licence]",aba decide community broadcast licence
1,20030219,act fire witnesses must be aware of defamation,"[act, fire, witness, must, aware, defamation]",act fire witness must aware defamation
2,20030219,a g calls for infrastructure protection summit,"[call, infrastructure, protection, summit]",call infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise,"[air, staff, aust, strike, pay, rise]",air staff aust strike pay rise
4,20030219,air nz strike to affect australian travellers,"[air, strike, affect, australian, travellers]",air strike affect australian travellers


## Step 3: Bag of words on the dataset

* 3-1. Dictionary

Create a dictionary from pre-processed headline texts containing the number of times a word appears in the training set.

To do that, let's pass your pre-processed headline texts to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [None]:
# Make copy of headline_clean for next step
tk_doc = abc['headline_clean']

from gensim import corpora
dictionary = corpora.Dictionary(tk_doc)

print(dictionary[0])

aba


* 3-2. Gensim filter_extremes

[`filter_extremes(no_below=i, no_above=j, keep_n=k) where i,j,k can be integers or fractions.`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [None]:
#Remove very rare and very common words using filter_extremes():
#code for step 3-2
print(len(dictionary))
print()
print(abc.shape)

dictionary.filter_extremes(no_below = 100, no_above = 0.80)

print()
print(len(dictionary))

80734

(1103665, 4)

6042


* 3-3. Gensim doc2bow

* Gensim doc2bow (pass the tokenized words to doc2bow and convert those to vectors.)

* Caution: No further preprocessing should be done such as tokenization, lemmatization, and etc before initiating this.

In [None]:
#code for step 3-3

corpus = [dictionary.doc2bow(text) for text in tk_doc]

print(corpus[1])

[(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]


## Step 4: Running LDA using Bag of Words ##

Perform LDA model on your final corpus.


In [None]:
#Run LDA model on the final corpus.

#num_topics: the number of latent topics to be extracted from the corpus.
#id2word: mapping from word ids (integers) to words (strings).
# Some other parameters. See the document explanations for more details.

#code for step 4.

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word = dictionary, passes = 15)
topics = ldamodel.print_topics(num_words = 5)
for topic in topics:
  print(topic)

(0, '0.028*"world" + 0.024*"change" + 0.016*"cup" + 0.015*"afl" + 0.015*"deal"')
(1, '0.039*"say" + 0.023*"house" + 0.021*"show" + 0.020*"could" + 0.019*"warn"')
(2, '0.040*"australian" + 0.029*"queensland" + 0.020*"coast" + 0.019*"help" + 0.019*"open"')
(3, '0.030*"call" + 0.025*"day" + 0.018*"melbourne" + 0.017*"live" + 0.017*"state"')
(4, '0.020*"north" + 0.019*"win" + 0.018*"donald" + 0.017*"first" + 0.016*"canberra"')
(5, '0.019*"get" + 0.015*"china" + 0.014*"mine" + 0.014*"life" + 0.013*"council"')
(6, '0.049*"australia" + 0.022*"one" + 0.020*"crash" + 0.017*"say" + 0.014*"arrest"')
(7, '0.047*"trump" + 0.031*"sydney" + 0.026*"fire" + 0.020*"perth" + 0.020*"home"')
(8, '0.051*"police" + 0.021*"sex" + 0.016*"jail" + 0.016*"man" + 0.015*"miss"')
(9, '0.035*"man" + 0.031*"new" + 0.031*"charge" + 0.031*"court" + 0.030*"government"')


### Step 5: label the topics ###

Using the keywords in each topic , what topics were you able to infer?
You should write down the inferred topic labels below.

* 0:
* 1:
* 2:
* ...

In [None]:
for idx, topic in ldamodel.print_topics(-1, num_words=4):
    #print out topic numbers and keywords.
#    print('Topic: {} Word: {}'.format(idx, topic))

    #print out keywords only (without probability)
    key_words_only = " ".join(re.findall("[a-zA-Z]+", topic))
    print (idx,key_words_only)

0 world change cup afl
1 say house show could
2 australian queensland coast help
3 call day melbourne live
4 north win donald first
5 get china mine life
6 australia one crash say
7 trump sydney fire perth
8 police sex jail man
9 man new charge court
