<a href="https://colab.research.google.com/github/SproutCoder/text_mining_23/blob/main/project_2_lda_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2: Topic Modelling using BOW and Latent Dirichlet Allocation

For this project, imagine you are a group of data scientists who want to categorize news articles, without prior knowledge on their topics. Vectorize the documents using the *bag of words* method and apply *Latent Dirichlet Allocation* for topic modelling.


For this project, you might need the following python packages:
- sklearn
- pandas
- gensim
- string

### Enter names and mat. numbers:
- Sebastian Pirozhkov, 421892
- Christopher Kaschny, 447930
- name 3, mat 3
- name 4, mat 4
- name 5, mat 5

## Task 1: Preprocessing

Perform the following preprocessing steps on the given dataset `data.csv`.

1. remove **duplicates**
2. **Tokenization**
3. Remove **stopwords**
4. Remove **punctuations**
5. **Lemmatization**: change words in third person to first person and verbs in past and future tenses into present.
6. **Stemming**: use a porter stemmer


### 1.1 Import Packages and Data

In [1]:
import pandas as pd
import nltk #new line
nltk.download('punkt') #new line
nltk.download('stopwords') #new line
nltk.download('averaged_perceptron_tagger') #new line
nltk.download('wordnet') #new line
from nltk.corpus import stopwords, wordnet
from nltk.stem import porter, WordNetLemmatizer, PorterStemmer
from nltk import word_tokenize
import re
import string

from gensim.corpora import Dictionary
from gensim.matutils import corpus2dense
from gensim.models.ldamodel import LdaModel

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# import data set
df = pd.read_csv('./articles.csv', encoding='latin')

#remove duplicates
print('shape with duplicates: ', df.shape)
df.drop_duplicates(subset=['Article'], inplace=True)
print('shape having removed duplicates:', df.shape)
df.head()

shape with duplicates:  (2692, 4)
shape having removed duplicates: (2584, 4)


Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


### 1.2 Define Preprocessing Function

In [3]:
def preprocess(document: str):
  # tokenize
  document = re.sub(r'\W+', ' ', document)  # Replace non-alphanumeric characters with spaces so we also catch strings like "reported.Sources"
  tokens = word_tokenize(document)  # Tokenize the resulting string

  # remove stop words
  stop_words = set(stopwords.words('english')) # create a set of English stopwords
  filtered_tokens = [word for word in tokens if word.lower() not in stop_words] # remove stopwords

  # remove punctuation
  clean_tokens = [token for token in tokens if token not in string.punctuation]
  
  
  # perform POS tagging (needed for lemmatization)
  tagged_tokens = nltk.pos_tag(tokens) # => e.g. ('dogs', 'NNS'); uses Penn Treebank POS, which are not compatible with WordNet

  #convert Penn Treebank POS tags to WordNet POS tags 
  def convert_to_wordnet_pos(penn_pos):
      if penn_pos.startswith('J'):
          return wordnet.ADJ
      elif penn_pos.startswith('V'):
          return wordnet.VERB
      elif penn_pos.startswith('N'):
          return wordnet.NOUN
      elif penn_pos.startswith('R'):
          return wordnet.ADV
      else:
          return wordnet.NOUN  # noun as default

  # lemmatize each token based on its WordNet POS tag
  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = list(map(lambda x: lemmatizer.lemmatize(x[0], convert_to_wordnet_pos(x[1])), tagged_tokens))

  # stem tokens
  stemmer = PorterStemmer()
  stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens]
  return stemmed_tokens

### 1.3 Apply Preprocessing Function to Data

In [4]:
pre_df = pd.DataFrame()
pre_df['Article'] = df['Article'].apply(preprocess)
pre_df['Heading'] = df['Heading'].apply(preprocess)
pre_df['Date'] = df['Date']
pre_df['NewsType'] = df['NewsType']
pre_df.head()

Unnamed: 0,Article,Heading,Date,NewsType
0,"[karachi, the, sindh, govern, have, decid, to,...","[sindh, govt, decid, to, cut, public, transpor...",1/1/2015,business
1,"[hong, kong, asian, market, start, 2015, on, a...","[asia, stock, up, in, new, year, trad]",1/2/2015,business
2,"[hong, kong, hong, kong, share, open, 0, 66, p...","[hong, kong, stock, open, 0, 66, percent, low]",1/5/2015,business
3,"[hong, kong, asian, market, tumbl, tuesday, fo...","[asian, stock, sink, euro, near, nine, year]",1/6/2015,business
4,"[new, york, us, oil, price, monday, slip, belo...","[u, oil, price, slip, below, 50, a, barr]",1/6/2015,business


## Task 2: Latent Dirichlet Allocation

Apply LDA to the tokenized dataset using `gensim`.

1. Build a vocabulary using the gensim `Dictionary`
2. filter outliers: words that occur in less than 20 documents of in more than 50% of the documents.
3. create a BOW matrix for the preprocessed data
4. Train the LDA model on the training data
5. create a dataframe in which you map each document to a) its topic b) the topic's keywords

### 2.1 Build Vocabulary (and filter outliers)

In [5]:
# build vocabulary
docs_list = pre_df['Article'].tolist()
dictionary = Dictionary(docs_list) # Create gensim dictionary 

# filter outliers: words that occur in less than 20 documents of in more than 50% of the documents
dictionary.filter_extremes(no_below=20, no_above=0.5) 

print(dictionary)

Dictionary<2820 unique tokens: ['7', 'add', 'applic', 'bring', 'cent']...>


### 2.3 Create a BOW matrix

The resulting bag-of-words is represented as a list of tuples, each tuple containing two values: the word ID and the frequency of that word in the document.

In [6]:
# this cell is not done
# create a BOW representation
bow_corpus = [dictionary.doc2bow(doc) for doc in docs_list]

print(bow_corpus[:3])

#convert it to a dense matrix 
bow_matrix = corpus2dense(bow_corpus, num_terms=len(dictionary), num_docs = len(bow_corpus))
print(bow_matrix)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 3), (18, 4), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 2), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 2), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 4), (41, 1), (42, 1), (43, 1)], [(1, 1), (4, 2), (6, 3), (7, 1), (12, 1), (13, 1), (19, 1), (20, 1), (23, 1), (25, 1), (26, 1), (29, 1), (31, 3), (33, 1), (36, 1), (44, 13), (45, 1), (46, 2), (47, 15), (48, 2), (49, 1), (50, 1), (51, 1), (52, 2), (53, 3), (54, 2), (55, 1), (56, 1), (57, 1), (58, 4), (59, 1), (60, 2), (61, 1), (62, 1), (63, 2), (64, 6), (65, 4), (66, 1), (67, 2), (68, 1), (69, 1), (70, 2), (71, 1), (72, 6), (73, 1), (74, 2), (75, 6), (76, 3), (77, 2), (78, 2), (79, 1), (80, 2), (81, 1), (82, 1), (83, 1), (84, 3), (85, 1), (86, 3), (87, 2), (88, 2), (89, 1), (90, 1), (91, 1), (92, 2), (93, 1), (94, 1), (95, 1), (96, 1), 

### 2.4 create LDA model

In [7]:
num_topics = 10  # number of topics you want to identify
lda_model = LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=20,  # number of passes through the corpus during training
    alpha='auto',  # hyperparameter controlling the sparsity of the topic distribution
    per_word_topics=True,  # return topic distribution for each word in the input corpus
    random_state=42  # for reproducibility
)


In [8]:
# Print the topics learned by the model
for topic_id, topic in lda_model.show_topics(num_topics=num_topics, num_words=5, formatted=True):
    print(f"Topic {topic_id}: {topic}\n")

Topic 0: 0.023*"pakistan" + 0.010*"minist" + 0.010*"countri" + 0.009*"govern" + 0.009*"invest"

Topic 1: 0.020*"i" + 0.019*"hi" + 0.009*"who" + 0.009*"win" + 0.008*"final"

Topic 2: 0.010*"not" + 0.007*"state" + 0.006*"which" + 0.006*"presid" + 0.006*"we"

Topic 3: 0.026*"million" + 0.024*"billion" + 0.014*"percent" + 0.010*"1" + 0.010*"br"

Topic 4: 0.034*"oil" + 0.025*"price" + 0.016*"crude" + 0.014*"barrel" + 0.014*"opec"

Topic 5: 0.019*"run" + 0.018*"wicket" + 0.017*"over" + 0.014*"hi" + 0.014*"first"

Topic 6: 0.051*"pakistan" + 0.025*"team" + 0.017*"khan" + 0.017*"cricket" + 0.014*"match"

Topic 7: 0.033*"percent" + 0.017*"1" + 0.013*"0" + 0.013*"market" + 0.011*"dollar"

Topic 8: 0.025*"cricket" + 0.024*"india" + 0.021*"test" + 0.014*"we" + 0.013*"play"

Topic 9: 0.028*"england" + 0.024*"test" + 0.019*"hi" + 0.019*"pakistan" + 0.014*"first"



### 2.5 Map Topics to Docs 

In [9]:
# Map documents to topics
document_topics = []
for document in bow_corpus:  # Iterate over each document in the preprocessed corpus
    topic_distribution = lda_model.get_document_topics(document)  # Get the topic distribution for the document
    topic = max(topic_distribution, key=lambda x: x[1])[0]  # Find the topic with the highest probability
    document_topics.append(topic)
print('having mapped the topic ids on each doc, the TopicIDs of first five docs: ', document_topics[:5])

# Get topic keywords
num_keywords = 5  # Specify the number of keywords you want to retrieve for each topic
topic_keywords = []
for topic_id in range(lda_model.num_topics):
    keywords_prob = lda_model.show_topic(topic_id, topn=num_keywords)  # Get the top 5 keywords for the topic
    keywords = [keyword for keyword, _ in keywords_prob]
    print('topicID ', topic_id, ' keywords: ', keywords)
    topic_keywords.append(keywords) 

having mapped the topic ids on each doc, the TopicIDs of first five docs:  [0, 7, 7, 7, 4]
topicID  0  keywords:  ['pakistan', 'minist', 'countri', 'govern', 'invest']
topicID  1  keywords:  ['i', 'hi', 'who', 'win', 'final']
topicID  2  keywords:  ['not', 'state', 'which', 'presid', 'we']
topicID  3  keywords:  ['million', 'billion', 'percent', '1', 'br']
topicID  4  keywords:  ['oil', 'price', 'crude', 'barrel', 'opec']
topicID  5  keywords:  ['run', 'wicket', 'over', 'hi', 'first']
topicID  6  keywords:  ['pakistan', 'team', 'khan', 'cricket', 'match']
topicID  7  keywords:  ['percent', '1', '0', 'market', 'dollar']
topicID  8  keywords:  ['cricket', 'india', 'test', 'we', 'play']
topicID  9  keywords:  ['england', 'test', 'hi', 'pakistan', 'first']


### 2.5.1 Map TopicID

In [10]:
df['TopicID'] = document_topics
#df['TopicKeywords'] = topic_keywords

In [11]:
df.head()

Unnamed: 0,Article,Date,Heading,NewsType,TopicID
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business,0
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business,7
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business,7
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business,7
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business,4


### 2.5.2 retrieve and map topics keywords on docs

In [12]:
# Create a function to retrieve topic keywords based on the topic ID
def get_topic_keywords(row):
    topic_id = row['TopicID']
    if topic_id is not None:
        return topic_keywords[topic_id]
    else:
        return None

# Add the 'TopicKeywords' column to the DataFrame using the function
df['TopicKeywords'] = df.apply(get_topic_keywords, axis=1)

# Now the 'TopicKeywords' column is added to the DataFrame 'df' and filled with the corresponding topic keywords based on the 'TopicID' column


In [13]:
df.head(5)

Unnamed: 0,Article,Date,Heading,NewsType,TopicID,TopicKeywords
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business,0,"[pakistan, minist, countri, govern, invest]"
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business,7,"[percent, 1, 0, market, dollar]"
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business,7,"[percent, 1, 0, market, dollar]"
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business,7,"[percent, 1, 0, market, dollar]"
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business,4,"[oil, price, crude, barrel, opec]"
