[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/ntl2222/HackathonAI/blob/nikos/topic_extractor.ipynb)

# Topic Extraction from an unsupervised dataset
---

### Resources

- [gensim.models.LdaMulticore](https://radimrehurek.com/gensim/models/ldamodel.html#)

# Table of Contents

- [Data](#Data)
- [Text Preprocessing](#Text-Preprocessing)
- [Creating Custom Vocabulary](#Creating-Custom-Vocabulary)
- [Latent Dirichlet Allocation (LDA) for Topic Extraction](#Latent-Dirichlet-Allocation-(LDA)-for-Topic-Extraction)
- [Topic Allocation using FastText](#Topic-Allocation-using-FastText)

---

In [1]:
# for colab

# Data

The dataset we used is: [10000 Restaurant Reviews](#https://www.kaggle.com/datasets/joebeachcapital/restaurant-reviews) from www.kaggle.com. 

In [164]:
# %%writefile scripts/data.py

from pathlib import Path
import requests
import zipfile
import os

def get_data():
    dataset_dir = Path('./data/raw')
    dataset_dir.mkdir(parents=True, exist_ok=True)
    
    notEmpty = any(dataset_dir.iterdir())
    
    if notEmpty:
        print('Dataset exists.')
        
    else:
        try:
            response = requests.get('https://storage.googleapis.com/kaggle-data-sets/3697155/6410731/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240224%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240224T212811Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=3630dc54d8e2cee4459eceb6d3414ccb669f04b6996660b9d1a5e20d07f242fde686cf4609e222e2e0d4d34746a77c1c0115c550228a80bfb707e252614ae108f6e2b7f6fa206998100df0c3218b91bd5ad6ea64aa2921b4ecb170f123e0e9e36e9e20a0d772e1689d698fa53a1f1f0f673cc4b94b42919f970c6286bd3d2fa7ecf5e72a14a3c4ba8fd32e2074c97e178e922d8a44280914e36b8371ebc172e122d9db33e6bd83735ba3c3f106224e2eb6566d7885fd87dccd26156f7018ec0d1d4138b55b4d27ba205e5fd68e4b923b4ca8b64bced817e37f9164e3284bab015e05ec046bf635f90f18ebf1fcfcc2ab450851c441deea8700d717f33251be3a')
        
            if response.status_code == 200:
                print('Downloading dataset..')
                with open('archive.zip', 'wb') as f:
                    f.write(response.content)
        
                print('Unzipping...')
                with zipfile.ZipFile('archive.zip', 'r') as zip_ref:
                    zip_ref.extractall(dataset_dir)
                    print('Done.')
        
                os.remove('archive.zip')
    
            else:
                raise requests.exceptions.RequestException(f"Error downloading dataset. status code: {response.status_code}")
                
        except requests.exceptions.RequestException as e:
            print(e)


Overwriting scripts/data.py


In [2]:
download_data()

Dataset exists.


In [3]:
dataset_dir = Path('./data')
csv_dir = dataset_dir / 'raw' / 'Restaurant reviews.csv'

In [4]:
import pandas as pd

df = pd.read_csv(csv_dir, usecols=['Review', 'Rating']).dropna() # make sure we dont have null reviews
df

Unnamed: 0,Review,Rating
0,"The ambience was good, food was quite good . h...",5
1,Ambience is too good for a pleasant evening. S...,5
2,A must try.. great food great ambience. Thnx f...,5
3,Soumen das and Arun was a great guy. Only beca...,5
4,Food is good.we ordered Kodi drumsticks and ba...,5
...,...,...
9995,Madhumathi Mahajan Well to start with nice cou...,3
9996,This place has never disappointed us.. The foo...,4.5
9997,"Bad rating is mainly because of ""Chicken Bone ...",1.5
9998,I personally love and prefer Chinese Food. Had...,4


In [76]:
# remove rows from column Review if they are not of type string
df = df.drop(df[df['Review'].apply(lambda x: not isinstance(x, str))].index)
# remove rows that contain only special symbols and not words
df = df[~df['Review'].str.contains(r'^[\W_]+$')]

In [77]:
# check for duplicates or missing values
duplicate_index = df.index[df.index.duplicated()]
print('Duplicates:')
print(len(duplicate_index))

print('\nMissing indexes:')
missing_index = set(range(len(df))) - set(df.index)
print(len(missing_index))

Duplicates:
0

Missing indexes:
56


In [78]:
df = df.reindex(range(len(df)))
missing_index = set(range(len(df))) - set(df.index)
df = df.dropna()
print('Missing indexes:')
print(len(missing_index))

Missing indexes:
0


## Text Preprocessing

In [9]:
from typing import List
import re 

def remove_url(text: str) -> str:
    text = re.sub(r"http\S+", "", text)
    return text

* We would also like to handle the emojis that occur in the reviews but without deleting them completely, since they carry a great deal of information in their context. We will instead replace them with the corrensponding text.

In [10]:
import demoji

def replace_emoji(text: str) -> str:
    emojis = demoji.findall(text)

    for emoji in emojis:
        text = text.replace(emoji, ' ' + emojis[emoji].split(':')[0])

    return text

In [11]:
review = df.values[65][0]

print('Before:')
print(review)
print('\nAfter:')
print(replace_emoji(review))

Before:
Best place to hangout...😊
Food is really great...
Thanks Papiya for the service...😊
Staff was reallly co-operative...
Ambience is really great, especially PDR(Private Dining Room) is awesome...😍👌🏻

After:
Best place to hangout... smiling face with smiling eyes
Food is really great...
Thanks Papiya for the service... smiling face with smiling eyes
Staff was reallly co-operative...
Ambience is really great, especially PDR(Private Dining Room) is awesome... smiling face with heart-eyes OK hand


* Next we will perform some standard pre-processing steps (like tokenization, removing stop words, etc.) to prepare our reviews to be fed to the model.

In [12]:
def tokenize(text: str) -> List[str]:
    text = text.lower()
    text = text.split(' ')

    return text

In [13]:
review, _ = next(iter(df.values))

print('Before:')
print(review)
print('\nAfter:')
print(tokenize(review))    

Before:
The ambience was good, food was quite good . had Saturday lunch , which was cost effective .
Good place for a sate brunch. One can also chill with friends and or parents.
Waiter Soumen Das was really courteous and helpful.

After:
['the', 'ambience', 'was', 'good,', 'food', 'was', 'quite', 'good', '.', 'had', 'saturday', 'lunch', ',', 'which', 'was', 'cost', 'effective', '.\ngood', 'place', 'for', 'a', 'sate', 'brunch.', 'one', 'can', 'also', 'chill', 'with', 'friends', 'and', 'or', 'parents.\nwaiter', 'soumen', 'das', 'was', 'really', 'courteous', 'and', 'helpful.']


In [14]:
# ntlk.download()
from nltk.corpus import stopwords

def remove_stopwords(text: List[str]) -> List[str]:
    text = [words for words in text if words not in stopwords.words('english')]

    return text

In [15]:
print('Before:')
print(review)
print('\nAfter:')
print(remove_stopwords(tokenize(review)))     

Before:
The ambience was good, food was quite good . had Saturday lunch , which was cost effective .
Good place for a sate brunch. One can also chill with friends and or parents.
Waiter Soumen Das was really courteous and helpful.

After:
['ambience', 'good,', 'food', 'quite', 'good', '.', 'saturday', 'lunch', ',', 'cost', 'effective', '.\ngood', 'place', 'sate', 'brunch.', 'one', 'also', 'chill', 'friends', 'parents.\nwaiter', 'soumen', 'das', 'really', 'courteous', 'helpful.']


In [16]:
# python -m spacy download en_core_web_sm
import spacy

sp = spacy.load("en_core_web_sm")

In [17]:
def lemmatization(text: List[str]) -> List[str]:

    text = ' '.join(text)
    token = sp(text)
    text = [word.lemma_ for word in token]
    
    return text

In [18]:
print('Before:')
print(review)
print('\nAfter:')
print(lemmatization(tokenize(review)))     

Before:
The ambience was good, food was quite good . had Saturday lunch , which was cost effective .
Good place for a sate brunch. One can also chill with friends and or parents.
Waiter Soumen Das was really courteous and helpful.

After:
['the', 'ambience', 'be', 'good', ',', 'food', 'be', 'quite', 'good', '.', 'have', 'saturday', 'lunch', ',', 'which', 'be', 'cost', 'effective', '.', '\n', 'good', 'place', 'for', 'a', 'sate', 'brunch', '.', 'one', 'can', 'also', 'chill', 'with', 'friend', 'and', 'or', 'parent', '.', '\n', 'waiter', 'soumen', 'das', 'be', 'really', 'courteous', 'and', 'helpful', '.']


* Now putting everything together

In [34]:
# %%writefile src/utils/text_clean.py

import nltk
from nltk.corpus import stopwords
from nltk import pos_tag
# nltk.download('stopwords')
# nltk.download('averaged_perceptron_tagger')

import spacy
# python -m spacy download en_core_web_sm

import demoji
import re
from typing import List

class TextCleaner():
    '''Performs various transformation to a string to prepare it for nlp.
        
        Example usage:
            f = TextCleaner(remove_verbs=False)
            clean_text = f.clean('this is an example text')'''
    
    def __init__(self, remove_stopwords: bool = True, remove_verbs: bool = True, apply_lemma: bool = True):
        self.remove_verbs = remove_verbs
        self.remove_stopwords = remove_stopwords
        self.apply_lemma = apply_lemma
        self.tokens = []

        self.sp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

    def tokenizer(self, text: str) -> List[str]:
        '''Transforms input text to lowercase and splits it to tokens'''
        doc = self.sp(text)
        tokens = []

        for token in doc:
            if self.remove_verbs and not token.pos_.startswith('N'):
                continue
        
            if self.remove_stopwords and token.text.lower() in stopwords.words('english'):
                continue

            # Check if the token is not a punctuation or whitespace and is not empty
            if not token.is_punct and not token.is_space and token.text.strip():          
                tokens.append(token.text.lower())
        return tokens

    def lemmatize(self, text: str) -> List[str]:
        '''Returns lemmatized tokens if apply_lemma = True'''
        doc = self.sp(text)
        tokens = []

        for token in doc:
            if self.remove_verbs and not token.pos_.startswith('N'):
                continue
        
            if self.remove_stopwords and token.text.lower() in stopwords.words('english'):
                continue
                
            # Check if the token is not a punctuation or whitespace and is not empty
            if not token.is_punct and not token.is_space and token.text.strip():
                lemma_token = token.lemma_.lower()
                tokens.append(lemma_token)
        return tokens

    def _demoji_replace(self, text: str) -> str:
        '''Replaces emojis with text'''
        emojis = demoji.findall(text)
        for emoji in emojis:
            text = text.replace(emoji, ' ' + emojis[emoji].split(':')[0])    
        return text

    def clean(self, text: str) -> str:
        '''Performs a full transformation of the input text'''
        # Remove urls
        clean_text = re.sub(r"http\S+", "", text)
        # Replace emojis
        clean_text = self._demoji_replace(clean_text)
        # Tokenize & lemmatization
        if self.apply_lemma:
            self.tokens = self.lemmatize(clean_text)
        else:
            self.tokens = self.tokenizer(clean_text)
            
        # Join tokens back into a single string
        cleaned_text = " ".join(self.tokens)
        # self.tokens = tokens
        return cleaned_text


In [180]:
df['Review'] = df['Review'].astype(str)

# f = TextCleaner(remove_verbs=True)
# df['cleaned-reviews'] = df['Review'].map(lambda review: f.clean(review))
# df.to_csv(dataset_dir / 'processed' /'clean_no_verbs.csv', index=False)

# Creating Custom Vocabulary

In [118]:
import pandas as pd
df = pd.read_csv(dataset_dir / 'processed' /'clean_no_verbs.csv')
df

Unnamed: 0,Review,Rating,cleaned-reviews
0,"The ambience was good, food was quite good . h...",5,ambience food lunch cost place sate brunch fri...
1,Ambience is too good for a pleasant evening. S...,5,ambience evening service food experience kudo ...
2,A must try.. great food great ambience. Thnx f...,5,food ambience service recommendation music bac...
3,Soumen das and Arun was a great guy. Only beca...,5,guy behavior sincerety food course place
4,Food is good.we ordered Kodi drumsticks and ba...,5,food drumstick basket biryani thank ambience
...,...,...,...
9883,I am amazed at the quality of food and service...,4,quality food service place ambience location p...
9884,The food was amazing. Do not forget to try 'Mo...,4.5,food sizzler staff chicken town heart
9885,We ordered from here via swiggy:\n\nWe ordered...,4,swiggy mushroom quantity dish paneer gravy dis...
9886,I have been to this place on a sunday with my ...,1,place friend meal time friend moment 2:15pm ma...


In [119]:
df.loc[:, 'cleaned-reviews'] = df['cleaned-reviews'].astype(str)
reviews = df['cleaned-reviews'].values.tolist()

In [57]:
# %%writefile src/utils/vocab.py

import torchtext
from torchtext.vocab import vocab
import gensim.corpora as corpora

from collections import Counter, OrderedDict
from typing import List, Dict, Union

class CustomVocab(torchtext.vocab.Vocab):
    '''Creates a custom vocabulary from a list of strings and provides various information 
        about it in the form of attributes.'''
    
    def __init__(self, document: Union[List[str], str]):
        super(CustomVocab, self).__init__(None)
        
        self.rawText = document
        self.tokens = []
        self.word_freqs = []
        self.vocab = self._create_vocab(document)
        self.size = len(self.word_freqs)
        self.id2word = []
        self.bow = self._bag_of_words(document)

    def __len__(self):
        return len(self.word_freqs)

    def _create_vocab(self, document):
        tokens = self._get_tokens(document)
        orderedDict = self._get_word_freq(tokens)
        
        vocab = torchtext.vocab.vocab(ordered_dict=orderedDict, min_freq=1)
        vocab.set_default_index(-1)
        return vocab

    def _get_tokens(self, document):
        if isinstance(document, str):
            document = [document]

        tokens = []
        for word in document:
            token = word.split(' ')
            tokens.extend(token)

        self.tokens = tokens
        return tokens

    def _get_word_freq(self, tokens):       

        counter = Counter(tokens)
        sort_counter = sorted(counter.items(), key=lambda x: x[1], reverse=True)
        self.word_freqs = sort_counter
        return OrderedDict(counter)

    def _bag_of_words(self, document):
        words = []
        for doc in document:
            words.append([token for token in doc.split(' ')])
    
        self.id2word = corpora.Dictionary(words)
        
        return [self.id2word.doc2bow(word) for word in words]


In [58]:
vocab = CustomVocab(reviews)

In [59]:
tokens = vocab.tokens
freq = vocab.word_freqs
stoi = vocab.get_stoi()
bow = vocab.bow
id2word = vocab.id2word
vocab_size = vocab.size

In [60]:
len(bow), len(id2word)

(9888, 6944)

In [61]:
vocab_size

6944

# Latent Dirichlet Allocation (LDA) for Topic Extraction
We use an LDA model from gensim library to find relevant topics in our dataset.

In [78]:
# %%writefile src/models/lda_model.py

from gensim.models import LdaMulticore
from src.utils.vocab import CustomVocab

import os
from typing import List, Dict

def extraxt_topics(document: List[str],
                    num_topics: int = 3,
                    passes: int = 1,
                    iterations: int = 100) -> Dict[int, str]:
    '''Initializes and trains the lda model.
       Returns the top n topics of the document.'''

    vocab = CustomVocab(document)
    id2word = vocab.id2word
    bow = vocab.bow
    

    num_cores = os.cpu_count()

    # initialize and train lda model
    lda_model = LdaMulticore(corpus=bow, id2word=id2word,
                             num_topics=1,
                             passes=passes,
                             iterations=iterations,
                             workers=num_cores)
                             
    lda_model.save('data/saved_models/lda_model')

    # get the top n topics
    labels = {id: topic[0] for id, topic in enumerate(lda_model.show_topic(0, topn=num_topics))}

    return labels
                        

In [63]:
labels = extraxt_topics(reviews)

In [64]:
for _, topic in labels.items():
    print(topic)

food
place
service


# Topic Allocation using FastText

In [163]:
# %%writefile src/models/fastText_model.py

from gensim.models import FastText

from typing import List, Dict
import numpy as np
import os

def train_model():
    '''Initialize and train a FastText model.'''
    # initialize
    fasttext_model = FastText(vector_size=100, window=3, min_count=1, workers=os.cpu_count(), sg=0)
    fasttext_model.build_vocab_from_freq(word_freq=dict(freq))
    # train & save model
    fasttext_model.train(corpus_iterable=reviews, total_examples=len(reviews), epochs=20)
    # fasttext_model.save('data/saved_models/fastText_model')

    return fasttext_model


def match_topics(document: List[str], labels: Dict[int, str]) -> Dict[str, str]:
    '''Matches each review to the most relevant topic.'''
    topic_dict = {}
    for row, review in enumerate(reviews):
        prob = []
        for _, topic in labels.items():
            prob.append(fasttext_model.wv.n_similarity(review, topic))
            topic_dict[review] = labels[np.argmax(prob)]
            
    return topic_dict

Overwriting src/models/fastText_model.py


In [135]:
fasttext_model = train_model()
topic_dict = match_topics(reviews, labels)

In [158]:
data = list(topic_dict.items())
topic_df = pd.DataFrame(data=data, columns=['cleaned-reviews', 'topic'], dtype=str)
topic_df

Unnamed: 0,cleaned-reviews,topic
0,ambience food lunch cost place sate brunch fri...,place
1,ambience evening service food experience kudo ...,service
2,food ambience service recommendation music bac...,service
3,guy behavior sincerety food course place,service
4,food drumstick basket biryani thank ambience,service
...,...,...
8564,quality food service place ambience location p...,service
8565,food sizzler staff chicken town heart,place
8566,swiggy mushroom quantity dish paneer gravy dis...,service
8567,place friend meal time friend moment 2:15pm ma...,place


In [159]:
topic_df.to_csv(dataset_dir / 'processed' / 'topics.csv', index=False)