<a href="https://colab.research.google.com/github/MikelKN/new-phd-with-rawat/blob/main/topic_modeling_Google_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
https://github.com/MikelKN/new-phd-with-rawat.git

## Loading the data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

from google.colab import userdata
# userdata.get('secretName')
import os

os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')

!kaggle datasets download -d elgringofrances/english-hate-speech-superset
!kaggle datasets download -d saurabhshahane/fake-news-classification

!unzip english-hate-speech-superset.zip
!unzip fake-news-classification.zip

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/elgringofrances/english-hate-speech-superset
License(s): MIT
Downloading english-hate-speech-superset.zip to /content
 22% 5.00M/22.8M [00:00<00:00, 24.3MB/s]
100% 22.8M/22.8M [00:00<00:00, 86.7MB/s]
Dataset URL: https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading fake-news-classification.zip to /content
 89% 82.0M/92.1M [00:00<00:00, 223MB/s]
100% 92.1M/92.1M [00:00<00:00, 189MB/s]
Archive:  english-hate-speech-superset.zip
  inflating: en_hf_102024.csv        
Archive:  fake-news-classification.zip
  inflating: WELFake_Dataset.csv     


## Running the script

## Notebook credit:

- [Fastopics by Wu et al. 2024](https://colab.research.google.com/drive/1bduHWL5_bvsl4EYOgimCOmU-7RfnXqrX?usp=sharing)

In [5]:
!pip install topmost
!pip install fastopic
# # !pip install plotly kaleido
# !pip install -U kaleido
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# limit to the lenght of token that should be displayed on the daatsets
pd.options.display.max_colwidth = 500

import os
import pandas as pd
import kaggle
import kagglehub
import matplotlib.pyplot as plt
# import plotly

import topmost
from fastopic import FASTopic

# from IPython.display import display

class Read:
    @staticmethod
    def read_and_filter_dataset (filepath, text_column, max_length=1000):
        data = pd.read_csv(filepath, low_memory=False)
        # Filter rows based on text length
        data = data[data[text_column].str.len() <= max_length]
        return data

class Topic():
    def __init__(self, num_topics=50):
        self.num_topics = num_topics
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Using device: {self.device}")

    def topic_modelling(self, dataset, dataset_name, text):
        # Select the first 100 rows of the 'text' column
        docs = dataset[text].tolist()

        # Initialize and fit the topic model
        topic_model = FASTopic(num_topics=self.num_topics, verbose=True)
        topic_top_words, doc_topic_dist = topic_model.fit_transform(docs)

        # Visualize topic hierarchy and topic weights
        fig_hierarchy = topic_model.visualize_topic_hierarchy()
        fig_hierarchy.show()
        # fig_hierarchy.write_image(hierarchy_filename)

        fig_weights = topic_model.visualize_topic_weights(top_n=20, height=500)
        fig_weights.show()

        return topic_top_words, doc_topic_dist, fig_hierarchy, fig_weights

def main():
    reader= Read()
    topic_model = Topic()

    # Authenticate with Kaggle API
    kaggle.api.authenticate()

    print('\nUploading and reading the datasets starts!')

    # Read the CSV files into DataFrames
    hate_speech = reader.read_and_filter_dataset('/content/en_hf_102024.csv', 'text')
    fake_news = reader.read_and_filter_dataset('/content/WELFake_Dataset.csv', 'text')

    # Drop missing values
    fake_news.dropna(inplace=True)
    hate_speech.dropna(inplace=True)


    # Extract DataFrame to include only rows with label = 1
    hate_speech_super_df = hate_speech.loc[hate_speech['labels'] == 1, ['text', 'labels']].sample(n=6000, random_state=42).reset_index(drop=True)
    fake_news_df = fake_news.loc[fake_news['label'] == 1, ['text', 'label']].sample(n=6000, random_state=42).reset_index(drop=True)

    print('\n')
    print('Dataset reader completed!')
    print('\n')

    print('Starting senitment analysis!')
    print('\n')

    print('The code for topic modelling begins here...!')
    print('\n')

    top_topics_fake, topic_distribution_fake, figure_heirarchy_fake, figure_weights_fake = topic_model.topic_modelling (fake_news_df, 'fake_news', 'text')
    top_topics_hate, topic_distribution_hate, figure_heirarchy_hate, figure_weights_hate = topic_model.topic_modelling (hate_speech_super_df,"hate_speech", 'text')

    print('\n')
    print("Script Has finished running!")

if __name__ == "__main__":
    main()


Using device: cuda

Uploading and reading the datasets starts!


2024-11-18 08:34:37,335 - FASTopic - use device: cuda




Dataset reader completed!


Starting senitment analysis!


The code for topic modelling begins here...!




loading train texts: 100%|██████████| 60/60 [00:00<00:00, 5069.05it/s]
parsing texts: 100%|██████████| 60/60 [00:00<00:00, 5077.54it/s]
2024-11-18 08:34:38,249 - TopMost - Real vocab size: 1624
2024-11-18 08:34:38,250 - TopMost - Real training size: 60 	 avg length: 61.750
Training FASTopic:   4%|▍         | 8/200 [00:00<00:02, 64.48it/s]2024-11-18 08:34:38,604 - FASTopic - Epoch: 010 loss: 455.652
Training FASTopic:   8%|▊         | 15/200 [00:00<00:04, 41.49it/s]2024-11-18 08:34:38,969 - FASTopic - Epoch: 020 loss: 432.913
Training FASTopic:  14%|█▍        | 28/200 [00:00<00:06, 26.12it/s]2024-11-18 08:34:39,424 - FASTopic - Epoch: 030 loss: 385.052
Training FASTopic:  18%|█▊        | 37/200 [00:01<00:06, 25.36it/s]2024-11-18 08:34:39,857 - FASTopic - Epoch: 040 loss: 352.167
Training FASTopic:  24%|██▍       | 49/200 [00:01<00:06, 22.53it/s]2024-11-18 08:34:40,315 - FASTopic - Epoch: 050 loss: 332.058
Training FASTopic:  29%|██▉       | 58/200 [00:02<00:06, 22.35it/s]2024-11-18 08:3

Topic 0: waters maxine jury talking matter dershowitz grand racist know ought racists alan calling better race
Topic 1: california falwell those specifically deplorable complete int taking lie sitting liberty jerry muslims referring tell
Topic 2: ease arguments smith plight reject final answer less intellectual enormous song rebekah subtle bury universal
Topic 3: hitler germany attempting reminds spent raid clean journal shut else born infowarsif anything dusseldorf lived
Topic 4: veterans bus cinquina rodgers hospital arnaldo jersey health together then everybody care fortunate legion problems
Topic 5: tehran aerospace exhibition hamid international products javanipress underway capital showcase milad event catching javani iconic
Topic 6: register ivanka voting eric didn york deadline advance votes guess guilty unaware win weekthe missed
Topic 7: sign front please take business until out you will because country does stop received dawn
Topic 8: campaign fundraisers barry miss end dinn

2024-11-18 08:34:47,748 - FASTopic - use device: cuda
loading train texts: 100%|██████████| 60/60 [00:00<00:00, 2346.07it/s]
parsing texts: 100%|██████████| 60/60 [00:00<00:00, 12599.29it/s]
2024-11-18 08:34:48,689 - TopMost - Real vocab size: 626
2024-11-18 08:34:48,693 - TopMost - Real training size: 60 	 avg length: 18.717
Training FASTopic:   2%|▎         | 5/200 [00:00<00:04, 40.71it/s]2024-11-18 08:34:49,639 - FASTopic - Epoch: 010 loss: 122.545
Training FASTopic:  10%|▉         | 19/200 [00:01<00:18,  9.91it/s]2024-11-18 08:34:50,744 - FASTopic - Epoch: 020 loss: 117.930
Training FASTopic:  14%|█▎        | 27/200 [00:02<00:16, 10.68it/s]2024-11-18 08:34:51,504 - FASTopic - Epoch: 030 loss: 108.547
Training FASTopic:  20%|█▉        | 39/200 [00:03<00:08, 19.75it/s]2024-11-18 08:34:51,922 - FASTopic - Epoch: 040 loss: 98.691
Training FASTopic:  24%|██▍       | 48/200 [00:03<00:06, 22.60it/s]2024-11-18 08:34:52,338 - FASTopic - Epoch: 050 loss: 90.224
Training FASTopic:  28%|██▊   

Topic 0: number many muslim kill how yesterday per year takes wonder democrats will terrorist killed terrorists
Topic 1: sounds chief chitownboystownbathhouseboi rahm emmanuel alinskyite looks kike like that insensitive disprove any shot cop
Topic 2: big cheeto ever miss ignorant worst president meant wished ripdonaldtrump matter death rest despise him
Topic 3: sympathy said exactly done receive mind should have boycottchina wuhanvirus spanish flu what chinese you
Topic 4: perps towelhead had radar police course did nothing about got now will battle angry nest
Topic 5: ass america entity chill upstairs realm god dark hate let niggers something from like but
Topic 6: strong comrades enough females without endangering combat self most sensitive relative positions liked extremely last
Topic 7: wuhan knocking originated doors criminal lab communist every country chinesevirus link coming from this chinavirus
Topic 8: million uyghurs decided recently han video definitely among forcing across






Script Has finished running!
