# Analysing Twitter Data on Insomnia

This is my third-year Computer Science project at the University of Manchester. 

This notebook contains the code to do topic modelling using the BERTopic model on existing tweets file. It provides various visualisations on the produced topic modelling on the given dataset to do data analysis.

To train the model a sample (memory constraints can affect that depending on existing hardware resources) of all data can be taken from the whole dataset which can also be filtered with subjectivity filtering and tweets predicted with some sentiment label (e.g. only positively labelled tweets can be chosen to be analysed).  

The model can be either trained with the whole dataset (might be problems with memory constraints) or incrementally (online). However, online training seems to produce worse results compared to the first method. More details are provided below in corresponding sections.

The code was developed using the **Google Colab** platform.

## Essential Things to Have to Run this Notebook

Essential things to have to run this notebook: 
1. Set the **BASE_PATH** which is the project directory to access the dataset.
2. Make sure you have json (or at least csv) file which consists of tweets on insomnia. If not, then you can firstly run **TweetCollector** notebook to collect tweets. You can update the names of files in the **Define Constants** section if preferred.

This notebook was created by consulting the official [BERTopic documentation](https://maartengr.github.io/BERTopic/index.html). 

© 2023 Lukas Rimkus 

# Connect to the Google Drive

Firstly, connect to the Google Drive to be able to access files from there to read and store tweets.

If other platform is used to run the notebook code, then comment this out. 

In [1]:
from google.colab import drive, files

colab_path = '/content/drive'
drive.mount(colab_path)

Mounted at /content/drive


# Install and Import Required Libraries for Tweets Topic Modelling

## Install Libraries

In [2]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.14.1-py2.py3-none-any.whl (120 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.7/120.7 KB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 KB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hdbscan>=0.8.29
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m97.6 MB

## Import Libraries

In [3]:
import os
import re
import time
from datetime import datetime
import json
import requests

from textblob import TextBlob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer, OnlineCountVectorizer
from bertopic.representation import KeyBERTInspired

from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans, MiniBatchKMeans, Birch
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import IncrementalPCA
from sentence_transformers import SentenceTransformer

# Define Constants and Configurations  

**Change BASE_PATH to your own location on Google Drive**.

In [4]:
BASE_PATH = "/content/drive/MyDrive/Third Year Project"

json_file_name = "uni_data.json"
model_name = "bertopic_model"
topics_file_name = "topic_modelling_tweets.json"
labelled_topics_file_name = "labelled_topic_modelling_tweets.json"
predicted_tweets_file_name = "predictions_data.json"

json_file_path = f"{BASE_PATH}/{json_file_name}"  # this contains all collected tweets
model_save_path = f"{BASE_PATH}/{model_name}"  # path to save a trained model
save_topic_tweets_paths = f"{BASE_PATH}/{topics_file_name}"  # this contains tweets used for training BERTopic model
labelled_topic_tweets_paths = f"{BASE_PATH}/{labelled_topics_file_name}"  # this contains all tweets labelled with topics
predicted_tweets_file_path = f"{BASE_PATH}/{predicted_tweets_file_name}"  # this contains the whole tweets dataset with predicted sentiments 

sentiments_labels = {"NEGATIVE": 0, "NEUTRAL": 1, "POSITIVE": 2}

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 1500)

# Reading Dataset of Tweets 

In [5]:
def read_json_dataset(json_path: str) -> tuple:
    """
    This method reads a json tweets dataset at the given path. 
    The dataset as a dataframe is returned.     
    """
    file_exists = os.path.exists(json_path)
    if not file_exists:
        print(f"There is no file at: {json_path}")
        return False, None

    # Read the dataset
    tweets_df = pd.read_json(json_path, orient="records")

    return True, tweets_df

# Subjectivity Filtering

In [6]:
def filter_subjective_tweets(tweets_df: pd.DataFrame, threshold: float) -> tuple:
    """
    This method removes tweets which are considered to not be subjective. It is done 
    using TextBlob subjectivity property which gives values [0, 1] where 1 is highly 
    subjective. Through experimentation, I chose to use a threshold of 0.1 for tweets  
    to be preserved in the dataframe.
    """
    number_of_original_tweets = len(tweets_df)

    subjectivities = np.zeros(number_of_original_tweets)

    for i, tweet in enumerate(tweets_df["Tweet"]):
        tweet_blob = TextBlob(tweet)
        subjectivity = tweet_blob.subjectivity
        
        subjectivities[i] = subjectivity

    tweets_df["Subjectivity"] = subjectivities

    subjective_tweets_df = tweets_df[tweets_df["Subjectivity"] > threshold].copy()
    tweets_df.drop(columns=["Subjectivity"], inplace=True)
    subjective_tweets_df = subjective_tweets_df.reset_index()

    return subjective_tweets_df 

# Reading Tweets Dataset

In [7]:
def read_saved_dataset_for_topic_modelling(path: str=save_topic_tweets_paths) -> tuple:
    """
    This method reads a dataset already tested and saved for topic modelling which is of a tested size (150,000 works well) 
    for the model to handle.
    Documents and a dataframe of the dataset are returned. 
    """
    _, tweets_df = read_json_dataset(path)
    tweets_df.dropna(subset=['Tweet'], inplace=True)

    return tweets_df

def take_random_tweets_sample(path: str=json_file_path, sample_size: int=150000, random_state: int=10) -> tuple:
    """
    This method reads a big dataset of tweets and takes a sample from that. It is too big to be 
    handled the whole due to memory concerns.
    Dataframes of the selected and unselected tweets are returned. 
    """
    _, tweets_df = read_json_dataset(json_file_path)

    # Remove rows with empty tweets
    tweets_df.dropna(subset=['Tweet'], inplace=True)

    selected_tweets_df, unselected_tweets_df = take_randomly_training_tweets(tweets_df, sample_size, random_state)

    return selected_tweets_df, unselected_tweets_df


def take_saved_subjective_tweets(threshold: float=0.1) -> tuple:
    """
    This method takes subjective tweets from the dataset which was already tested and 
    saved for topic modelling. 
    A dataframe of the dataset are returned. 
    """
    _, tweets_df = read_saved_dataset_for_topic_modelling()

    tweets_df = filter_subjective_tweets(tweets_df, threshold=threshold)

    return tweets_df


def take_random_subjective_tweets_sample(path=json_file_path, sample_size=150000, threshold=0.1, random_state: int=10) -> tuple:
    """
    This method reads a big dataset of tweets, find subjective tweets and takes a 
    sample from that if there are enough tweets. However, it may be too big to be 
    handled the whole due to memory concerns.
    Dataframes of the selected and unselected tweets are returned. 
    """
    _, tweets_df = read_json_dataset(path)
    tweets_df.dropna(subset=['Tweet'], inplace=True)

    tweets_df = filter_subjective_tweets(tweets_df, threshold=threshold)

    selected_tweets_df, unselected_tweets_df = take_randomly_training_tweets(tweets_df, sample_size, random_state)

    return selected_tweets_df, unselected_tweets_df
    

def read_tweets_by_sentiment(path: str=predicted_tweets_file_path, sample_size: int=150000, sentiment: int=sentiments_labels["NEUTRAL"], random_state: int=10) -> tuple:
    """
    This method reads the tweets dataset with predicted sentiments so that tweets 
    of the given one sentiment are extracted to analyse their topics.   
    Dataframes of the selected and unselected tweets are returned. 
    """
    _, tweets_df = read_json_dataset(path)

    tweets_df = tweets_df[tweets_df["Predicted Sentiment"] == sentiment].copy()

    selected_tweets_df, unselected_tweets_df = take_randomly_training_tweets(tweets_df, sample_size, random_state)

    return selected_tweets_df, unselected_tweets_df


def take_randomly_training_tweets(tweets_df: pd.DataFrame, sample_size: int=150000, random_state: int=10) -> tuple:
    """
    This method takes a random sample from a dataframe. If there are less tweets, then 
    all of them are taken. Random state can be given to be able to reproduce results. 
    Dataframes of the selected and unselected tweets are returned. 
    """
    # Add random state for reproducibility
    tweets_df = tweets_df.sample(frac=1, random_state=random_state).reset_index(drop=True)

    if sample_size >= len(tweets_df):
        sample_size = len(tweets_df)
        selected_tweets_df = tweets_df.copy()
        
        # Create an empty dataframe 
        unselected_tweets_df = selected_tweets_df.sample(0).copy()
    else:
        selected_tweets_df = tweets_df[:sample_size].copy()
        unselected_tweets_df = tweets_df[sample_size:].reset_index(drop=True).copy()

    return selected_tweets_df, unselected_tweets_df

Say that a method how and what data should be read.

In [8]:
tweets_df, unselected_tweets_df = take_random_tweets_sample(path=json_file_path)
docs = tweets_df["Tweet"].values

tweets_df.head()

Unnamed: 0,Publish Date,Location,Tweet
0,1675602806000,On break/offline,I really feel like im spiraling down to old habbits again. Doubting myself and having sleepless nights bc I cant figure out what i did this time to make people hate me. Im sorry i cant be the Artist i was once. I guess last year just dragged me down too much +
1,1670388314000,,I can’t sleep witch pathetic little loser wants to give himself meaning? \n\nFindom paypig cashslave humanatm finsub cashcow brat goddess findomuk findomusa cuckold
2,1678231474000,Planet: Reach,Idk how it works but if I can't sleep I drink and energy drink and boom im relaxed and ready to fall asleep
3,1671647919000,"Cape Town, South Africa",i’ve taken a sleeping pill maybe thrice in my life. and subsequently passed out for 48 hours 🥴 they are defs not for me
4,1673624716000,Asa Norte,When I can’t sleep I imagine and visualize fantastic places and people. url


In [9]:
print(f"Number of tweets for training: {len(docs)}")

Number of tweets for training: 150000


In [10]:
print(f"Number of unselected tweets for training: {len(unselected_tweets_df)}")
unseen_docs = unselected_tweets_df["Tweet"]

Number of unselected tweets for training: 770095


# BERTopic Topic Modelling

## Topic Modelling Class Definition

In [11]:
class TopicModelling:
    """
    This class defines the topic modelling functionality using BERTopic model. 
    This is used to train the model, visualise and analyse the results.
    """
    def __init__(self, docs: list) -> None:
        """
        Set docs parameter and initiliase some values in the constructor. 
        """
        self.docs = docs
        self.stop_words = self.define_stop_words()
        self.topic_model = None
        self.embeddings = None
        self.loaded_model = False

    def define_stop_words(self):
        """
        Define the stop words to ignore due to their prevalence in the dataset to be used while training the model. 
        """
        my_defined_stop_words = ["url", "sleep", "sleeping", "insomnia", "sleepless", "slept", "asleep", "sleepy"]
        stop_words = list(text.ENGLISH_STOP_WORDS)
        stop_words.extend(my_defined_stop_words)
        return stop_words

    def load_saved_model(self, model_save_path: str=model_save_path) -> None:
        """
        Load the model from the local storage instead.
        """
        self.topic_model = BERTopic.load(model_save_path)
        self.loaded_model = True
        
    def train_model(self) -> None:
        """
        This method builds from the beginning till the end all components of the BERTopic model.

        How it works: https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview
        """

        # Reduce dimensionality
        umap_model = UMAP(n_neighbors=40, n_components=10, min_dist=0.0, metric='cosine', random_state=42, low_memory=True) 

        # Cluster reduced embeddings
        hdbscan_model = HDBSCAN(min_cluster_size=200, metric='euclidean', cluster_selection_method='eom', prediction_data=True) 

        # Tokenise reduced embeddings using bigrams and my defined stop words. 
        vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=self.stop_words, min_df=10) # Using the CountVectorizer to extract all possible words 
        
        # Create topic representation and reduce the impact of frequent words
        ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

        # The model I use shows better results on average to extract embeddings, but it's slower and bigger
        # https://www.sbert.net/docs/pretrained_models.html
        sentence_model = SentenceTransformer("all-mpnet-base-v2")
        self.embeddings = sentence_model.encode(self.docs)

        # Fine-tune topic representations with KeyBERT
        representation_model = KeyBERTInspired()

        # The final model
        self.topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=hdbscan_model, 
                                    vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, representation_model=representation_model, 
                                    calculate_probabilities=False, low_memory=True, top_n_words=15, min_topic_size=200, n_gram_range=(1, 2), 
                                    nr_topics="auto")

        # Fit the given documents with produced embeddings 
        self.topic_model.fit(self.docs, self.embeddings)

    def train_model_incrementally(self, step_size:int = 10000) -> None:
        """
        This method is similar to 'train_model' but the main difference is that this method 
        does not have a bottleneck of memory because this trains the model incrementally (online).
        However, the results seem to be worse compared to the 'train_model' because 
        it requires to use a different clustering algorithm (MiniBatchKMeans) and make other
        approximations. That is the tradeoff between performance and memory.  
        """
        # https://maartengr.github.io/BERTopic/getting_started/online/online.html
        
        umap_model = IncrementalPCA(n_components=10)
        
        cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)

        vectorizer_model = OnlineCountVectorizer(stop_words=self.stop_words, decay=0, delete_min_df=10)
        ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

        sentence_model = SentenceTransformer("all-mpnet-base-v2")
        self.embeddings = sentence_model.encode(self.docs)

        representation_model = KeyBERTInspired()

        self.topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=cluster_model,
                                    vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, representation_model=representation_model, 
                                    calculate_probabilities=False, low_memory=True, top_n_words=15, min_topic_size=200, n_gram_range=(1, 2), 
                                    nr_topics="auto")

        topics = []
        
        # Incrementally fit the topic model by training on specified number of documents at a time
        for i in range(0, len(docs), step_size):
            end_index = i + step_size if i + step_size < len(docs) else len(docs)
            self.topic_model.partial_fit(docs[i:end_index], embeddings=self.embeddings[i:end_index])
            topics.extend(self.topic_model.topics_)
        
        self.topic_model.topics_ = topics

    def reduce_topics(self, nr_topics: int=15) -> None:
        """
        This method reduces the overall number of extracted topics for the model.
        """
        self.topic_model.reduce_topics(self.docs, nr_topics=nr_topics)

    def merge_chosen_topics(self, topics_to_merge: list) -> None:
        """
        Merge chosen topics together. 
        """
        self.topic_model.merge_topics(self.docs, topics_to_merge)

    def get_document_info(self) -> pd.DataFrame:
        """
        Return a dataframe consisting of topic specific information for each tweet 
        used for training the model.
        """
        labelled_tweets_df = self.topic_model.get_document_info(self.docs)
        return labelled_tweets_df

    def get_representative_docs(self) -> dict:
        """
        Returns a dict with a topic number as key and value as a list of representative docs.
        """
        representative_docs = self.topic_model.get_representative_docs()
        return representative_docs

    def get_topic_info(self) -> pd.DataFrame:
        """
        Returns a dataframe with the number of rows which correspond to each extracted 
        topic. Gives information about counts per topic, most common words.   
        """
        freq = self.topic_model.get_topic_info()
        return freq

    def visualize_documents(self):
        """
        This visualises tweets on a 2D graph which gives a different colour for each topic.
        """
        # If the model was loaded from local storage, then there are no embeddings array  
        if self.loaded_model:
            return self.topic_model.visualize_documents(docs, hide_annotations=True, hide_document_hover=True)
        else:
            return self.topic_model.visualize_documents(docs, embeddings=self.embeddings, hide_annotations=True, hide_document_hover=True)

    def visualize_topics(self):
        """
        Show how topics represented are near/far to each other on a 2D graph. 
        """
        return self.topic_model.visualize_topics()

    def visualize_hierarchy(self):
        """
        Visualises a dendrogram of all topics, tells which topics are most similar to each 
        other and could be merged. 
        """
        hierarchical_topics = self.topic_model.hierarchical_topics(self.docs)
        return self.topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

    def visualize_barchart(self, top_n_topics: int=15):
        """
        Visualises the most common representable words for each topic with probabilities for each word.    
        """
        return self.topic_model.visualize_barchart(top_n_topics=top_n_topics)

    def visualize_heatmap(self, n_clusters: int=10):
        """
        Visualises how similar each topic is similar to each other in the bar chart
        using cosine similarity between 0 and 1 being very similar.   
        """
        return self.topic_model.visualize_heatmap(n_clusters=n_clusters)
    
    def visualize_topics_over_time(self, date: np.ndarray):
        """
        Visualises all topics on a graph over time with number of tweets of that topic per day. 
        """
        transformed_date = list(map(lambda d: datetime.fromtimestamp(int(d)/1000).strftime("%Y-%m-%d"), date))

        topics_over_time = self.topic_model.topics_over_time(docs=self.docs, 
                                                        timestamps=transformed_date, 
                                                        global_tuning=True, 
                                                        evolution_tuning=True
                                                        )

        return self.topic_model.visualize_topics_over_time(topics_over_time)

    def search_topics_by_keyword(self, keyword: str, top_n: int=5) -> None:
        """
        Provide some statistics (like most probable topic) for the given keyword. 
        """
        similar_topics, similarity = self.topic_model.find_topics(keyword, top_n=top_n)
        print(f"Most similar topics with similarity to keyword '{keyword}': {list(zip(similar_topics, similarity))}")
        
        representative_docs = self.get_representative_docs()
        topic = similar_topics[0]
        print(f"Most similar topic {topic}: {self.topic_model.get_topic(topic)}")
        print(f"Representative documents of topic {topic}: {representative_docs[topic]}")

    def get_topics_representation(self, topic_num: int=0) -> list:
        """
        Gives most common/representitive words with probabilities of the topic. 
        """
        print(f"Representative documents of topic {topic_num}:")
        
        representative_docs = self.get_representative_docs()
        for i, doc in enumerate(representative_docs[topic_num]):
            print(f"{i + 1}. {doc}")

        return self.topic_model.get_topic(topic_num)

    def get_top_n_words(self, topic_num: int):
        """
        Constructs and returns a string of n most common words for a topic.
        """
        n_words_with_probs = self.topic_model.get_topic(topic_num)

        n_words = " - ".join([word for (word, prob) in n_words_with_probs])
        return n_words
    
    def get_topic_tweets(self, topic_num: int) -> int:
        """
        Obtain a dataframe of tweets with the specified topic number. 
        """
        labelled_tweets_df = self.get_document_info()
        topic_df = labelled_tweets_df[labelled_tweets_df["Topic"] == topic_num].copy()
        topic_df = topic_df.reset_index()
        return topic_df

    def allocate_topics_to_unseen_tweets(self, unseen_docs: list) -> np.ndarray:
        """
        Predict the most common topic with a probability for a given unseen tweets. 
        """
        topics, probs = self.topic_model.transform(unseen_docs)
        return topics, probs

    def construct_topics_df_with_unseen_tweets(self, unseen_docs: list, predicted_topics: np.ndarray, probs: np.ndarray) -> pd.DataFrame:
        """
        Construct a dataframe for predicted topics for new tweets of the same structure the method 
        "get_document_info" for the trained tweets returns. This dataframe is merged with the 
        one generated for the tweets used for training. 
        """
        data = {"Document": unseen_docs, "Topic": predicted_topics}
        predicted_topics_df = pd.DataFrame(data=data)

        freq = self.get_topic_info()
        labelled_tweets_df = self.get_document_info()

        predicted_topics_df["Name"] = predicted_topics_df["Topic"].apply(lambda topic: freq.loc[topic+1, "Name"])
        predicted_topics_df["Probability"] = probs
        predicted_topics_df["Top_n_words"] = predicted_topics_df["Topic"].apply(lambda topic: topic_modelling.get_top_n_words(topic))
        predicted_topics_df["Representative_document"] = False

        labelled_tweets_topics_df = pd.concat([labelled_tweets_df, predicted_topics_df], axis=0)
        return labelled_tweets_topics_df
    
    def save_model(self, model_save_path: str=model_save_path) -> None:
        """
        Save the model at the given path.
        """
        self.topic_model.save(model_save_path)

    def save_labelled_tweets(self, labelled_tweets_df: pd.DataFrame, labelled_topic_tweets_paths: str=labelled_topic_tweets_paths) -> None:
        """
        Save provided datframe at the given path.
        """
        labelled_tweets_df.to_json(labelled_topic_tweets_paths, orient="records", indent=4)


In [12]:
topic_modelling = TopicModelling(docs)
topic_modelling.train_model()

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Reduce the number of topics if there are too many of them to be analysed.

In [13]:
topic_modelling.reduce_topics(nr_topics=15)

Uncomment the code below to make predictions on the unselected tweets. However, do it with caution, as there may not be enough memory supported if hundreds of thousands or more tweets are to be utilised.  

In [14]:
# predicted_topics, probs = topic_modelling.allocate_topics_to_unseen_tweets(unseen_docs)
# print(len(predicted_topics))
# labelled_tweets_topics_df = topic_modelling.construct_topics_df_with_unseen_tweets(unseen_docs, predicted_topics, probs)
# print(f"Labelled tweets with topics in total: {len(labelled_tweets_topics_df)}")
# labelled_tweets_topics_df.head()

## Get the List of Extracted Topics

In [15]:
freq = topic_modelling.get_topic_info()
freq

Unnamed: 0,Topic,Count,Name
0,-1,59205,-1_depression_tired_stress_awake
1,0,30377,0_took melatonin_melatonin_melatonin gummies_ambien
2,1,19254,1_nap_tired_took nap_cantsleep
3,2,18717,2_sounds_lost_hear_baby
4,3,7320,3_movie_movies_favorite_netflix
5,4,5201,4_headaches_headache_fatigue_tired
6,5,2489,5_hungry_breakfast_eat_eating
7,6,1422,6_excited_tomorrow_tonight_awake
8,7,1072,7_art_draw_creative_creating
9,8,1043,8_twitter_tweet_tweets_awake


## Visualise Documents on 2D Graph

In [16]:
topic_modelling.visualize_documents()

## Visualise Topics

In [17]:
topic_modelling.visualize_topics()

## Display a Hierarchical Structure of Topics (Dendrogram)

In [18]:
topic_modelling.visualize_hierarchy()

100%|██████████| 13/13 [00:02<00:00,  5.82it/s]


## Visualise Bar Charts of Topics with Most Representative Words

In [19]:
topic_modelling.visualize_barchart(top_n_topics=15)

## Visualise the Heatmap of Topics

In [20]:
topic_modelling.visualize_heatmap(n_clusters=10)

## Visualise Topics Frequencies over Time

In [21]:
topic_modelling.visualize_topics_over_time(date=tweets_df["Publish Date"])

## Analyse the Selected Topic

In [22]:
topic_modelling.search_topics_by_keyword(keyword="test", top_n=5)

Most similar topics with similarity to keyword 'test': [(10, 0.35716307), (13, 0.33808145), (-1, 0.32690054), (5, 0.3255312), (11, 0.32349613)]
Most similar topic 10: [('financial', 0.42623153), ('pay', 0.4169234), ('beg', 0.3536514), ('afford', 0.32357457), ('paid', 0.31290692), ('need', 0.31110668), ('help', 0.30495873), ('job', 0.28710407), ('business', 0.27671182), ('really need', 0.27433473)]
Representative documents of topic 10: [" boss pls sir am having sleepless nights sir,needs to pay my house rent sir and have dropped my acct number severally,joined ur telegram also but am not favored,pls help me sir.U won't ever face hardship sir. Sodiya oluwashina GTB 0168034960 would be forever gratefu", ' Good evening, Pls sir help me cos I truly need financial help to use start a foodstuff business and pay this debt that has been giving me sleepless nights, pls a little help from you will help me 🙏 #Danlil', ' Pls sir I need money to pay my house rent🙏🏾I have some money but not complete 

In [23]:
topic_num = 3
topic_modelling.get_topics_representation(topic_num=topic_num)
topic_df = topic_modelling.get_topic_tweets(topic_num=topic_num)
topic_df.head(50)

Representative documents of topic 3:
1. Newy choosing "Titanic" as his favorite romantic movie? Not a bad choice.

I've never seen "The Proposal" though (sorry Dermy).

Personally, I'll take "Sleepless in Seattle".

#GoAvsGo
2. Hello Friends!

 suggested a rhythm game called Melatonin to play on stream so... that I will do! Chill Friday stream, come stop by at url
3.   Saving Private Ryan and Forrest Gump are my favorite. Castaway and Big are iconic. Sleepless in Seattle and You've got mail great date night movies and The DaVinci code movies are also very entertaining. Overall I think Tom Hanks has the best Library of movies out of all actors.


Unnamed: 0,index,Document,Topic,Name,Top_n_words,Probability,Representative_document
0,19,Playing vs in insomnia qualifiers lower bracket and then probably LPB lets shoot some heads\n\n#CTRLTheGame url url,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,0.582156,False
1,42,Sleepless in Seattle 💚 url,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,0.555052,False
2,45,I got obsessed with hide and seek by stormzy and I found out the chorus was sang by oxlade and I can’t sleep right now ‘cause I’m trying to hear how that’s oxlade.😭,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,0.624312,False
3,70,I just be up listening to music late at night it don’t be shit else to do when I can’t sleep 😌,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,1.0,False
4,78,"And may also starring Meg Ryan, like ""Sleepless in a Coffin in Seattle""",3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,1.0,False
5,135,Can't sleep cause I'm binging ke huy quan videos and crying,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,0.526759,False
6,145,Trust Spotify to let me know a 9th Wonder song has the same beat as sleepless nights.,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,1.0,False
7,256,RIP \nInsomnia and One Step Too Far with Dido are still part of my playlists after decades url,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,1.0,False
8,280,can’t sleep so i’m watching a doc about a german cult in chile that’s making me mad,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,0.541608,False
9,281,In memory of Maxi Jazz best of Faithless on today. 3 year old and 1 year old loving it. Guess they are kings of Insomnia. #MaxiJazz #insomnia,3,3_movie_movies_favorite_netflix,movie - movies - favorite - netflix - favourite - watched - watch - stream - relaxing - fav,1.0,False
