# **Table of Contents:**

* [Folder Setup](#folders)
* [Reading data from MongoDB](#read)
* [Data preprocessing](#preprocessing)
* [Topic modelling](#modelling)

# Folder setup <a class="anchor" id="folders"></a>

In [1]:
import os
import sys

In [2]:
directory_path = os.path.dirname(os.getcwd())
sys.path.append(directory_path + "\\utils")
sys.path.append(directory_path + "\\scripts")
sys.path.append(directory_path + "\\notebooks")

MongoDB connection string is stored in .env folder

from dotenv import load_dotenv
load_dotenv()

# Reading data from MongoDB <a class="anchor" id="read"></a>

from read_docs import ReadDocs
data_access = ReadDocs(os.environ.get('MONGODB_URI'))

data_access.list_databases()

data_access.list_collections("tweets")

tweets_df = data_access.read_tweets_in_collection("tweets","global")

# Data preprocessing <a class="anchor" id="preprocessing"></a>

Preprocessing functions to standardize the tweets one word at a time.

In [3]:
project_base = os.path.dirname(os.getcwd())
data_folder = project_base + r"\data"

In [4]:
# Alternative to reading from MongoDB cluster
import pandas as pd
tweets_df = pd.read_json(data_folder+r"\global_twitter_data.json", lines=True)

In [5]:
from preprocessing import TweetsPreprocessing
tweets_prep = TweetsPreprocessing()

In [7]:
words_list = tweets_prep.preprocess_tweets_df(tweets_df, "full_text")

# Topic modelling <a class="anchor" id="modelling"></a>

In [None]:
from topic_modelling import TopicModelling
import pyLDAvis

In [None]:
topic_model = TopicModelling()
tweet_mappings = topic_model.make_dictionary(words_list)

In [None]:
parent_dir = os.path.dirname(os.getcwd())
topic_model.save_dictionary(tweet_mappings,'global_mappings')

In [None]:
bow = topic_model.create_bow(processed_df, tweet_mappings)
topic_model.serialize_bow('global_bow',bow)

In [None]:
lda = topic_model.create_lda_model(bow, tweet_mappings)

**TODO**: Look into this visualization: maybe try using a single Tweet text instead of an entire dataframe list???

pyLDAvis.enable_notebook()
display(topic_model.visualize_lda_results(lda, bow, tweet_mappings))

In [None]:
topic_model.save_lda_objects(lda, 'global_lda')

**TODO**: Try loading saved objects i.e. lda model, mappings, bag of word corpus

**TODO**: Pick a topic of interest to investigate (Visualization-guided analysis)

**TODO**: Create a wordcloud (Visualize topic 2 with a word cloud)