# **Table of Contents:**

* [Folder Setup](#folders)
* [Reading data from MongoDB](#read)
* [Data preprocessing](#preprocessing)
* [Topic modelling](#modelling)

# Folder setup <a class="anchor" id="folders"></a>

In [1]:
import os
import sys

In [2]:
directory_path = os.path.dirname(os.getcwd())
sys.path.append(directory_path + "\\utils")
sys.path.append(directory_path + "\\scripts")
sys.path.append(directory_path + "\\notebooks")

MongoDB connection string is stored in .env folder

In [None]:
from dotenv import load_dotenv
load_dotenv()

# Reading data from MongoDB <a class="anchor" id="read"></a>

In [None]:
from read_docs import ReadDocs
data_access = ReadDocs(os.environ.get('MONGODB_URI'))

In [None]:
data_access.list_databases()

In [None]:
data_access.list_collections("tweets")

In [None]:
tweets_df = data_access.read_tweets_in_collection("tweets","global")

# Data preprocessing <a class="anchor" id="preprocessing"></a>

Preprocessing functions to standardize the tweets one word at a time.

In [3]:
project_base = os.path.dirname(os.getcwd())
data_folder = project_base + r"\data"

In [4]:
# Alternative to reading from MongoDB cluster
import pandas as pd
tweets_df = pd.read_json(data_folder+r"\global_twitter_data.json", lines=True)

In [5]:
from tweet_preprocessing import TweetsPreprocessing

In [6]:
tweets_prep = TweetsPreprocessing()
processed_df = tweets_prep.preprocess_tweets_df(tweets_df, "full_text", 'n')
display(processed_df)

Unnamed: 0,processed
0,rt i_ameztoy extra random image i\n\nlets focu...
1,rt indopac_info #chinas media explains the mil...
2,china even cut off communication they dont anw...
3,putin to #xijinping i told you my friend taiw...
4,rt chinauncensored im sorry i thought taiwan w...
...,...
21995,rt indopac_info a good infographic of #chinas ...
21996,rt indopac_info a good infographic of #chinas ...
21997,reuters thanks #pelosi smart move
21998,rt indopac_info #taiwan peoples desire for uni...


# Topic modelling <a class="anchor" id="modelling"></a>

In [7]:
from topic_modelling import TopicModelling
import pyLDAvis

In [8]:
topic_model = TopicModelling()
tweet_mappings = topic_model.make_dictionary(processed_df.values.tolist())

In [9]:
parent_dir = os.path.dirname(os.getcwd())
topic_model.save_dictionary(tweet_mappings,'global_mappings')

True

In [10]:
bow = topic_model.create_bow(processed_df, tweet_mappings)
topic_model.serialize_bow('global_bow',bow)

True

In [11]:
lda = topic_model.create_lda_model(bow, tweet_mappings)

  perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words


**TODO**: Look into this visualization: maybe try using a single Tweet text instead of an entire dataframe list???

pyLDAvis.enable_notebook()
display(topic_model.visualize_lda_results(lda, bow, tweet_mappings))

In [13]:
topic_model.save_lda_objects(lda, 'global_lda')

True

**TODO**: Try loading saved objects i.e. lda model, mappings, bag of word corpus

**TODO**: Filter the tweets based on the list comprehension code that he used. Do not save formatted text as dataframe instead save as list.
(Putting it all together)

**TODO**: Pick a topic of interest to investigate (Visualization-guided analysis)

**TODO**: Create a wordcloud (Visualize topic 2 with a word cloud)