# **Table of Contents:**

* [Folder Setup](#folders)
* [Reading data from MongoDB](#read)
* [Data preprocessing](#preprocessing)
* [Pickling](#pickling)

# Folder setup <a class="anchor" id="folders"></a>

In [1]:
import os
import sys

In [2]:
directory_path = os.path.dirname(os.getcwd())
sys.path.append(directory_path + "\\utils")
sys.path.append(directory_path + "\\scripts")
sys.path.append(directory_path + "\\notebooks")

MongoDB connection string is stored in .env folder

from dotenv import load_dotenv
load_dotenv()

# Reading data from MongoDB <a class="anchor" id="read"></a>

from read_docs import ReadDocs
data_access = ReadDocs(os.environ.get('MONGODB_URI'))

data_access.list_databases()

data_access.list_collections("tweets")

tweets_df = data_access.read_tweets_in_collection("tweets","global")

# Data preprocessing <a class="anchor" id="preprocessing"></a>

Preprocessing functions to standardize the tweets one word at a time.

In [3]:
project_base = os.path.dirname(os.getcwd())
data_folder = project_base + r"\data"

In [4]:
# Alternative to reading from MongoDB cluster
import pandas as pd
tweets_df = pd.read_json(data_folder+r"\global_twitter_data.json", lines=True)

In [5]:
from preprocessing import TweetsPreprocessing
tweets_prep = TweetsPreprocessing()

In [6]:
words_list = tweets_prep.preprocess_tweets_df(tweets_df, "full_text")

**Processing the tweets df takes approximately 10 minutes to run**

In [7]:
display(words_list)

[['rt',
  '',
  'extra',
  'random',
  'image',
  '',
  'let',
  'focus',
  '',
  'one',
  '',
  'specific',
  'zone',
  '',
  '',
  'western',
  'coast',
  '- gt',
  'longjing',
  'district',
  'taichung',
  '#city',
  '#ta'],
 ['rt',
  '',
  '#chinas',
  'medium',
  'explains',
  '',
  'military',
  'reason',
  '',
  '',
  'area',
  '',
  '',
  'drill',
  '',
  '',
  '#taiwan',
  'strait',
  'read',
  '',
  'label',
  '',
  '',
  'pi'],
 ['china',
  'even',
  'cut',
  '',
  'communication',
  '',
  'dont',
  'anwer',
  'phonecalls',
  '',
  '',
  'u',
  '',
  '',
  'clown',
  '',
  'enters',
  '',
  'stage',
  '',
  'ask',
  '#xijinping',
  '',
  'change',
  'putin',
  'mind'],
 ['putin',
  '',
  '#xijinping',
  '',
  '',
  'told',
  '',
  '',
  'friend',
  'taiwan',
  '',
  '',
  '',
  'vassal',
  'state',
  'include',
  'nuke',
  'much',
  'like',
  '',
  'ukrainian',
  'model',
  '',
  'warn',
  '',
  '',
  '',
  'take',
  'pelosi',
  '',
  'open',
  'china',
  'eye'],
 ['rt',
  '

# Pickling<a class="anchor" id="pickling"></a>

In [8]:
import pickle

In [11]:
with open('words_list.pkl', 'wb') as f:
    pickle.dump(words_list, f)