### Introduction

This notebook contains my work for Exploratory Data Analysis, concerning topic modeling. I will attempt to build some Latent Dirichlet Allocation (LDA) topic vectors on some of my documents. In order to resolve issues with Deprecation Warnings, I will attempt to suppress some of the warnings, first.

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import os
import re
import numpy as np
import pandas as pd
import time

import nltk
import pyLDAvis
import pyLDAvis.gensim_models
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

from nltk.tokenize import RegexpTokenizer

from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
from matplotlib import pyplot as plt

import gensim
import string
from gensim import corpora
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

import multiprocessing
num_cores = multiprocessing.cpu_count()

# Change this to where your datasets are located:
os.chdir("D:/Datasets/453_NLP_Final_Project/training/training_full_text")

# Check versions to ensure that there are no compatability issues:
python_version = !python --version
print("Python Version: ", python_version)
print("Current Directory: ", os.getcwd())
print("Numpy version: ", np.__version__)
print("Pandas version: ", pd.__version__)

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


Python Version:  ['Python 3.8.8']
Current Directory:  D:\Datasets\453_NLP_Final_Project\training\training_full_text
Numpy version:  1.20.2
Pandas version:  1.2.3


### Functions

Here are some useful functions for cleaning the data for analysis in topic vectors.<br>

For LDA, I'm importing the multi-threaded CPU version of LDA from Gensim, since this greatly reduces the runtime required.

In [3]:
def remove_stopwords(tokenized_word_list):
    clean_list = [w for w in tokenized_word_list if not w in stop_words]
    return clean_list

def clean_text(test_string):
    new_string = re.sub(r'[!"#$%&\'()*+,-.\/:;<=>?@\[\\\]^_`{\|}~]|[\d]+',"", test_string)
    new_string2 = re.sub(r'\w{3}.[A-Z|a-z]+.com', "", new_string)
    return new_string2

def tokenize_text(test_string):
    tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
    tokenized_string = tokenizer.tokenize(test_string)
    return tokenized_string

def get_LDA_model(text_data, num_topics, passes):
    start_time = time.time()
    dictionary = corpora.Dictionary(text_data)
    corpus = [dictionary.doc2bow(text) for text in text_data]
   
    LDA_model = gensim.models.ldamulticore.LdaMulticore(
                               corpus=corpus,
                               num_topics=num_topics,
                               id2word=dictionary,
                               workers=6,
                               passes=passes)
    
    topics = LDA_model.print_topics(num_words=10)
    for x in topics:
        print(x)
        print("\n")
    print("Time Taken: ", time.time() - start_time)
    return LDA_model, topics, corpus, dictionary

  and should_run_async(code)


#### Loading in Data

In this case, I will be loading from my file directory, containing text files. Adjust this as necessary.

In [4]:
texts = []
filenames = [os.listdir()[i] for i in range(0, 1000)]
for path in filenames:
    file = open(path, encoding='utf8')
    text = file.read()
    texts.append(text)
    file.close()
for i in range(0, len(texts)):
    texts[i] = texts[i].replace('\n', '')
    
tokenized_texts = []
for text in texts:
    temp_encode = text.encode('ascii', 'ignore')
    temp_decode = temp_encode.decode('ascii')
    x1 = clean_text(temp_decode)
    tokenized_texts.append(tokenize_text(x1))

clean_lists = []
for x in tokenized_texts:
    clean_lists.append(remove_stopwords(x))

  and should_run_async(code)


In [5]:
string_unicode = texts[0]
string_encode = string_unicode.encode("ascii", "ignore")
string_decode = string_encode.decode()
print(string_decode)

  04Mediterranean Oil & Gas Plc Annual Report 2011 www.medoilgas.comOur strategy for growthRegional OperatorLeverage our competitive advantage  that lies in the breadth and depth of our Italy-based team that manages the full-value chain of our E&P business together with our AIM-listing, knowledgeable management team and strong support from our key shareholdersFinancial StrengthBeing debt free, we will use the steady income from our onshore and offshore gas production to underwrite our operating costs, support asset maturation and small capital programmes. 05Mediterranean Oil & Gas Plc Annual Report 2011 www.medoilgas.comBUSINESSREVIEWCORPORATEGOVERNANCEFINANCIALSTATEMENTSBalanced PortfolioUse our Resources Factory to our advantage. Grow production and move resources to reserves by maturing the portfolio in support of our production growth targets. Balance frontier exploration with asset maturation and good reservoir managementGrowth OpportunityPrudently invest to de-risk our attractive

  and should_run_async(code)


In [6]:
LDA_model, topics, corpus, dictionary = get_LDA_model(clean_lists, 10, 10)

  and should_run_async(code)


(0, '0.011*"The" + 0.008*"US" + 0.007*"financial" + 0.006*"year" + 0.005*"million" + 0.005*"assets" + 0.005*"Group" + 0.004*"value" + 0.004*"per" + 0.004*"cash"')


(1, '0.008*"The" + 0.007*"million" + 0.006*"Group" + 0.006*"financial" + 0.005*"December" + 0.005*"Company" + 0.005*"assets" + 0.004*"year" + 0.004*"value" + 0.004*"cash"')


(2, '0.011*"The" + 0.008*"Group" + 0.006*"December" + 0.005*"Company" + 0.005*"year" + 0.005*"assets" + 0.005*"financial" + 0.004*"value" + 0.004*"cash" + 0.004*"Groups"')


(3, '0.011*"The" + 0.009*"Group" + 0.006*"year" + 0.006*"Company" + 0.006*"Committee" + 0.005*"Directors" + 0.005*"Board" + 0.005*"performance" + 0.004*"value" + 0.004*"assets"')


(4, '0.084*"e" + 0.046*"n" + 0.044*"r" + 0.027*"c" + 0.021*"h" + 0.021*"l" + 0.016*"p" + 0.014*"f" + 0.014*"u" + 0.013*"g"')


(5, '0.009*"The" + 0.008*"year" + 0.006*"directors" + 0.006*"group" + 0.005*"March" + 0.005*"financial" + 0.005*"value" + 0.004*"share" + 0.004*"report" + 0.004*"assets"')


(6, 

### Visualizing Results

Using the pyLDAvis library, we can visualize the various topic vectors and what phrases are likely to constitute a particular topic vector. This may need some revision, as cleaning the text becomes more important to allow some of the words to be visualized.

In [7]:
lda_display = pyLDAvis.gensim_models.prepare(LDA_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

  and should_run_async(code)


#### Summary

Some observations: <br>
- Based on the trends above, the single and double letter stray characters are abbrevations
- We do not want to necessarily remove them; this would be a worthwhile exercises to understand what acronyms or short-hand means in this case.