### Topic Modelling using LDA and NMF ###

Datasets: wikipedia articles and stratpoint articles. Wikipedia articles are stored in a database whereas Stratpoint articles are stored in a .json file. 

Flow:
1. Data extraction and pre-processing
         
         This includes creating bags of words that are usable for NMF and LDA.
2. Data analysis and visualization
        
        After applying both NMF and LDA, analysis is in order. Note that the main difference between the two datasets is the size. We will essentially be comparing the results of LDA and NMF on a large corpora versus a relatively small corpora.

In [1]:
# Extract stratpoint article data
import json

strat_data = []
with open('strat_articles.json') as json_file:  
    data = json.load(json_file)
    for entry in data:
        print(entry['title'][0],"\n\n",entry['body'],"\n\n")
        strat_data.append(entry['title'][0] + " " + entry['body'])

Swift PH #21: Swift 5, Localization, and Core Data 

 SwiftPH has always been an avenue for iOS developers to share knowledge, find colleagues, and learn more about technology which is why this every month there is a meet up! Last month’s  was hosted by Stratpoint.For this month, three topics were discussed.The first speaker was . She did her presentation using Xcode Playgrounds. Swift 5 was released March 2019, with it came Xcode 10.2.She showed us all of the new features that was introduced with Swift 5. There are no breaking changes in Swift 5 and that it is source code compatible with Swift 4.2 she said. Because of this, migrating legacy projects to Swift 5 will be easy.Because she was using Xcode Playgrounds for her presentation, she easily showed us the new features with comparison to Swift 4.2 syntax. is a Lead Software Engineer at Stratpoint. She’s been developing native iOS applications since 2013. She also completed Data Science bootcamp by an AI inclined company and now, she

In [2]:
# Extract 100,000 wikipedia articles
import sqlite3
try:
    connection =  sqlite3.connect('../wiki-kaggle-17_18.db')
    cursor = connection.cursor()
    print("Established connection : ", connection)
except:
    pass

Established connection :  <sqlite3.Connection object at 0x1103859d0>


In [3]:
# Fetch and check data
import pandas as pd

cursor.execute("SELECT TITLE,SECTION_TITLE,SECTION_TEXT FROM ARTICLES LIMIT 5")
rows = cursor.fetchall()

for row in rows:
    print(row, "\n\n\n")


('Anarchism', 'Introduction', "\n\n\n\n\n\n'''Anarchism''' is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.\n\nWhile anti-statism is central, anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations, including, but not limited to, the state system.  Anarchism is usually considered an extreme left-wing ideology, and much of anarchist economics and anarchist legal philosophy reflects anti-authoritarian interpretations of communism, collectivism, syndicalism, mutualism, or participatory economics.\n\nAnarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy. Many types an

In [4]:
wiki_data = []

cursor.execute("SELECT TITLE,SECTION_TITLE,SECTION_TEXT FROM ARTICLES LIMIT 100000")
rows = cursor.fetchall()

for row in rows:
    wiki_data.append(row[0] + " " + row[1]+" "+row[2])
# Check if parsed correctly
print(wiki_data[69])

Achilles Namesakes * The name of Achilles has been used for at least nine Royal Navy warships since 1744 - both as HMS ''Achilles'' and with the French spelling HMS ''Achille''. A 60-gun ship of that name served at the Battle of Belleisle in 1761 while a 74-gun ship served at the Battle of Trafalgar. Other battle honours include Walcheren 1809. An armored cruiser of that name served in the Royal Navy during the First World War.
* HMNZS ''Achilles'' was a ''Leander''-class cruiser which served with the Royal New Zealand Navy in World War II. It became famous for its part in the Battle of the River Plate, alongside  and . In addition to earning the battle honour 'River Plate', HMNZS Achilles also served at Guadalcanal 1942–43 and Okinawa in 1945. After returning to the Royal Navy, the ship was sold to the Indian Navy in 1948 but when she was scrapped parts of the ship were saved and preserved in New Zealand.
* A species of lizard, ''Anolis achilles'', which has widened heel plates, is na

#### Data Pre-processing ####
At this point, the raw data from the .json and .db files have been successfully extracted and stored in list variables. The next point of action would be to pre-process these documents in order to have them ready for actual data analysis.

1. Impose various parameters

        As per usual, remove stop words and words with length less than 3.

2. Stemming
        
        This will be done with the use of the PorterStemmer from Python's Natural Language Tool Kit (NLTK)
3. Lemmatization

        This will be done with the use of the WordNetLemmatizer.

In [48]:
from __future__ import print_function
import gensim
import nltk
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import preprocess_string
from nltk.stem import *
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")

strat_clean_data = []
wiki_clean_data = []


temp = remove_stopwords(strat_data[0])
#print(temp)
temp = preprocess_string(temp)
#print(temp)

def clean(data):
    print("Cleaning data...")
    temp_clean = []
    for entry in data:
        temp = remove_stopwords(entry)
        temp = preprocess_string(temp)
        temp_clean.append(temp)
    print("Data cleaning finished")
    return temp_clean
  

Presented below is a sample of the cleaned version of the raw data. Note that the data is still unstemmed and unlemmatized.

In [51]:
strat_clean_data = clean(strat_data)
print(strat_clean_data[33])
wiki_clean_data = clean(wiki_data)
print(wiki_clean_data[99999])

Cleaning data...
Data cleaning finished
['kudo', 'stratpoint’', 'appl', 'team', 'stratpoint’', 'engag', 'appl', 'start', 'octob', 'appl', 'team', 'work', 'project', 'make', 'sure', 'qualiti', 'work', 'deliv', 'time', 'client', 'appl', 'valu', 'importantli', 'todai', 'appl', 'team', 'work', 'project', 'apple’', 'global', 'depart', 'recent', 'finish', 'intern', 'project', 'receiv', 'acknowledg', 'recognit', 'appl', 'team', 'wayn', 'aono', 'appl', 'global', 'solutionstess', 'taft', 'manag', 'global', 'solut', 'appl', 'thank', 'team', 'continu', 'ey', 'appl', 'project', 'innov', 'deliv', 'inspir', 'copyright', 'stratpoint', 'technolog', 'right', 'reserv', 'post', 'tag', 'site', 'agre', 'updat']
Cleaning data...
Data cleaning finished
['nicola', 'chauvin', 'histor', 'histor', 'research', 'identifi', 'biograph', 'detail', 'real', 'nicola', 'chauvin', 'lead', 'claim', 'wholli', 'fiction', 'figur', 'research', 'gérard', 'puymèg', 'conclud', 'nicola', 'chauvin', 'exist', 'believ', 'legend', 'cr

In [56]:
# Sample of what happens during stemming
print(stemmer.stem(strat_clean_data[0][77]))

cross


At this point, the data will be stemmed and lemmatized to ensure that the corpora is unpolluted. Note that stemming and lemmatization occurs on individual words.

In [59]:
strat_stemmed_data = []
wiki_stemmed_data = []

def stem_data(data):
    print("Stemming data...")
    temp_stemmed = []
    for entry in data:
        temp_doc = []
        for word in entry:
            temp_doc.append(stemmer.stem(word))
        temp_stemmed.append(temp_doc)
    print("Data stemming finished...")
    return temp_stemmed

strat_stemmed_data = stem_data(strat_clean_data)
print(strat_stemmed_data[50])
wiki_stemmed_data = stem_data(wiki_clean_data)
print(wiki_stemmed_data[9999])

Stemming data...
Data stemming finished...
['stratpoint', 'counter', 'strike', 'tournament', 'year', 'stratpoint', 'hold', 'sport', 'fest', 'team', 'sport', 'activ', 'tabl', 'game', 'week', 'await', 'activ', 'counter', 'stike', 'tournament', 'team', 'black', 'green', 'blue', 'red', 'repr', 'best', 'gamer', 'fun', 'stratpoint', 'cool', 'place', 'work', 'standbi', 'list', 'winner', 'innov', 'deliv', 'inspir', 'copyright', 'stratpoint', 'technolog', 'right', 'reserv', 'post', 'tag', 'site', 'agr', 'updat']
Stemming data...
Data stemming finished...
['abdur', 'rahman', 'khan', 'legaci', 'afghan', 'societi', 'mix', 'feel', 'rule', 'rememb', 'ruler', 'initi', 'program', 'modern', 'effect', 'prevent', 'countri', 'occupi', 'russia', 'britain', 'great', 'game', 'hand', 'sector', 'afghanistan', 'rememb', 'domest', 'violent', 'geopolit', 'weak', 'ruler', 'brought', 'power', 'british', 'declar', 'war', 'afghan', 'minor', 'instead', 'fight', 'british', 'decid', 'afghanistan', 'foreign', 'polici']


In [63]:
strat_lemmatized_data = []
wiki_lemmatized_data = []

def lemmatize_data(data):
    print("Lemmatizing data...")
    temp_lemmatized = []
    for entry in data:
        temp_doc = []
        for word in entry:
            temp_doc.append(lemmatizer.lemmatize(word))
        temp_lemmatized.append(temp_doc)
    print("Data lemmatization finished...")
    return temp_lemmatized

strat_lemmatized_data = lemmatize_data(strat_stemmed_data)
print(strat_lemmatized_data[50])
wiki_lemmatized_data = lemmatize_data(wiki_stemmed_data)
print(wiki_lemmatized_data[99999])

Lemmatizing data...
Data lemmatization finished...
['stratpoint', 'counter', 'strike', 'tournament', 'year', 'stratpoint', 'hold', 'sport', 'fest', 'team', 'sport', 'activ', 'tabl', 'game', 'week', 'await', 'activ', 'counter', 'stike', 'tournament', 'team', 'black', 'green', 'blue', 'red', 'repr', 'best', 'gamer', 'fun', 'stratpoint', 'cool', 'place', 'work', 'standbi', 'list', 'winner', 'innov', 'deliv', 'inspir', 'copyright', 'stratpoint', 'technolog', 'right', 'reserv', 'post', 'tag', 'site', 'agr', 'updat']
Lemmatizing data...
Data lemmatization finished...
['nicola', 'chauvin', 'histor', 'histor', 'research', 'identifi', 'biograph', 'detail', 'real', 'nicola', 'chauvin', 'lead', 'claim', 'wholli', 'fiction', 'figur', 'research', 'gérard', 'puymèg', 'conclud', 'nicola', 'chauvin', 'exist', 'believ', 'legend', 'crystal', 'restor', 'juli', 'monarchi', 'pen', 'songwrit', 'vaudevil', 'historian', 'argu', 'figur', 'chauvin', 'continu', 'long', 'tradit', 'mytholog', 'farmer', 'soldier', 