## Topic Modelling using LDA and NMF ##

Datasets: wikipedia articles and stratpoint articles. Wikipedia articles are stored in a database whereas Stratpoint articles are stored in a .json file. 

Flow:
1. Data extraction and pre-processing
         
         This includes creating bags of words that are usable for NMF and LDA.
2. Data analysis and visualization
        
        After applying both NMF and LDA, analysis is in order. Note that the main difference between the two datasets is the size. We will essentially be comparing the results of LDA and NMF on a large corpora versus a relatively small corpora.

In [111]:
# Extract stratpoint article data
import json

strat_data = []
with open('strat_articles.json') as json_file:  
    data = json.load(json_file)
    for entry in data:
        print(entry['title'][0],"\n\n",entry['body'],"\n\n")
        strat_data.append(entry['title'][0] + " " + entry['body'])

Swift PH #21: Swift 5, Localization, and Core Data 

 SwiftPH has always been an avenue for iOS developers to share knowledge, find colleagues, and learn more about technology which is why this every month there is a meet up! Last month’s  was hosted by Stratpoint.For this month, three topics were discussed.The first speaker was . She did her presentation using Xcode Playgrounds. Swift 5 was released March 2019, with it came Xcode 10.2.She showed us all of the new features that was introduced with Swift 5. There are no breaking changes in Swift 5 and that it is source code compatible with Swift 4.2 she said. Because of this, migrating legacy projects to Swift 5 will be easy.Because she was using Xcode Playgrounds for her presentation, she easily showed us the new features with comparison to Swift 4.2 syntax. is a Lead Software Engineer at Stratpoint. She’s been developing native iOS applications since 2013. She also completed Data Science bootcamp by an AI inclined company and now, she

In [112]:
# Extract 100,000 wikipedia articles
import sqlite3
try:
    connection =  sqlite3.connect('../wiki-kaggle-17_18.db')
    cursor = connection.cursor()
    print("Established connection : ", connection)
except:
    pass

Established connection :  <sqlite3.Connection object at 0x1ae56d4b90>


In [3]:
# Fetch and check data
import pandas as pd

cursor.execute("SELECT TITLE,SECTION_TITLE,SECTION_TEXT FROM ARTICLES LIMIT 5")
rows = cursor.fetchall()

for row in rows:
    print(row, "\n\n\n")


('Anarchism', 'Introduction', "\n\n\n\n\n\n'''Anarchism''' is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.\n\nWhile anti-statism is central, anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations, including, but not limited to, the state system.  Anarchism is usually considered an extreme left-wing ideology, and much of anarchist economics and anarchist legal philosophy reflects anti-authoritarian interpretations of communism, collectivism, syndicalism, mutualism, or participatory economics.\n\nAnarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy. Many types an

In [113]:
wiki_data = []

cursor.execute("SELECT TITLE,SECTION_TITLE,SECTION_TEXT FROM ARTICLES LIMIT 100000")
rows = cursor.fetchall()

for row in rows:
    wiki_data.append(row[0] + " " + row[1]+" "+row[2])
# Check if parsed correctly
print(wiki_data[69])

Achilles Namesakes * The name of Achilles has been used for at least nine Royal Navy warships since 1744 - both as HMS ''Achilles'' and with the French spelling HMS ''Achille''. A 60-gun ship of that name served at the Battle of Belleisle in 1761 while a 74-gun ship served at the Battle of Trafalgar. Other battle honours include Walcheren 1809. An armored cruiser of that name served in the Royal Navy during the First World War.
* HMNZS ''Achilles'' was a ''Leander''-class cruiser which served with the Royal New Zealand Navy in World War II. It became famous for its part in the Battle of the River Plate, alongside  and . In addition to earning the battle honour 'River Plate', HMNZS Achilles also served at Guadalcanal 1942–43 and Okinawa in 1945. After returning to the Royal Navy, the ship was sold to the Indian Navy in 1948 but when she was scrapped parts of the ship were saved and preserved in New Zealand.
* A species of lizard, ''Anolis achilles'', which has widened heel plates, is na

#### Data Pre-processing ####
At this point, the raw data from the .json and .db files have been successfully extracted and stored in list variables. The next point of action would be to pre-process these documents in order to have them ready for actual data analysis.

1. Impose various parameters

        As per usual, remove stop words and words with length less than 3.

2. Stemming
        
        This will be done with the use of the PorterStemmer from Python's Natural Language Tool Kit (NLTK)
3. Lemmatization

        This will be done with the use of the WordNetLemmatizer.

In [114]:
from __future__ import print_function
import gensim
import nltk
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import preprocess_string
from nltk.stem import *
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")

strat_clean_data = []
wiki_clean_data = []


temp = remove_stopwords(strat_data[0])
#print(temp)
temp = preprocess_string(temp)
#print(temp)

def clean(data):
    print("Cleaning data...")
    temp_clean = []
    for entry in data:
        temp = remove_stopwords(entry)
        temp = preprocess_string(temp)
        temp_clean.append(temp)
    print("Data cleaning finished")
    return temp_clean
  

Presented below is a sample of the cleaned version of the raw data. Note that the data is still unstemmed and unlemmatized.

In [115]:
strat_clean_data = clean(strat_data)
print(strat_clean_data[33])
wiki_clean_data = clean(wiki_data)
print(wiki_clean_data[99999])

Cleaning data...
Data cleaning finished
['kudo', 'stratpoint’', 'appl', 'team', 'stratpoint’', 'engag', 'appl', 'start', 'octob', 'appl', 'team', 'work', 'project', 'make', 'sure', 'qualiti', 'work', 'deliv', 'time', 'client', 'appl', 'valu', 'importantli', 'todai', 'appl', 'team', 'work', 'project', 'apple’', 'global', 'depart', 'recent', 'finish', 'intern', 'project', 'receiv', 'acknowledg', 'recognit', 'appl', 'team', 'wayn', 'aono', 'appl', 'global', 'solutionstess', 'taft', 'manag', 'global', 'solut', 'appl', 'thank', 'team', 'continu', 'ey', 'appl', 'project', 'innov', 'deliv', 'inspir', 'copyright', 'stratpoint', 'technolog', 'right', 'reserv', 'post', 'tag', 'site', 'agre', 'updat']
Cleaning data...
Data cleaning finished
['nicola', 'chauvin', 'histor', 'histor', 'research', 'identifi', 'biograph', 'detail', 'real', 'nicola', 'chauvin', 'lead', 'claim', 'wholli', 'fiction', 'figur', 'research', 'gérard', 'puymèg', 'conclud', 'nicola', 'chauvin', 'exist', 'believ', 'legend', 'cr

In [116]:
# Sample of what happens during stemming
print(stemmer.stem(strat_clean_data[0][77]))

cross


At this point, the data will be stemmed and lemmatized to ensure that the corpora is unpolluted. Note that stemming and lemmatization occurs on individual words.

In [117]:
strat_stemmed_data = []
wiki_stemmed_data = []

def stem_data(data):
    print("Stemming data...")
    temp_stemmed = []
    for entry in data:
        temp_doc = []
        for word in entry:
            temp_doc.append(stemmer.stem(word))
        temp_stemm``ed.append(temp_doc)
    print("Data stemming finished...")
    return temp_stemmed

strat_stemmed_data = stem_data(strat_clean_data)
print(strat_stemmed_data[50])
wiki_stemmed_data = stem_data(wiki_clean_data)
print(wiki_stemmed_data[99999])

Stemming data...
Data stemming finished...
['stratpoint', 'counter', 'strike', 'tournament', 'year', 'stratpoint', 'hold', 'sport', 'fest', 'team', 'sport', 'activ', 'tabl', 'game', 'week', 'await', 'activ', 'counter', 'stike', 'tournament', 'team', 'black', 'green', 'blue', 'red', 'repr', 'best', 'gamer', 'fun', 'stratpoint', 'cool', 'place', 'work', 'standbi', 'list', 'winner', 'innov', 'deliv', 'inspir', 'copyright', 'stratpoint', 'technolog', 'right', 'reserv', 'post', 'tag', 'site', 'agr', 'updat']
Stemming data...
Data stemming finished...
['nicola', 'chauvin', 'histor', 'histor', 'research', 'identifi', 'biograph', 'detail', 'real', 'nicola', 'chauvin', 'lead', 'claim', 'wholli', 'fiction', 'figur', 'research', 'gérard', 'puymèg', 'conclud', 'nicola', 'chauvin', 'exist', 'believ', 'legend', 'crystal', 'restor', 'juli', 'monarchi', 'pen', 'songwrit', 'vaudevil', 'historian', 'argu', 'figur', 'chauvin', 'continu', 'long', 'tradit', 'mytholog', 'farmer', 'soldier', 'mile', 'glorios

In [118]:
strat_lemmatized_data = []
wiki_lemmatized_data = []

def lemmatize_data(data):
    print("Lemmatizing data...")
    temp_lemmatized = []
    for entry in data:
        temp_doc = []
        for word in entry:
            temp_doc.append(lemmatizer.lemmatize(word))
        temp_lemmatized.append(temp_doc)
    print("Data lemmatization finished...")
    return temp_lemmatized

strat_lemmatized_data = lemmatize_data(strat_stemmed_data)
print(strat_lemmatized_data[50])
wiki_lemmatized_data = lemmatize_data(wiki_stemmed_data)
print(wiki_lemmatized_data[99999])

Lemmatizing data...
Data lemmatization finished...
['stratpoint', 'counter', 'strike', 'tournament', 'year', 'stratpoint', 'hold', 'sport', 'fest', 'team', 'sport', 'activ', 'tabl', 'game', 'week', 'await', 'activ', 'counter', 'stike', 'tournament', 'team', 'black', 'green', 'blue', 'red', 'repr', 'best', 'gamer', 'fun', 'stratpoint', 'cool', 'place', 'work', 'standbi', 'list', 'winner', 'innov', 'deliv', 'inspir', 'copyright', 'stratpoint', 'technolog', 'right', 'reserv', 'post', 'tag', 'site', 'agr', 'updat']
Lemmatizing data...
Data lemmatization finished...
['nicola', 'chauvin', 'histor', 'histor', 'research', 'identifi', 'biograph', 'detail', 'real', 'nicola', 'chauvin', 'lead', 'claim', 'wholli', 'fiction', 'figur', 'research', 'gérard', 'puymèg', 'conclud', 'nicola', 'chauvin', 'exist', 'believ', 'legend', 'crystal', 'restor', 'juli', 'monarchi', 'pen', 'songwrit', 'vaudevil', 'historian', 'argu', 'figur', 'chauvin', 'continu', 'long', 'tradit', 'mytholog', 'farmer', 'soldier', 

#### Bag of Words ####

At this point the documents can be converted to a bag of words that can be fed to the LDA and NMF models respectively. 

In [119]:
strat_dict = gensim.corpora.Dictionary(strat_lemmatized_data)
wiki_dict = gensim.corpora.Dictionary(wiki_lemmatized_data)

# Further refine data
strat_dict.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
wiki_dict.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [120]:
strat_bow_corpus = [strat_dict.doc2bow(doc) for doc in strat_lemmatized_data]
print(strat_bow_corpus[50])
wiki_bow_corpus = [wiki_dict.doc2bow(doc) for doc in wiki_lemmatized_data]
print(wiki_bow_corpus[99999])

[(10, 1), (11, 1), (13, 2)]
[(16, 1), (26, 1), (42, 1), (74, 1), (103, 1), (111, 1), (123, 1), (127, 1), (178, 1), (200, 1), (240, 1), (258, 1), (271, 1), (288, 1), (302, 2), (323, 1), (342, 1), (353, 1), (397, 1), (399, 1), (505, 2), (553, 1), (554, 4), (576, 2), (640, 1), (668, 1), (686, 1), (781, 1), (840, 1), (1008, 1), (1108, 1), (1126, 1), (1167, 1), (1170, 2), (1175, 1), (1281, 2), (1341, 1), (1385, 1), (1422, 1), (1473, 1), (1493, 1), (1519, 1), (1546, 1), (1568, 1), (1660, 1), (1662, 1), (1727, 1), (1823, 4), (1943, 1), (2086, 2), (2123, 2), (2157, 1), (2226, 1), (2288, 1), (2583, 1), (2615, 1), (2695, 2), (2759, 1), (2776, 1), (2792, 1), (2925, 1), (2995, 1), (3041, 1), (3082, 1), (3106, 3), (3118, 2), (3435, 1), (3841, 1), (3844, 1), (4012, 1), (4014, 1), (4267, 3), (4401, 1), (4410, 1), (4566, 1), (4739, 1), (5060, 1), (5203, 1), (5265, 1), (5303, 1), (5427, 1), (5835, 1), (6450, 1), (7128, 1), (7620, 1), (9907, 1), (9932, 1), (10173, 1), (10938, 1), (12134, 1), (16325, 1),

#### TF-IDF ####

From here, tf-idf can be implemented in order to add more weight to the context of each word. That is, it will now matter whether a word appears frequently across different documents regardless of how many or how little its individual frequencies are.

In [121]:
from gensim import corpora, models

strat_tfidf = models.TfidfModel(strat_bow_corpus)
wiki_tfidf = models.TfidfModel(wiki_bow_corpus)

strat_corpus_tfidf = strat_tfidf[strat_bow_corpus]
wiki_corpus_tfidf = wiki_tfidf[wiki_bow_corpus]

print(strat_corpus_tfidf[50])
print(wiki_corpus_tfidf[99999])

[(10, 0.3449089442213748), (11, 0.3449089442213748), (13, 0.8729694384067481)]
[(16, 0.029431657114858904), (26, 0.03461809720935662), (42, 0.040275457979752426), (74, 0.034486138912769256), (103, 0.042340645355611356), (111, 0.02722373293943246), (123, 0.06243793936144729), (127, 0.02188611246294211), (178, 0.05842824041241621), (200, 0.046849492079414495), (240, 0.07960453145288589), (258, 0.04670854986112542), (271, 0.04901298332997398), (288, 0.04002059261578295), (302, 0.0766428853373004), (323, 0.05299450410025125), (342, 0.031239421586829415), (353, 0.029887379714942652), (397, 0.04950394765537014), (399, 0.04173101487712732), (505, 0.08733811032303278), (553, 0.02953138632206164), (554, 0.22855812439565953), (576, 0.09621355994691044), (640, 0.03918930058760039), (668, 0.03516930473776266), (686, 0.031891447834819786), (781, 0.027347498054988456), (840, 0.035106582157960003), (1008, 0.03565439616632901), (1108, 0.06993072321650994), (1126, 0.07922498454315628), (1167, 0.0579612

After applying TF-IDF, the data can now be fed into the models. 

In [122]:
strat_lda_model = gensim.models.LdaMulticore(strat_bow_corpus, num_topics=10, id2word=strat_dict, passes=2, workers=2)
for idx, topic in strat_lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.236*"event" + 0.148*"busi" + 0.146*"held" + 0.136*"compani" + 0.070*"servic" + 0.043*"new" + 0.043*"help" + 0.043*"discus" + 0.040*"philippin" + 0.031*"time"
Topic: 1 
Words: 0.121*"time" + 0.119*"u" + 0.096*"app" + 0.084*"process" + 0.083*"talk" + 0.068*"softwar" + 0.047*"manag" + 0.047*"project" + 0.046*"build" + 0.046*"servic"
Topic: 2 
Words: 0.134*"learn" + 0.105*"team" + 0.087*"year" + 0.079*"work" + 0.066*"compani" + 0.061*"project" + 0.059*"new" + 0.048*"held" + 0.048*"u" + 0.047*"event"
Topic: 3 
Words: 0.140*"u" + 0.102*"work" + 0.098*"new" + 0.096*"time" + 0.093*"project" + 0.072*"year" + 0.065*"softwar" + 0.051*"help" + 0.048*"learn" + 0.045*"compani"
Topic: 4 
Words: 0.239*"servic" + 0.112*"philippin" + 0.090*"busi" + 0.074*"new" + 0.063*"help" + 0.057*"softwar" + 0.054*"manag" + 0.045*"year" + 0.033*"learn" + 0.031*"discus"
Topic: 5 
Words: 0.192*"discus" + 0.173*"manag" + 0.118*"held" + 0.102*"learn" + 0.064*"process" + 0.057*"softwar" + 0.048*"compani

In [123]:
wiki_lda_model = gensim.models.LdaMulticore(wiki_bow_corpus, num_topics=10, id2word=wiki_dict, passes=2, workers=2)
for idx, topic in wiki_lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.011*"island" + 0.008*"year" + 0.008*"album" + 0.007*"new" + 0.007*"star" + 0.006*"dai" + 0.006*"river" + 0.006*"band" + 0.006*"record" + 0.006*"north"
Topic: 1 
Words: 0.014*"game" + 0.011*"music" + 0.011*"plai" + 0.008*"film" + 0.007*"team" + 0.006*"time" + 0.006*"new" + 0.006*"player" + 0.006*"season" + 0.005*"year"
Topic: 2 
Words: 0.017*"citi" + 0.013*"univ" + 0.011*"new" + 0.011*"state" + 0.010*"school" + 0.009*"nation" + 0.006*"year" + 0.006*"colleg" + 0.006*"world" + 0.006*"art"
Topic: 3 
Words: 0.030*"american" + 0.013*"english" + 0.008*"player" + 0.008*"john" + 0.008*"politician" + 0.007*"author" + 0.006*"new" + 0.006*"actor" + 0.006*"footbal" + 0.006*"singer"
Topic: 4 
Words: 0.007*"centuri" + 0.005*"church" + 0.005*"dai" + 0.005*"god" + 0.005*"christian" + 0.005*"book" + 0.004*"work" + 0.004*"time" + 0.004*"year" + 0.004*"tradit"
Topic: 5 
Words: 0.006*"form" + 0.006*"theori" + 0.005*"exampl" + 0.005*"differ" + 0.005*"number" + 0.004*"u" + 0.004*"gener" + 

#### Test model against unseen data ####
Fetch a random wikipedia document from WikiRoulette.com. Of course, we have to clean and transform the data before feeding it to the model.

In [124]:
random_wiki_article = """
Marjorie Gordon
Marjorie Gordon (12 November 1893 – 14 October 1983) was an English actress and singer.
Gordon was born in Southsea, Portsmouth, Hampshire as Marjorie Kettlewell. Her professional stage career began in 1915 on tour in the chorus of the D'Oyly Carte Opera Company. The next season, she was given the roles of the Plaintiff in Trial by Jury and Lady Psyche in Princess Ida. She also understudied and occasionally played the title role in Patience and Yum-Yum in The Mikado. She left D'Oyly Carte in June 1916 to understudy the role of Sylvia Dale at the Adelphi Theatre in London in the Rudolf Friml musical High Jinks, sometimes appearing in the role until July 1917 and touring in the role later that year.[1]
Gordon starred in the title role in the romantic comic opera Valentine at the St. James's Theatre from January to April 1918. She next starred as Grace in the hit musical Going Up (1918–1919) at London's Gaiety Theatre. She then appeared in the Ivor Novellomusical Who's Hooper? at the Adelphi Theatre (1919–1920). Next, she appeared in My Nieces at the Aldwych Theatre, London, in 1921. She also recorded songs from Going Up and Who's Hooper?. For the next two decades, she appeared in both musicals and plays.[1] These included Stop Flirting (1923). She later appeared in a few films, playing Ruth Hopkins in Danger Trails (1935), Matron in All the Way to Paris (1967) and Passenger in Golden Rendezvous (1977).
Later stage roles included Midge in Tulip Time in 1935 at the Alhambra Theatre.[2] Her last appearance on stage in London was in the "revusical promenade" Let's All Go Down the Strand at the Adelphi in 1939.[1]
She died one month shy of her 90th birthday in Eastbourne, Sussex.
"""

print(random_wiki_article)


Marjorie Gordon
Marjorie Gordon (12 November 1893 – 14 October 1983) was an English actress and singer.
Gordon was born in Southsea, Portsmouth, Hampshire as Marjorie Kettlewell. Her professional stage career began in 1915 on tour in the chorus of the D'Oyly Carte Opera Company. The next season, she was given the roles of the Plaintiff in Trial by Jury and Lady Psyche in Princess Ida. She also understudied and occasionally played the title role in Patience and Yum-Yum in The Mikado. She left D'Oyly Carte in June 1916 to understudy the role of Sylvia Dale at the Adelphi Theatre in London in the Rudolf Friml musical High Jinks, sometimes appearing in the role until July 1917 and touring in the role later that year.[1]
Gordon starred in the title role in the romantic comic opera Valentine at the St. James's Theatre from January to April 1918. She next starred as Grace in the hit musical Going Up (1918–1919) at London's Gaiety Theatre. She then appeared in the Ivor Novellomusical Who's Ho

In [125]:
cleaned_random_wiki_article = remove_stopwords(random_wiki_article)
cleaned_random_wiki_article = preprocess_string(cleaned_random_wiki_article)
preprocessed_random_wiki_article = []

for word in cleaned_random_wiki_article:
    temp = stemmer.stem(word)
    temp = lemmatizer.lemmatize(temp)
    preprocessed_random_wiki_article.append(temp)

print(preprocessed_random_wiki_article)

['marjori', 'gordon', 'marjori', 'gordon', 'novemb', 'octob', 'english', 'actress', 'singer', 'gordon', 'born', 'southsea', 'portsmouth', 'hampshir', 'marjori', 'kettlewel', 'profess', 'stage', 'career', 'began', 'tour', 'choru', 'oyli', 'cart', 'opera', 'compani', 'season', 'given', 'role', 'plaintiff', 'trial', 'juri', 'ladi', 'psych', 'princess', 'ida', 'understudi', 'occas', 'plai', 'titl', 'role', 'patienc', 'yum', 'yum', 'mikado', 'left', 'oyli', 'cart', 'june', 'understudi', 'role', 'sylvia', 'dale', 'adelphi', 'theatr', 'london', 'rudolf', 'friml', 'music', 'high', 'jink', 'appear', 'role', 'juli', 'tour', 'role', 'later', 'year', 'gordon', 'star', 'titl', 'role', 'romant', 'comic', 'opera', 'valentin', 'jame', 'theatr', 'januari', 'april', 'star', 'grace', 'hit', 'music', 'go', 'london', 'gaieti', 'theatr', 'appear', 'ivor', 'novellomus', 'hooper', 'adelphi', 'theatr', 'appear', 'niec', 'aldwych', 'theatr', 'london', 'record', 'song', 'go', 'hooper', 'decad', 'appear', 'music'

In [126]:
# Use strat dict and wiki dict
strat_random_bow_vector = strat_dict.doc2bow(preprocessed_random_wiki_article)
wiki_random_bow_vector = wiki_dict.doc2bow(preprocessed_random_wiki_article)

In [127]:
# Check strat LDA
for index,score in sorted(strat_lda_model[strat_random_bow_vector], key = lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score,strat_lda_model.print_topic(index,5)))

Score: 0.774950385093689	 Topic: 0.140*"u" + 0.102*"work" + 0.098*"new" + 0.096*"time" + 0.093*"project"
Score: 0.02501062862575054	 Topic: 0.134*"learn" + 0.105*"team" + 0.087*"year" + 0.079*"work" + 0.066*"compani"
Score: 0.025008706375956535	 Topic: 0.236*"event" + 0.148*"busi" + 0.146*"held" + 0.136*"compani" + 0.070*"servic"
Score: 0.025008277967572212	 Topic: 0.281*"philippin" + 0.170*"event" + 0.121*"busi" + 0.119*"compani" + 0.081*"discus"
Score: 0.025005899369716644	 Topic: 0.305*"app" + 0.137*"busi" + 0.099*"build" + 0.062*"servic" + 0.054*"u"
Score: 0.02500445581972599	 Topic: 0.456*"philippin" + 0.169*"manag" + 0.065*"busi" + 0.062*"compani" + 0.053*"help"
Score: 0.025003574788570404	 Topic: 0.121*"time" + 0.119*"u" + 0.096*"app" + 0.084*"process" + 0.083*"talk"
Score: 0.025003381073474884	 Topic: 0.239*"servic" + 0.112*"philippin" + 0.090*"busi" + 0.074*"new" + 0.063*"help"
Score: 0.025003284215927124	 Topic: 0.192*"discus" + 0.173*"manag" + 0.118*"held" + 0.102*"learn" + 

In [128]:
# Check wiki LDA
for index,score in sorted(wiki_lda_model[wiki_random_bow_vector], key = lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score,wiki_lda_model.print_topic(index,5)))

Score: 0.4185808598995209	 Topic: 0.014*"game" + 0.011*"music" + 0.011*"plai" + 0.008*"film" + 0.007*"team"
Score: 0.39315298199653625	 Topic: 0.030*"american" + 0.013*"english" + 0.008*"player" + 0.008*"john" + 0.008*"politician"
Score: 0.0996396392583847	 Topic: 0.017*"citi" + 0.013*"univ" + 0.011*"new" + 0.011*"state" + 0.010*"school"
Score: 0.06109686940908432	 Topic: 0.011*"island" + 0.008*"year" + 0.008*"album" + 0.007*"new" + 0.007*"star"
Score: 0.023735657334327698	 Topic: 0.011*"forc" + 0.008*"war" + 0.008*"air" + 0.007*"oper" + 0.006*"ship"


So far, the results for the wiki LDA model looks pretty good. The Strat LDA model does not perform well and understandably so--most of the topics are saturated with the same thing (tech, software, etc.) whereas the wikipedia database includes a wide variety of topics. Let's try this one more time with another random article.

In [129]:
random_wiki_article = """
Ferdinando Sforzini (born 4 December 1984) is an Italian footballer who currently plays as a striker.[1] He is a product of the famous Napoli youth academy.

During his stay in Grosseto, Sforzini was given the nickname Nandogoal[2] and also taglia gole [cutthroat], nicknamed for his traditional goal celebration.[3]
Early career
A tall forward, Sforzini started his career off with the youth system of Napoli and was loaned out to Lazio from Napoli's youth system in 2002.

Ferdinando Sforzini graduated from Lazio youth system in 2003 and immediately sold, to Sassuolo.

At Sassuolo he impressed, scoring 10 goals in 29 appearances in 2004–05 Serie C2 season.

Udinese
After impressing with U.S. Sassuolo Calcio, he was transferred to Serie A club Udinese Calcio in June 2005.

Ferdinandos Sforzini remained at Udine for 2 months, but loaned out to Serie B side Verona in order to gain more experience in August 2005.[4] In his season with Verona, Ferdinando Sforzini managed 5 goals in 35 Serie B appearances, which mainly as a substitutes player, but also made 16 league starts.

For the 2006–2007 season, Udinese opted to again loan out the player in July 2006, however this time, it was to Modena, another Serie B club. With Modena, Ferdinandos Sforzini scored 5 goals again, in 30 league appearances (started 17 times). On 1 July 2006 Sforzini returned to Udinese, however, he again was loaned out soon later.

In July 2007, Ferdinandos Sforzini transferred to Vicenza for yet another Serie B season.[5] With the club, Ferdinandos Sforzini made 14 league appearances (half of them as starting XI) in his six-month stay, scoring 2 goals. For the second half of the 2007–2008 season, Udinese loaned out the 23-year-old Sforzini to Ravenna, a club struggling to avoid relegation from the Serie B.[6] In the latter half of the season, Ferdinandos Sforzini scored an impressive 9 goals in 21 league appearances for Ravenna Calcio, but it was not enough to save them from relegation.

Ferdinandos Sforzini was again loaned out into the Serie B – this time with Grosseto. In his six-month loan period with Grosseto, Sforzini was a regular starter at the start of season, appearing 18 times (10 of them as starter) and scoring 3 goals in Serie B. In January 2009, he swapped places with Alessandro Pellicori of Avellino. With relegation frightening Avellino Sforzini impressed, hitting good form and scoring 7 goals in just 12 Serie B appearances.

After hitting good form with Avellino, Sforzini returned to Udinese on 1 July 2009, and received the call-up to the pre-season camp.[7] But he was loaned out again, this time to Serie B champion A.S. Bari, the Serie A newcomer.[8] He was the starting forward along with Vitaly Kutuzov in the opening match of Serie A, a shocking 1–1 draw with defending champion Internazionale. But in the next Serie A match the coach Giampiero Ventura preferred Riccardo Meggiorini as starting forward to partner with Kutuzov.

Sforzini was further hit by left leg injury soon after,[9] At first he would be rested a month[10] but hit by flu in November.[11] Sforzini then trained separately for ongoing injury.[12]

After Kutuzov's out of season injury, Bari signed José Ignacio Castillo, and Sforzini was call-up by Ventura several time to as an extra cover for their starting forwards Barreto, Meggiorini and Castillo. He played his return match on 30 January 2010, substituted Castillo at the second half.[13] He made a further 5 Serie A appearances as substitute.

On 1 July 2010, Sforzini returned to Udinese again, and played for Udinese B team (composite of players that pending loan) in pre-season friendlies.[14][15][16]

CFR Cluj loan
On 25 August 2010, he was loaned to Romanian side CFR Cluj.[17][18] He made his debut in Liga I 10 September 2010 against Rapid Bucureşti in a 2–0 defeat, replacing Cristian Bud eight minutes from the end of the game. On 15 September made his debut in the Champions League in Cluj- Basel (2-1), replacing Lacina Traoré as a substitute

Scored the first goal for Cluj on 27 October 2010 won the encounter against the Targu Mures (2-0 final), valid for the knockout stages of the Cupa României. However, after the sacking of Italian manager Andrea Mandorlini in September 2010, Sforzini lost his place in the side and he returned to Udinese Calcio at the start of the January window.

During the 2011 January winter transfer market he was transferred to Italian Serie B team U.S. Grosseto on loan with option to sign permanently.[19]

Grosseto
After scoring 8 goals in 20 games on loan at U.S. Grosseto in Serie B, Sforzini's move was made permanent leaving Udinese Calcio after a 6-year spell at the club. His form in his 2 seasons in Grosseto was impressive, in his first season he scored 21 goals in 40 appearances in all competitions with 20 of those coming in Serie B in the 2011/12 season. He finished 3rd top scorer in Serie B behind Ciro Immobile (28 goals for Pescara) and Marco Sau (21 goals for Juve Stabia).

On 10 August 2012, Grosseto were provisionally relegated, by the Disciplinary Commission set up for Scommessopoli scandal investigations, to Lega Pro Prima Divisione because of their involvement in Scommessopoli scandal. Furthermore, the president of Grosseto was at the time suspended from all football activities for five years.

However, on 22 August 2012, Grosseto and its president were acquitted by the Court of justice, completely eliminating the verdict of the first instance and so re-instated back to Serie B for the following 2012/13 season.[20]

His start to the 2012/13 season saw him score 11 goals in his first 21 Serie B appearances before his form attracted interest of Serie A club Pescara in the January transfer window.

Pescara
On 31 January 2013, Sforzini transferred to Serie A side Pescara for €400,000[21] plus Grosseto received Gastón Brugman and Danilo Soddimo on season long loan's from Pescara as part of the transfer.[22]

In his first half season he made 10 appearances im Serie A, without scoring a goal. He could not help Pescara escape from relegation to Serie B. During the 2013–14 season, Sforzini scored just 2 goals in 20 Serie B games.

Latina
On 1 September 2014 he was signed by fellow Serie B club U.S. Latina Calcio.

Entella
On 2 February 2015 he was swapped with Gianluca Litteri of Entella.[23]
"""

print(random_wiki_article)


Ferdinando Sforzini (born 4 December 1984) is an Italian footballer who currently plays as a striker.[1] He is a product of the famous Napoli youth academy.

During his stay in Grosseto, Sforzini was given the nickname Nandogoal[2] and also taglia gole [cutthroat], nicknamed for his traditional goal celebration.[3]
Early career
A tall forward, Sforzini started his career off with the youth system of Napoli and was loaned out to Lazio from Napoli's youth system in 2002.

Ferdinando Sforzini graduated from Lazio youth system in 2003 and immediately sold, to Sassuolo.

At Sassuolo he impressed, scoring 10 goals in 29 appearances in 2004–05 Serie C2 season.

Udinese
After impressing with U.S. Sassuolo Calcio, he was transferred to Serie A club Udinese Calcio in June 2005.

Ferdinandos Sforzini remained at Udine for 2 months, but loaned out to Serie B side Verona in order to gain more experience in August 2005.[4] In his season with Verona, Ferdinando Sforzini managed 5 goals in 35 Serie B

In [21]:
cleaned_random_wiki_article = remove_stopwords(random_wiki_article)
cleaned_random_wiki_article = preprocess_string(cleaned_random_wiki_article)
preprocessed_random_wiki_article = []

for word in cleaned_random_wiki_article:
    temp = stemmer.stem(word)
    temp = lemmatizer.lemmatize(temp)
    preprocessed_random_wiki_article.append(temp)

print(preprocessed_random_wiki_article)

['ferdinando', 'sforzini', 'born', 'decemb', 'italian', 'footbal', 'current', 'plai', 'striker', 'product', 'famou', 'napoli', 'youth', 'academi', 'stai', 'grosseto', 'sforzini', 'given', 'nicknam', 'nandogo', 'taglia', 'gole', 'cutthroat', 'nicknam', 'tradit', 'goal', 'celebr', 'ear', 'career', 'tall', 'forward', 'sforzini', 'start', 'career', 'youth', 'napoli', 'loan', 'lazio', 'napoli', 'youth', 'ferdinando', 'sforzini', 'graduat', 'lazio', 'youth', 'immedi', 'sold', 'sassuolo', 'sassuolo', 'impress', 'score', 'goal', 'appear', 'seri', 'season', 'udin', 'impress', 'sassuolo', 'calcio', 'transfer', 'seri', 'club', 'udin', 'calcio', 'june', 'ferdinando', 'sforzini', 'remain', 'udin', 'month', 'loan', 'seri', 'verona', 'order', 'gain', 'experi', 'august', 'season', 'verona', 'ferdinando', 'sforzini', 'manag', 'goal', 'seri', 'appear', 'main', 'substitut', 'player', 'leagu', 'start', 'season', 'udin', 'opt', 'loan', 'player', 'juli', 'time', 'modena', 'seri', 'club', 'modena', 'ferdinan

In [130]:
# Use strat dict and wiki dict
strat_random_bow_vector = strat_dict.doc2bow(preprocessed_random_wiki_article)
wiki_random_bow_vector = wiki_dict.doc2bow(preprocessed_random_wiki_article)

In [131]:
# Check strat LDA
for index,score in sorted(strat_lda_model[strat_random_bow_vector], key = lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score,strat_lda_model.print_topic(index,5)))

Score: 0.7749496698379517	 Topic: 0.140*"u" + 0.102*"work" + 0.098*"new" + 0.096*"time" + 0.093*"project"
Score: 0.02501138299703598	 Topic: 0.134*"learn" + 0.105*"team" + 0.087*"year" + 0.079*"work" + 0.066*"compani"
Score: 0.025008708238601685	 Topic: 0.236*"event" + 0.148*"busi" + 0.146*"held" + 0.136*"compani" + 0.070*"servic"
Score: 0.02500828169286251	 Topic: 0.281*"philippin" + 0.170*"event" + 0.121*"busi" + 0.119*"compani" + 0.081*"discus"
Score: 0.025005901232361794	 Topic: 0.305*"app" + 0.137*"busi" + 0.099*"build" + 0.062*"servic" + 0.054*"u"
Score: 0.02500445581972599	 Topic: 0.456*"philippin" + 0.169*"manag" + 0.065*"busi" + 0.062*"compani" + 0.053*"help"
Score: 0.025003576651215553	 Topic: 0.121*"time" + 0.119*"u" + 0.096*"app" + 0.084*"process" + 0.083*"talk"
Score: 0.025003381073474884	 Topic: 0.239*"servic" + 0.112*"philippin" + 0.090*"busi" + 0.074*"new" + 0.063*"help"
Score: 0.025003284215927124	 Topic: 0.192*"discus" + 0.173*"manag" + 0.118*"held" + 0.102*"learn" + 

In [132]:
# Check wiki LDA
for index,score in sorted(wiki_lda_model[wiki_random_bow_vector], key = lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score,wiki_lda_model.print_topic(index,5)))

Score: 0.41857659816741943	 Topic: 0.014*"game" + 0.011*"music" + 0.011*"plai" + 0.008*"film" + 0.007*"team"
Score: 0.39315229654312134	 Topic: 0.030*"american" + 0.013*"english" + 0.008*"player" + 0.008*"john" + 0.008*"politician"
Score: 0.09963896870613098	 Topic: 0.017*"citi" + 0.013*"univ" + 0.011*"new" + 0.011*"state" + 0.010*"school"
Score: 0.06110285967588425	 Topic: 0.011*"island" + 0.008*"year" + 0.008*"album" + 0.007*"new" + 0.007*"star"
Score: 0.023735253140330315	 Topic: 0.011*"forc" + 0.008*"war" + 0.008*"air" + 0.007*"oper" + 0.006*"ship"


### NMF ###

The data has already been pre-processed. What is left to do is to create an NMF model based on the data. 

Important notes: Take note of the shape and dimensions of the resulting term-document matrices from the TfidfVectorizer will directly affect the ability of the data to be fit to the model. 

That is, see "max_features" parameter in TfidfVectorizer(params here)

In [197]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import numpy as np

# convert the text to a tf-idf weighted term-document matrix
vectorizer_strat = TfidfVectorizer(max_features = 101, min_df=15, stop_words='english') 
X = vectorizer_strat.fit_transform(strat_data)
print("Fit transform strat data")
print(X)
idx_to_word_strat = np.array(vectorizer_strat.get_feature_names())

vectorizer_wiki = TfidfVectorizer(max_features = 101, min_df = 15, stop_words = 'english')
Y = vectorizer_wiki.fit_transform(wiki_data)
print("Fit wiki data")
print(Y)
idx_to_word_wiki = np.array(vectorizer_wiki.get_feature_names())

Fit transform strat data
  (0, 7)	0.15869605907484646
  (0, 20)	0.19268180593542822
  (0, 27)	0.38536361187085644
  (0, 12)	0.2628139440664661
  (0, 19)	0.5357040452535317
  (0, 8)	0.5065888775878767
  (0, 28)	0.1625898129467608
  (0, 26)	0.1550249302738178
  (0, 24)	0.15869605907484646
  (0, 29)	0.1625898129467608
  (0, 4)	0.14213639960716265
  (0, 10)	0.06510112251482157
  (0, 6)	0.06422726864514273
  (0, 11)	0.06422726864514273
  (0, 5)	0.06422726864514273
  (0, 0)	0.06422726864514273
  (0, 2)	0.06422726864514273
  (0, 23)	0.06422726864514273
  (0, 16)	0.06422726864514273
  (0, 15)	0.06422726864514273
  (0, 14)	0.06779636840383337
  (0, 18)	0.06422726864514273
  (0, 3)	0.06422726864514273
  (0, 25)	0.06422726864514273
  (1, 20)	0.7090528557405915
  :	:
  (71, 3)	0.09300035261793409
  (71, 25)	0.09300035261793409
  (71, 13)	0.1977197971153282
  (71, 17)	0.6893702518121967
  (72, 7)	0.0897988001154951
  (72, 20)	0.03634325699486472
  (72, 27)	0.03634325699486472
  (72, 12)	0.446142966

In [198]:
X.shape

(73, 30)

In [199]:
Y.shape

(100000, 101)

Create an NMF model based on strat data.

In [200]:
# apply NMF
nmf_strat = NMF(n_components=20, solver="mu")
W = nmf_strat.fit_transform(X)
H = nmf_strat.components_

In [201]:
get_topics(nmf_strat, 20, vectorizer_strat)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10,Topic # 11,Topic # 12,Topic # 13,Topic # 14,Topic # 15,Topic # 16,Topic # 17,Topic # 18,Topic # 19,Topic # 20
0,stratpoint,philippines,software,services,team,business,use,2016,developers,time,held,stratpoint,development,work,new,years,tagged,technologies,using,innovate
1,site,2016,development,business,stratpoint,work,time,business,development,new,stratpoint,years,posted,tagged,innovate,time,innovate,new,innovate,inspire
2,2017,services,developers,use,tagged,time,new,stratpoint,time,software,tagged,work,2017,time,site,using,technologies,using,2017,deliver
3,1998,business,business,stratpoint,software,inspire,development,innovate,new,stratpoint,philippines,development,time,posted,agree,1998,stratpoint,team,new,posted
4,rights,years,services,tagged,new,2017,services,development,innovate,technologies,developers,tagged,technologies,innovate,updated,tagged,new,development,stratpoint,technologies
5,updated,held,using,years,2016,posted,held,software,inspire,services,posted,software,tagged,using,copyright,deliver,business,2016,philippines,using
6,agree,2017,use,work,time,innovate,tagged,held,tagged,held,technologies,use,deliver,rights,reserved,updated,software,business,held,development
7,copyright,agree,philippines,1998,development,deliver,stratpoint,use,deliver,work,work,developers,reserved,updated,rights,copyright,time,work,tagged,team
8,reserved,copyright,new,held,services,copyright,team,years,using,years,time,agree,copyright,agree,inspire,agree,agree,developers,team,philippines
9,using,deliver,years,inspire,held,reserved,work,new,site,agree,business,business,rights,copyright,deliver,rights,2017,innovate,business,services


Create an NMF model based on the wiki data.

In [202]:
nmf_wiki = NMF(n_components=20, solver="mu")
W = nmf_wiki.fit_transform(Y)
H = nmf_wiki.components_

In [203]:
get_topics(nmf_wiki, 20, vectorizer_wiki)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10,Topic # 11,Topic # 12,Topic # 13,Topic # 14,Topic # 15,Topic # 16,Topic # 17,Topic # 18,Topic # 19,Topic # 20
0,time,used,film,introduction,history,university,american,john,new,city,language,international,war,united,national,music,north,game,book,day
1,life,use,series,known,century,state,player,english,york,area,people,world,world,states,government,group,island,player,life,church
2,years,example,john,term,modern,york,english,french,john,population,example,countries,ii,law,state,country,south,series,work,year
3,known,using,new,including,early,new,french,church,history,west,european,country,military,countries,party,early,water,time,world,national
4,later,form,based,include,french,public,german,king,american,south,based,development,british,national,president,british,000,10,great,12
5,century,different,life,group,period,national,president,century,south,north,common,european,german,international,political,including,east,year,according,modern
6,early,common,york,german,european,high,united,german,began,east,use,million,king,public,country,include,west,12,people,king
7,work,called,set,central,british,law,general,modern,united,major,development,national,united,use,military,major,area,second,national,great
8,called,term,history,french,english,development,british,british,use,central,set,major,states,term,general,world,population,end,term,10
9,people,number,work,population,political,president,states,term,end,000,non,based,air,john,power,began,10,later,century,century


In [138]:
def get_topics(model, num_topics, vectorizer):
    feat_names = vectorizer.get_feature_names()
    
    word_dict = {};
    for i in range(num_topics):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-20 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words;
    
    return pd.DataFrame(word_dict);

In [204]:
X.shape

(73, 30)

In [152]:
type(X)

scipy.sparse.csr.csr_matrix

In [205]:
Y.shape

(100000, 101)

In [102]:
type(Y)

scipy.sparse.csr.csr_matrix

Test both models on previously unseen data.

In [59]:
preprocessed_random_wiki_article

['ferdinando',
 'sforzini',
 'born',
 'decemb',
 'italian',
 'footbal',
 'current',
 'plai',
 'striker',
 'product',
 'famou',
 'napoli',
 'youth',
 'academi',
 'stai',
 'grosseto',
 'sforzini',
 'given',
 'nicknam',
 'nandogo',
 'taglia',
 'gole',
 'cutthroat',
 'nicknam',
 'tradit',
 'goal',
 'celebr',
 'ear',
 'career',
 'tall',
 'forward',
 'sforzini',
 'start',
 'career',
 'youth',
 'napoli',
 'loan',
 'lazio',
 'napoli',
 'youth',
 'ferdinando',
 'sforzini',
 'graduat',
 'lazio',
 'youth',
 'immedi',
 'sold',
 'sassuolo',
 'sassuolo',
 'impress',
 'score',
 'goal',
 'appear',
 'seri',
 'season',
 'udin',
 'impress',
 'sassuolo',
 'calcio',
 'transfer',
 'seri',
 'club',
 'udin',
 'calcio',
 'june',
 'ferdinando',
 'sforzini',
 'remain',
 'udin',
 'month',
 'loan',
 'seri',
 'verona',
 'order',
 'gain',
 'experi',
 'august',
 'season',
 'verona',
 'ferdinando',
 'sforzini',
 'manag',
 'goal',
 'seri',
 'appear',
 'main',
 'substitut',
 'player',
 'leagu',
 'start',
 'season',
 'ud

In [195]:
vectorizer_wiki_test = TfidfVectorizer(min_df = 1, stop_words='english') 
vectorizer_wiki_test.fit(preprocessed_random_wiki_article)
test = vectorizer_wiki_test.transform(preprocessed_random_wiki_article)

In [196]:
test.shape

(147, 101)

In [206]:
topic_probabilities = nmf_wiki.transform(test)

In [207]:
topic_probabilities.shape

(147, 20)

In [208]:
topic_probabilities

array([[8.07034207e-258, 0.00000000e+000, 0.00000000e+000, ...,
        1.40420363e-271, 3.78290319e-004, 7.11584178e-018],
       [7.51983879e-003, 7.55461911e-004, 0.00000000e+000, ...,
        2.24817939e-003, 0.00000000e+000, 2.06643681e-004],
       [8.07034207e-258, 0.00000000e+000, 0.00000000e+000, ...,
        1.40420363e-271, 3.78290319e-004, 7.11584178e-018],
       ...,
       [9.24821076e-003, 6.99705576e-003, 3.76511515e-003, ...,
        0.00000000e+000, 0.00000000e+000, 3.28790799e-045],
       [3.32258431e-016, 0.00000000e+000, 0.00000000e+000, ...,
        0.00000000e+000, 1.44378928e-138, 7.97580458e-037],
       [3.26289070e-002, 1.98224406e-014, 1.17093348e-010, ...,
        1.19730699e-002, 0.00000000e+000, 0.00000000e+000]])

In [209]:
topic = np.argmax(topic_probabilities)

In [210]:
print(topic)

835


In [212]:
topic_probabilities.shape

(147, 20)

In [189]:
topic_probabilities

array([[5.81631627e-169, 0.00000000e+000, 7.21093992e-003, ...,
        1.07696429e-206, 0.00000000e+000, 5.31264720e-024],
       [8.74461865e-002, 0.00000000e+000, 0.00000000e+000, ...,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [5.81631627e-169, 0.00000000e+000, 7.21093992e-003, ...,
        1.07696429e-206, 0.00000000e+000, 5.31264720e-024],
       ...,
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000]])

In [190]:
topic = np.argmax(topic_probabilities)

In [191]:
print(topic)

99


At this point, we can shelf the notebook for the moment because of the lack of simple inferencing capabilities. 