***

# A Jupyter Notebook for a topic model analyses 
# and visualization of The Guardian articles
# containing the word 'DJ' between 1985 and 2005

***

***

# The first part of the analyis
### Exploring the initial corpus

***

In [1]:
from __future__ import print_function
import pyLDAvis
import pyLDAvis.sklearn
from glob import glob
pyLDAvis.enable_notebook()
import warnings
warnings.filterwarnings('ignore') # only use this when you know the script and want to supress unnecessary warnings
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import os.path

In [2]:
# Import dataset consisting of seperate txt files

articles=[]
print("Constructing dataset, total number of documents included:")
for file in glob ("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/DJ Guardian_individual cleaned articels/*.txt"): 
    with open(file, errors="ignore") as fi:
        articles.append(fi.read())
length=len(articles)
print(length)

Constructing dataset, total number of documents included:
7755


In [3]:
print(articles[0]) 

April 1, 1994
THE GUARDIAN IN GLASGOW
SECTION: THE GUARDIAN FEATURES PAGE; Pg. 11
LENGTH: 259 words
THE Guardian is hosting three talks at Sound City in Glasgow next week on these
topical pop issues: Is MTV culture poisoning the world? Is club culture making
gigs redundant? Has black British music come of age?
MTV Culture: As MTV plays an increasingly vital role in the success of new
bands, is it in danger of creating a globally homogenised pop music at the
expense of indigenous styles? Or is MTV in the vanguard of a creative new
eclecticism? Brent Hanson, director of MTV Europe, Stuart Cosgrove, independent
music TV producer, Caroline Sullivan, Guardian pop critic (Monday April 4).
Dancing On Pop's Grave: Has dance supplanted pop as the main arena for inventive
new music? Have record companies been slow to realise that the new performance
forum is the club rather than the gig?  Speakers include: Muff Winwood, head of
A&R, CBS, Radio 1 DJ Steve Lamacq, Bobby Bluebell formerly of the Bl

In [4]:
# original vectorizer
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(articles) 
print(dtm_tf.shape)

# What about stemming (says, say, said)


(7755, 21843)


In [5]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(articles)
print(dtm_tfidf.shape)

(7755, 21843)


In [6]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)
# DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
# Blei wrote somewhere that the online method is very fast.

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_jobs=1, n_topics=20, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [7]:
# LDA tf visualization
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

***

### My interpretation of the topics:
#### T4 = dance topic
#### T7 = clubbing topic
#### T13 = glamour topic
#### T17 = racial hip hop topic
#### T14 = base music genre topic
#### T15 = film and television topic
#### T6 = stock market topic
#### T12 = online business topic
#### T16 = currency trading topic
#### T3 = human interest + drugs topic

***

In [8]:
# pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds') 


In [9]:
# Explore topic in the convential way (i.e. using a row of the 30 most common topic words)

n_top_words = 30

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda_tf, tf_feature_names, n_top_words) 

Topic #0:
art beach gallery sea noise museum style modern fashion design water world best beautiful place carnival artist video hhh contemporary light euros mix images set air ibiza designer oup pleasure
Topic #1:
labour party minister government blair political election president leader prime vote tony tory snp secretary iraq conservative campaign politics war politicians country power conference state hague ecology democratic hamilton parliament
Topic #2:
festival john london england peel poll day strauss west june smith cricket green south match season park players hall sunday event liverpool stage live july glastonbury play north august leeds
Topic #3:
ftse early change close high low dow yield euro hang nikkei seng markets share key city dax cac pages index nasdaq aim closed xxx eurotop cmp stx indl jones wall
Topic #4:
jazz bass music asian drum london funk hop hip african dub beats musicians indian latin jungle sound sounds album musical classical playing tracks contemporary fus

***

#### For subsequent analysis, I need the topic numbers given above, and not the topic numbers given in the visualization. 
#### The most dance-related topics in the visualization are: T4 and T7. These correspond to T8 and T13 (note that the content of the topics is identical, only their identification number differs). 

***

In [10]:
# Explore corpus using a doctopic matrix

CORPUS_PATH = "C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/DJ Guardian_individual cleaned articels/"

filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])

print(filenames[0])

dtm_transformed = tf_vectorizer.fit_transform(articles)

doctopic = lda_tf.fit_transform(dtm_transformed)

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

# Write doctopic to a csv file

os.chdir("C:/Users/renswilderom/Documents/Machine learning") 

filenamesclean = [fn.split('/')[-1] for fn in filenames]
i=0
with open('doctopic.csv',mode='w') as fo:
    for rij in doctopic:
        fo.write('"'+filenamesclean[i]+'"')
        fo.write(',')
        for kolom in rij:
            fo.write(str(kolom))
            fo.write(',')
        fo.write('\n')
        i+=1

C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/DJ Guardian_individual cleaned articels/The Guardian_April 1, 1994_259_368.txt


***

# The second part of the analysis
### Analizing a subcorpus

***

***



#### Below I continue with doctopic matrix produced above.

#### Dance music articles have a strong relation to topic 8 and 14. I selected individual articles as 'dance articles' if their topic loading is higher than the mean topic loading of that topic plus one standard deviation (This is the 'low cutoff' point). Alternatively, a 'higher cutoff point is set by including topic loadings which are higher than the mean plus two standard deviations. 


***

In [11]:
# open doctopic.csv
# Create new row with collumn names: file and t_0 to t_19
# Create a describtive statistics table of var t_0 till t_19 and calculate the treshold; 
# Then create 'dance dummy' variable for articles within the selected topics positioned above the threshold (mean + 1 std, or mean +2 std)
# Finally, use the list of article names to copy files from one directory to another, enabling subsequent analyis.

import pandas as pd
csv_file = pd.read_csv("C:/Users/renswilderom/Documents/Machine learning/doctopic.csv", header=None, index_col=False,
                  names = ["file", "t_0", "t_1", "t_2", "t_3", "t_4", "t_5", "t_6", "t_7", "t_8", "t_9", "t_10", "t_11", "t_12", "t_13", "t_14", "t_15", "t_16", "t_17", "t_18", "t_19"])

# When creating a row with new names, be careful not to overwrite the original first row.
# Load the xls file a dataframe
df = csv_file
df

Unnamed: 0,file,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_10,t_11,t_12,t_13,t_14,t_15,t_16,t_17,t_18,t_19
0,"The Guardian_April 1, 1994_259_368.txt",0.000325,0.000325,0.040934,0.021634,0.015616,0.000325,0.000325,0.000325,0.546908,...,0.000325,0.018402,0.000325,0.000325,0.000325,0.000325,0.147581,0.018210,0.000325,0.000325
1,"The Guardian_April 1, 1994_3298_369.txt",0.000034,0.044571,0.000034,0.000034,0.000034,0.008427,0.000034,0.000034,0.000034,...,0.012728,0.000034,0.203479,0.063610,0.000034,0.100384,0.000034,0.000034,0.298328,0.000034
2,"The Guardian_April 1, 1994_3298_371.txt",0.000034,0.044571,0.000034,0.000034,0.000034,0.008427,0.000034,0.000034,0.000034,...,0.012728,0.000034,0.203479,0.063610,0.000034,0.100384,0.000034,0.000034,0.298328,0.000034
3,"The Guardian_April 1, 1994_425_370.txt",0.000279,0.010199,0.000279,0.000279,0.000279,0.010181,0.067423,0.000279,0.081635,...,0.000279,0.029502,0.000279,0.000279,0.000279,0.000279,0.000279,0.000279,0.797149,0.000279
4,"The Guardian_April 1, 1994_425_373.txt",0.000279,0.010199,0.000279,0.000279,0.000279,0.010181,0.067423,0.000279,0.081635,...,0.000279,0.029502,0.000279,0.000279,0.000279,0.000279,0.000279,0.000279,0.797149,0.000279
5,"The Guardian_April 1, 1994_733_367.txt",0.000148,0.037209,0.088214,0.000148,0.000148,0.000148,0.000148,0.000148,0.048574,...,0.017783,0.000148,0.018901,0.000148,0.000148,0.249042,0.000148,0.000148,0.053415,0.000148
6,"The Guardian_April 1, 1996_1654_15.txt",0.000098,0.000098,0.000098,0.000098,0.000098,0.000098,0.022031,0.000098,0.020958,...,0.000098,0.000098,0.000098,0.041480,0.000098,0.661640,0.000098,0.000098,0.062604,0.000098
7,"The Guardian_April 1, 2000_110_1875.txt",0.001515,0.001515,0.001515,0.971212,0.001515,0.001515,0.001515,0.001515,0.001515,...,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515
8,"The Guardian_April 1, 2000_298_1876.txt",0.000397,0.000397,0.021451,0.025221,0.000397,0.000397,0.000397,0.000397,0.000397,...,0.000397,0.364719,0.000397,0.000397,0.000397,0.000397,0.000397,0.000397,0.176126,0.406530
9,"The Guardian_April 1, 2000_487_1877.txt",0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,...,0.147692,0.000213,0.030596,0.045004,0.000213,0.150185,0.053861,0.000213,0.277881,0.000213


In [12]:
df_1 = df.describe().loc[['mean','std']]
df2 = df_1.transpose()
df2['cutoff_high'] = df2['mean'] + 2*df2['std'] 
df2['cutoff_low'] = df2['mean'] + df2['std'] 
df2

Unnamed: 0,mean,std,cutoff_high,cutoff_low
t_0,0.015631,0.034845,0.085321,0.050476
t_1,0.005112,0.016749,0.03861,0.021861
t_2,0.033073,0.066705,0.166482,0.099777
t_3,0.076235,0.245264,0.566763,0.321499
t_4,0.013928,0.040718,0.095364,0.054646
t_5,0.003993,0.017212,0.038418,0.021206
t_6,0.021852,0.060399,0.142651,0.082251
t_7,0.009073,0.031719,0.072512,0.040792
t_8,0.115701,0.1579,0.4315,0.2736
t_9,0.122045,0.143624,0.409293,0.265669


In [13]:
# Select the appropriate cutoff point per topic from the table above
t_8_cutoff_high = df2.get_value('t_8', 'cutoff_high')
t_8_cutoff_low = df2.get_value('t_8', 'cutoff_low')

t_13_cutoff_high = df2.get_value('t_13', 'cutoff_high')
t_13_cutoff_low = df2.get_value('t_13', 'cutoff_low')

In [14]:
# These values are used to create new 'dance_high' and 'dance_low' dummies in the original df 

df['dance_high'] = '0'
df['dance_high'][
    (df['t_8'] > t_8_cutoff_high)| 
    (df['t_13'] > t_13_cutoff_high)] = '1' 

df['dance_low'] = '0'
df['dance_low'][
    (df['t_8'] > t_8_cutoff_low)| 
    (df['t_13'] > t_13_cutoff_low)] = '1' 

df

Unnamed: 0,file,t_0,t_1,t_2,t_3,t_4,t_5,t_6,t_7,t_8,...,t_12,t_13,t_14,t_15,t_16,t_17,t_18,t_19,dance_high,dance_low
0,"The Guardian_April 1, 1994_259_368.txt",0.000325,0.000325,0.040934,0.021634,0.015616,0.000325,0.000325,0.000325,0.546908,...,0.000325,0.000325,0.000325,0.000325,0.147581,0.018210,0.000325,0.000325,1,1
1,"The Guardian_April 1, 1994_3298_369.txt",0.000034,0.044571,0.000034,0.000034,0.000034,0.008427,0.000034,0.000034,0.000034,...,0.203479,0.063610,0.000034,0.100384,0.000034,0.000034,0.298328,0.000034,0,0
2,"The Guardian_April 1, 1994_3298_371.txt",0.000034,0.044571,0.000034,0.000034,0.000034,0.008427,0.000034,0.000034,0.000034,...,0.203479,0.063610,0.000034,0.100384,0.000034,0.000034,0.298328,0.000034,0,0
3,"The Guardian_April 1, 1994_425_370.txt",0.000279,0.010199,0.000279,0.000279,0.000279,0.010181,0.067423,0.000279,0.081635,...,0.000279,0.000279,0.000279,0.000279,0.000279,0.000279,0.797149,0.000279,0,0
4,"The Guardian_April 1, 1994_425_373.txt",0.000279,0.010199,0.000279,0.000279,0.000279,0.010181,0.067423,0.000279,0.081635,...,0.000279,0.000279,0.000279,0.000279,0.000279,0.000279,0.797149,0.000279,0,0
5,"The Guardian_April 1, 1994_733_367.txt",0.000148,0.037209,0.088214,0.000148,0.000148,0.000148,0.000148,0.000148,0.048574,...,0.018901,0.000148,0.000148,0.249042,0.000148,0.000148,0.053415,0.000148,0,0
6,"The Guardian_April 1, 1996_1654_15.txt",0.000098,0.000098,0.000098,0.000098,0.000098,0.000098,0.022031,0.000098,0.020958,...,0.000098,0.041480,0.000098,0.661640,0.000098,0.000098,0.062604,0.000098,0,0
7,"The Guardian_April 1, 2000_110_1875.txt",0.001515,0.001515,0.001515,0.971212,0.001515,0.001515,0.001515,0.001515,0.001515,...,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0,0
8,"The Guardian_April 1, 2000_298_1876.txt",0.000397,0.000397,0.021451,0.025221,0.000397,0.000397,0.000397,0.000397,0.000397,...,0.000397,0.000397,0.000397,0.000397,0.000397,0.000397,0.176126,0.406530,0,0
9,"The Guardian_April 1, 2000_487_1877.txt",0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,0.000213,...,0.030596,0.045004,0.000213,0.150185,0.053861,0.000213,0.277881,0.000213,0,0


In [15]:
# How many dance articles do I have according to the high criterion?
df3 = df[df.dance_high != '0']
df4 = df3[['file']]
df4.shape

(991, 1)

In [16]:
# How many dance articles do I have according to the low criterion?
df3 = df[df.dance_low != '0']
df5 = df3[['file']]
df5.shape

(1879, 1)

In [17]:
# Probably this can be done in a more straightforward fashion... (but this works)

os.chdir("C:/Users/renswilderom/Documents/Machine learning") 

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('dance_list_high.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
df4.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('dance_list_low.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
df5.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

***

### Working with the dance high articles

***

In [18]:
# This step copies the identified dance music articles from their original folder to a new destination folder. 

import shutil
import os


# Create A folder for dance articles, if the folder does not exists.
if not os.path.exists("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles high"):
    os.makedirs("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles high")  
dest1 = "C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles high"

os.chdir("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/DJ Guardian_individual cleaned articels")

# the following list of articles are dance articles:
files_tocopy = pd.read_excel("C:/Users/renswilderom/Documents/Machine learning/dance_list_high.xlsx") 
files_tocopy = files_tocopy['file'].apply(lambda x: x.replace('"', "")).tolist()


for f in files_tocopy:
    shutil.copy(f, dest1)   
     
        
print ("Done with copying dance high files")    

Done with copying dance high files


In [19]:
# Import dataset consisting of seperate txt files

articles=[]
print("Constructing dataset, total number of documents included:")
for file in glob ("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles high/*.txt"): 
    with open(file, errors="ignore") as fi:
        articles.append(fi.read())
length=len(articles)
print(length)
print("And check whether the numbers correspond...")

Constructing dataset, total number of documents included:
991
And check whether the numbers correspond...


In [20]:
# original vectorizer
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(articles) 
print(dtm_tf.shape)

# What about stemming (says, say, said)

(991, 3895)


In [21]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(articles)
print(dtm_tfidf.shape)

(991, 3895)


In [22]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)
# DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
# Blei wrote somewhere that the online method is very fast.

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_jobs=1, n_topics=20, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [23]:
# LDA tf visualization
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

***

### Working with the dance low articles

***

In [24]:
# In this step I copy the identified dance music articles from their original folder to a new destination folder. 

import shutil
import os


# Create A folder for dance articles, if the folder does not exists.
if not os.path.exists("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles low"):
    os.makedirs("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles low")  
dest1 = "C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles low"

os.chdir("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/DJ Guardian_individual cleaned articels")

# the following list of articles are dance articles:
files_tocopy = pd.read_excel("C:/Users/renswilderom/Documents/Machine learning/dance_list_low.xlsx") 
files_tocopy = files_tocopy['file'].apply(lambda x: x.replace('"', "")).tolist()


for f in files_tocopy:
    shutil.copy(f, dest1)   
     
        
print ("Done with copying dance low files")   

Done with copying dance low files


In [25]:
# Import dataset consisting of seperate txt files

articles=[]
print("Constructing dataset, total number of documents included:")
for file in glob ("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles low/*.txt"): 
    with open(file, errors="ignore") as fi:
        articles.append(fi.read())
length=len(articles)
print(length)
print("And check whether the numbers correspond...")

Constructing dataset, total number of documents included:
1879
And check whether the numbers correspond...


In [26]:
# original vectorizer
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(articles) 
print(dtm_tf.shape)

# What about stemming (says, say, said)

(1879, 7565)


In [27]:
# tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
# dtm_tfidf = tfidf_vectorizer.fit_transform(articles)
# print(dtm_tfidf.shape)

In [28]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
# lda_tfidf = LatentDirichletAllocation(n_topics=20, random_state=0)
# lda_tfidf.fit(dtm_tfidf)
# DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
# Blei wrote somewhere that the online method is very fast.

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_jobs=1, n_topics=20, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [29]:
# LDA tf visualization
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

***

# The third part of the analysis
### Exploring the top articles per topic

***

In [30]:
# Say I would like to see the most prominent articles for topic 13, the Northern Soul topic, I first see which topic number it has in the ...
# conventional 'topic order system'. So:

n_top_words = 30

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda_tf, tf_feature_names, n_top_words)

#Topic 13, the Northern Soul topic, appears to correspond to topic 18

Topic #0:
guide clubs preview club night sat house tonight nick green baird patric london fri djs residents guest nights breaks funk set support live sees thu party decks venue october november
Topic #1:
eno pieces wheel quartet composer drumming premiere trains opera spooky bowie minimalist recipe fond adams electric fence cave bases ghosts loops issued final april club scheduled city tape sampling royalties
Topic #2:
said police people black road community year says local birmingham yesterday man told home pubs men violence british asian young night group women outside north street pub south area caribbean
Topic #3:
says people students london work school law business university women course children year hacienda young education job company pounds theatre wild training college time manager society skills centre kids management
Topic #4:
club dance house clubs people night djs ibiza ecstasy scene clubbing acid rave culture clubbers london party nights parties gay year techno ministry

In [38]:
# create a doctopic matrix

CORPUS_PATH = "C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles low/"

filenames = sorted([os.path.join(CORPUS_PATH, fn) for fn in os.listdir(CORPUS_PATH)])

print(filenames[0])

dtm_test = tf_vectorizer.fit_transform(articles) 

doctopic = lda_tf.fit_transform(dtm_test)

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

os.chdir("C:/Users/renswilderom/Documents/Machine learning")

filenamesclean = [fn.split('/')[-1] for fn in filenames]
i=0
with open('doctopic_articles.csv',mode='w') as fo:
    for rij in doctopic:
        fo.write('"'+filenamesclean[i]+'"')
        fo.write(',')
        for kolom in rij:
            fo.write(str(kolom))
            fo.write(',')
        fo.write('\n')
        i+=1

C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper aticles DJ/DJ Guardian/dance articles low/The Guardian_April 1, 1994_259_368.txt


In [39]:
# Open the CSV file produced in the cell above in order to explore the top articles related to the topic of interest

import pandas as pd
csv_file = pd.read_csv("C:/Users/renswilderom/Documents/Machine learning/doctopic_articles.csv", header=None, index_col=False,
                  names = ["file", "t_0", "t_1", "t_2", "t_3", "t_4", "t_5", "t_6", "t_7", "t_8", "t_9", "t_10", "t_11", "t_12", "t_13", "t_14", "t_15", "t_16", "t_17", "t_18", "t_19"])

# When creating a row with new names, be careful not to overwrite the original first row.
# Load the xls file a dataframe
df = csv_file

print(df.shape)

(1879, 21)


In [40]:
# What is the topic of interest?

topic_of_interest = "t_18"

# Set the directory, this is based on on the same location as the doctopic matrix

os.chdir(CORPUS_PATH)

In [41]:
df1 = df[['file', topic_of_interest]] 
df2 = df1.sort(topic_of_interest, ascending=False)
df3 = df2.head(5)
df3

Unnamed: 0,file,t_18
70,"The Guardian_April 24, 1998_444_1708.txt",0.995759
71,"The Guardian_April 24, 1998_444_2120.txt",0.995759
82,"The Guardian_April 24, 2004_108_7343.txt",0.612723
81,"The Guardian_April 24, 2004_108_6032.txt",0.612723
853,"The Guardian_July 23, 1999_1825_1058.txt",0.255883


In [42]:
interest1 = df3['file'].iloc[0]
file  = open(interest1, 'r+')
file.read().splitlines()

['April 24, 1998',
 'Music: The A to Z of clubbing;',
 'Our cut-out-and-keep guide to pre -millennial nightlife. X is for . . .',
 'BYLINE: BEN OSBORNE',
 'SECTION: The Guardian Features Page; Pg. 15',
 'LENGTH: 444 words',
 'Xpress-2 Originally formed by Rocky and Diesel, two collaborators hanging around',
 'the Boys Own fanzine scene in the late eighties, Xpress-2 soon picked up a third',
 'member. Ashley Beedle, a young London DJ was introduced to Xpress-2 while they',
 'were recording their first single, Muzik X-Press. Further club classics followed',
 'with the release of Rock 2 House, Tranz Europe X-press, Say What and London',
 'X-press. Having played together as DJs at the legendary Boys Own parties, they',
 'teamed up as a house DJ outfit, pinching a hip-hop sound system trick and',
 'playing their sets over four decks and two mixers. Exhibiting their broad roots,',
 'the three joined with Dave Hill to form an eclectic, jazz -inspired,',
 'experimental offshoot, the Ballistic 

In [43]:
interest2 = df3['file'].iloc[1]
file  = open(interest2, 'r+')
file.read().splitlines()

['April 24, 1998',
 'Music: The A to Z of clubbing;',
 'Our cut-out-and-keep guide to pre -millennial nightlife. X is for . . .',
 'BYLINE: BEN OSBORNE',
 'SECTION: The Guardian Features Page; Pg. 15',
 'LENGTH: 444 words',
 'Xpress-2 Originally formed by Rocky and Diesel, two collaborators hanging around',
 'the Boys Own fanzine scene in the late eighties, Xpress-2 soon picked up a third',
 'member. Ashley Beedle, a young London DJ was introduced to Xpress-2 while they',
 'were recording their first single, Muzik X-Press. Further club classics followed',
 'with the release of Rock 2 House, Tranz Europe X-press, Say What and London',
 'X-press. Having played together as DJs at the legendary Boys Own parties, they',
 'teamed up as a house DJ outfit, pinching a hip-hop sound system trick and',
 'playing their sets over four decks and two mixers. Exhibiting their broad roots,',
 'the three joined with Dave Hill to form an eclectic, jazz -inspired,',
 'experimental offshoot, the Ballistic 

In [44]:
interest3 = df3['file'].iloc[2]
file  = open(interest3, 'r+')
file.read().splitlines()

['April 24, 2004',
 'The Guide: Clubs: * Digital T Festival BELFAST',
 'BYLINE: patric baird',
 'SECTION: The Guide, Pg. 33',
 'LENGTH: 108 words',
 "Northern Ireland's biggest celebration of electronic-based music returns,",
 'offering bands and DJs as diverse as Hawkwind, Faithless and Talvin Singh,',
 'combined with performances from artists shaping the local musical landscape.',
 "Tonight's event at The Pavilion features DJ Salvatore Principato of New York",
 'minimalist funksters Liquid Liquid, with support from Chris Caul and Greg',
 'McCann. Playing live are Glasgow-based band Engine, who can be relied upon to',
 'deliver the dancefloor goods, especially as they are celebrating the release of',
 "the Tom Findlay remix of their single, Startin' To Feel.",
 "Lavery's Pavilion Bar, Ormeau Road, Sat 24"]

In [45]:
interest4 = df3['file'].iloc[3]
file  = open(interest4, 'r+')
file.read().splitlines()

['April 24, 2004',
 'The Guide: Clubs: * Digital T Festival BELFAST',
 'BYLINE: patric baird',
 'SECTION: The Guide, Pg. 33',
 'LENGTH: 108 words',
 "Northern Ireland's biggest celebration of electronic-based music returns,",
 'offering bands and DJs as diverse as Hawkwind, Faithless and Talvin Singh,',
 'combined with performances from artists shaping the local musical landscape.',
 "Tonight's event at The Pavilion features DJ Salvatore Principato of New York",
 'minimalist funksters Liquid Liquid, with support from Chris Caul and Greg',
 'McCann. Playing live are Glasgow-based band Engine, who can be relied upon to',
 'deliver the dancefloor goods, especially as they are celebrating the release of',
 "the Tom Findlay remix of their single, Startin' To Feel.",
 "Lavery's Pavilion Bar, Ormeau Road, Sat 24"]

In [46]:
interest5 = df3['file'].iloc[4]
file  = open(interest5, 'r+')
file.read().splitlines()

['July 23, 1999',
 'All-time top 10;',
 "Northern Soul, one of Britain's most obsessive subcultures, is revealed in all",
 'its glory in a ten-hour documentary. Bob Dickinson talks to the DJ and producer',
 'behind it, Ian Levine',
 'BYLINE: ??',
 'SECTION: Guardian Friday Pages; Pg. 14',
 'LENGTH: 1825 words',
 "In the introduction to Ian Levine's film, The Strange World of Northern Soul, a",
 'DJ looks into the camera and recalls the times as a fan he spent all his',
 'available cash on records, as a result having to live in a diet of cold baked',
 "beans. Another DJ likens Wigan Casino, the club he loved, to a 'secret society'.",
 "The motive for club-going? 'Instant gratification live for the weekend'. These",
 'two are talking retrospectively, looking back across three decades at the effect',
 'of music from industrial towns in North America upon industrial towns in the',
 'north of England; a British working class obsession with black American working',
 "class dance music. It's 

***

# End of script

***

***

### More links and resources:

#### This is the original script on which this notebook is based: http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb

#### TF-IDF vectorizers: http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

#### MMDS is dimension reduction via Jensen-Shannon Divergence & Metric Multidimensional Scaling: http://bugra.github.io/work/notes/2014-03-16/jensen-shannon-divergence-matrix-multi-dimensional-scaling/

#### Working with Markdown: http://datascience.ibm.com/blog/markdown-for-jupyter-notebooks-cheatsheet/

#### Notebook shortcuts: https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/pdf_bw/

#### Number of topics I: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

#### Number of topics II: https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/

#### Pandas: https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

#### Fit transform: https://datascience.stackexchange.com/questions/12321/difference-between-fit-and-fit-transform-in-scikit-learn-models


***