# Exploring Wikipedia clickstream data: English Wiki in December 2018    

## Defining community topics with NLP

### 1. Introduction   

This notebook contains natural language processing of the article communities found in the English Wikipedia clickstream dataset for December 2018.  
This is the 5th part of a project about the usage patterns of Wikipedia.  
The preceding parts are:  
1. [Data quality analysis of available datasets](data_quality_analysis.ipynb)  
2. [Exploratory data analysis of the English Wikipedia clickstream dataset for December 2018](English_Wikipedia_EDA.ipynb)  
3. [Graph modeling of the English Wikipedia clickstream dataset for December 2018](English_Wikipedia_graph_modeling_AWS.ipynb)  
4. [Network analysis of the English Wikipedia clickstream dataset](English_Wikipedia_network_analysis_AWS.ipynb)

In the preceding notebook, we've used Louvain community detection on the English Wikipedia clickstream graph to find clusters of Wikipedia article that tend to be close to each over by user clickstream traffic. Inspecting the article titles in the reulting communities, it appears that the commuities tend to cluster on certain topics. Since these communities appear to reflect topics of interest, it would be useful to know what these topics are without having to browse through the individual article titles in the communities. We can use NLP to extract the community topics.  

In this notebook, we'll use natural language processing techniques to derive topic descriptions for the article communities we've found in the English Wikipedia clickstream graph.  

#### Notebook contents:  

1. [Introduction](#1.-Introduction)  
2. [Notebook setup](#2.-Notebook-setup)  
3. [Analysis](#3.-Analysis)  
   3.1 [Pre-process article titles](#3.1-Pre-process-article-titles)  
   3.2 [Named entity recognition](#3.2-Named-entity-recognition)  
      = [Named entity type descriptions](#Named-entity-type-descriptions)  
   3.3 [Lemmatizing and stemming](#3.3-Lemmatizing-and-stemming)  
   3.4 [Scaled traffic weights](#3.4-Scaled-traffic-weights)  
   3.5 [Bags of words](#3.5-Bags-of-words)  
      = [What are the topics of the top 20 communities?](#What-are-the-topics-of-the-top-20-communities?)
4. [Summary](#4.-Summary)
5. [Next steps](#5.-Next-steps)

### 2. Notebook setup  

#### Imports

In [1]:
import pandas as pd
import numpy as np

import re

from py2neo import authenticate, Graph, Node, Relationship


import os
import csv
import pickle

from time import sleep
from timeit import default_timer as timer
from datetime import datetime

from IPython.display import display, HTML

# custom general helper functions for this project
import custom_utils as cu
import importlib


In [2]:
from collections import defaultdict

In [3]:
# Note: not sure why, but on running this gensim import the kernel kept dying.
# Running the following in command line fixed it:
# conda install -f numpy
import gensim

In [4]:
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/arinai/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# reload imports as needed
importlib.reload(cu);

In [7]:
# set up Pandas options
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 3)
pd.options.display.float_format = '{:.2f}'.format

In [8]:
pd.options.display.max_colwidth = 100

#### Connect to neo4j

In [None]:
n4j_cred = cu.read_n4jpass()

# set up authentication parameters
authenticate("localhost:7474", n4j_cred["user"], n4j_cred["password"])

# connect to authenticated graph database
graph = Graph("http://localhost:7474/db/data/")

# test query
r = graph.data('CALL db.indexes;')
pd.DataFrame(r)

#### Pull the input data from neo4j

In [None]:
# get a listing of all Louvain communities and their stats
start_time = timer()
q = """
    MATCH (a:Article)
    WHERE exists(a.louvain_community)
    RETURN 
        a.title as title,
        a.louvain_community as louvain_community, 
        a.external_search_traffic as external_search_traffic,
        a.link_in_traffic as link_in_traffic,
        a.search_in_traffic as search_in_traffic
    ORDER BY external_search_traffic IS NOT NULL desc, external_search_traffic desc;
    """
r = graph.data(q)
louvain_communities_for_NLP = pd.DataFrame(r)

print("Number of articles:",len(louvain_communities_for_NLP))

cu.printRunTime(start_time)

Output from the above:  
Number of articles: 2729767  
 Runtime: 3.84 min

In [None]:
# pickle the output
myoutfile = "pickles/en_1218_louvain_communities_for_NLP.pkl"
with open(myoutfile, 'wb') as picklefile:
    pickle.dump(louvain_communities_for_NLP, picklefile)

print("Pickle created: " + myoutfile)

Output from the above:  

Pickle created: pickles/en_1218_louvain_communities_for_NLP.pkl

In [59]:
# unpickle
with open("pickles/en_1218_louvain_communities_for_NLP.pkl", 'rb') as picklefile: 
    louvain_communities_for_NLP = pickle.load(picklefile)

louvain_communities_for_NLP.head(20)

Unnamed: 0,external_search_traffic,link_in_traffic,louvain_community,search_in_traffic,title
0,4576854.0,1108189.0,3,5630.0,George_H._W._Bush
1,3538068.0,639353.0,4,6451.0,Jason_Momoa
2,3475113.0,223635.0,9,23563.0,2.0_(film)
3,3251996.0,682992.0,4,10416.0,Bird_Box_(film)
4,3020671.0,31170.0,1,,Main_Page
5,2634665.0,408421.0,4,34309.0,Aquaman_(film)
6,2328884.0,200893.0,4,192.0,Bird_Box
7,2231176.0,575481.0,3,3945.0,Priyanka_Chopra
8,2226602.0,117115.0,5,958.0,List_of_most-disliked_YouTube_videos
9,2050628.0,336621.0,5,4161.0,Freddie_Mercury


In [60]:
len(louvain_communities_for_NLP)

2729767

### 3. Analysis

#### 3.1 Pre-process article titles

In [61]:
louvain_communities_for_NLP_proc =louvain_communities_for_NLP.copy()
louvain_communities_for_NLP_proc.rename(index=str,columns={'title':'title_raw'}, inplace=True)
louvain_communities_for_NLP_proc.head(5)

Unnamed: 0,external_search_traffic,link_in_traffic,louvain_community,search_in_traffic,title_raw
0,4576854.0,1108189.0,3,5630.0,George_H._W._Bush
1,3538068.0,639353.0,4,6451.0,Jason_Momoa
2,3475113.0,223635.0,9,23563.0,2.0_(film)
3,3251996.0,682992.0,4,10416.0,Bird_Box_(film)
4,3020671.0,31170.0,1,,Main_Page


In [63]:
# the importance of words will be weighted by search and link traffic
louvain_communities_for_NLP_proc["weight"] = louvain_communities_for_NLP_proc[[
        "external_search_traffic", "link_in_traffic", "search_in_traffic"]].sum(axis=1).astype('int64').fillna(0)

louvain_communities_for_NLP_proc.drop(["external_search_traffic", "link_in_traffic", "search_in_traffic"], 
                                      axis = 1,
                                      inplace=True)

In [64]:
louvain_communities_for_NLP_proc.head(5)

Unnamed: 0,louvain_community,title_raw,weight
0,3,George_H._W._Bush,5690673
1,4,Jason_Momoa,4183872
2,9,2.0_(film),3722311
3,4,Bird_Box_(film),3945404
4,1,Main_Page,3051841


In [66]:
# clean up the title
louvain_communities_for_NLP_proc["title"] = \
    louvain_communities_for_NLP_proc.title_raw.str.replace('_', ' ')
    
louvain_communities_for_NLP_proc.head(5)

Unnamed: 0,louvain_community,title_raw,weight,title
0,3,George_H._W._Bush,5690673,George H. W. Bush
1,4,Jason_Momoa,4183872,Jason Momoa
2,9,2.0_(film),3722311,2.0 (film)
3,4,Bird_Box_(film),3945404,Bird Box (film)
4,1,Main_Page,3051841,Main Page


In [68]:
louvain_communities_for_NLP_proc.describe()

Unnamed: 0,louvain_community,weight
count,2729767.0,2729767.0
mean,11.48,1731.93
std,64.77,13116.59
min,0.0,0.0
25%,3.0,66.0
50%,5.0,200.0
75%,10.0,735.0
max,1697.0,5690673.0


#### 3.2 Named entity recognition  

We can use the [spaCy](https://spacy.io/) NLP library to identify named entities in article titles.  
SpaCy's installation instructions can be found [here](https://spacy.io/usage/).

In [None]:
import spacy

In [29]:
# Editing the pipeline to cut down on processing time (dataset too large to use default pipeline)
nlp = spacy.load('en', disable=['parser', 'tagger'])

In [30]:
nlp.pipeline

[('ner', <spacy.pipeline.EntityRecognizer at 0x7ff34717b780>)]

In [None]:
print("Started running at", datetime.now(), "UTC")

start_time = timer()

nrows = len(louvain_communities_for_NLP_proc)
print("Total number of rows to process:", nrows, "\n")

for i in range(nrows):
    txt = louvain_communities_for_NLP_proc.iloc[i].title

    doc = nlp(txt)

    ents_arr = []
    for ent in doc.ents:
        ents_arr.append(ent.label_)
    
    louvain_communities_for_NLP_proc.at[str(i), "named_entities"] = " ".join(ents_arr)
    
    if (i % 100000 == 0):
        print("Rows processed:", round(i * 100/len(louvain_communities_for_NLP_proc), 4), "%,", "count = ", i )
        print("Elapsed time:", round((timer() - start_time)/60, 4), "min\n")
        print("Last row's processing vars:")
        print("i=", i, "txt=", txt, "ents_arr=", ents_arr, "\n",
              "updated data row:\n", louvain_communities_for_NLP_proc.iloc[i], "\n")
    

cu.printRunTime(start_time)

Started running at 2019-03-01 19:57:26.153032 UTC
Total number of rows to process: 2729767 

Rows processed: 0.0 %, count =  0
Elapsed time: 0.0001 min

Last row's processing vars:
i= 0 txt= George H. W. Bush ents_arr= ['PERSON'] 
 updated data row:
 louvain_community                    3
title_raw            George_H._W._Bush
weight                         5690673
title                George H. W. Bush
named_entities                  PERSON
Name: 0, dtype: object 

Rows processed: 3.6633 %, count =  100000
Elapsed time: 6.8751 min

Last row's processing vars:
i= 100000 txt= List of cities and boroughs in Pennsylvania by population ents_arr= ['GPE'] 
 updated data row:
 louvain_community                                                           10
title_raw            List_of_cities_and_boroughs_in_Pennsylvania_by_population
weight                                                                    5628
title                List of cities and boroughs in Pennsylvania by population
named

The code above took a few hours to run, and finished successfully, but the browser tunnel got interrupted at some point, so the print statements above are incomplete, but the dataset has been successfully populated with `named_entities`.

Double-checking that all of the data got NER-processed:

In [86]:
nrows

2729767

In [85]:
i

2729766

In [87]:
txt

'Christgau (disambiguation)'

In [93]:
louvain_communities_for_NLP_proc.iloc[i]

louvain_community                             5
title_raw            Christgau_(disambiguation)
weight                                       18
title                Christgau (disambiguation)
named_entities                              GPE
Name: 2729766, dtype: object

In [89]:
louvain_communities_for_NLP_proc.head(20)

Unnamed: 0,louvain_community,title_raw,weight,title,named_entities
0,3,George_H._W._Bush,5690673,George H. W. Bush,PERSON
1,4,Jason_Momoa,4183872,Jason Momoa,PERSON
2,9,2.0_(film),3722311,2.0 (film),CARDINAL
3,4,Bird_Box_(film),3945404,Bird Box (film),PERSON
4,1,Main_Page,3051841,Main Page,PERSON
5,4,Aquaman_(film),3077395,Aquaman (film),
6,4,Bird_Box,2529969,Bird Box,PERSON
7,3,Priyanka_Chopra,2810602,Priyanka Chopra,PERSON
8,5,List_of_most-disliked_YouTube_videos,2344675,List of most-disliked YouTube videos,ORG
9,5,Freddie_Mercury,2391410,Freddie Mercury,ORG


In [90]:
louvain_communities_for_NLP_proc.tail(20)

Unnamed: 0,louvain_community,title_raw,weight,title,named_entities
2729747,16,Mantle_of_Luís_I,15,Mantle of Luís I,
2729748,13,List_of_Zero:_Black_Blood_episodes,14,List of Zero: Black Blood episodes,CARDINAL EVENT
2729749,22,Moskvitch_404_Sport,15,Moskvitch 404 Sport,CARDINAL
2729750,3,Caridina_loehae,11,Caridina loehae,GPE
2729751,3,Tatu_Miettunen,11,Tatu Miettunen,
2729752,2,HIST2H3C,10,HIST2H3C,
2729753,6,"Muk,_Iran",17,"Muk, Iran",GPE GPE
2729754,10,"Wila,_Missouri",67,"Wila, Missouri",PERSON GPE
2729755,3,Isaac_Rochussen,11,Isaac Rochussen,PERSON
2729756,7,Raffaello_Bertieri,26,Raffaello Bertieri,PERSON


The named entities assignment is not perfect, for example, both "Main Page" and "Bird Box" articles were labeled "PERSON", but in many cases it did quite well.  

##### Named entity type descriptions  
SpaCy's named entity type descriptions can be found [here](https://spacy.io/api/annotation#named-entities).  

The "Main Page" article seems to be an important node in the network, so let's fix its named_entities value.

In [95]:
louvain_communities_for_NLP_proc.iloc[4]

louvain_community            1
title_raw            Main_Page
weight                 3051841
title                Main Page
named_entities          PERSON
Name: 4, dtype: object

In [103]:
louvain_communities_for_NLP_proc.at[str(4), "named_entities"] = ''

In [104]:
louvain_communities_for_NLP_proc.iloc[4]

louvain_community            1
title_raw            Main_Page
weight                 3051841
title                Main Page
named_entities                
Name: 4, dtype: object

In [106]:
louvain_communities_for_NLP_proc.head(10)

Unnamed: 0,louvain_community,title_raw,weight,title,named_entities
0,3,George_H._W._Bush,5690673,George H. W. Bush,PERSON
1,4,Jason_Momoa,4183872,Jason Momoa,PERSON
2,9,2.0_(film),3722311,2.0 (film),CARDINAL
3,4,Bird_Box_(film),3945404,Bird Box (film),PERSON
4,1,Main_Page,3051841,Main Page,
5,4,Aquaman_(film),3077395,Aquaman (film),
6,4,Bird_Box,2529969,Bird Box,PERSON
7,3,Priyanka_Chopra,2810602,Priyanka Chopra,PERSON
8,5,List_of_most-disliked_YouTube_videos,2344675,List of most-disliked YouTube videos,ORG
9,5,Freddie_Mercury,2391410,Freddie Mercury,ORG


In [109]:
# pickle the output
myoutfile = "pickles/en_1218_louvain_communities_for_NLP_proc.pkl"
with open(myoutfile, 'wb') as picklefile:
    pickle.dump(louvain_communities_for_NLP_proc, picklefile)

print("Pickle created: " + myoutfile)

Pickle created: pickles/en_1218_louvain_communities_for_NLP_proc.pkl


In [37]:
# unpickle
with open("pickles/en_1218_louvain_communities_for_NLP_proc.pkl", 'rb') as picklefile: 
    louvain_communities_for_NLP_proc = pickle.load(picklefile)

louvain_communities_for_NLP_proc.head(20)

Unnamed: 0,louvain_community,title_raw,weight,title,named_entities
0,3,George_H._W._Bush,5690673,George H. W. Bush,PERSON
1,4,Jason_Momoa,4183872,Jason Momoa,PERSON
2,9,2.0_(film),3722311,2.0 (film),CARDINAL
3,4,Bird_Box_(film),3945404,Bird Box (film),PERSON
4,1,Main_Page,3051841,Main Page,
5,4,Aquaman_(film),3077395,Aquaman (film),
6,4,Bird_Box,2529969,Bird Box,PERSON
7,3,Priyanka_Chopra,2810602,Priyanka Chopra,PERSON
8,5,List_of_most-disliked_YouTube_videos,2344675,List of most-disliked YouTube videos,ORG
9,5,Freddie_Mercury,2391410,Freddie Mercury,ORG


In [30]:
louvain_communities_for_NLP_proc.named_entities.value_counts()

PERSON                                830645
                                      814109
ORG                                   390291
GPE                                   170318
DATE                                   46268
NORP                                   38910
GPE GPE                                38162
CARDINAL                               29662
ORG GPE                                28779
PERSON GPE                             21082
LOC                                    16358
DATE ORG                               15102
FAC                                    14702
PERSON PERSON                          14446
DATE EVENT                             14300
PERSON DATE                            14223
PERSON ORG                             11964
ORG ORG                                11261
EVENT                                  10618
DATE GPE                               10552
ORG DATE                                9295
WORK_OF_ART                             8606
ORG PERSON

In [36]:
louvain_communities_for_NLP_proc[louvain_communities_for_NLP_proc.louvain_community == 14][:20]

Unnamed: 0,louvain_community,title_raw,weight,title,named_entities
2175,14,List_of_NHL_statistical_leaders,102869,List of NHL statistical leaders,ORG
3019,14,2019_World_Junior_Ice_Hockey_Championships,117350,2019 World Junior Ice Hockey Championships,DATE ORG
3411,14,Wayne_Gretzky,97377,Wayne Gretzky,PERSON
3480,14,24Hours,74156,24Hours,
4259,14,IIHF_World_U20_Championship,78838,IIHF World U20 Championship,ORG
4548,14,List_of_Stanley_Cup_champions,73879,List of Stanley Cup champions,EVENT
5399,14,Alexander_Ovechkin,69756,Alexander Ovechkin,PERSON
5408,14,2018_World_Junior_Ice_Hockey_Championships,76122,2018 World Junior Ice Hockey Championships,DATE EVENT
5525,14,Sidney_Crosby,62898,Sidney Crosby,PERSON
5554,14,Spengler_Cup,56905,Spengler Cup,EVENT


#### 3.3 Lemmatizing and stemming

In [51]:
stemmer = SnowballStemmer("english")

In [39]:
def parse_title(title):
    # words = re.sub('[-():]', " ", title).split()
    words = gensim.utils.simple_preprocess(title)
    stopwords = gensim.parsing.preprocessing.STOPWORDS
    
    parsed=[]
    for word in words:
        if word not in stopwords:
            lemmatized = stemmer.stem(WordNetLemmatizer().lemmatize(word, pos='v'))
            parsed.append(lemmatized)
            
    return parsed

In [40]:
print("Started running at", datetime.now(), "UTC")

start_time = timer()

nrows = len(louvain_communities_for_NLP_proc)
print("Num of rows:", nrows)

for i in range(nrows):
    txt = louvain_communities_for_NLP_proc.iloc[i].title
    parsed = parse_title(txt)
    
    louvain_communities_for_NLP_proc.at[str(i), "title_parsed"] = " ".join(parsed)
    
cu.printRunTime(start_time)

louvain_communities_for_NLP_proc.head(20)

Started running at 2019-03-03 03:58:15.950026 UTC
Num of rows: 2729767


Runtime: 12.51 min



Unnamed: 0,louvain_community,title_raw,weight,title,named_entities,title_parsed
0,3,George_H._W._Bush,5690673,George H. W. Bush,PERSON,georg bush
1,4,Jason_Momoa,4183872,Jason Momoa,PERSON,jason momoa
2,9,2.0_(film),3722311,2.0 (film),CARDINAL,film
3,4,Bird_Box_(film),3945404,Bird Box (film),PERSON,bird box film
4,1,Main_Page,3051841,Main Page,,main page
5,4,Aquaman_(film),3077395,Aquaman (film),,aquaman film
6,4,Bird_Box,2529969,Bird Box,PERSON,bird box
7,3,Priyanka_Chopra,2810602,Priyanka Chopra,PERSON,priyanka chopra
8,5,List_of_most-disliked_YouTube_videos,2344675,List of most-disliked YouTube videos,ORG,list dislik youtub video
9,5,Freddie_Mercury,2391410,Freddie Mercury,ORG,freddi mercuri


In [41]:
# pickle the output
myoutfile = "pickles/en_1218_louvain_communities_for_NLP_proc_2.pkl"
with open(myoutfile, 'wb') as picklefile:
    pickle.dump(louvain_communities_for_NLP_proc, picklefile)

print("Pickle created: " + myoutfile)

Pickle created: pickles/en_1218_louvain_communities_for_NLP_proc_2.pkl


In [9]:
# unpickle
with open("pickles/en_1218_louvain_communities_for_NLP_proc_2.pkl", 'rb') as picklefile: 
    louvain_communities_for_NLP_proc = pickle.load(picklefile)

louvain_communities_for_NLP_proc.head(20)

Unnamed: 0,louvain_community,title_raw,weight,title,named_entities,title_parsed
0,3,George_H._W._Bush,5690673,George H. W. Bush,PERSON,georg bush
1,4,Jason_Momoa,4183872,Jason Momoa,PERSON,jason momoa
2,9,2.0_(film),3722311,2.0 (film),CARDINAL,film
3,4,Bird_Box_(film),3945404,Bird Box (film),PERSON,bird box film
4,1,Main_Page,3051841,Main Page,,main page
5,4,Aquaman_(film),3077395,Aquaman (film),,aquaman film
6,4,Bird_Box,2529969,Bird Box,PERSON,bird box
7,3,Priyanka_Chopra,2810602,Priyanka Chopra,PERSON,priyanka chopra
8,5,List_of_most-disliked_YouTube_videos,2344675,List of most-disliked YouTube videos,ORG,list dislik youtub video
9,5,Freddie_Mercury,2391410,Freddie Mercury,ORG,freddi mercuri


#### 3.4 Scaled traffic weights

In the current dataset, each article shows up once. If we treat each article as a document for further bag-of-words parsing, it will reflect the words in the network of unique article titles, but it will not capture the differences in traffic to different articles.  
We've identified our article communities using community detection with edges weighted by traffic, so that how much traffic goes to the articles is an important factor in the definition of article communities. To capture the differences in traffic, we'll use a weight based on the traffic. Ideally, we could repeat each article title in the dataset the number of times it was viewed, and then continue to bag-of-words parsing. But the dataset is too large to do that, and for the purposes of this project we only need the relative order of the word frequencies. So instead, we can take the log of the traffic weights to get smaller and more gradual scaled weights that are feasible to use here.

In [None]:
louvain_communities_for_NLP_proc["scaled_weight"] = np.log(louvain_communities_for_NLP_proc.weight)

In [11]:
louvain_communities_for_NLP_proc.head()

Unnamed: 0,louvain_community,title_raw,weight,title,named_entities,title_parsed,scaled_weight
0,3,George_H._W._Bush,5690673,George H. W. Bush,PERSON,georg bush,15.55
1,4,Jason_Momoa,4183872,Jason Momoa,PERSON,jason momoa,15.25
2,9,2.0_(film),3722311,2.0 (film),CARDINAL,film,15.13
3,4,Bird_Box_(film),3945404,Bird Box (film),PERSON,bird box film,15.19
4,1,Main_Page,3051841,Main Page,,main page,14.93


#### 3.5 Bags of words

To get a sense of what each Wikipedia articles community is about, we can make simple bags of words and named entities for each community (weighted by traffic), and then take the top N most frequent words and named entities as a descriptor for each community.

Make community descriptors for all communities:

In [14]:
louvain_community_ids = louvain_communities_for_NLP_proc.louvain_community.unique().tolist()
louvain_community_ids[:5]

[3, 4, 9, 1, 5]

In [15]:
len(louvain_community_ids)

1698

In [40]:
print("Started running at", datetime.now(), "UTC")
start_time = timer()

community_topics_dict = {}
count = 0

for cid in louvain_community_ids:
    # get a community
    comm = louvain_communities_for_NLP_proc[louvain_communities_for_NLP_proc.louvain_community == cid]
    
    # put together title and NER doc lists per community
    nrows = len(comm)
    title_docs = []
    ner_docs = []

    for i in range(nrows):
        title_parsed_arr = comm.iloc[i].title_parsed.split()
        named_entities_arr = comm.iloc[i].named_entities.split()
        weight = max(round(comm.iloc[i].scaled_weight).astype("int64"), 1)

        for j in range(weight):
            title_docs.append(title_parsed_arr)
            ner_docs.append(named_entities_arr)
    
    # Make a gensim dictionary from title docs
    titles_dictionary = gensim.corpora.Dictionary(title_docs)
    # Get top 20 most frequent words used in the article titles within the given community
    topic_word_ids = sorted(titles_dictionary.dfs, key=titles_dictionary.dfs.__getitem__, reverse=True)[:20]

    # store the top 20 title words and their weighted frequencies 
    topic_words_arr = []
    topic_words_weights = []
    for wid in topic_word_ids:
        topic_words_arr.append(titles_dictionary[wid])
        topic_words_weights.append(titles_dictionary.dfs[wid])
    
    # Make a gensim dictionary from NER docs
    ner_dictionary = gensim.corpora.Dictionary(ner_docs)
    # Get top 20 most frequent NERs used in the article titles within the given community
    topic_ner_ids = sorted(ner_dictionary.dfs, key=ner_dictionary.dfs.__getitem__, reverse=True)[:20]

    # store the top 20 title named entities and their weighted frequencies 
    topic_ner_arr = []
    topic_ner_weights = []
    for nid in topic_ner_ids:
        topic_ner_arr.append(ner_dictionary[nid])
        topic_ner_weights.append(ner_dictionary.dfs[nid])
    
    # store the community topic arrays in a dict
    community_topics_dict[cid] = {}
    community_topics_dict[cid]["topic_words"] = topic_words_arr
    community_topics_dict[cid]["topic_words_weights"] = topic_words_weights
    community_topics_dict[cid]["topic_ner"] = topic_ner_arr
    community_topics_dict[cid]["topic_ner_weights"] = topic_ner_weights
    
    
    if (count % 100 == 0):
        print("Communities processed:", round(count * 100/len(louvain_community_ids), 4), "%,", "count = ", count )
        print("Elapsed time:", round((timer() - start_time)/60, 4), "min\n")
        print("Last community's processing vars:")
        print("cid=", cid, "\ncommunity_topics_dict entry:\n", community_topics_dict[cid])
    count += 1
    
cu.printRunTime(start_time)

Started running at 2019-03-03 23:47:18.127757 UTC
Communities processed: 0.0 %, count =  0
Elapsed time: 4.5866 min

Last community's processing vars:
cid= 3 
community_topics_dict entry:
 {'topic_words': ['list', 'unit', 'state', 'footbal', 'elect', 'nation', 'th', 'constitu', 'parti', 'battl', 'john', 'film', 'st', 'district', 'war', 'cup', 'disambigu', 'india', 'leagu', 'world'], 'topic_words_weights': [86275, 62493, 55433, 49587, 41775, 39424, 30185, 29297, 26282, 23894, 23292, 22708, 22624, 20653, 19766, 17757, 16889, 16388, 15638, 15583], 'topic_ner': ['PERSON', 'ORG', 'GPE', 'DATE', 'NORP', 'CARDINAL', 'EVENT', 'ORDINAL', 'LOC', 'FAC', 'PRODUCT', 'LAW', 'WORK_OF_ART', 'LANGUAGE', 'QUANTITY', 'MONEY', 'TIME', 'PERCENT'], 'topic_ner_weights': [1088529, 559783, 426692, 171238, 112900, 43759, 37994, 30215, 28239, 14446, 5942, 3857, 3657, 2923, 1980, 811, 496, 112]}
Communities processed: 5.8893 %, count =  100
Elapsed time: 24.535 min

Last community's processing vars:
cid= 400 
com

Runtime: 24.78 min



In [41]:
community_topics_dict[22]

{'topic_ner': ['ORG',
  'PERSON',
  'DATE',
  'GPE',
  'CARDINAL',
  'EVENT',
  'PRODUCT',
  'NORP',
  'TIME',
  'ORDINAL',
  'FAC',
  'LOC',
  'LAW',
  'WORK_OF_ART',
  'QUANTITY',
  'LANGUAGE',
  'PERCENT',
  'MONEY'],
 'topic_ner_weights': [65251,
  64819,
  23494,
  14602,
  11542,
  11079,
  5620,
  5000,
  1148,
  953,
  663,
  585,
  293,
  229,
  218,
  55,
  28,
  4],
 'topic_words': ['grand',
  'prix',
  'championship',
  'seri',
  'engin',
  'race',
  'car',
  'formula',
  'list',
  'ford',
  'honda',
  'motorcycl',
  'motor',
  'world',
  'automobil',
  'toyota',
  'driver',
  'merced',
  'bmw',
  'benz'],
 'topic_words_weights': [10329,
  9949,
  6733,
  6195,
  6132,
  6073,
  5593,
  4185,
  4161,
  4154,
  3680,
  3466,
  3441,
  3335,
  2688,
  2684,
  2427,
  2388,
  2303,
  2291]}

In [42]:
# pickle the output
myoutfile = "pickles/en_1218_louvain_community_topics_dict.pkl"
with open(myoutfile, 'wb') as picklefile:
    pickle.dump(community_topics_dict, picklefile)

print("Pickle created: " + myoutfile)

Pickle created: pickles/en_1218_louvain_community_topics_dict.pkl


In [43]:
# unpickle
with open("pickles/en_1218_louvain_community_topics_dict.pkl", 'rb') as picklefile: 
    community_topics_dict = pickle.load(picklefile)

community_topics_dict[22]

{'topic_ner': ['ORG',
  'PERSON',
  'DATE',
  'GPE',
  'CARDINAL',
  'EVENT',
  'PRODUCT',
  'NORP',
  'TIME',
  'ORDINAL',
  'FAC',
  'LOC',
  'LAW',
  'WORK_OF_ART',
  'QUANTITY',
  'LANGUAGE',
  'PERCENT',
  'MONEY'],
 'topic_ner_weights': [65251,
  64819,
  23494,
  14602,
  11542,
  11079,
  5620,
  5000,
  1148,
  953,
  663,
  585,
  293,
  229,
  218,
  55,
  28,
  4],
 'topic_words': ['grand',
  'prix',
  'championship',
  'seri',
  'engin',
  'race',
  'car',
  'formula',
  'list',
  'ford',
  'honda',
  'motorcycl',
  'motor',
  'world',
  'automobil',
  'toyota',
  'driver',
  'merced',
  'bmw',
  'benz'],
 'topic_words_weights': [10329,
  9949,
  6733,
  6195,
  6132,
  6073,
  5593,
  4185,
  4161,
  4154,
  3680,
  3466,
  3441,
  3335,
  2688,
  2684,
  2427,
  2388,
  2303,
  2291]}

##### What are the topics of the top 20 communities?

In [220]:
# unpickle
with open("pickles/en_1218_louvain_communities.pkl", 'rb') as picklefile: 
    louvain_communities = pickle.load(picklefile)

louvain_communities.head(20)

Unnamed: 0,articles_count,external_search_traffic,link_edges_count,link_traffic,louvain_community,search_edges_count,search_traffic,total_visits,avg_external_search_traffic,avg_link_traffic_per_edge,avg_visits_per_article,link_network_density,link_network_density_delta
0,507642,535227964,3385462,329625267,3,102124,4188154,1124655652,1054.34,97.36,2215.45,0.0,5.67
1,302120,288453340,2037310,140416238,7,41383,1257742,592415339,954.76,68.92,1960.86,0.0,5.74
2,267541,626988532,2540296,461475849,4,98857,3938984,1350876078,2343.52,181.66,5049.23,0.0,8.5
3,233396,150495954,1209989,84128861,10,25894,932688,322074785,644.81,69.53,1379.95,0.0,4.18
4,223864,219874886,1327054,151652674,5,32332,930331,468611692,982.18,114.28,2093.29,0.0,4.93
5,218001,279571288,1593048,95420595,1,39273,1056534,1095599823,1282.43,59.9,5025.66,0.0,6.31
6,209065,259036837,1333634,96546076,2,45927,1568306,503087845,1239.03,72.39,2406.37,0.0,5.38
7,110397,117206803,700021,56716133,11,23456,972333,220630202,1061.68,81.02,1998.52,0.0,5.34
8,81807,72472648,420037,28116882,6,10319,320477,139408430,885.9,66.94,1704.11,0.0,4.13
9,73172,37184350,542438,36767267,8,10775,286442,100143853,508.18,67.78,1368.61,0.0,6.41


In [221]:
top_20_louvain_communities_ids = louvain_communities[:20].louvain_community
top_20_louvain_communities_ids

0      3
1      7
2      4
3     10
4      5
5      1
6      2
7     11
8      6
9      8
10    13
11    17
12    12
13    21
14     9
15    22
16     0
17    15
18    19
19    14
Name: louvain_community, dtype: int64

In [222]:
# print out the topic descriptors
for cid in top_20_louvain_communities_ids:
    print("\nCommunity id =", cid)
    print("Community topic words:\n", " ".join(community_topics_dict[cid]["topic_words"]))
    print("Community topic NER:\n", " ".join(community_topics_dict[cid]["topic_ner"]))


Community id = 3
Community topic words:
 list unit state footbal elect nation th constitu parti battl john film st district war cup disambigu india leagu world
Community topic NER:
 PERSON ORG GPE DATE NORP CARDINAL EVENT ORDINAL LOC FAC PRODUCT LAW WORK_OF_ART LANGUAGE QUANTITY MONEY TIME PERCENT

Community id = 7
Community topic words:
 list disambigu church al john novel saint languag histori st film battl ii cathol roman book william art la christian
Community topic NER:
 PERSON ORG GPE NORP DATE CARDINAL LOC WORK_OF_ART FAC EVENT ORDINAL LANGUAGE PRODUCT LAW TIME MONEY QUANTITY PERCENT

Community id = 4
Community topic words:
 film list seri tv actor comic season award disambigu episod school john charact novel man star love song album david
Community topic NER:
 PERSON ORG DATE GPE CARDINAL NORP WORK_OF_ART EVENT LOC ORDINAL FAC PRODUCT TIME LAW LANGUAGE QUANTITY MONEY PERCENT

Community id = 10
Community topic words:
 state list new counti school unit california york station pa

##### Grouping communities by common terms  

We can search the community descriptor words for specific terms to find groups of communities on adjacent topics.  
For example, we can find all communities about books, or music, or United States.

In [216]:
# choose some words to search for
words = "novel book read"
words = "album discography music song"
words = "united states"

# process the selected words
preprocessed = gensim.utils.simple_preprocess(words)
    
lemmatized = []
for w in preprocessed:
    stem = stemmer.stem(WordNetLemmatizer().lemmatize(w, pos='v'))
    lemmatized.append(stem)
    
print(lemmatized)

['unit', 'state']


In [218]:
# find communities that have the selected words

for cid in louvain_community_ids:
    #if (lemmatized in community_topics_dict[cid]["topic_words"]):
    #if not set(lemmatized).isdisjoint(set(community_topics_dict[cid]["topic_words"])):
    if set(lemmatized).issubset(set(community_topics_dict[cid]["topic_words"])):
        print("\nCommunity id =", cid)
        print("Community topic words:\n", " ".join(community_topics_dict[cid]["topic_words"]))
        print("Community topic NER:\n", " ".join(community_topics_dict[cid]["topic_ner"]))


Community id = 3
Community topic words:
 list unit state footbal elect nation th constitu parti battl john film st district war cup disambigu india leagu world
Community topic NER:
 PERSON ORG GPE DATE NORP CARDINAL EVENT ORDINAL LOC FAC PRODUCT LAW WORK_OF_ART LANGUAGE QUANTITY MONEY TIME PERCENT

Community id = 6
Community topic words:
 univers school list colleg bank institut compani state intern educ group law technolog scienc unit nation john mall new busi
Community topic NER:
 ORG PERSON GPE NORP DATE CARDINAL FAC LOC ORDINAL EVENT WORK_OF_ART LANGUAGE PRODUCT LAW TIME MONEY QUANTITY PERCENT

Community id = 10
Community topic words:
 state list new counti school unit california york station park district elect john citi texa high nation hous north river
Community topic NER:
 PERSON GPE ORG DATE LOC FAC CARDINAL NORP ORDINAL EVENT WORK_OF_ART PRODUCT LANGUAGE LAW TIME MONEY QUANTITY PERCENT

Community id = 2
Community topic words:
 list disambigu syndrom nation park diseas food b

### 4. Summary  

In this notebook, we've extracted named entities from article titles in order to generalize articles about specific people, places, etc. We then created bags of words for each community, and extracted the top words and named entities as descriptors for each community. These community descriptor words can also be used to group communities based on common terms.

### 5. Next steps

Now that we have defined Wikipedia article communities, their stats, and a way to describe their topics, we can build a D3 visualization to summarize and illustrate these findings.  

But since we're looking at data that represents browsing behaviors on Wikipedia, it would be interesting to see what was the more obscure browsing on Wikipedia in December 2018. The [next notebook](English_Wikipedia_deepWiki.ipynb) in this project explores that.