# Interactive WordCloud

By:
 - Aditya Agrawal (adityaagrawa@cs.umass.edu) 
 - Sanjay Reddy S (ssatti@umass.edu)

#### Works on Python 3

    - Please ensure 'nysk.xml' is in the same directory.

This file presents the code and the plot for visualizing top 'n' most frequent occurences from our text (Source: http://archive.ics.uci.edu/ml/datasets/NYSK#) where the parameter 'n' can be interactively selected and viewed, limited to a maximum of 500 for demonstration purposes. We can scale the size, dimensions according to our main dataset from the VAST Challenge. This visualization can be used to look at the most talked about, or the most frequent occurence in our articles, as well as emails dataset from VAST Challenge 2014. 

### Importing Libraries & Housekeeping Jobs

In [1]:
%load_ext autoreload
%autoreload 2

#Importing Libraries
import xml.etree.ElementTree as ET
import pandas as pd
from wordcloud import WordCloud
import pickle

#NLTK
import nltk
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

#Bokeh
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.layouts import column, row, widgetbox
from bokeh.models import HoverTool, Plot
from bokeh.palettes import Spectral6
from bokeh.io import *
from bokeh.models.glyphs import ImageURL
from bokeh.models.widgets import Slider
from bokeh.application import Application
from bokeh.application.handlers import FunctionHandler
output_notebook()

## Handling XML Data as a Pandas Frame

Our data is present an an xml file, thus we had to pre-process it to obtain our normal flat-table data for exploration and further processing. 

In [2]:
xml_data = open('nysk.xml').read()

def xml2df(xml_data):
    root = ET.XML(xml_data) # element tree
    all_records = []
    for i, child in enumerate(root):
        record = {}
        for subchild in child:
            record[subchild.tag] = subchild.text
            all_records.append(record)
    return pd.DataFrame(all_records)
data = xml2df(xml_data).drop_duplicates()

### BOW and Word Frequency Exploration
 - Word Cloud- For Visualizing top 'n' words
 - Histogram- For Looking at top 'n' words (present as a separate file.)

Here we explore the 'Bag of Words' representation of our dataset, and use the same for creating Interactive Wordcloud and an interactive histogram.

In [3]:
all_news = ' '.join(data.text.values)
all_news_lower = all_news.lower()

tokenizer = RegexpTokenizer(r'\w+')

news_frequency_distribution = FreqDist(tokenizer.tokenize(all_news_lower))
news_frequency_distribution.most_common(10)

stopWords = set(stopwords.words('english'))

#Removing punctuation and stopwords
for stopword in stopWords:
    if stopword in news_frequency_distribution:
        del news_frequency_distribution[stopword]

# Setting up the WordCloud

The below code could take awhile to render, as the wordcloud is created on-the-fly, saved locally and then rendered back. After intialization, it gets alot faster. 

(Could throw an error on the first attempt to run the code or so, please try running atleast a few times. Though it runs perfectly after restarting almost everytime on our systems.)

In [4]:
wordcloud = WordCloud(
    width=1000,
    height=1000,
    max_words=500,
    scale=3,
)

wordcloud.generate_from_frequencies(dict(news_frequency_distribution.most_common(200)))
wordcloud.to_file("news-wordcloud_1.png")

count = 1
    
def modify_doc1(doc):
    #Plotting in Bokeh
    def create_figure():
        global count
        p = figure(x_range=(0,1), y_range=(0,1))
        p.image_url(url = ["news-wordcloud_{}.png".format(count)], x=0, y=1, w=1, h=1)
        p.axis.visible = False
        return p
        #show(p)
    
    #Setting Up Widgets
    num_most_freq = Slider(title="Number of Most frequent words", value=50 , start=50, end=500 , step=50)
    inputs = widgetbox(num_most_freq)
    
    # Set up callbacks
    def update_data(attrname, old, new):
        global count
        # Get the current slider values
        n = num_most_freq.value
        count += 1
        # Generate New Word Cloud
        wordcloud.generate_from_frequencies(dict(news_frequency_distribution.most_common(n)))
        wordcloud.to_file("news-wordcloud_{}.png".format(count))
        #p.image_url( url=[ "news-wordcloud.png"],
        #    x=1, y=1, w=1000, h=1000, anchor="bottom_left")
        layout.children[1] = create_figure()
        
    num_most_freq.on_change('value', update_data)
    layout = row(inputs, create_figure(), width=800)
    doc.add_root(layout)

handler1 = FunctionHandler(modify_doc1)
app1 = Application(handler1)
app1.create_document()
show(app1, notebook_url= 'localhost:8888')

In [5]:
# NER Tagging
from nltk import pos_tag, ne_chunk, word_tokenize
from collections import defaultdict
"""
myTree = ne_chunk(pos_tag(word_tokenize(all_news)))
NER_tags = []
for chunk in myTree:
    if hasattr(chunk, 'label'):
        NER_tags.append((chunk.label(), ' '.join(c[0] for c in chunk)))

#   TAKES A REALLLLY LONG TIME TO RUN!

NER_dict = defaultdict()         
for label, entity in NER_tags:
    if label not in NER_dict:
        NER_dict[label] = defaultdict(float)
    NER_dict[label][entity] += 1

#   SAVE THE NER_DICT OBJECT LOCALLY!

import pickle
def save_object(obj, filename):
    with open(filename, 'wb') as output:
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
save_object(NER_dict, 'NER_dict.pkl')

"""
#   For Loading the NER_dict object:
    
with open(r"NER_dict.pkl", "rb") as input_file:
    NER_dict = pickle.load(input_file)

NER_dict

defaultdict(None,
            {'FACILITY': defaultdict(float,
                         {'60th Street': 1.0,
                          '845-548-5383 Check': 1.0,
                          'Air': 5.0,
                          'Air Chance': 2.0,
                          'Air Force': 2.0,
                          'Air Force One': 1.0,
                          'Air France': 594.0,
                          'Air France Flight': 1.0,
                          'Air France Jet': 1.0,
                          'Air France Passenger Names Probed': 1.0,
                          'Air France-KLM': 2.0,
                          'Air Malta': 3.0,
                          'Air Talent': 1.0,
                          'Air Talent P1Research Dubai': 1.0,
                          'Air Transport': 1.0,
                          'Al Jazeera': 1.0,
                          'Al Jazira Financial Services': 1.0,
                          'Al Manai': 3.0,
                          'Al Sadd Stadium': 1.0,