# CS690V Project - By: Divyesh Harit

# Vast Challenge - 2011
<br>
# Mini Challenge 1 - Characterization of an Epidemic Spread
<br>
Goal - A major epidemic has started in the city of Vastopolis. The task is to analyze the data and determine how the epidemic is being spread and whether or not it is contained. The origin of the epidemic must also be identified.

## Algorithms
<br>
**Clustering** - <br>
    **Based on Date/Time** - Will give a sense or a trend of how messages were changing with time (it could be either change in count or change in the context)<br>
    **Based on Location** - To see location where people are more active or totally inactive (if active then we can see what kind of message were coming in, whether epidemic is being spread or is contained, how are people reacting)
<br>
These clustering results should give a good picture of the epidemic spread We should be able to answer questions like how it started, how it is spreading, all affected aread and whether it is contained or not. <br>
<br>

In [1]:
import pandas as pd
import numpy as np
import csv
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn import cluster
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import bokeh
#from bokeh.charts import Scatter
from bokeh.io import output_notebook,output_file
from bokeh.layouts import layout
from bokeh.models import Label
from bokeh.plotting import figure, show
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter("ignore", UserWarning)
output_notebook()

# Data preprocessing

In [2]:
#Load dataset and format
df = pd.read_csv('vc2011_microblogs.csv', sep=',', encoding='ISO-8859-1')
temp_datetime = df[df.columns[1]]
temp_latlong = df[df.columns[2]]
tweets = df[df.columns[3]]
temp_latlong = list(temp_latlong)
latlong = []
for row in temp_latlong:
    s = row.split(' ')
    latlong.append(s)
latlong = np.array(latlong, dtype=float)

In [3]:
date_time = []
temp_datetime = list(temp_datetime)
for row in temp_datetime:
    d = row.split(' ')
    date_time.append(d[0])
date_time = np.array(date_time)

# Clustering by location

In [4]:
#assign a color to each label
def get_colors(labels):
    colors=[]
    for i in labels:
        if i==0.0 or i==0:
            colors.append('red')
        if i==1.0 or i==1:
            colors.append('#CAB2D6')
        if i==2.0 or i==2:
            colors.append('blue')
        if i==3.0 or i==3:
            colors.append('green')
        if i==4.0 or i==4:
            colors.append('black')
        if i==5.0 or i==5:
            colors.append('brown')
        if i==6.0 or i==6:
            colors.append('gray')
        if i==7.0 or i==7:
            colors.append('navy')
        if i==8.0 or i==8:
            colors.append('orange')
        if i==9.0 or i==9:
            colors.append('olive')
    return colors

In [5]:
from bokeh.io import push_notebook, show
from bokeh.layouts import widgetbox, row, column, layout
from bokeh.models import CustomJS, Select, Slider
from bokeh.models import ColumnDataSource, HoverTool, Legend, BoxZoomTool, ResetTool, LassoSelectTool, WheelZoomTool, PanTool

#Cluster
def clustering(number_of_clusters):
    kmeans = cluster.KMeans(n_clusters=number_of_clusters)
    return kmeans

#Initial, default clustering
clf = clustering(6)
clf.fit(latlong)

labels = clf.labels_
colors = get_colors(labels)

#Initial, default plot
source = ColumnDataSource(data=dict(x=latlong[:,0], y=latlong[:,1], colors=colors))
plot = figure(width=550, height=450, title='Clustering By Location', x_axis_label = "Latitude", y_axis_label = "Longitude",
              tools = [PanTool(), BoxZoomTool(), ResetTool(), LassoSelectTool(), WheelZoomTool()])
plot.circle('x','y', fill_color='colors', line_color='colors', source=source)

In [6]:
def update_clusters(value):
    clusters = value
    clf = clustering(clusters)
    clf.fit(latlong)
    labels = clf.labels_ 
    colors = get_colors(labels)
    source.data=dict(x=latlong[:,0], y=latlong[:,1], colors=colors)
    push_notebook(handle=bokeh_handle)

callback = CustomJS(code="""
if (IPython.notebook.kernel !== undefined) {
    var kernel = IPython.notebook.kernel;
    cmd = "update_clusters(" + cb_obj.value + ")";
    kernel.execute(cmd, {}, {});
}
""")
    
#Slider to change clusters
slider = Slider(title="Number of clusters", value=6, start=2, end=10, step=1, width=200,
                callback=callback)

#bokeh_handle = show(column(slider, plot), notebook_handle=True)

# Cluster analysis
Here, we see quite a lot of difference in results from HW5 and HW6. <br>
There, we had shortlisted only the tweets that had "#earthquake" in them. This was probably a mistake, as it really reduced the number of scatter points. <br>
Here, we have done no such shorrtlisting. We have simply clustered the tweets by their location, more specifically, their latitude and longitude. This, coupled with the higher number of original tweets, gives us a better idea of the pattern in location. We can infer that the tweets weren't very far off from each other. They are all within a degree or so on each axis. <br>
Further, in all number of cluster choices, we observe there are some very particular gaps. This will help us leave out such regions when we are trying to narrow down the location from where epidemic started. Further, we can dig deeper into tweets based on sub-clusters within a particular popular region.

# Clustering by Date

In [7]:
import collections
#Get the dates of messages, and get their counts
month = []
days = []
for date in date_time:
    x = date.split('/')[0]
    if x == '4':
        continue
    else:
        days.append(date.split('/')[1])
        
daycount={}
for day in days:
    if day not in daycount:
        daycount[day] = 1
    else:
        daycount[day] += 1

daycount = {int(k):int(v) for k,v in daycount.items()}
sorteddaycount = collections.OrderedDict(sorted(daycount.items()))
msg_days, msg_counts = zip(*sorteddaycount.items())

In [8]:
#Plot Frequency distribution
source = ColumnDataSource(data=dict(
    x=range(1,31),
    y=msg_counts,
    day=msg_days,
))

hover = HoverTool(tooltips=[("(Date, May 2011)", "(@day)"), ("(# of Msgs)", "(@y)"),])

p = figure(plot_width=650, plot_height=500, title="Frequency distribution of msgs over various days",
          tools = [hover, PanTool(), BoxZoomTool(), ResetTool(), LassoSelectTool(), WheelZoomTool()])
p.vbar('x', width=0.8, top='y', color="firebrick", source = source)

show(p)

# Cluster analysis
Here, we filtered out the messages in April (Those with month = '4'), as they made up only ~4% of the messages and removing them helped focus on majority of the trend. <br>
We see a consistent frequency of messages for the first 17 days, then over the last 3 days, the number of messages really shoots up, and does not really die down. <br>
This will help us pinpoint messages over those 3 days and see for epidemic related messages. Maybe the epidemic start on 18th and continued for the next 3 days?

# Conclusion and Comparison with midterm approach
This approach helped us gain quite a few insights about the data: <br>
1. It told us that the messages aren't spread out over a very vast geographical area, but just ~1 degrees separation for each of Latitude and Longitude.
2. It told us the content of messages themselves follow a nice pattern when vectorized and represented in 2D.
3. It told us there were consistent number of messages for 17/20 days, but really increased over the next 3 days.

My midterm approach was quite different. I mostly focused on the tweets themselves, and not the location and dates as done here: <br> 
After cleaning (Removing stop words, non-alphanumeric characters) the tweets, I stemmed the words, got their word counts, sorted them and saw the top 100 most frequent stems. <br>
I clustered the tweets and performed topic modeling and observed top 20 topics being talked about in those tweets. <br>
Finally, I clustered the tweets based on the actual tweet content and saw the trend when changing the number of clusters. <br>
My approach here performs better than the midterm one, and gives a better about the dataset. I'd attribute this to not restricting the analysis to just the messages but exploring the location and date as well.

In [9]:
from PIL import Image

background = Image.open("map.png")
overlay = Image.open("msgs.png")

background = background.convert("RGBA")
overlay = overlay.convert("RGBA")

new_img = Image.blend(background, overlay, 0.3)
new_img.save("tweets_on_map.png","PNG")
from IPython.display import Image
#Image(filename='tweets_on_map.png')

# Plot messages by date on map

In [10]:
day_17 = df.loc[df['Created_at'].str.contains("5/17/2011")]
day_18 = df.loc[df['Created_at'].str.contains("5/18/2011")]
day_19 = df.loc[df['Created_at'].str.contains("5/19/2011")]
day_20 = df.loc[df['Created_at'].str.contains("5/20/2011")]

In [11]:
temp_latlong = day_17[day_17.columns[2]]
temp_latlong = list(temp_latlong)
new_latlong = []
for row in temp_latlong:
    s = row.split(' ')
    new_latlong.append(s)
new_latlong = np.array(new_latlong, dtype=float)

In [12]:
source = ColumnDataSource(data=dict(x=new_latlong[:,0], y=new_latlong[:,1]))
plot = figure(width=850, height=600, title='Clustering By Location', x_axis_label = "Latitude", y_axis_label = "Longitude",
              tools = [PanTool(), BoxZoomTool(), ResetTool(), LassoSelectTool(), WheelZoomTool()])
plot.image_url(url=['map.png'], x=42.16, y=93.6, w=0.142, h=0.41)
plot.circle('x','y', fill_color='red', line_color='red', source=source)

def update_date(value):
    date = "day_" + str(value)
    if value == 17:
        temp_latlong = day_17[day_17.columns[2]]
    elif value == 18:
        temp_latlong = day_18[day_18.columns[2]]
    elif value == 19:
        temp_latlong = day_19[day_19.columns[2]]
    else:
        temp_latlong = day_20[day_20.columns[2]]
    temp_latlong = list(temp_latlong)
    latlong = []
    for row in temp_latlong:
        s = row.split(' ')
        latlong.append(s)
    latlong = np.array(latlong, dtype=float)
    source.data=dict(x=latlong[:,0], y=latlong[:,1])
    push_notebook(handle=bokeh_handle)

callback = CustomJS(code="""
if (IPython.notebook.kernel !== undefined) {
    var kernel = IPython.notebook.kernel;
    cmd = "update_date(" + cb_obj.value + ")";
    kernel.execute(cmd, {}, {});
}
""")
    
#Slider to change clusters
slider = Slider(title="Date, May 2011", value=17, start=17, end=20, step=1, width=200,
                callback=callback)

bokeh_handle = show(column(slider, plot), notebook_handle=True)

# Plot messages by symptoms + date on map

In [13]:
from collections import defaultdict
symp = ['cold', 'sweats', 'headache', 'fatigue', 'fever', 'stomach', 'flu', 'chills', 'diarrhea', 'pneumonia']
symptoms = defaultdict(int)
for s in symp:
    symptoms[s] += 1
symptoms_17 = []; symptoms_18 = []; symptoms_19 = []; symptoms_20 = []
for symptom,value in symptoms.items():
    temp = day_17[day_17.apply(lambda r: r.str.contains(symptom, case=False).any(), axis=1)]
    temp = list(temp[temp.columns[2]])
    symptoms_17.append(temp)
    temp = day_18[day_18.apply(lambda r: r.str.contains(symptom, case=False).any(), axis=1)]
    temp = list(temp[temp.columns[2]])
    symptoms_18.append(temp)
    temp = day_19[day_19.apply(lambda r: r.str.contains(symptom, case=False).any(), axis=1)]
    temp = list(temp[temp.columns[2]])
    symptoms_19.append(temp)
    temp = day_20[day_20.apply(lambda r: r.str.contains(symptom, case=False).any(), axis=1)]
    temp = list(temp[temp.columns[2]])
    symptoms_20.append(temp)

In [14]:
symptoms_17_final = []
for ls in symptoms_17:
    tmp = []
    for lat in ls:
        tmp.append(list(map(float,lat.split(" "))))
    symptoms_17_final.append(tmp)
symptoms_18_final = []
for ls in symptoms_18:
    tmp = []
    for lat in ls:
        tmp.append(list(map(float,lat.split(" "))))
    symptoms_18_final.append(tmp)
symptoms_19_final = []
for ls in symptoms_19:
    tmp = []
    for lat in ls:
        tmp.append(list(map(float,lat.split(" "))))
    symptoms_19_final.append(tmp)
symptoms_20_final = []
for ls in symptoms_20:
    tmp = []
    for lat in ls:
        tmp.append(list(map(float,lat.split(" "))))
    symptoms_20_final.append(tmp)

In [15]:
lat_17 = []
for row in symptoms_17_final:
    tmp = []
    for tup in row:
        tmp.append(tup[0])
    lat_17.append(tmp)
long_17 = []
for row in symptoms_17_final:
    tmp = []
    for tup in row:
        tmp.append(tup[1])
    long_17.append(tmp)
    
lat_18 = []
for row in symptoms_18_final:
    tmp = []
    for tup in row:
        tmp.append(tup[0])
    lat_18.append(tmp)
long_18 = []
for row in symptoms_18_final:
    tmp = []
    for tup in row:
        tmp.append(tup[1])
    long_18.append(tmp)
    
lat_19 = []
for row in symptoms_19_final:
    tmp = []
    for tup in row:
        tmp.append(tup[0])
    lat_19.append(tmp)
long_19 = []
for row in symptoms_19_final:
    tmp = []
    for tup in row:
        tmp.append(tup[1])
    long_19.append(tmp)
    
lat_20 = []
for row in symptoms_20_final:
    tmp = []
    for tup in row:
        tmp.append(tup[0])
    lat_20.append(tmp)
long_20 = []
for row in symptoms_20_final:
    tmp = []
    for tup in row:
        tmp.append(tup[1])
    long_20.append(tmp)

In [16]:
def update_day(value):
    if value == 17:
        source_17_1.data = dict(x=lat_17[0], y=long_17[0])
        source_17_3.data = dict(x=lat_17[2], y=long_17[2])
        source_17_5.data = dict(x=lat_17[4], y=long_17[4])
        source_17_7.data = dict(x=lat_17[6], y=long_17[6])
        source_17_8.data = dict(x=lat_17[7], y=long_17[7])
        source_17_9.data = dict(x=lat_17[8], y=long_17[8])
        source_17_10.data = dict(x=lat_17[9], y=long_17[9])
    elif value == 18:
        source_17_1.data = dict(x=lat_18[0], y=long_18[0])
        source_17_3.data = dict(x=lat_18[2], y=long_18[2])
        source_17_4.data = dict(x=lat_18[3], y=long_18[3])
        source_17_6.data = dict(x=lat_18[5], y=long_18[5])
        source_17_8.data = dict(x=lat_18[7], y=long_18[7])
        source_17_9.data = dict(x=lat_18[8], y=long_18[8])
        source_17_10.data = dict(x=lat_18[9], y=long_18[9])
    elif value == 19:
        source_17_1.data = dict(x=lat_19[0], y=long_19[0])
        source_17_3.data = dict(x=lat_19[2], y=long_19[2])
        source_17_5.data = dict(x=lat_19[4], y=long_19[4])
        source_17_7.data = dict(x=lat_19[6], y=long_19[6])
        source_17_8.data = dict(x=lat_19[7], y=long_19[7])
        source_17_9.data = dict(x=lat_19[8], y=long_19[8])
        source_17_10.data = dict(x=lat_19[9], y=long_19[9])
    else:
        source_17_1.data = dict(x=lat_20[0], y=long_20[0])
        source_17_3.data = dict(x=lat_20[2], y=long_20[2])
        source_17_5.data = dict(x=lat_20[4], y=long_20[4])
        source_17_7.data = dict(x=lat_20[6], y=long_20[6])
        source_17_8.data = dict(x=lat_20[7], y=long_20[7])
        source_17_9.data = dict(x=lat_20[8], y=long_20[8])
        source_17_10.data = dict(x=lat_20[9], y=long_20[9])
    push_notebook(handle=bokeh_handle)
callback = CustomJS(code="""
if (IPython.notebook.kernel !== undefined) {
    var kernel = IPython.notebook.kernel;
    cmd = "update_day(" + cb_obj.value + ")";
    kernel.execute(cmd, {}, {});
}
""")

In [17]:
source_17_1 = ColumnDataSource(data=dict(x=lat_17[0], y=long_17[0]))
source_17_2 = ColumnDataSource(data=dict(x=lat_17[1], y=long_17[1]))
source_17_3 = ColumnDataSource(data=dict(x=lat_17[2], y=long_17[2]))
source_17_4 = ColumnDataSource(data=dict(x=lat_17[3], y=long_17[3]))
source_17_5 = ColumnDataSource(data=dict(x=lat_17[4], y=long_17[4]))
source_17_6 = ColumnDataSource(data=dict(x=lat_17[5], y=long_17[5]))
source_17_7 = ColumnDataSource(data=dict(x=lat_17[6], y=long_17[6]))
source_17_8 = ColumnDataSource(data=dict(x=lat_17[7], y=long_17[7]))
source_17_9 = ColumnDataSource(data=dict(x=lat_17[8], y=long_17[8]))
source_17_10 = ColumnDataSource(data=dict(x=lat_17[9], y=long_17[9]))
plot = figure(width=850, height=600, title='Clustering By Symptoms By location', x_axis_label = "Latitude", y_axis_label = "Longitude",
              tools = [PanTool(), BoxZoomTool(), ResetTool(), LassoSelectTool(), WheelZoomTool()])
plot.image_url(url=['map.png'], x=42.16, y=93.6, w=0.142, h=0.41)
plot.circle('x','y', fill_color='red', line_color='red', source=source_17_1, legend = "cold")
plot.circle('x','y', fill_color='blue', line_color='blue', source=source_17_3, legend = "headache")
plot.circle('x','y', fill_color='orange', line_color='orange', source=source_17_5, legend = "fever")
plot.circle('x','y', fill_color='navy', line_color='navy', source=source_17_7, legend = "flu")
plot.circle('x','y', fill_color='brown', line_color='brown', source=source_17_8, legend = "chills")
plot.circle('x','y', fill_color='olive', line_color='olive', source=source_17_9, legend = "diarrhea")
plot.circle('x','y', fill_color='gray', line_color='gray', source=source_17_10, legend = "pneumonia")

#Slider to change clusters
slider = Slider(title="Date, May 2011", value=17, start=17, end=20, step=1, width=200,
                callback=callback)

bokeh_handle = show(column(slider, plot), notebook_handle=True)

# MINI CHALLENGE 3

In [18]:
# Create dictionary of terror related words
terror = ['terror', 'terrorist', 'terrorists', 'terrorism','threat','bomb', 'scare', 'explosion', 'torture','violence','shoot','shot','panic', 'gun', 'missile', 'horror','scary', 'attack']
terror_words = defaultdict(int)
for t in terror:
    terror_words[t] = 0

In [19]:
# Read articles
import io
path = './MC_3_Materials_4_4_2011/'
articles = ""
for i in range(1,4475):
    if i < 10:
        full_path =  path+'0000'+str(i)+'.txt'
    elif i >= 10 and i <100:
        full_path =  path+'000'+str(i)+'.txt'
    elif i >= 10 and i <100:
        full_path =  path+'000'+str(i)+'.txt'
    elif i >= 100 and i <1000:
        full_path =  path+'00'+str(i)+'.txt'
    elif i >= 1000 and i <=4474:
        full_path =  path+'0'+str(i)+'.txt'
    file = io.open(full_path, mode="r", encoding="ISO-8859-1")
    articles += file.read()
articles = articles.split(' ')

In [20]:
# Get frequencies of these words from articles
for word in articles:
    if word in terror_words:
        terror_words[word] += 1
freq = []
for k, v in terror_words.items():
    freq.append(v)

In [21]:
# Plot distribution
source = ColumnDataSource(data=dict(
    x=np.arange(1,19),
    y=freq,
    word=terror,
))

hover = HoverTool(tooltips=[("(Word)", "(@word)"), ("(Freq.)", "(@y)"),])

p = figure(plot_width=800, plot_height=500, title="Frequency distribution of terror related words",
          tools = [hover, PanTool(), BoxZoomTool(), ResetTool(), LassoSelectTool(), WheelZoomTool()])
p.vbar(x='x', width=0.3, top='y', color="darkcyan", source = source)
from bokeh.models import LabelSet
labels = LabelSet(x='x', y='y', text='word', level='glyph',
                  x_offset=-24, source=source, render_mode='canvas')

p.add_layout(labels)
show(p)

# Topic Modeling

In [22]:
articles_str = ','.join(str(a) for a in articles)

In [23]:
from nltk.corpus import stopwords
import re
#Remove stop words
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
articles_str = pattern.sub('', articles_str)
#Remove URLs, non-alphanumeric characters and numbers
articles_list = articles_str.split(',')
for counter, t in enumerate(articles_list):
    articles_list[counter] = re.sub(r'[?|$|.|\\|!|#|\-|"|\n|,|@|(|)]',r'',articles_list[counter])
    articles_list[counter] = re.sub(r'https?:\/\/.*[\r\n]*', '', articles_list[counter])
    articles_list[counter] = re.sub(r'[0|1|2|3|4|5|6|7|8|9|:]',r'',articles_list[counter])

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

all_words = []
for row in articles_list:
    all_words += row.split(' ')
data_for_topic = list(articles_list)
full_wordcount={}
for word in all_words:
    if word not in full_wordcount:
        full_wordcount[word] = 1
    else:
        full_wordcount[word] += 1
#Get names of features
no_features = 1000
tfidf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(data_for_topic)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
#Run NMF to model topics for the features
no_topics = 20
nmf = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tfidf)

In [25]:
no_top_words = 10
for topic_idx, topic in enumerate(nmf.components_):
        print ("Topic %d: " % (topic_idx) + " ".join([tfidf_feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

Topic 0: exchange tuesday political tax derryberry going rates late chicago policy
Topic 1: group index expected higher use early products board director states
Topic 2: world national general unit city results set current costs large
Topic 3: shares codi york net chairman funds hong economy yen francs
Topic 4: mr state make lost issues bonds communications value dow employees
Topic 5: market like trading far month fund white game losses family
Topic 6: billion investors fell industry games thursday end including line pay
Topic 7: year law lower increase volume th european offer told today
Topic 8: sales week house news work deal major high number trade
Topic 9: company analysts chief officials financial right firm department point july
Topic 10: share companies money friday international securities kong small better used
Topic 11: new time cents day months campaign income power buy strong
Topic 12: bank federal growth home profit investment technology network rate ended
Topic 13: said

# Yeah, this didn't work. Anything remotely terrorist related isn't being talked about in the top 20 topics.