# Explore mined tweets
In which we explore the tweets that we've mined to look for ambiguous entities.

In [10]:
import gzip
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import json
from ast import literal_eval
import pandas as pd

## Load data

We'll start with the historical data mined with hurricane-related hashtags between 8/17/17 and 9/12/17 using [this](https://github.com/Jefferson-Henrique/GetOldTweets-python) handy library.

In [20]:
# start with historical tweet data
tweet_file = '../../data/mined_tweets/hurricane_data_2017-08-17_2017-09-12.gz'
tweet_data = []
for i, l in enumerate(gzip.open(tweet_file, 'r')):
    j = l.strip().split('\t')
    tweet_data.append(j)
print('%d total tweets'%(len(tweet_data)))
tweet_data_df = pd.concat([pd.Series(d) for d in tweet_data], axis=1).transpose()
tweet_cols = tweet_data_df.iloc[0, :].tolist()
tweet_cols.remove('permalink')
tweet_cols.remove(pd.np.nan)
tweet_cols += ['id', 'permalink']
tweet_data_df.columns = tweet_cols
tweet_data_df.drop(0, inplace=True, axis=0)
print(tweet_data_df.head())

207529 total tweets
         username              date retweets favorites  \
1        diane380  2017-09-11 19:59        0         0   
2   mccranie_paul  2017-09-11 19:59        0         6   
3  HartfordNHNews  2017-09-11 19:59        0         0   
4      jpreiser93  2017-09-11 19:59        2         7   
5     PoliticalQB  2017-09-11 19:59        0         0   

                                                text geo mentions  \
1  "Je sais pas mais je trouve qu il est stressé,...                
2                "I am wildly bored. #HurricaneIrma"                
3  "A 360-degree tour of the damage caused by #Hu...                
4  "I will never take electricity for granted eve...                
5  "Hurricane expert Klotzbach: #Irma at landfall...                

                                      hashtags                    id  \
1                                        #Irma  "907393316832071680"   
2                               #HurricaneIrma  "907393316463026177"   


## Look for ambiguity
Blunt filter: all tweets with a URL are probably news stories and will not contain ambiguity.

In [43]:
import re
url_matcher = re.compile('http\S+')
# url_matcher = re.compile('(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@))')
tweet_data_df['urls'] = tweet_data_df['text'].apply(lambda x: ' '.join(url_matcher.findall(x)))

In [44]:
print(tweet_data_df['urls'].value_counts().sort_values(inplace=False, ascending=False))

                                                                                                                                                                                           109976
https://                                                                                                                                                                                    37225
http://                                                                                                                                                                                     31140
https://www.                                                                                                                                                                                18320
http://www.                                                                                                                                                                                  8129
http:// http://               

In [46]:
tweet_data_df_no_urls = tweet_data_df[tweet_data_df['urls'] == '']
print('%d tweets without URL'%(len(tweet_data_df_no_urls)))
print(tweet_data_df_no_urls['text'])

109976 tweets without URL
1         "Je sais pas mais je trouve qu il est stressé,...
2                       "I am wildly bored. #HurricaneIrma"
4         "I will never take electricity for granted eve...
6         "Wow @CNN coverage of #HurricaneIrma on Sunday...
9         "Viewer Eugene Spann thanks @CBS12 for continu...
10        "I pray for the safety of all people in Florid...
15        "@insideFPL is restoring power to homes along ...
16         "Thank you very much. It's a great team!! #Irma"
17                                   "SAFE. #hurricaneirma"
19        "Súmate y ayuda a nuestros hermanos afectados ...
20        "Grateful to have had power all day. #irma #ch...
22        "Really missing the portable AC I bought on @a...
25        "Some damage pics from #HurricaneIrma pic.twit...
26        "@pizzahut Horrendous how you treat employees ...
27        "Despite everything we have come out the other...
28        "Just directed an autism specialist to a famil...
32        "The

So that cut the data in about half and it seems like more of these are eyewitness accounts. 

Let's do some manual inspection.

In [50]:
sample_size = 100
ctr = 4
print('\n'.join(tweet_data_df_no_urls['text'].values[ctr*sample_size:(ctr+1)*sample_size]))

"President Bach starts with a reference to the damage caused by #Irma . IOC to contribute in repairing sport infrastructure. pic.twitter.com/hwbZ47sOe0"
"texas holdem gettin heated #HurricaneIrma #PowerOuttage"
"2) #hurricaneimra #SaintMarteen #SaintMartin #irmaaftermath #Irma pic.twitter.com/8OK9P5ky45"
"Open restaurant #CafeVico #Sunrise #FtLauderdale #Irma pic.twitter.com/vREFUHSVnz"
"As they go through what's left of #Irma in #Asheville , they are still having a diaper drive for #Harvey victims. @WFMY pic.twitter.com/ezX0y1upEi"
"6 million people without power. The sunshine state is the darkest state at the moment. #irma"
"Double rainbow appeared over Central Florida, after a long day of post #HurricaneIrma cleanup. pic.twitter.com/HT5t6jQqmn"
"Power has been out for over 12 hours... #tropicalstorms #hurricaneirma"
"Nothing like seeing this beauty appear nearly 24 hours later after #hurricaneirma ...and on #September11 . pic.twitter.com/vZU1MSynz7"
"#irma is a bitch #rip"
"#Hurrica

Possible candidates for vague entity resolution:

- "So, when I said I wanted to get rid of the Jacaranda tree, I didn't mean in the neighbor's yard. #hurricaneirma pic.twitter.com/uwGDT7Vwqj"
- "Damage minimal in our neighborhood. Extra day off due to roads and power outages for other employees. #Imsleepingin #Irma @kennethdockery"
- "Prayers for many families in my sweet town which seems to be under water at the moment. #hurricaneirma"
- "This is the place to be on the Blvd! Cars backed up for miles.. LOL - SW #swrocks #hurricaneirma @tacobell pic.twitter.com/DapRGK8RuA"
- "#Atlanta getting hit by #HurricaneIrma power out at home, but my @WaffleHouse in Grant Park is open #awesome @wsbtv #MJJAllDay @mjjfootball pic.twitter.com/tguUyw3CuR"
- "Video from Fleming Island in Clay County after #Irma . Flooding of Creighton Road by Doctors Inlet. Canoeing down Creighton Landing Road. pic.twitter.com/8S4BT6c7iG"
- "If @insideFPL thinks we'll be grateful for their effort they're wrong! We shouldn't have had outages in south FL! #irma was TS here!!"
- "Its 20 freaking 17!!! It should not be taking #FPL this long to turn our power back on! #Miami #HurricaneIrma #Disgusted #PissedOff"
- "We are still without power here. Over 12k without @scegnews power and 5k without @AikenElectric power in Aiken. #Irma"
- "Open restaurant #CafeVico #Sunrise #FtLauderdale #Irma pic.twitter.com/vREFUHSVnz"
- "Just drove all the way to Taco Bell & couldn't even get in line bc there was a wreck & 77 cars in line #merica #irma"
- "We're starving & we luckily found #WaffleHouse open. #loyalty #HurricaneIrma pic.twitter.com/mFuIp419LY"
- 

Categories of lexical and phrasal ambiguity:

- abbreviations: `Blvd`, `FPL`
- combinations: `@WaffleHouse in Grant Park`
- names: `Creighton Road`, `Aiken`
- scope: `south FL`
- local: `my sweet town`, `our neighborhood`

## Find locals
Can we find the individuals who actually went through Harvey/Irma, as opposed to outsiders who would have no reference for the vague locations?

In [55]:
print(pd.np.any(tweet_data_df_no_urls['geo'] != ''))

False


There are no geotagged tweets in the historical Twitter data! We need to re-mine the historical data for tweets actually written in the areas under consideration: 

- Houston
- Miami

## Look for geotagged tweets in archive
Do we have any geotagged tweets in the archive?

In [83]:
import gzip
import json
from collections import Counter
test_file = '/hg190/corpora/twitter-crawl/new-archive/tweets-Aug-17-17-04-04.gz'
# place_name = 'Houston'
all_cities = Counter()
geotag_count = 0
with gzip.open(test_file, 'r') as archive:
    for i, l in enumerate(archive):
        try:
            j = json.loads(l)
            j_place = j['place']
            if(j_place is not None):
                j_place_city = j_place['city']
                all_cities[j_place_city] += 1
            if(j['geo'] is not None):
                geotag_count += 1
            if(i % 100000 == 0):
                print('processed %d tweets'%(i))
        except Exception, e:
            pass

processed 200000 tweets
processed 300000 tweets
processed 400000 tweets
processed 500000 tweets
processed 700000 tweets
processed 800000 tweets
processed 900000 tweets
processed 1000000 tweets
processed 1100000 tweets
processed 1300000 tweets
processed 1400000 tweets
processed 1500000 tweets
processed 1700000 tweets
processed 1800000 tweets
processed 1900000 tweets
processed 2000000 tweets
processed 2100000 tweets
processed 2200000 tweets
processed 2300000 tweets
processed 2400000 tweets
processed 2600000 tweets
processed 2800000 tweets
processed 2900000 tweets
processed 3000000 tweets
processed 3100000 tweets
processed 3200000 tweets
processed 3400000 tweets
processed 3500000 tweets
processed 3600000 tweets
processed 3700000 tweets
processed 3800000 tweets
processed 3900000 tweets
processed 4100000 tweets
processed 4400000 tweets
processed 4500000 tweets
processed 4600000 tweets
processed 4700000 tweets
processed 4800000 tweets
processed 5100000 tweets


In [80]:
print("%d/%d geotagged tweets"%(geotag_count, i))

74/5238085 geotagged tweets


In [81]:
all_cities = pd.Series(all_cities).sort_values(inplace=False, ascending=False)
print(all_cities[:50])

Series([], dtype: float64)


Weird! There are almost no geotagged tweets and literally no places. Maybe user information is more reliable?

## Plot time series

## Plot locations
Where do these tweets fall on a map?

plotting code [here](https://stackoverflow.com/questions/40491340/plotting-a-map-with-geopy-and-matplotlib-in-jupyter-notebook#40494221)

## Find ambiguous tweets
Let's look for ambiguous tweets that relate to each crisis (Harvey and Irma) that were generated near the height of each crisis.