# Data Collection 

Author: Preey Sawadmanod

---

The goal of our project was to map cities affected by Hurricane Sandy along the US East Coast and identify areas where survivors may have needed assistance. The first step was to gather data through web scraping through social media, however there were some data acquisition limitations. For instance, Facebook and Instagram have very restrictive APIs due to security issues, so these options were not used. 

One way we collected data was with the Twitter API. Twitter has opt-in geo-tagging of posts, but there are not many users who enable the feature, so we used their API to target specific users. For example we targeted tweets from  SandyAid, a non-profit organization founded in New York City that offered help during Hurricane Sandy, and we assumed that their location was New York City. 

TwitterScaper is another package we used to scrape historical tweets. It features a geocode which is assigned to the nearest city of the tweet. We used this feature to filter tweets gathered to within a certain radius of chosen cities along the East Coast, sent during the time period of the hurricane, and containing select keywords. 

We used both Twitter API TwitterScraper to collect data and combined all the data gathered into a single data frame. 

## Table of Contents 
---
- [Import packages](#Import-packages)
- [Twitter API](#Twitter-API)
- [TwitterScraper](#TwitterScraper)
- [Combined tweets](#Combined-tweets)

### Import packages

In [11]:
#TwitterScraper is a separate package and can be install with the code below
# !pip install twitterscraper
# !pip install tweepy
# !pip install unidecode

In [12]:
#Import miscellaneous
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#Load packages from tweepy 
import tweepy
from tweepy               import Stream
from tweepy               import OAuthHandler
from tweepy.streaming     import StreamListener
from unidecode            import unidecode

#Load packges from twitterscraper
import twitterscraper
from twitterscraper       import query_tweets

#Load additional packages 
import json
import sqlite3
import time
import csv
import datetime
import pandas   as pd

### Twitter API 

Scraping tweets with Twitter API to specifically target users in volunteer organization in New York City. 
API keys are not included, and therefore need to be generated. 

In [13]:
#Function to get tweets via API provided by Nick Read
def get_all_tweets(screen_name):
    
    #authorize twitter, initialize tweepy
    auth = OAuthHandler(ckey, csecret)
    auth.set_access_token(atoken, asecret)
    api = tweepy.API(auth)
    
      #Add API keys here
#     ckey    = 
#     csecret =
#     atoken  =
#     asecret = 
    
    #initialize a list to hold all the tweepy Tweets
    alltweets = []
    
    #make initial request for most recent tweets (200 is the maximum allowed count)
    new_tweets = api.user_timeline(screen_name = screen_name,count=200)
    
    #save most recent tweets
    alltweets.extend(new_tweets)
    
    #save the id of the oldest tweet less one
    oldest = alltweets[-1].id - 1
    
    #keep grabbing tweets until there are no tweets left to grab
    while len(new_tweets) > 0:
        print ("getting tweets before %s" % (oldest))

    #all subsequent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
    
        #save most recent tweets
        alltweets.extend(new_tweets)

        #update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1

        print ("...%s tweets downloaded so far" % (len(alltweets)))

    #transform the tweepy tweets into a 2D array that will populate the csv	
    outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
    
    #write the csv	
    with open('%s_tweets.csv' % screen_name,mode = 'w',encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(["id","created_at","text"])
        writer.writerows('./../data/outtweets')
        print("CSV Was Created")
        
    pass

We can run the function by passing the username we want to scrape. By searching specific organizations during Hurricane Sandy, we were able to obtain user names such as _"SandyAid", "Fema", "HurricaneSandyhelp"_. 

Pass in the username of the account you want to scrape ()

```python
if __name__ == '__main__':      
     get_all_tweets("SandyAid")
```

After a few trials, we decided to only move forward with the collection of tweets from `SandyAid` because its tweets had the least amount of noise in them. `FEMA` tweets and `HurricaneSandyHelp` tweets had too much noise, represented by people talking _about_ Hurricane Sandy itself and not asking for help.

### TwitterScraper


**The code block below was used to scrape tweets with Twitterscraper**
1. Created a list of major cities we wanted to target along the east coast after researching from historical data of the location where Hurricane Sandy hit. 
2. Querying each city within 10 mile radius with key words such as _rescue_, _help_, _hurricanesandy_ during time period from October 1st through November 30th in 2012. 
3. Storing information such as tweet, ID, Username, location and timestamp in a data frame.
4. Combining all data frames into one single one.

```python 
#creating a list of cities we want to scrape 
cities_list = ['Boston', 'Philadelphia', 'Providence','Washington DC', 
               'Buffalo', 'Toronto', 'Montreal','Richmond', 'Long Beach']

#for loop to interate through our city list and scrape Twitter for a selection of targeted key words
for city in cities_list:

    #Creating empty data frame
    df_tweets = pd.DataFrame(columns=['id','text','timestamp','user','location'])


    #Querying tweets 
    tweet_list = query_tweets(f'"rescue" OR "RESCUE" OR "HELP" OR "urgent rescue" 
                              OR "urgent help" OR "help needed" OR  
                              "#HurricaneSandy" OR "#Hurricanesandyhelp" 
                              -filter:retweets near:"{city}" within:10mi',
                              
                              #ADD TIMESTAMP
                              begindate = datetime.date(2012,10,1),
                              enddate = datetime.date(2012,11,30),
                              poolsize = 10)

    # Extract features of tweets to populate dataframe:
    for row, tweet in enumerate(tweet_list):
        df_tweets.loc[row,'id'] = tweet.user_id
        df_tweets.loc[row,'text'] = tweet.text
        df_tweets.loc[row,'timestamp'] = tweet.timestamp
        df_tweets.loc[row,'user'] = tweet.username
        df_tweets.loc[row,'location'] = city

    df_tweets.to_csv(f'../data/tweets/{city}_tweets.csv', index=False)

#reading in and combining all the tweet dataframes
def read_and_combine_df(cities_list):
  
    #empty data frame 
    df_tweets = pd.DataFrame(columns=['id','text','timestamp','user','location'])

    #looping through each city
    for city in cities_list: 
        print(city)
        city_df = pd.read_csv(f"../data/tweets/{city}_tweets.csv")

        #combining data frames
        df_tweets = pd.concat([df_tweets, city_df])
        
    return df_tweets

#Combining all data frames into a single one
df_1 = read_and_combine_df(cities_list)

#Saving data frame as CSV file
df_1.to_csv(path_or_buf='../data/df_1.csv', index=False)


```

Using TwitterScraper we were able to obtain tweets within 10 miles of radius of the following cities: 
- Boston
- Philadelphia
- Providence
- Washington DC 
- Buffalo
- Toronto
- Montreal
- Richmond
- Long Beach

### Combined tweets

Currently, we have two data frames, one from the Twitter API and one from TwitterScraper. These two data frames are going to be combined in the following steps.

In [14]:
#Reading in TwitterScraper file 
df = pd.read_csv('../data/tweets/SandyAid_tweets.csv')

In order to combine these two data frames we have to add additional columns to the `SandyAid_twees.csv`

In [15]:
#Rename column
df.rename(columns={'created_at': 'timestamp'}, inplace=True)

In [16]:
#Addition location for New York City tweets
df['location'] = "New York City"

In [17]:
#Adding missing column
df['user'] = ""

In [18]:
#Saving it to a another data frame
df.to_csv('../data/df_2.csv', index=False)

In [19]:
#Combining both data frames
df_1 = pd.read_csv('../data/df_1.csv')
df_2 = pd.read_csv('../data/df_2.csv')
df = pd.concat([df_1, df_2])

In [20]:
#Saving combined data frames
df.to_csv('../data/df_new.csv', index=False)