# Milestone 2 : Data Collection and Description
------
In this notebook we are going to review everything that was done so far in the project and evaluate what the remaining tasks are.

____
____


## 1.  Identifying Relevant Tweets
-----

### 1.1 Hashtags as Key Elements for Searching

On twitter the Hashtags are mainly during events. In our case it is the perfect tool to evaluate the awareness across the world. It is very convenient because it is often specifically related to one event and tends to be in english even though the rest of the tweet is in a different language. In order to find all the tweets related to an event, we needed to find as many hashtags which were related and in as many languages as possible.  

### 1.2 Selection of Hashtags 

____
____

## 2.  Tweets Acquisition
We had originally planned to use the twitter dataset that was given in the course. Unfortunatelly it was containing only 10% of the tweets in a given time period and wasn't including any information on the location of the user nor the user profile. Because of this we decided to go get the tweets about specific events by ourselves. 

------
### 2.1 Twitter API 
Our initial idea was to get the information we needed with the Twitter API, but there again we encountered several problems : 

- The **Rate Limit** of the Twitter API :  It would have taken a lot of time to get the tweets of a specific event, but we were ready to wait and launch the code on several computers (or on clusters)
- The **Search Query** limitations : After designing a code that would allow us to get the tweets by searching specific hashtags over a time interval, we discovered a huge limitation : tweets can only we searched with the API if they are *less than one week old*. 

So we have to discard the idea to use the Twitter API.

------
### 2.2 Scrapping Manually the Tweets 
Fortunatelly the twitter html interface (the website) allows us to search for any query on anytime interval. So we decided get the data by scraping directly the website. For that we use a browser that doesn't have a user interface **PhantomJS** and **Selenium** a python package that allows us to load urls in this browser and scroll down the search page in order to load results. Once loaded the use **Beautifull Soup 4** with the parser **LXML** To get every tweets of the page.

This was done using one script : [`tweet_acquisiton.py`](ADA2017_Homeworks/Project/TweetAcquisition/tweet_acquisition.py). 
For each event a new folder is created (for example here `Nigeria_1`). The logs of the tweet acquisition has been saved in this folder with an obvious name (Here `Nigeria_1.log`). Here is an example of the start of the log file : 

-----
```javascript
------------------------------------------- ACQUISITION PARAMETERS -------------------------------------------
Started at : 2017-11-27 10:10:47.485905
Tweets saved in ./Nigeria_1/
Searching from 2016-01-29 to 2016-02-06
Hastags used : ['Dalori', 'Dalorilivesmatter', 'Nigeria', 'BokoHaram', 'Bokoharam', 'bokoharam', 'Borno', 'StopBokoHaram', 'PrayForNigeria']
------------------------------------------- STARTING ACQUISITION -------------------------------------------
1 - Tweets : 2772 - Total : 2772 - Date : 2016-02-05 07:39:06 - Elapsed Time : 810.799 s - Delay : 810.799 s - Rate : 3.419 tw/s - Executed at 2017-11-27 10:24:20.470199
     + First Tweet Time : 2016-02-05 22:11:24
     + Last Tweet Time : 2016-02-05 07:39:06
```

------
The query url is created using the list of hashtags specified inside the script. The explanations on how to use the scripts are in the [`README.md`](ADA2017_Homeworks/Project/TweetAcquisition/README.md) file.


The tweets are acquired by segments : we scroll 500 times the page before parsing the html and saving a pickle containing the Raw data. Each pickle contains an average of 7000 tweets.  We show here an example of the structure of the dataframe acquired :


In [9]:
df = pickle.load(open('TweetAcquisition/Nigeria_1/Tweets_1.pickle', 'rb'))
df.head(4)

Unnamed: 0,hashtags,id,language,text,time_stamp,user_id,user_name,date
0,"[@FitzMP, #Biafrans, #Nigeria, #TyrantBuhari]",695731474243440640,en,"@FitzMP,We #Biafrans have died enough, we don’...",1454710284,354778701,EmekaGift,2016-02-05 22:11:24
1,"[#Nigeria, http://bit.ly/1SR2k89 ]",695758765627281408,es,"A más de dos años de que comenzó la crisis, ¿q...",1454716790,57683930,MSF_Mexico,2016-02-05 23:59:50
2,[#PrayForNigeria],695758763517730816,en,"I wish I was a little kid again, where all I h...",1454716790,518819812,allthingselliej,2016-02-05 23:59:50
3,"[#Nigeria, http://bit.ly/1odNc9y , #VOA]",695758537289367552,en,#Nigeria E-readers Help Thousands in Africa Le...,1454716736,2468196914,Vincecob,2016-02-05 23:58:56


We have scrapped as many information as possible from the html page of the search query, bit we still miss the most important thing : the location of the tweet.

------
### 2.3 Scrapping the location of the tweets 
From each tweet we take the `user_name` field and we go to the user profile to get the location information that the user has written on his profile. 
The function that does that is : [`location_acquisiton.py`](ADA2017_Homeworks/Project/TweetAcquisition/location_acquisiton.py). As we don't need to scroll down the page we directly use the **requests** python package combined with **Beautiful Soup 4** and **LXML**. As the code is very slow, we launch several times the process in parrallel in order to get the tweets at the same rate. 

In the follwing we display the head of the *Located* version of the pickled dataframe. 



In [12]:
df = pickle.load(open('TweetAcquisition/Nigeria_1/Located_Tweets_1.pickle', 'rb'))
df.head(4)

Unnamed: 0,hashtags,id,language,text,time_stamp,user_id,user_name,date,location
0,"[@FitzMP, #Biafrans, #Nigeria, #TyrantBuhari]",695731474243440640,en,"@FitzMP,We #Biafrans have died enough, we don’...",1454710284,354778701,EmekaGift,2016-02-05 22:11:24,[www.radiobiafra.co]
1,"[#Nigeria, http://bit.ly/1SR2k89 ]",695758765627281408,es,"A más de dos años de que comenzó la crisis, ¿q...",1454716790,57683930,MSF_Mexico,2016-02-05 23:59:50,"[Ciudad, de, Mexico]"
2,[#PrayForNigeria],695758763517730816,en,"I wish I was a little kid again, where all I h...",1454716790,518819812,allthingselliej,2016-02-05 23:59:50,[SomewhereOnlyWeKnow]
3,"[#Nigeria, http://bit.ly/1odNc9y , #VOA]",695758537289367552,en,#Nigeria E-readers Help Thousands in Africa Le...,1454716736,2468196914,Vincecob,2016-02-05 23:58:56,"[Brussels,, Belgium]"


Now we have the raw location information for each event. We need to geocode it to the associated country. 

------

## 3.  Geocoding the tweets

------
### 3.1 



------
## 4.  Enriching the Data
------
### 4.1 



------
## 5. Data Visualization


------
## 6. Critical Assessment

- Fact that twitter is biased by nature 
- Ideally we should have scrapped data from different social media
- Locations are never to be perfect, the location information is not objective. 
- Could have used Google API for example, but it is limited in the number of queries, and I won't be perfect either because usually google maps uses contextual infomation to find the location you are looking for. 


------
## 7. What's next ? 
