<a href="https://colab.research.google.com/github/SaturdaysAI-LATAM/Extraccion/blob/master/RecolectorTweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recolector de tweets

Este notebook sirve para recolectar tweets utilizando el API de Twitter.

El propósito es para recolectar datos para el proyecto de Saturdays AI edición LATAM (Equipo Verde).


## Importar librerias

Primero importamos las librerías que vamos a necesitar.

- Tweepy: Wrapper para interactuar con la API de Twitter.
- Pandas: Para crear DataFrames de los datos.
- Tqdm: Es una animación de barra cargando para ver progreso de un loop.


In [1]:
#!pip install tweepy

In [12]:
import os
from datetime import datetime, timedelta
import tweepy as tw
import pandas as pd
from tqdm import tqdm, notebook

In [2]:
pd.set_option('display.max_colwidth', None)

## API Twitter Autentication

In order to search and retrieve tweets from twitter API we need to create an "OAuthHandler" object with our personal credentials, such can be obtained from:
[https://developer.twitter.com/apps](https://developer.twitter.com/apps).

In [3]:
consumer_api_key = "Twitter Developer Account User Name"
consumer_api_secret = "Twitter Developer Account Password"

In [4]:
auth = tw.OAuthHandler(consumer_api_key, consumer_api_secret)

In [5]:
#This "api" object works to acces securely within the Twitter API
api = tw.API(auth, wait_on_rate_limit=True)

## Accesing the API

### Search definitions


Before we start using the api, we should knos what information we are looking for, then we need
to create a "Cursor" object, its objective is to store all the filters and operators we will use 
inside twitter API

To search by cities a dictionary is created, the city name is the ke and the `coordinates` plus the `radius` are the value, and the key:value will be an attribute in the "Cursor" object

In [6]:
places = {
    'New_York': "40.6643,-73.9385,10mi",
    'San_Francisco' : "37.781157,-122.398720,10mi",
    'Oklahoma': "35.4671,-97.5137,10mi",
    'London': "51.5072,-0.1275,10mi",
    'Vancouver': "49.3023,-123.107,10mi",
    'Houston': "29.7805,-95.3863,10mi",
    'Los_Angeles': "34.0194,-118.411,10mi",
    'Sidney': "-33.86785,151.20732,10mi",
    'Wellington': "-41.28664,174.77557,10mi",
    'Dublin':"53.333060,-6.248890,10mi"
}
city = "San_Francisco"

The search is performed with the following filters:

- Retweets excluded
- It should include any variation of the word "Covid"
- Date since could be any date from the beginnig of the pandemic
- Published within a 10mi range from the city
- Only in English
- Amount of tweets to collect(by default twitter limits this to 7500)

**NOTE:** The time to retrieve the data is directly proportional to the number of filters(parameters) used

In [17]:
search_words = ("lang:en -filter:retweets covid OR COVID OR coronavirus OR #coronavirus OR #covid19 OR covid19 OR #covid-19 OR cov2 OR #COVID OR #covid")
tenDaysAgo = datetime.today() - timedelta(days = 10)
date_since = str(tenDaysAgo)[:10]
place = places[city]
total_items = 10


# Cursor to retrieve tweets, tweet_mode='extended' recomended in order to prevent text truncated by API
tweets = tw.Cursor(api.search,
              q=search_words,
              geocode=place,
              lang="en",
              tweet_mode='extended',
              since=date_since).items(total_items)

## Store tweets in memory

In [15]:
tweets_copy = []
for tweet in tqdm(tweets):
     tweets_copy.append(tweet)

0it [00:00, ?it/s]


## Dataset construction

Storing only the relevant data for our modeling objective

In [10]:
tweets_df = pd.DataFrame()
for tweet in tqdm(tweets_copy):
    tweets_df = tweets_df.append(pd.DataFrame({'user_name': tweet.user.name,
                                               'user_location': tweet.user.location,
                                               'user_verified': tweet.user.verified,
                                               'date': tweet.created_at,
                                               'text': tweet.full_text,
                                               'is_retweet': tweet.retweeted}, index=[0]))

100%|██████████| 6500/6500 [00:39<00:00, 164.33it/s]


In [11]:
tweets_df=tweets_df.reset_index(drop=True)

In [12]:
tweets_df.head()

Unnamed: 0,user_name,user_location,user_verified,date,text,is_retweet
0,San Francisco Chronicle,"San Francisco, CA",True,2020-12-23 14:48:48,"Dr. Deborah Birx, the White House’s coronavirus response coordinator, told the media outlet Newsy that she will retire after President-elect Joe Biden and his administration take office. https://t.co/UOfqrgv07D",False
1,Dan Earhart,San Francisco,False,2020-12-23 14:47:15,Chief nursing officer at California hospital: 'It's a disaster right now for our staff'\nhttps://t.co/WRz4AnbtzG,False
2,Henry Harteveldt,San Francisco,False,2020-12-23 14:46:52,Very cool to see @emilyKPIX report this AM on Bay Area #startup @xwinginc using its autonomous-piloted #aircraft to help deliver Covid vaccine to the Navajo Nation. #aviation #innovation,False
3,Matt Greenberg,"Oakland, CA",False,2020-12-23 14:45:27,"For most of covid my wife has done nighttime baby duties and I woke up with the first kid. I always have her coffee ready when she gets out of bed. We recently switched jobs and I’ll say, gifted coffee when you wake up tastes way better ☕️😴",False
4,Jed Kolko,"San Francisco, CA",True,2020-12-23 14:44:37,"Huge Q post-covid (in 2021 ojala!) is whether this shift from services to goods persists, reverts, or overshoots. \n\nHave we permanently in-sourced more meals, exercise, and entertainment? \n\nOr will pent-up demand for going out and travel be unleashed? \n\nhttps://t.co/jviJXyX6qz https://t.co/5IbIMI3NLh",False


###  CSV Creation

Once we have all the information we store in disk all the information from the correspondent city

In [13]:
today = date.today()
tweets_df.to_csv("Datos1_Kike/covid19_tweets_"+city+".csv", index=False)