# Data Collection

## Speech Scraper
Data scientists often need to find creative ways to obtain data relevant for an analysis. Webscraping is a common method data scientists use to get web data.

Here, we are going to obtain the Secretary of Defense's public speeches from 2014 through the present. These speeches are available online here but there are over 200 of them. So, we will build a quick scraper to collect them.

First, let's import a few key packages:

1. requests: this allows us to make requests to webpages
2. BeautifulSoup: this is a handy tool for parsing websites
3. pandas: this allows us to manipulate tabular data

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

First, we need to have our scraper navigate to [https://www.state.gov/secretary/remarks/2018/](https://www.state.gov/secretary/remarks/2018/) which has a link to each of the Secretary's remarks for 2018

In [7]:
def get_soup():
    url = "https://www.state.gov/secretary/remarks/2018/"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    return soup

Next, we need to extract the links (_to just the remarks_) from this page.

In [8]:
def get_links(soup):
    links = []
    content = soup.find('div', {'class': 'l-wrap'})
    for a in content.findAll('a'):
        links.append(a['href'])
    print(str(len(links)) + " speeches were found")
    return links

Now, we can take any of the links we found and extract the following:

* Title
* Date
* Speech 
* URL

In [9]:
def get_remarks(link):
    url = "http://www.state.gov" + link
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    title = soup.find('h2', {'id': 'page-title'}).text
    date = soup.find('div', {'id': 'date_long'}).text
    content = soup.find('div', {'id': 'centerblock'})
    speech = ''
    for p in content.findAll('p'):
        speech += p.text + '\n'
    
    speech_object = {'url': url,
                     'title': title,
                     'date': date,
                     'speech': speech}
    return speech_object

Finally, we should just run our scraper!

In [10]:
speeches = []
base = get_soup()
links = get_links(base)
count = 1
for link in links:
    speech = get_remarks(link)
    speeches.append(speech)
    print("{0}: {1}".format(count, speech['title']))
    count += 1

398 speeches were found
1: On the Occasion of Christmas
2: Interview With Steve Inskeep of NPR
3: Interview With Brian Grimmett of KMUW Wichita Public Radio
4: Interview With Bryce Dolan of 550 KFRM-AM
5: Interview With Laura Ingraham of The Laura Ingraham Show
6: Interview With Steve and Ted in the Morning of KNSS Radio


KeyboardInterrupt: 

Now we can make a dataframe (a table) from the speeches:

In [60]:
df = pd.DataFrame.from_records(speeches)

Let's make sure that the speeches are at least 1000 characters in length. Otherwise, it might be junk data.

In [63]:
df = df[df.speech.str.len() > 500 ]

Now, we can see that we collected 384 speeches which meet the criteria.

In [65]:
df.shape

(384, 4)

Finally, we can save these speeches as a .csv file for future use!

In [66]:
df.to_csv('datasets/SecState_Speeches.csv', index=False)

## Streaming data from an API
Let's connect to an API. We'll use the Twitter API for our example. First we need to load up a couple Twitter related Python libraries.

In [2]:
import twitter
import tweepy

Next, I need to load in my Twitter API credentials.

In [3]:
credentials = {}
with open('/Users/brandon/Google Drive/Python/Twitter Credentials/twitter_credentials.txt','r') as f:
    for line in f:
        cred_item = line.split(':')
        credentials[cred_item[0]] = cred_item[1].strip()
        
auth = tweepy.OAuthHandler(credentials['Consumer_Key'], credentials['Consumer_Secret'])

auth.set_access_token(credentials['Access_token'], credentials['Access_secret'])

Now, I instantiate a connection to the Twitter API.

In [4]:
api = tweepy.API(auth)

Finally, I can perform a search:

In [5]:
query = '#venezuela'
search_results = []

for status in tweepy.Cursor(api.search, q=query, ).items(100):
    #access the json property of the status object by appending ._json to the status item
    tweet = status._json
    search_results.append(tweet)
    print("User: {}".format(tweet['user']['screen_name']))
    print("Tweet: {}".format(tweet['text']))    
    print('\n')

User: Threwlys1
Tweet: RT @RenovaMidia: O presidente interino da #Venezuela, Juan Guaidó, convocou dois novos protestos contra a ditadura de Nicolás Maduro.

http…


User: Patinahat2
Tweet: RT @OffGuardian0: Before you buy into the mainstream narrative on #Venezuela, you have to ask yourself:

"Why would the #US &amp; #UK govts, st…


User: Marte_Ven1
Tweet: Ninguno de esos vividores se quiere devolver a pasar roncha en #Venezuela. Aprovechen cuerda de chulos, cuando haya… https://t.co/mlxyixn5d8


User: Nimayee1
Tweet: RT @Mojahedineng: "... it is obvious that the Iranian regime’s main concern is about its own future."
#Iran
#Venezuela
https://t.co/QrOww3f…


User: oscar_a_f
Tweet: RT @micheldoueihi: -¡Buenos días!
#29Ene #FelizMartes # PDVSAEsDelPueblo Pedro Carreño Día del Trabajador Social #primerapagina #Venezuela…


User: pamavi66
Tweet: RT @bitMomentum: Trending ahora en Izquierda/Centro Izqda.:
➀ #psoe ↑ 
➁ #espana ↑ 
➂ #doshermanas ↑ 
➃ #venezuela ↓ 
➄ #felizmartes ↑ 
➅ #…


U

User: Giokica
Tweet: RT @sahouraxo: You just can’t make this stuff up. 

While brutally suppressing the protests of tens of thousands of #YellowVests at home, M…


User: MaxCoutinhoDS
Tweet: No que toca à #Venezuela, a União Europeia, mais uma vez, mostrou a sua natureza cobarde e resolveu dar o benefício… https://t.co/9n4WaXFZxn


User: Aliciaperbar
Tweet: RT @HimiobSantome: #29Ene #URGENTE Juez de #Yaracuy denuncia que la obligaron a mantener presos a 11 adolescentes  de la entidad. Es una un…


User: VladimArRamones
Tweet: RT @HimiobSantome: #29Ene #URGENTE Juez de #Yaracuy denuncia que la obligaron a mantener presos a 11 adolescentes  de la entidad. Es una un…


User: garlakat
Tweet: RT @garlakat: @diofantoo @MonicaAparicioA @CorreodeBsAs @OBarreraJ @ngotranslations @moo2n @Aledemar1M @cermenho85 @ZMimiba @PianoNocturno…


User: andyronzoni
Tweet: RT @cristiancrespoj: Cónsul de #Venezuela en Miami desconoce al usurpador @NicolasMaduro y acepta como único líder de la nación al Presi

### Accessing Tweets by Location

Set locations using lat, long for U.S. Embassy Caracas 10.478073, -66.871375

In [101]:
embassy = api.trends_closest('10.478073', '-66.871375')
print(embassy)

[{'country': 'Venezuela', 'parentid': 23424982, 'woeid': 395269, 'url': 'http://where.yahooapis.com/v1/place/395269', 'name': 'Caracas', 'placeType': {'name': 'Town', 'code': 7}, 'countryCode': 'VE'}]


In [103]:
geo_trends = api.trends_place(embassy[0]['woeid'])

for i in geo_trends[0]['trends']:
    print(i['name'])

#GuaidoChallenge
#GuaidoPatasCortas
#GuaidoLosTieneLocos
#GuaidóSeVacilóADiosdado
#apostilla
Jorge Rodríguez
Hotel Lido
Elliott Abrams
Valentina
diosdado y bernal
Plaza Alfredo Sadel
JULIANTINA VS HOMOFOBIA
Banco de Inglaterra
Cabello y Bernal
Asamblea Popular
"Pablito"
totalán
Diosdi
Guaidó y Bernal
Málaga
Consulado de Venezuela
Mr. Trump
Roberto Marrero
Héctor Manrique
Brumadinho
Pablo Casado
tesoro de eeuu
Franklin Virguez
Es Guaidó
TransMiranda
Alicia Machado
Metro de Oeste a Este
Plaza Bolívar de Chacao
RCTV
Juan Guaidó
Estados Unidos
Houston
#Julen
#concluGUAIDÓMADURO
#26Ene
#TodosSomosGuaido
#JorgeRodriguezCabezaeguebo
#GuaidoSeLosVacilo
#UnidadYLealtadConNicolas
#VzlaConLaANPorLaLibertad
#GuaidoEsMiPresidenteOK
#VenezuelaYElMundoConMaduro
#winstonmascaguebo
#Petare
#elmundoconvzla


## Checking our usage limits

In [104]:
api_status = api.rate_limit_status()

In [105]:
api_status

{'rate_limit_context': {'access_token': '188015457-qiXHARsYFRFecKwHWNAhZMusgoPQZarcITY0aNus'},
 'resources': {'account': {'/account/login_verification_enrollment': {'limit': 15,
    'remaining': 15,
    'reset': 1548476542},
   '/account/personalization/sync_optout_settings&POST': {'limit': 200,
    'remaining': 200,
    'reset': 1548476542},
   '/account/settings': {'limit': 15, 'remaining': 15, 'reset': 1548476542},
   '/account/update_profile': {'limit': 15,
    'remaining': 15,
    'reset': 1548476542},
   '/account/verify_credentials': {'limit': 75,
    'remaining': 75,
    'reset': 1548476542}},
  'account_activity': {'/account_activity/all/:instance_name/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1548476542},
   '/account_activity/all/:instance_name/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1548476542},
   '/account_activity/all/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1548476542},
   '/account_activity/direct_messages/:i

# NOAA Weather API

In [12]:
import requests

In [9]:
token = open('/Users/brandon/Google Drive/Python/noaa_credentials.txt').read().split('\n')[0]

In [13]:
locations = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/locations'

In [14]:
response = requests.get(locations, headers={'token': token})

In [15]:
locations_json = response.json()

In [16]:
sample = locations_json['results'][2]

In [17]:
sample

{'datacoverage': 0.9991,
 'id': 'CITY:AE000003',
 'maxdate': '2019-01-28',
 'mindate': '1944-03-01',
 'name': 'Dubai, AE'}

In [18]:
gsom = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GSOM&locationid={}&startdate=2016-01-01&enddate=2018-01-01'.format(sample['id'])#&units=standard&startdate=2010-05-01&enddate=2010-05-31'

In [19]:
response = requests.get(gsom, headers={'token': token})

In [20]:
gsom_json = response.json()

In [21]:
gsom_json

{'metadata': {'resultset': {'count': 10, 'limit': 25, 'offset': 1}},
 'results': [{'attributes': '5,S',
   'datatype': 'DP01',
   'date': '2017-04-01T00:00:00',
   'station': 'GHCND:AEM00041194',
   'value': 0},
  {'attributes': '5,S',
   'datatype': 'DP10',
   'date': '2017-04-01T00:00:00',
   'station': 'GHCND:AEM00041194',
   'value': 0},
  {'attributes': '5,S',
   'datatype': 'DP1X',
   'date': '2017-04-01T00:00:00',
   'station': 'GHCND:AEM00041194',
   'value': 0},
  {'attributes': '5,,S,30,+',
   'datatype': 'EMXP',
   'date': '2017-04-01T00:00:00',
   'station': 'GHCND:AEM00041194',
   'value': 0},
  {'attributes': '5,,,S',
   'datatype': 'PRCP',
   'date': '2017-04-01T00:00:00',
   'station': 'GHCND:AEM00041194',
   'value': 0},
  {'attributes': '4,S',
   'datatype': 'DX32',
   'date': '2017-07-01T00:00:00',
   'station': 'GHCND:AE000041196',
   'value': 0},
  {'attributes': '4,S',
   'datatype': 'DX70',
   'date': '2017-07-01T00:00:00',
   'station': 'GHCND:AE000041196',
   '

# CoreNLP Service

In [1]:
from pycorenlp import StanfordCoreNLP
import json

In [2]:
text = 'Brandon is running for president of the U.S. in 2020.'

In [3]:
nlp = StanfordCoreNLP('http://52.20.193.246:9000')
res = nlp.annotate(text,
                   properties={
                       'annotators': 'ner',
                       'outputFormat': 'json',
                       'timeout': 1000
                         })

In [4]:
for token in res['sentences'][0]['tokens']:
    if token['ner'] in ['PERSON','LOCATION','DATE']:
        print('{0}: {1}'.format(token['ner'], token['word']))

PERSON: Brandon
LOCATION: U.S.
DATE: 2020
