# Lede Summer 2019 Project - Part 7c
## Scrape article data for all athletes - restrict API searches to stories from the sports desk AND containing Olympic or Paralympic as keywords

* Use NYT API
* Make a df with html, number of hits, and athlete name
* Join the new df with the main df with all medal info, athlete's name, game_type, etc

* I used keywords 'medal', 'olympic' or 'paralympic', and athlete name to narrow the search.

``` body:"{athlete}" AND body:"medal" AND body:"{game_type} ```

In [1]:
import requests
import pandas as pd
import re
import numpy as np
import os

import itertools

from bs4 import BeautifulSoup
from dotenv import load_dotenv
load_dotenv()

import time

pd.set_option('display.max_rows', None)

In [2]:
# !touch .env

In [3]:
SECRET_KEY = os.getenv("NYT_API_KEY")

## Take a look at the documentation for one athlete (Sarah Will)

In [4]:
query = 'Sarah Will'
game_type = 'paralympic'
base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?'

url = f'{base_url}q=body:"Sarah Will" AND body:"medalist" AND body:"olympic" OR body:"paralympic"&api-key={SECRET_KEY}'

print(url)
print('---------')

response = requests.get(url)
doc = response.json()

articles = []
article = {}

hits_count = doc['response']['meta']['hits']
print(hits_count)

results = doc['response']['docs']
for result in results:
    article = {}
    article['headline'] = result['headline']['main']
    article['lede'] = result['lead_paragraph']
    article['url'] = result['web_url']
    articles.append(article)

articles

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=body:"Sarah Will" AND body:"medalist" AND body:"olympic" OR body:"paralympic"&api-key=6RoF7KkaWm8FvrmuI28gFWSNZqN90AQI
---------
5


[{'headline': 'As New Olympians Celebrate Victory, Families Contemplate Safety',
  'lede': 'OMAHA — The CenturyLink Center, where the United States Olympic swimming trials are being held, might as well be a Tupperware container given how little of the outside world seeps through its doors. But headlines from Rio de Janeiro about violence, economic unrest and the threat of the Zika virus have nonetheless penetrated the hermetically sealed atmosphere and threatened to contaminate the celebrations of newly minted Olympians.',
  'url': 'https://www.nytimes.com/2016/07/03/sports/olympics/as-new-olympians-celebrate-victory-families-contemplate-safety.html'},
 {'headline': 'Highlights: The Winter Olympics Opening Ceremony',
  'lede': 'The New York Times covered the Winter Olympics opening ceremony from inside of Fisht Olympic Stadium in Sochi — live as it happened, not on tape delay.',
  'url': 'https://sports.blogs.nytimes.com/2014/02/07/live-coverage-2014-winter-olympics-opening-ceremony/'}

## Do the same for Michael Phelps

In [5]:
query = 'Michael PHELPS'
base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?'

url = f'{base_url}q=body:"{query}" AND body:"medal"&api-key={SECRET_KEY}'

print(url)
print('---------')

response = requests.get(url)
doc = response.json()

articles = []
article = {}

hits_count = doc['response']['meta']['hits']
print(hits_count)

results = doc['response']['docs']
for result in results:
    article = {}
    article['headline'] = result['headline']['main']
    article['lede'] = result['lead_paragraph']
    article['url'] = result['web_url']
    articles.append(article)

articles
for article in articles:
    print(article['url'])

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=body:"Michael PHELPS" AND body:"medal"&api-key=6RoF7KkaWm8FvrmuI28gFWSNZqN90AQI
---------
1000
https://www.nytimes.com/2017/09/23/insider/michael-phelps-grant-hackett-friendship.html
https://www.nytimes.com/2017/09/21/sports/michael-phelps-grant-hackett-tiger-woods.html
https://www.nytimes.com/aponline/2019/06/30/us/ap-history.html
https://www.nytimes.com/2019/04/13/sports/tiger-woods-masters-augusta.html
https://www.nytimes.com/2018/07/27/sports/missy-franklin-swimming.html
https://www.nytimes.com/2018/08/20/style/olympics-reality-show.html
https://www.nytimes.com/video/sports/olympics/100000004582506/ryan-held-on-michael-phelps-and-4x100-win.html
https://www.nytimes.com/2017/08/11/your-money/senior-athletes-staying-in-shape.html
https://www.nytimes.com/2019/01/25/sports/gracie-gold-figure-skating-.html
https://www.nytimes.com/2017/02/01/sports/michael-phelps-enjoys-victory-lap-with-jordan-spieth-at-pro-am.html


## Import the data of athletes with their country codes and coordinates

In [6]:
df = pd.read_csv('athletes_with_coord.csv')

In [7]:
df.head(1)

Unnamed: 0,alternate_name,citizenship,event,first_name,full_name,game_type,gender,last_name,medals_bronze,medals_gold,medals_silver,medals_total,other_info,season,years,code,country_name,latitude,longitude,NOC
0,,SWE,Para shooting,Jonas,Jonas JAKOBSSON,Paralympic,Men,JAKOBSSON,8,17,2,27,,Summer,1980-2012,SE,Sweden,60.128161,18.643501,SWE


In [8]:
df.shape

(145, 20)

## Make lists of athlete names so I can loop over them and use them as keywords in the API queries

In [9]:
athletes = df.full_name.to_list()

In [10]:
athletes_para = athletes[:80]
# athletes_para
athletes_oly = athletes[80:]
# athletes_oly

## Scrape the NYT API for the article headlines, url, lede paragraph, and number of hits for each athlete. Make a list called "errors", which contains the names of athletes for whom the API request didn't work

### Scrape just the Paralympic athletes

In [11]:
base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?'

rows = []
errors = []
for athlete in athletes_para:
    row = {}
    url = f'{base_url}q=body:"{athlete}" AND body:"medal"&api-key={SECRET_KEY}'
    response = requests.get(url)
    doc = response.json()
    
    try:
        row['name'] = athlete
        
        hits_count = doc['response']['meta']['hits']
        row['hits'] = hits_count
        if hits_count == 0:
            row['article_results'] = 0
        else:
            # Search results for each athlete, which includes the first 10 hits (first 10 headlines)
            articles = []
            results = doc['response']['docs']
            for result in results:
                article = {}
                article['headline'] = result['headline']['main']
                article['lede'] = result['lead_paragraph']
                article['url'] = result['web_url']
                articles.append(article)
            row['article_results'] = articles

        # NYT API rate limit is 10 calls/minute
        time.sleep(6)
        rows.append(row)
    except:
        try:
            row['name'] = athlete

            hits_count = doc['response']['meta']['hits']
            row['hits'] = hits_count
            if hits_count == 0:
                row['article_results'] = 0
            else:
                # Search results for each athlete, which includes the first 10 hits (first 10 headlines)
                articles = []
                results = doc['response']['docs']
                for result in results:
                    article = {}
                    article['headline'] = result['headline']['main']
                    article['lede'] = result['lead_paragraph']
                    article['url'] = result['web_url']
                    articles.append(article)
                row['article_results'] = articles

        # NYT API rate limit is 10 calls/minute
            time.sleep(6)
            rows.append(row)
        except:
            print('------')
            print(athlete)
            print(response.text)
            errors.append(athlete)
            print('-------')
rows

[{'name': 'Jonas JAKOBSSON', 'hits': 0, 'article_results': 0},
 {'name': 'Roberto MARSON', 'hits': 0, 'article_results': 0},
 {'name': 'Mike KENNY',
  'hits': 1,
  'article_results': [{'headline': "CADETS' DEDMOND TIES DASH MARK",
    'lede': 'WEST POINT, N.Y., Jan. 17. —Manhattan dominated the distance‐running events, but gave away too much in the field events today as the Army trackmcn registered a 58‐51 victory. The cadets won seven of the 13 events while scoring 37 of a possible 45 points in the five field events. ',
    'url': 'https://www.nytimes.com/1970/01/18/archives/cadets-dedmond-ties-dash-mark-equals-meet-record-of-63army-wins-7.html'}]},
 {'name': 'Daniel DIAS', 'hits': 0, 'article_results': 0},
 {'name': 'Heinz FREI',
  'hits': 3,
  'article_results': [{'headline': 'Day 7: Second Gold for Pistorius; Iran Forfeits Before Potential Game vs. Israel',
    'lede': 'Oscar Pistorius, the South African “Blade Runner,” won his second gold medal of the Beijing Paralympics with a vi

In [12]:
len(rows)

80

### Scrape just the Olympic athletes

In [13]:
base_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?'

for athlete in athletes_oly:
    row = {}
    url = f'{base_url}q=body:"{athlete}" AND body:"medal"&api-key={SECRET_KEY}'
    response = requests.get(url)
    doc = response.json()
    
    try:
        row['name'] = athlete
        
        hits_count = doc['response']['meta']['hits']
        row['hits'] = hits_count
        if hits_count == 0:
            row['article_results'] = 0
        else:
            # Search results for each athlete, which includes the first 10 hits (first 10 headlines)
            articles = []
            results = doc['response']['docs']
            for result in results:
                article = {}
                article['headline'] = result['headline']['main']
                article['lede'] = result['lead_paragraph']
                article['url'] = result['web_url']
                articles.append(article)
            row['article_results'] = articles

        # NYT API rate limit is 10 calls/minute
        time.sleep(6)
        rows.append(row)
    except:
        try:
            row['name'] = athlete

            hits_count = doc['response']['meta']['hits']
            row['hits'] = hits_count
            if hits_count == 0:
                row['article_results'] = 0
            else:
                # Search results for each athlete, which includes the first 10 hits (first 10 headlines)
                articles = []
                results = doc['response']['docs']
                for result in results:
                    article = {}
                    article['headline'] = result['headline']['main']
                    article['lede'] = result['lead_paragraph']
                    article['url'] = result['web_url']
                    articles.append(article)
                row['article_results'] = articles

        # NYT API rate limit is 10 calls/minute
            time.sleep(6)
            rows.append(row)
        except:
            print('------')
            print(athlete)
            print(response.text)
            errors.append(athlete)
            print('-------')
rows

[{'name': 'Jonas JAKOBSSON', 'hits': 0, 'article_results': 0},
 {'name': 'Roberto MARSON', 'hits': 0, 'article_results': 0},
 {'name': 'Mike KENNY',
  'hits': 1,
  'article_results': [{'headline': "CADETS' DEDMOND TIES DASH MARK",
    'lede': 'WEST POINT, N.Y., Jan. 17. —Manhattan dominated the distance‐running events, but gave away too much in the field events today as the Army trackmcn registered a 58‐51 victory. The cadets won seven of the 13 events while scoring 37 of a possible 45 points in the five field events. ',
    'url': 'https://www.nytimes.com/1970/01/18/archives/cadets-dedmond-ties-dash-mark-equals-meet-record-of-63army-wins-7.html'}]},
 {'name': 'Daniel DIAS', 'hits': 0, 'article_results': 0},
 {'name': 'Heinz FREI',
  'hits': 3,
  'article_results': [{'headline': 'Day 7: Second Gold for Pistorius; Iran Forfeits Before Potential Game vs. Israel',
    'lede': 'Oscar Pistorius, the South African “Blade Runner,” won his second gold medal of the Beijing Paralympics with a vi

In [14]:
len(rows)

145

In [15]:
df_articles = pd.DataFrame(rows)
# df_articles

In [18]:
df_articles.article_results.isna().sum()

0

### Attempt the API request again for athletes where I ran into errors

In [16]:
errors

[]

In [None]:
# for error in errors:
#     print(df[df.full_name == error].game_type)

### If there is a single error:

In [None]:
# missing_rows = []
# row = {}
# athlete = errors[0]
# url = f'{base_url}q=body:"{athlete}" AND body:"medal"&api-key={SECRET_KEY}'
# response = requests.get(url)

# row = {}
# row['name'] = query
# hits_count = doc['response']['meta']['hits']
# row['hits'] = hits_count
# if hits_count == 0:
#     row['article_results'] = 0
# else:
#     # Search results for each athlete, which includes the first 10 hits (first 10 headlines)
#     articles = []
#     results = doc['response']['docs']
#     for result in results:
#         article = {}
#         article['headline'] = result['headline']['main']
#         article['lede'] = result['lead_paragraph']
#         article['url'] = result['web_url']
#         articles.append(article)
#     row['article_results'] = articles

# missing_rows.append(row)
# missing_rows

### If multiple errors:

In [None]:
# missing_rows = []
# for error in errors:
#     row = {}
#     athlete = error
    
#     url = f'{base_url}q=body:"{athlete}" AND body:"medal"&api-key={SECRET_KEY}'
#     response = requests.get(url)
#     doc = response.json()
    
#     row = {}
#     row['name'] = error
#     hits_count = doc['response']['meta']['hits']
#     row['hits'] = hits_count
#     if hits_count == 0:
#         row['article_results'] = 0
#     else:
#         # Search results for each athlete, which includes the first 10 hits (first 10 headlines)
#         articles = []
#         results = doc['response']['docs']
#         for result in results:
#             article = {}
#             article['headline'] = result['headline']['main']
#             article['lede'] = result['lead_paragraph']
#             article['url'] = result['web_url']
#             articles.append(article)
#         row['article_results'] = articles

#     missing_rows.append(row)

In [None]:
# len(missing_rows)

In [None]:
# df_missing = pd.DataFrame(missing_rows)

In [None]:
# df_complete = df_articles.append(df_missing)
# df_complete.shape

In [17]:
# df_complete.article_results.isna().sum()

# Turn the table of article data into csv!

In [23]:
df_articles.shape

(145, 3)

In [19]:
# Rename df as df_complete, if appropriate

df_articles.to_csv('athletes_articles.csv', index=False)

## Merge the table of article data with the main df of athlete data, and save as csv

In [20]:
df_complete = pd.read_csv('athletes_articles.csv')

In [21]:
merged = df.merge(df_complete, left_on='full_name', right_on='name')
merged.shape

(149, 23)

In [22]:
merged.head(3)

Unnamed: 0,alternate_name,citizenship,event,first_name,full_name,game_type,gender,last_name,medals_bronze,medals_gold,...,season,years,code,country_name,latitude,longitude,NOC,article_results,hits,name
0,,SWE,Para shooting,Jonas,Jonas JAKOBSSON,Paralympic,Men,JAKOBSSON,8,17,...,Summer,1980-2012,SE,Sweden,60.128161,18.643501,SWE,0,0,Jonas JAKOBSSON
1,,ITA,Wheelchair fencing,Roberto,Roberto MARSON,Paralympic,Men,MARSON,3,16,...,Summer,1964-1976,IT,Italy,41.87194,12.56738,ITA,0,0,Roberto MARSON
2,,GBR,Para swimming,Mike,Mike KENNY,Paralympic,Men,KENNY,0,16,...,Summer,1976-1988,GB,United Kingdom,55.378051,-3.435973,GBR,"[{'headline': ""CADETS' DEDMOND TIES DASH MARK""...",1,Mike KENNY
3,,BRA,Para swimming,Daniel,Daniel DIAS,Paralympic,Men,DIAS,3,14,...,Summer,2008-2016,BR,Brazil,-14.235004,-51.92528,BRA,0,0,Daniel DIAS
4,,SUI,Para athletics,Heinz,Heinz FREI,Paralympic,Men,FREI,6,14,...,Summer,1984-2012,CH,Switzerland,46.818188,8.227512,SUI,[{'headline': 'Day 7: Second Gold for Pistoriu...,3,Heinz FREI
5,,SUI,Para athletics | handcycling,Franz,Franz NIETLISPACH,Paralympic,Men,NIETLISPACH,2,14,...,Summer,1980-2004,CH,Switzerland,46.818188,8.227512,SUI,0,0,Franz NIETLISPACH
6,,CAN,Para swimming,Michael,Michael EDGSON,Paralympic,Men,EDGSON,0,14,...,Summer,1984-1992,CA,Canada,56.130366,-106.346771,CAN,0,0,Michael EDGSON
7,,AUS,Para swimming,Matthew,Matthew COWDREY,Paralympic,Men,COWDREY,3,13,...,Summer,2004-2012,AU,Australia,-25.274398,133.775136,AUS,[{'headline': 'A Fifth Gold for Du Toit and a ...,4,Matthew COWDREY
8,,NOR,Para swimming,Erling,Erling TRONDSEN,Paralympic,Men,TRONDSEN,1,13,...,Summer,1976-1992,NO,Norway,60.472024,8.468946,NOR,0,0,Erling TRONDSEN
9,,USA,Para athletics,Bart,Bart DODSON,Paralympic,Men,DODSON,4,13,...,Summer,1984-2000,US,United States,37.09024,-95.712891,USA,0,0,Bart DODSON


## Drop the extraneous 'name' column

In [24]:
merged = merged.drop(columns='name')
merged.shape

(149, 22)

In [25]:
merged.to_csv('all_info.csv', index=False)

# Clean up all_info.csv and save it as all_info_cleaned.csv

### Manually rename events so that they are consistent (in text editor)
* Sometimes, the Olympic pdf lists "short track speed skating" as "short track." Replace "short track" with "short track speed skating" in text editor

``` short track, ```

replace with

``` short track speed skating, ```

### Manually get rid of the duplicate rows for Victor An, Margaret Harriman


### Manually re-edit some entries in the country_name column for all_info_cleaned.csv. 

I edited the citizenship (country code) in Part 6 so that the country codes for all ahtletes match the country codes in the current ISO code, which doesn't have values for countries such as Soviet Union and Rhodesia.

Now that I have already merged the dataframes of country codes, country name, athlete info and country coordinates, I am going to edit the country_name of the athletes so that the proper country name would display in the final map.

I decided to keep FRG (West Germany) and GDR (East Germany) as just Germany. I also kept Katerina TEPLA as Czech Republic.

Change country_name to Soviet Union:
```
        Nikolay ANDRIANOV >> RUS
        Boris SHAKHLIN >> RUS
        Viktor CHUKARIN >> UKR
        Aleksandr DITYATIN >> RUS
        Larisa LATYNINA >> UKR
        Polina ASTAKHOVA >> UKR
        Galina KULAKOVA >> RUS
```
Other changes:
```        Margaret HARRIMAN (ZIM) to Rhodesia```

In [26]:
df = pd.read_csv('all_info_cleaned.csv')
df.sort_values('hits', ascending=False)
df.shape

(145, 22)

## Make sure all athletes have coordinates

In [27]:
df.sort_values('latitude', na_position='first')

Unnamed: 0,alternate_name,citizenship,event,first_name,full_name,game_type,gender,last_name,medals_bronze,medals_gold,...,other_info,season,years,code,country_name,latitude,longitude,NOC,article_results,hits
39,,NZL,Para swimming,Sophie,Sophie PASCOE,Paralympic,Women,PASCOE,0,9,...,,Summer,2008-2016,NZ,New Zealand,-40.900557,174.885971,NZL,0,0
29,,RSA,Para swimming,Natalie,Natalie DU TOIT,Paralympic,Women,DU TOIT,0,13,...,,Summer,2004-2012,ZA,South Africa,-30.559482,22.937506,RSA,"[{'headline': 'In First for Olympics, Amputee ...",24
33,,RSA,Para archery | dartchery | lawn bowls,Margaret,Margaret HARRIMAN,Paralympic,Women,HARRIMAN,4,11,...,Also competed representing Rhodesia,Summer,1960-1996,ZA,South Africa,-30.559482,22.937506,RSA,[{'headline': 'U.S. PARAPLEGICS ADD FOUR MEDAL...,1
57,,AUS,Para alpine skiing | cycling,Michael,Michael MILTON,Paralympic,Men,MILTON,2,6,...,,Winter,1992-2006,AU,Australia,-25.274398,133.775136,AUS,0,0
7,,AUS,Para swimming,Matthew,Matthew COWDREY,Paralympic,Men,COWDREY,3,13,...,,Summer,2004-2012,AU,Australia,-25.274398,133.775136,AUS,[{'headline': 'A Fifth Gold for Du Toit and a ...,4
32,,ZIM,Para archery | dartchery | swimming (Summer Ol...,Margaret,Margaret HARRIMAN,Paralympic,Women,HARRIMAN,4,11,...,Also competed representing South Africa,Summer,1960-1996,ZW,Rhodesia,-19.015438,29.154857,ZIM,[{'headline': 'U.S. PARAPLEGICS ADD FOUR MEDAL...,1
3,,BRA,Para swimming,Daniel,Daniel DIAS,Paralympic,Men,DIAS,3,14,...,,Summer,2008-2016,BR,Brazil,-14.235004,-51.92528,BRA,0,0
24,,ISR,Para athletics | wheelchair basketball | swimm...,Zipora,Zipora RUBIN-ROSENBAUM,Paralympic,Women,RUBIN-ROSENBAUM,6,14,...,,Summer,1964-1988,IL,Israel,31.046051,34.851612,ISR,0,0
12,,ISR,Para swimming,Uri,Uri BERGMAN,Paralympic,Men,BERGMAN,1,12,...,,Summer,1976-1988,IL,Israel,31.046051,34.851612,ISR,0,0
105,Hyun-Soo Ahn,KOR,short track speed skating,AN,Victor AN,Olympic,Men,Victor,2,6,...,Also competed for Russia,Winter,2006-2014,KR,South Korea,35.907757,127.766922,KOR,[{'headline': 'Tiger Woods to Receive Presiden...,1769
