# Lede Summer 2019 Project - Part 8
## Scrape article data for all athletes - use BeautifulSoup because the API searches were not giving me exact matches (i.e. "Sarah Will" gave me anyone named "Sarah" that was mentioned in the article)

* Use BeautifulSoup
* Make a df with html, number of hits, and athlete name
* Join the new df with the main df with all medal info, athlete's name, game_type, etc

In [17]:
import requests
import pandas as pd
import re
import numpy as np
import os

from bs4 import BeautifulSoup
from dotenv import load_dotenv
load_dotenv()

import time

pd.set_option('display.max_rows', None)

## Scrape article data for athletes with common names

Athlete Sarah Will has more hits than she actually does because her name is very common, and NYT API is not very precise -- results for people named Sarah (but not Sarah Will) appear in the results.

The NYT search engine yields more accurate results for Sarah Will, so I am manually scraping the search engine result for Sarah Will using BeautifulSoup.

### Import the dataframe with all the athlete/article info so far

In [18]:
df = pd.read_csv('all_info_cleaned.csv')
df

Unnamed: 0,alternate_name,citizenship,event,first_name,full_name,game_type,gender,last_name,medals_bronze,medals_gold,...,other_info,season,years,code,country_name,latitude,longitude,NOC,article_results,hits
0,,SWE,Para shooting,Jonas,Jonas JAKOBSSON,Paralympic,Men,JAKOBSSON,8,17,...,,Summer,1980-2012,SE,Sweden,60.128161,18.643501,SWE,0,0
1,,ITA,Wheelchair fencing,Roberto,Roberto MARSON,Paralympic,Men,MARSON,3,16,...,,Summer,1964-1976,IT,Italy,41.87194,12.56738,ITA,0,0
2,,GBR,Para swimming,Mike,Mike KENNY,Paralympic,Men,KENNY,0,16,...,,Summer,1976-1988,GB,United Kingdom,55.378051,-3.435973,GBR,"[{'headline': ""CADETS' DEDMOND TIES DASH MARK""...",1
3,,BRA,Para swimming,Daniel,Daniel DIAS,Paralympic,Men,DIAS,3,14,...,,Summer,2008-2016,BR,Brazil,-14.235004,-51.92528,BRA,0,0
4,,SUI,Para athletics,Heinz,Heinz FREI,Paralympic,Men,FREI,6,14,...,,Summer,1984-2012,CH,Switzerland,46.818188,8.227512,SUI,[{'headline': 'Day 7: Second Gold for Pistoriu...,3
5,,SUI,Para athletics | handcycling,Franz,Franz NIETLISPACH,Paralympic,Men,NIETLISPACH,2,14,...,,Summer,1980-2004,CH,Switzerland,46.818188,8.227512,SUI,0,0
6,,CAN,Para swimming,Michael,Michael EDGSON,Paralympic,Men,EDGSON,0,14,...,,Summer,1984-1992,CA,Canada,56.130366,-106.346771,CAN,0,0
7,,AUS,Para swimming,Matthew,Matthew COWDREY,Paralympic,Men,COWDREY,3,13,...,,Summer,2004-2012,AU,Australia,-25.274398,133.775136,AUS,[{'headline': 'A Fifth Gold for Du Toit and a ...,4
8,,NOR,Para swimming,Erling,Erling TRONDSEN,Paralympic,Men,TRONDSEN,1,13,...,,Summer,1976-1992,NO,Norway,60.472024,8.468946,NOR,0,0
9,,USA,Para athletics,Bart,Bart DODSON,Paralympic,Men,DODSON,4,13,...,,Summer,1984-2000,US,United States,37.09024,-95.712891,USA,0,0


In [None]:
# print(df.article_results[4])

### Scrape the NYT search engine using BeautifulSoup for 'Sarah Will'

In [5]:
url = f'https://www.nytimes.com/search?query="sarah will" paralympic medal sports'
response = requests.get(url)
doc = BeautifulSoup(response.text)
results = doc.find_all(class_='css-1kl114x')
base_url = 'https://www.nytimes.com/'
rows = []

for result in results:
    row = {}
    headline = result.h4.text.strip()
    row['headline'] = headline
    author = result.find(class_='css-15w69y9').text.strip()
    row['author'] = author
    try:
        lede = result.find(class_='css-1dwgixl').text.strip()
        row['lede'] = lede
    except:
        lede = ''
        row['lede'] = lede
    endpoint = results[0].a['href']
    url = base_url + endpoint
    row['url'] = url
    rows.append(row)
rows

[{'headline': 'BOLDFACE NAMES',
  'author': 'By James Barron With Glenn Collins',
  'lede': '',
  'url': 'https://www.nytimes.com//2002/03/21/nyregion/boldface-names-165565.html?searchResultPosition=1'}]

## Do the same for Victor An/Hyun Soo Ahn. No NYT results come up with Victor An, but a few come up with Hyun Soo ahn

In [21]:
url = f'https://www.nytimes.com/search?query="Hyun-Soo Ahn"'
response = requests.get(url)
doc = BeautifulSoup(response.text)
results = doc.find_all(class_='css-1kl114x')
base_url = 'https://www.nytimes.com/'
rows = []

results

for result in results:
    row = {}
    headline = result.h4.text.strip()
    row['headline'] = headline
    author = result.find(class_='css-1lppelv').text.strip()
    row['author'] = author
    try:
        lede = result.find(class_='css-1lppelv').text.strip()
        row['lede'] = lede
    except:
        lede = ''
        row['lede'] = lede
    endpoint = results[0].a['href']
    url = base_url + endpoint
    row['url'] = url
    rows.append(row)
rows

[{'headline': 'Sports Briefing',
  'author': 'Sports Briefing',
  'lede': 'Sports Briefing',
  'url': 'https://www.nytimes.com//2004/02/15/sports/sports-briefing.html?searchResultPosition=1'},
 {'headline': 'In the arena: Last cup of bicerin before we head to 2010 Winter Games in Vancouver',
  'author': 'In the arena: Last cup of bicerin before we head to 2010 Winter Games in Vancouver',
  'lede': 'In the arena: Last cup of bicerin before we head to 2010 Winter Games in Vancouver',
  'url': 'https://www.nytimes.com//2004/02/15/sports/sports-briefing.html?searchResultPosition=1'},
 {'headline': 'Olympics: A toast to Torino, with a last bicerin',
  'author': 'Olympics: A toast to Torino, with a last bicerin',
  'lede': 'Olympics: A toast to Torino, with a last bicerin',
  'url': 'https://www.nytimes.com//2004/02/15/sports/sports-briefing.html?searchResultPosition=1'}]

### Manually replace the article data for Sarah Will and Victor An in text editor in the file all_info_cleaned.csv 

* Edit the article_info (list of dictionary)
* Edit the hits (number of articles)

Replace 

```
"[{'headline': 'Paralympians See a Big Welcome in a Small Title Change', 'lede': 'Jessica Long, a Paralympic swimmer, remembered feeling invisible at a large gathering with the news media in Chicago ahead of the 2008 Beijing Summer Games. She sat  in a corner with two other American Paralympians, watching reporters  interview their Olympic counterparts without paying any attention to the three of them.', 'url': 'https://www.nytimes.com/2019/06/29/sports/olympics/usoc-paralympians-.html'}, {'headline': 'Rob Matthews, 56, Blind Paralympian Who Won 8 Gold Medals, Dies', 'lede': 'Rob Matthews, a blind runner who won eight gold medals for Britain at the Paralympic Games and broke 22 world records, died on April 11 at a hospice in Auckland, New Zealand, where he had lived for the past decade. He was 56.', 'url': 'https://www.nytimes.com/2018/04/19/obituaries/rob-matthews-56-dies-blind-paralympian-won-8-gold-medals.html'}, {'headline': 'Rio Olympics Today: U.S. Swimmers Restock Their Trophy Case', 'lede': 'It was quite a night for the American swim team. Katie Ledecky, the most dominant female swimmer in Rio, demolished her own record in the 400-meter freestyle. Ledecky finished the 400 with a time of 3 minutes 56.46 seconds.', 'url': 'https://www.nytimes.com/2016/08/07/sports/olympics/schedule-rio-summer-games-results-watch.html'}, {'headline': 'Swimmer Is Fighting a Ruling: She Is Not Disabled Enough', 'lede': 'EXETER, N.H. — Racked by sudden spasms in her shoulders, back and hands — the things she most relies upon to offset her paralyzed legs — the American swimmer Victoria Arlen failed to qualify for the final in the 100-meter breaststroke at the Paralympics last summer. But she persevered in the freestyle, going on to become one of the competition’s breakout stars. When Arlen returned home to New Hampshire with four medals and a world record, Exeter threw her a parade. ', 'url': 'https://www.nytimes.com/2013/09/27/sports/swimmer-is-fighting-a-ruling-she-is-not-disabled-enough.html'}, {'headline': 'At Paralympics, First Thing Judged Is Disability', 'lede': 'LONDON — Anthony Dawson, who has cerebral palsy and little muscle function on his right side, rode for South Africa in the first round of the equestrian dressage competition at the Paralympics on Thursday, guiding his horse through an intricately choreographed series of movements. ', 'url': 'https://www.nytimes.com/2012/09/01/sports/at-paralympics-first-thing-judged-is-disability.html'}, {'headline': 'For Boston, a New Beginning After a Safe Ending to Its Marathon', 'lede': 'BOSTON — When the digital clocks along Boylston Street flashed 2:49 on Marathon Monday, nothing out of the ordinary happened. And that was reason for joyous celebration.', 'url': 'https://www.nytimes.com/2014/04/22/us/boston-marathon.html'}, {'headline': 'Highlights: The Winter Olympics Opening Ceremony', 'lede': 'The New York Times covered the Winter Olympics opening ceremony from inside of Fisht Olympic Stadium in Sochi — live as it happened, not on tape delay.', 'url': 'https://sports.blogs.nytimes.com/2014/02/07/live-coverage-2014-winter-olympics-opening-ceremony/'}, {'headline': 'Paralympians’ Equipment Raises Debate on Fairness', 'lede': 'LONDON — On the outskirts of the athletes’ village, across from the canteen where the competitors refuel and relax between events, there is a pile of dismembered feet, with a leg or two sticking out.', 'url': 'https://www.nytimes.com/2012/09/09/sports/equipment-used-by-disabled-athletes-fuels-debate-on-fairness.html'}, {'headline': 'Events in New Jersey for June 29-July 5, 2014', 'lede': 'A guide to cultural and recreational events in New Jersey. Items for the calendar should be sent at least three weeks in advance to njtowns@nytimes.com.', 'url': 'https://www.nytimes.com/2014/06/29/nyregion/events-in-new-jersey-for-june-29-july-5-2014.html'}, {'headline': 'A Deeper Look at Faster, Higher, Stronger', 'lede': 'GREENWICH, Conn.', 'url': 'https://www.nytimes.com/2012/08/05/sports/olympics/the-olympic-games-studies-art-and-athletics-at-the-bruce-museum-in-greenwich.html'}]"
```

with    

```
"[{'headline': 'BOLDFACE NAMES','author': 'By James Barron With Glenn Collins','lede': '','url': 'https://www.nytimes.com//2002/03/21/nyregion/boldface-names-165565.html?searchResultPosition=1'}]"
```

The new row is now:

``` ,USA,Para alpine skiing,Sarah,Sarah WILL,Paralympic,Women,WILL,0,12,1,13,,Winter,1992-2002,US,United States,37.09024,-95.712891,USA,"[{'headline': 'BOLDFACE NAMES','author': 'By James Barron With Glenn Collins','lede': '','url': 'https://www.nytimes.com//2002/03/21/nyregion/boldface-names-165565.html?searchResultPosition=1'}]",1
 ```

## For Victor An:

The article info should be:

```

[{'headline': 'Sports Briefing',
  'author': 'Sports Briefing',
  'lede': 'Sports Briefing',
  'url': 'https://www.nytimes.com//2004/02/15/sports/sports-briefing.html?searchResultPosition=1'},
 {'headline': 'In the arena: Last cup of bicerin before we head to 2010 Winter Games in Vancouver',
  'author': 'In the arena: Last cup of bicerin before we head to 2010 Winter Games in Vancouver',
  'lede': 'In the arena: Last cup of bicerin before we head to 2010 Winter Games in Vancouver',
  'url': 'https://www.nytimes.com//2004/02/15/sports/sports-briefing.html?searchResultPosition=1'},
 {'headline': 'Olympics: A toast to Torino, with a last bicerin',
  'author': 'Olympics: A toast to Torino, with a last bicerin',
  'lede': 'Olympics: A toast to Torino, with a last bicerin',
  'url': 'https://www.nytimes.com//2004/02/15/sports/sports-briefing.html?searchResultPosition=1'}]
 
 ```

In [7]:
df_cleaned = pd.read_csv('all_info_cleaned.csv')

In [8]:
df_cleaned.sort_values('hits', ascending=False).head(30)

Unnamed: 0,alternate_name,citizenship,event,first_name,full_name,game_type,gender,last_name,medals_bronze,medals_gold,...,other_info,season,years,code,country_name,latitude,longitude,NOC,article_results,hits
116,,USA,aquatics,Michael,Michael PHELPS,Olympic,Men,PHELPS,2,23,...,,Summer,2004-2016,US,United States,37.09024,-95.712891,USA,[{'headline': 'Gracie Gold’s Battle for Olympi...,962
131,,USA,athletics,Carl,Carl LEWIS,Olympic,Men,LEWIS,0,9,...,,Summer,1984-1996,US,United States,37.09024,-95.712891,USA,"[{'headline': 'Mel Rosen,Coach of Powerful ‘92...",815
106,Hyun-Soo Ahn,RUS,short track speed skating,AN,Victor AN,Olympic,Men,Victor,2,6,...,Also competed for South Korea,Winter,2006-2014,RU,Russia,61.52401,105.318756,RUS,"[{'headline': 'A 2012 Olympic Gold Medal,Final...",791
105,Hyun-Soo Ahn,KOR,short track speed skating,AN,Victor AN,Olympic,Men,Victor,2,6,...,Also competed for Russia,Winter,2006-2014,KR,South Korea,35.907757,127.766922,KOR,"[{'headline': 'A 2012 Olympic Gold Medal,Final...",791
127,,USA,aquatics,Mark,Mark SPITZ,Olympic,Men,SPITZ,1,9,...,,Summer,1968-1972,US,United States,37.09024,-95.712891,USA,[{'headline': 'Swimming Geek/Sports Reporter S...,466
124,,USA,aquatics,Ryan,Ryan LOCHTE,Olympic,Men,LOCHTE,3,6,...,,Summer,2004-2016,US,United States,37.09024,-95.712891,USA,[{'headline': 'Ryan Lochte Is Suspended for 14...,243
137,,USA,aquatics,Jenny,Jenny THOMPSON,Olympic,Women,THOMPSON,1,8,...,,Summer,1992-2004,US,United States,37.09024,-95.712891,USA,[{'headline': 'Rio Olympics: Claressa Shields ...,161
139,,USA,aquatics,Natalie,Natalie COUGHLIN,Olympic,Women,COUGHLIN,5,3,...,,Summer,2004-2012,US,United States,37.09024,-95.712891,USA,"[{'headline': 'So Far at U.S. Olympic Trials,M...",137
138,,USA,aquatics,Dara,Dara TORRES,Olympic,Women,TORRES,4,4,...,,Summer,1984-2008,US,United States,37.09024,-95.712891,USA,[{'headline': 'Serena Williams. New Mom. Elite...,136
104,,USA,short track speed skating,OHNO,Apolo Anton OHNO,Olympic,Men,Apolo Anton,4,2,...,,Winter,2002-2010,US,United States,37.09024,-95.712891,USA,[{'headline': 'Nigeria Has an Olympic-Level Bo...,103


In [None]:
df_cleaned.sort_values('medals_total', ascending=False)