# Homework: Intro Scraping Practice

In this assignment, we'll practicing our scraping skills by examining the "Published Reproductions" section of Soma's Investigate.ai project: https://investigate.ai/

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that, even though we installed the library as `pip install beautifulsoup4`, the import statement uses a slightly different name.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 1) Grab the HTML for https://investigate.ai/

Use `requests` to get the HTML, assigning it to a variable

In [2]:
url = 'https://investigate.ai/'

response = requests.get(url)

html = response.text

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [3]:
soup = BeautifulSoup(html, 'html.parser')

### 3) Use `.select(...)` to select *just* the "Published reproductions" section


You'll want "View Source" or pop open the Element Inspector to figure out which element to target.

Reminder: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

Hint: Look for a `div` with a particular class.

In [4]:
sections = soup.select('.section-projects')

### 4) Use `projects_section.select(...)` to select elements that represent a single project

Assign the list to a variable named `project_els`.

In [5]:
for projects_section in sections:
    project_els = projects_section.select('h5')

### 5) Count the number of matching elements, using `len`

Does it match the number of projects you see on the page? (It should.)

In [6]:
len(project_els)

26

### 6) For each project, print its title, publisher, summary, and link

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Title: Searching for faulty airbags in vehicle complaints

Publisher: The New York Times

Summary: The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Link: https://investigate.ai/nyt-takata-airbags/
---
```

In [68]:
titles = []
publishers = []
summaries = []
weblinks = []

for projects_section in sections:
    project_titles = projects_section.select('h5 a')
    for t in project_titles:
        title = t.get_text()
        titles.append(title)
    project_publishers = projects_section.select('.card-footer')
    for p in project_publishers:
        pub = p.get_text()
        pub2 = pub.strip()
        publishers.append(pub2)
    project_summaries = projects_section.select('.card-body p')
    for s in project_summaries:
        summ = s.get_text()
        summaries.append(summ)
    links = projects_section.select('h5 a')
    for l in links:
        link = l['href']
        webl = f'{url}{link}'
        weblinks.append(webl)
        

for title, publisher, summary, link in zip(titles, publishers, summaries, weblinks):
    print(f'Title: {title}')
    print('')
    print(f'Publisher: {publisher}')
    print('')
    print(f'Summary: {summary}')
    print('')
    print(f'Link: {url}{link}')
    print('---')
    print('')


Title: Searching for faulty airbags in vehicle complaints

Publisher: The New York Times

Summary: The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Link: https://investigate.ai/https://investigate.ai//nyt-takata-airbags/
---

Title: Building a crime classification engine

Publisher: Los Angeles Times

Summary: Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.

Link: https://investigate.ai/https://investigate.ai//latimes-crime-classification/
---

Title: Chinese museum analysis

Publisher: Caixin

Summary: A word-count analysis of the names of around 4500 museums in China.

Link: https://investigate.ai/https://investigate.ai//caixin-museum-word-count/
---

Title: Analyzing online safety through app store reviews

Publisher: The Washington Post

Summary: After dow

### 7) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `title`, `publisher`, `summary`, and `link`.

In [69]:
data = {
    'Title': titles,
    'Publisher': publishers,
    'Summary': summaries,
    'Link': weblinks
}
df = pd.DataFrame(data)

df

Unnamed: 0,Title,Publisher,Summary,Link
0,Searching for faulty airbags in vehicle compla...,The New York Times,The National Highway Transportation Safety Adm...,https://investigate.ai//nyt-takata-airbags/
1,Building a crime classification engine,Los Angeles Times,Using machine learning as an investigative too...,https://investigate.ai//latimes-crime-classifi...
2,Chinese museum analysis,Caixin,A word-count analysis of the names of around 4...,https://investigate.ai//caixin-museum-word-count/
3,Analyzing online safety through app store reviews,The Washington Post,After downloading over a hundred thousand revi...,https://investigate.ai//wapo-app-reviews/
4,Uncovering abusive doctors that were allowed t...,Atlanta Journal-Constitution,"How to comb through 100,000 disciplinary docum...",https://investigate.ai//ajc-doctors-abuse/
5,Analyzing the tone of Trump's speeches,The New York Times,Standard sentiment analysis scores a document ...,https://investigate.ai//upshot-trump-emolex/
6,Detecting special interest model legislation i...,"USA Today, The Arizona Republic, and the Cente...",Special interest groups use model legislation ...,https://investigate.ai//azcentral-text-reuse-m...
7,Detecting bots in FCC comment submissions,,The comment period on the FCC's net neutrality...,https://investigate.ai//fcc-comments/
8,Figuring out what Democratic candidates care a...,Bloomberg,In the wide field of Democratic presidential c...,https://investigate.ai//bloomberg-tweet-topics/
9,What does Trump tweet about?,The New York Times,What does Trump tweet about? An analysis of ov...,https://investigate.ai//nyt-trump-tweets/


### 8) Using that `DataFrame`, calculate:

Who are the most-featured publishers?

In [83]:
top_pub = df['Publisher'].value_counts()

top_pub

Publisher
The New York Times                                                      3
ProPublica                                                              3
                                                                        2
Los Angeles Times                                                       1
Caixin                                                                  1
Atlanta Journal-Constitution                                            1
The Washington Post                                                     1
Bloomberg                                                               1
USA Today, The Arizona Republic, and the Center for Public Integrity    1
FiveThirtyEight                                                         1
Milwaukee Journal-Sentinel                                              1
Dallas Morning News                                                     1
The Associated Press                                                    1
Tampa Bay Times             

Which project has the longest summary, in number of characters of text? How long is it?

In [84]:
df['Summary-Length'] = df['Summary'].apply(len)

df.sort_values(by='Summary-Length', ascending=False)

Unnamed: 0,Title,Publisher,Summary,Link,SummaryLength,Summary-Length
21,Bias in the jury selection process,APM Reports,"When selecting a jury, both the defense and th...",https://investigate.ai//apm-reports-jury-bias/,243,243
0,Searching for faulty airbags in vehicle compla...,The New York Times,The National Highway Transportation Safety Adm...,https://investigate.ai//nyt-takata-airbags/,198,198
5,Analyzing the tone of Trump's speeches,The New York Times,Standard sentiment analysis scores a document ...,https://investigate.ai//upshot-trump-emolex/,193,193
10,Examining life expectancy at the local level,The Associated Press,Combine geographically granular life expectanc...,https://investigate.ai//ap-regression-unemploy...,167,167
24,An analysis of racial bias in criminal sentencing,ProPublica,Can an algorithm be racist? An examination of ...,https://investigate.ai//propublica-criminal-se...,167,167
19,Uncovering surveillance planes with BuzzFeed,BuzzFeed News,"From a list of points along a flight's path, h...",https://investigate.ai//buzzfeed-spy-planes/,161,161
7,Detecting bots in FCC comment submissions,,The comment period on the FCC's net neutrality...,https://investigate.ai//fcc-comments/,157,157
17,Investigating who gets a ticket and who gets a...,The Boston Globe,A classic piece of data journalism analyzing t...,https://investigate.ai//boston-globe-tickets/,150,150
14,Measuring the impact of re-segregation on Flor...,Tampa Bay Times,"Using race, income, and other data to predict ...",https://investigate.ai//tampa-bay-times-schools/,149,149
6,Detecting special interest model legislation i...,"USA Today, The Arizona Republic, and the Cente...",Special interest groups use model legislation ...,https://investigate.ai//azcentral-text-reuse-m...,149,149


How many times longer is it than the average summary?

In [85]:
median_length = df['Summary-Length'].median()
top_length = 243

top_length - median_length

104.0

---

---

---