## Scraping w/o an API


#### What is scraping?

Scraping is the process of using code to automatically collect data from websites. It's like having a robot do the work of copying and pasting information from different pages on a website, except the robot is a program that we write.

#### Why do people do scraping?

People scrape websites for various reasons, such as market research, lead generation, content creation, and data analysis. It's a way to gather large amounts of data quickly and efficiently. One of the key benefits of web scraping in journalism is the ability to gather data for investigative reporting. Journalists can scrape data from government databases, corporate websites, and other sources to uncover hidden information and identify potential sources for stories. Additionally, web scraping can help journalists identify patterns and trends that might not be immediately apparent, as well as monitor news and social media for breaking stories, trends, or conversations around particular topics. This data can be used to create data visualizations that help readers understand complex information, as well as fact-check statements made by politicians or other public figures. Overall, web scraping is a powerful tool for journalists to collect and analyze data, uncover stories, and hold those in power accountable.

Here are some examples of some projects that involve web scraping:
1. Analyzing Political Campaign Finance: An investigative journalist might use web scraping to collect data on campaign finance reports of political candidates and parties to uncover patterns of financial support.
2. Mapping the Real Estate Market: A data journalist might use web scraping to gather information from real estate websites to analyze trends in property prices and rental rates in different parts of the country.
3. Tracking Health Statistics: A health reporter might use web scraping to collect data on public health indicators, such as vaccination rates or hospital admissions, to monitor trends and identify potential areas of concern.
4. Monitoring Environmental Risks: An environmental journalist might use web scraping to collect data on pollution levels and environmental hazards, such as industrial emissions or toxic waste sites, to track the impact of industry on the environment.
5. Investigating Social Media Networks: An investigative journalist might use web scraping to collect data on social media activity, such as hashtag trends and user behavior, to uncover potential connections between political groups or individuals.




#### What are some of the different methods?

In Python, there are two main ways to scrape webpages: using Beautiful Soup for static pages or using Selenium for dynamic pages. Beautiful Soup is a library that can extract data from HTML tags, while Selenium can things we do in the browser such as clicking and filling out forms. Beautiful Soup is simpler to use, but Selenium is more flexible for scraping dynamic pages. Here, we will cover methods for scraping static sites.


#### What are the hurdles?

Scraping can be a complex and challenging task, and there are several hurdles to consider, including:

1. Legal issues: Scraping can violate website terms of service and may be illegal in some cases.
2. Anti-scraping measures: Websites may use anti-scraping measures, such as CAPTCHAs, to prevent scraping.
3. Data quality: Scraped data may be incomplete, inaccurate, or outdated.
4. Website changes: Websites may change their structure or content, which can break scraping scripts and require updates.
5. Performance issues: Scraping large amounts of data can be resource-intensive and may cause performance issues for the scraping tool or website being scraped.

#### Retrieving & Examining HTML Code


Suppose we have a set of links for IMDb pages that each corresponds to a different movie. How do we extract the titles of those movies?

The first step is to write code to obtain the HTML page for a movie. We can use the requests library for this.

In [4]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'https://www.imdb.com/title/tt0133093/'
response = requests.get(url)

html = response.content
display(html)

b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n</body>\r\n</html>\r\n'

We run into an issue here where the website detects that we are not accessing it through a web browser and blocks our request, so we get a "Forbidden" message. We can modify the header of our request so it appears to be coming from a browser:

In [22]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
response = requests.get(url, headers=headers)
html = response.content
display(html)



We've now successfully obtained the HTML page corresponding to movie The Matrix. We now need to extract the relevant information from the page. Examining the HTML above, we notice that the title of the movie is contained within the first set of \<title\> tags (see \<title\>The Matrix (1999) - IMDb\</title\> in the HTML above). Therefore, we can now use soup.find('title'), which extracts the string contained within the first pair of \<title\> tags, to extract the title of the movie.

In [25]:
soup = BeautifulSoup(html, 'html.parser')

# extract the movie title
title = soup.find('title').text

# print the extracted data
print(f'Title: {title}')

Title: The Matrix (1999) - IMDb


To get the movie title without the "- IMDb" at the end, we can do the following:

In [26]:
title.split(' - ')[0]

['The Matrix (1999)', 'IMDb']

Now that we've determined the process for extracting a movie title from an IMDb page, we can repeat it for other movies such as Avatar:

In [17]:
url = 'https://www.imdb.com/title/tt0499549/'
response = requests.get(url, headers=headers)
html = response.content

soup = BeautifulSoup(html, 'html.parser')

# extract the movie title
title = soup.find('title').text.split(' - ')[0]

# print the extracted data
print(f'Title: {title}')

Title: Avatar (2009)


#### Chrome Inspect Tool

To help us determine the HTML tags, we can use the Inspect tool in Google Chrome.

First, we can go to CNN.com in a web browser and right-click on the page to bring up the context menu. From there, we can select "Inspect" to open up the developer tools in the browser.

Once we have the developer tools open, we can use the mouse cursor to hover over different elements on the page and see how they are structured in the HTML code. We can also use the "Select element" tool (which looks like a cursor with a box around it) to click on an element and see its HTML code highlighted in the developer tools.

For example, let's say we want to extract the headlines from the front page of CNN.com. Using the "Select element" tool, we can click on one of the headlines and see its HTML code highlighted in the developer tools. We can see that the headline is contained within an \<span\> tag with the "data-editable" attribute set to "headline".

We can use Beautiful Soup to extract all of the headlines from the page by finding all of the \<span\> tags with that class with the "data-editable" attribute set to "headline" and extracting their text:

In [38]:

# Make a request to the CNN website
response = requests.get('https://www.cnn.com/')

# Create a BeautifulSoup object from the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the headline elements using the 'span' tag and 'data-editable' attribute
headline_elements = soup.find_all('span', {'data-editable': 'headline'})

# Extract the text content of each headline element
headlines = [element.text.strip() for element in headline_elements]

# Print the headlines
print(headlines)

['Ukraine', 'Escaped inmates captured', 'Hunter Biden', 'Jamie Foxx', '‘Star Wars’ timeline', 'Coronation', 'Axe Files', 'A jury found them guilty for their role in trying to forcibly prevent the peaceful transfer of power from then-President Donald Trump to Joe Biden after the 2020 election', 'Proud Boys were ‘Donald Trump’s army,’ prosecutor said in closing arguments of seditious conspiracy trial', 'Proud Boy testified that talk of ‘stacking bodies’ was locker-room banter', 'Florida man charged with throwing explosive at Capitol riot', 'Former FBI supervisor arrested on January 6-related charges allegedly encouraged mob to ‘kill’ police', 'The only thing that makes sense about the drone attack on the Kremlin', 'Messages on Russian drones appear to reference alleged Putin assassination attempt', 'What we know about the murky drone attack and the questions that remain', 'CNN military analyst breaks down alleged Kremlin drone strike video', 'First Horizon stock tumbles after TD Bank mer

#### Further Examples

We go through some examples to illustrate some common HTML tags and how to use them to extract information.

#### Example 1: Extracting Links

In [14]:
url = 'https://www.tasnimnews.com/fa'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
links = soup.find_all('a')

for link in links:
    href = link.get('href')
    if href and href.startswith('http'):
        print(href)

https://t.me/Tasnimnews
https://www.instagram.com/tasnimnews_fa/
https://twitter.com/Tasnimnews_Fa
http://www.aparat.com/tasnim.video
https://splus.ir/tasnimnews
https://profile.igap.net/tasnimnews
https://gap.im/join/nZrdWLo9xXy4LCIQ4wXsDLY7PQWjGpIgtu0cxJ3ZhWz
https://ble.ir/tasnimnews
https://rubika.ir/tasnimnews
https://eitaa.com/tasnimnews
https://t.me/Tasnimnews
https://www.instagram.com/tasnimnews_fa/
https://twitter.com/Tasnimnews_Fa
http://www.aparat.com/tasnim.video
https://splus.ir/tasnimnews
https://profile.igap.net/tasnimnews
https://gap.im/join/nZrdWLo9xXy4LCIQ4wXsDLY7PQWjGpIgtu0cxJ3ZhWz
https://ble.ir/tasnimnews
https://rubika.ir/tasnimnews
https://eitaa.com/tasnimnews
http://creativecommons.org/licenses/by/4.0/
http://www.tasnimnews.com
http://creativecommons.org/licenses/by/4.0/


This code uses BeautifulSoup to parse the HTML code of the ISNA news agency website and find all the \<a\> elements. It then extracts the href attribute of each link and prints it, but only if it starts with "http", to exclude internal links.



#### Example 2: Extracting Images

In [15]:
url = 'https://www.tasnimnews.com/fa'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
images = soup.find_all('img')

for image in images:
    src = image.get('src')
    if src and src.startswith('http'):
        print(src)

https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1402/02/18/1402021815342899127532583.jpeg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1401/11/30/1401113021073267127094812.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1402/01/03/1402010312211070927284182.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1402/01/24/1402012414513891727382192.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1402/02/18/1402021813064847027531212.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1402/02/18/1402021816545373627533312.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1394/12/10/139412101603505917247702.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1399/09/10/1399091012115956621719302.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1402/02/17/1402021713475248827526362.jpg
https://newsmedia.tasnimnews.com/Tasnim/Uploaded/Image/1402/02/01/1402020114414992627439572.jpg
https://newsmedia.tasnimnews.com/Tasnim/

This code uses BeautifulSoup to parse the HTML code of the Fars News Agency website and find all the <img> elements. It then extracts the src attribute of each image and prints it, but only if it starts with "http", to exclude internal images.


#### Example 3: Extracting Text


In [13]:
url = 'https://www.cnn.com/'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
text = soup.get_text()

print(text)

 








Breaking News, Latest News and Videos | CNN


































































































CNN values your feedback




                                                        1. How relevant is this ad to you?
                                                






























                                                2. Did you encounter any technical issues?
                                        











                                                                        Video player was slow to load content
                                                                        



                                                                        Video content never loaded
                                                                        



                                                                        Ad froze or did not finish loading
                                      

This code uses BeautifulSoup to parse the HTML code of the CNN website and extract all the text content. We then print the text, which includes all the visible text on the page as well as any text within HTML elements.