# Data Hunting and Gathering (Part 1)

## INTRO

![Web Scraping](http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg)

Welcome to the first part of our journey into the world of web scraping. Web scraping, also known as web harvesting or web data extraction, is a technique used for extracting data from websites. This process involves fetching the web page and then extracting data from it.

### Why Learn Web Scraping?
Understanding how to scrape data from the web is a valuable skill for any data professional. In the digital era, data is the new gold, and web scraping is the mining equipment. Here's why it's essential:

- **Data Availability**: The internet is a vast source of data for all kinds of analyses, from market trends to academic research.
- **Automation**: Web scraping can automate the process of collecting data, saving time and effort.
- **Competitive Advantage**: In many fields, having timely and relevant data can be a game-changer.

### Real-World Applications
- **Market Research**: Analyzing competitors, understanding customer sentiments, and identifying market trends.
- **Price Comparison**: Aggregating pricing data from various websites for comparison shopping.
- **Social Media Analysis**: Gathering data from social networks for sentiment analysis or trend spotting.

### Ethical Considerations in Web Scraping

Web scraping, while a powerful technique for data extraction, comes with significant ethical and legal responsibilities. As budding data scientists and web scrapers, it's crucial to navigate this landscape with a deep understanding and respect for these considerations.

### Respecting Website Policies and Laws

- **Adhering to Terms of Service**: Every website has its own set of rules, usually outlined in its Terms of Service (ToS). It's important to read and understand these rules before scraping, as violating them can have legal implications.

- **Following Copyright Laws**: The data you scrape is often copyrighted. Ensure that your use of scraped data complies with copyright laws and respects intellectual property rights.

- **Privacy Concerns**: Be mindful of personal data. Scraping and using personal information without consent can breach privacy laws and ethical standards.

### Example: Understanding Google's `robots.txt`

Google's `robots.txt` file is an excellent example of how websites communicate their scraping policies. Accessible at [Google's robots.txt](https://www.google.com/robots.txt), this file provides directives to web crawlers about which pages they can or cannot scrape.

#### Implications of Google's `robots.txt`

- **Selective Access**: Google allows certain parts of its site to be crawled while restricting others. For instance, crawling the search results pages is generally disallowed.

- **Dynamic Nature**: The content of `robots.txt` files can change, reflecting the website's evolving stance on web scraping. Regular checks are necessary for compliance.

- **Respecting the Limits**: Even if a `robots.txt` file allows scraping of some pages, it does not automatically mean all scraping activities are legally or ethically acceptable. It's a guideline, not a blanket permission.

### 1. Introduction to Data Hunting in the Digital Age

#### The Evolution of Data Sourcing

In this course, we focus on data as our foundational element. Traditionally, data has been sourced from structured formats like spreadsheets from scientific experiments or records in relational databases within organizations. But with the digital revolution, particularly the advent of the internet, our approach to data collection must evolve. The internet is a vast reservoir of unstructured data, presenting both challenges and opportunities for data retrieval and analysis.

#### Understanding the Landscape of Web Data

When seeking data from the internet, it's essential to first consider how the website in question provides access to its data. Many large-scale websites like Google, Facebook, and Twitter offer an **Application Programming Interface (API)**. APIs are designed to facilitate easy access to a website's data in a structured format, simplifying the process of data extraction.

##### The Role of APIs

- **APIs as a Primary Tool**: An API acts as a bridge between the data seeker and the website's database, allowing for streamlined data retrieval.
- **Limitations**: However, not all websites provide an API. Additionally, even when an API is available, it may not grant access to all the data a user might need.

##### The Need for Web Scraping

In cases where an API is absent or insufficient, we turn to **web scraping**. Web scraping involves extracting raw data directly from a website's frontend - essentially, the same information presented to users in their web browsers.

###### Diving into Scraping

- **Dealing with Unstructured Data**: Scraping requires us to interact with unstructured data, necessitating custom coding and data parsing techniques.
- **Legal and Ethical Considerations**: It's crucial to approach web scraping with an awareness of the legal and ethical implications, respecting website policies and user privacy.

## Starting Our Journey

Our first practical step in this journey will be to explore how to connect to the internet and retrieve a basic webpage. We'll begin by using Python's `urllib.request` module, a powerful tool for interacting with URLs and handling web requests.

Join us as we embark on this exciting journey to master the art of data hunting in the digital era, where we'll navigate the complexities of APIs, web scraping, and the ethical considerations that come with them.

In [3]:
# Import the 'urlopen' function from the 'urllib.request' module.
# This function is used for opening URLs, which is the first step in web scraping.
from urllib.request import urlopen

# Use the 'urlopen' function to open the URL 'http://www.google.com/'.
# The function returns a response object which can be used to read the content of the page.
# Here, 'source' is a variable that holds the response object from the URL.
source = urlopen("http://www.google.com/")

# Print the response object.
# This command does not print the content of the webpage.
# Instead, it prints a representation of the response object, 
# which includes information like the URL, HTTP response status, headers, etc.
print(source)

<http.client.HTTPResponse object at 0x10566e970>


In [2]:
pip install urlopen

Defaulting to user installation because normal site-packages is not writeable
Collecting urlopen
  Downloading urlopen-1.0.0.zip (2.1 kB)
Building wheels for collected packages: urlopen
  Building wheel for urlopen (setup.py) ... [?25ldone
[?25h  Created wheel for urlopen: filename=urlopen-1.0.0-py3-none-any.whl size=1409 sha256=e6c998cb5de8dee3bf803e8db412d3ea526549a9bdd193bea3fdd94dd0f7efe3
  Stored in directory: /Users/lydia/Library/Caches/pip/wheels/59/6a/b1/985d91ea20feea120e71e88fc5b085c5d57297a80645eb7e8c
Successfully built urlopen
Installing collected packages: urlopen
Successfully installed urlopen-1.0.0
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


## Exploring the Content Retrieved by `urlopen`

This code snippet demonstrates the basic usage of the `urlopen` function for accessing a webpage. However, it is important to note that `print(source)` will not display the HTML content of the webpage but rather the HTTP response object's representation. To view the actual content of the page, you would need to read from the `source` object using methods like `source.read()`.

After opening a URL using the `urlopen` function from the `urllib.request` module, we typically want to access the actual content of the webpage. This is where `source.read()` comes into play.

### Understanding `source.read()`

When you call `urlopen`, it returns an HTTPResponse object. This object, which we've named `source` in our example, holds various data and metadata about the webpage. To extract the actual HTML content of the page, we use the `read` method on this object.

### What Does `source.read()` Do?

- **Retrieves Webpage Content**: `source.read()` reads the entire content of the webpage to which the URL points. This content is usually in HTML format, which is the standard language for creating webpages.

- **Binary Format**: The data retrieved is in binary format. To work with it as a string in Python, you might need to decode it using a method like `.decode('utf-8')`.

- **One-time Operation**: It's important to note that you can read the content of the response only once. After `source.read()` is executed, the response object does not retain the content in a readable form. If you need to access the content again, you must reopen the URL.

Here's a simple example to illustrate this:

In [4]:
#Let us check what is in
something = source.read()
print(something)

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es"><head><meta content="Google.es permite acceder a la informaci\xf3n mundial en castellano, catal\xe1n, gallego, euskara e ingl\xe9s." name="description"><meta content="noodp, " name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="SWiCf7-dIev-GglZy0PI7Q">(function(){var _g={kEI:\'GMFFZuuAM5Pl5NoPkd2IwAw\',kEXPI:\'0,793110,3446935,2872,2891,3926,7828,31274,46127,230,107236,6642,49751,2,39760,6700,41949,84371,8155,23351,22435,9779,42459,20199,73178,2266,764,15816,1804,37682,9400,1635,13492,5254651,916,5992406,2839593,2,28,4,5,28,2,2,2,65,76,84,40,20718077,70,7280730,43886,3,318,4,1281,3,2121778,2585,24111,23005240,7950,1,4848,8408,3323,13342,20408,6,5709,5,1899,1700,21172,13998,1923,10958,4832,1575,2870,7621,3355,15164,7968,214,149,241,7526,5328,4501,5,

## DEMO

Let's get our hands-on with some initial exercises to get warmed up with web scraping!

### Exercises

1. **Python.org Content Check**: Does [https://www.python.org](https://www.python.org) contain the word `Python`?  
   _Hint: You can use the `in` keyword to check._

2. **Google.com Image Search**: Does [http://google.com](http://google.com) contain an image?  
   _Hint: Look for the `<img>` tag._

3. **First Characters of Python.org**: What are the first ten characters of [https://www.python.org](https://www.python.org)?

4. **Keyword Check in Pyladies.com**: Is there the word 'python' in [https://pyladies.com](https://pyladies.com)?

In [10]:
# EX1: Check if 'Python' is in the content of http://www.python.org/

# Import the urlopen function from the urllib.request module
# This function is used to open a URL and retrieve its contents
from urllib.request import urlopen

# Use the urlopen function to access the webpage at http://www.python.org/
# The function returns an HTTPResponse object which is stored in the variable 'source'
source = urlopen("http://www.python.org/")

# Read the content of the response object using the read() method
# The read() method retrieves the content of the webpage in binary format
# The binary content is then decoded to a string using the 'latin-1' encoding
# The decoded string is stored in the variable 'something'
something = source.read().decode('latin-1')

# Check if the word "Python" is in the decoded string
# This is done using the 'in' keyword, which checks for the presence of a substring in a string
# The result is a boolean value: True if "Python" is found, False otherwise
"Python" in something

# Note: The choice of 'latin-1' for decoding might not always be appropriate
# It's often better to use 'utf-8', which is a more common encoding for webpages
# For example: something = source.read().decode('utf-8')

True

## Definitions: Request, Crawling and Scrapping

### Using `urlopen` vs. `Request` in Web Scraping

When performing web scraping tasks in Python, you have the option to use either the `urlopen` function from the `urllib.request` module or the `Request` object in combination with `urlopen`. Here, we'll explain why you might choose one approach over the other.

### Using `urlopen` Directly

**Advantages**:

- **Simplicity**: It's a straightforward way to access a webpage and retrieve its content without the need for additional objects or customization.
  
- **Default Behavior**: `urlopen` uses default settings for the HTTP request, which is suitable for many common use cases.

- **Convenience**: For simple web scraping tasks, it provides a concise and readable solution.

### Using `Request` with `urlopen`

**Advantages**:

- **Customization**: You can set custom headers, use different HTTP methods (e.g., POST, PUT), and configure advanced options like handling redirects, cookies, and timeouts.

- **Fine-Grained Control**: It offers greater flexibility for handling complex scenarios.

In summary, the choice between using `urlopen` directly and creating a `Request` object depends on the complexity of your web scraping task. For simple tasks like fetching webpage content, `urlopen` is often sufficient and more straightforward. However, if you need to customize headers, use non-GET HTTP methods, or handle advanced scenarios, creating a `Request` object allows for fine-grained control over your HTTP requests.


### Crawling and Scraping: Unveiling the Web's Secrets

Crawling and scraping are two fundamental techniques in the world of web data acquisition. They form the backbone of many data-driven applications and are crucial skills for data analysts and web developers.

### Crawling: Navigating the Web

Crawling, often referred to as web crawling or web scraping, is the process of systematically navigating the World Wide Web to retrieve web pages. Think of it as a web robot or spider, tirelessly traversing the internet to discover and index web content. This technique is at the heart of search engines like Google and Bing.

### Why Do We Crawl?

Crawling serves several important purposes:

- **Indexing**: It allows search engines to index and catalog web pages, making them searchable by users.
  
- **Link Discovery**: Crawlers extract links from web pages, helping build a vast network of interconnected web resources. This link structure is crucial for understanding the web's architecture.
  
- **Data Retrieval**: Crawlers may scrape or extract data from web pages, but their primary goal is to discover and navigate to other web pages.

### Scraping: Harvesting Data

Scraping is the process of extracting specific data or information from a single web page. Unlike crawling, which focuses on navigating the web, scraping zooms in on a single webpage to harvest valuable data.

### Use Cases of Scraping

Scraping is used for a variety of purposes, such as:

- **Data Extraction**: It allows us to extract structured data like product prices, news headlines, or stock market information from websites.

- **Content Monitoring**: Scraping can be employed to track changes in content on specific web pages, such as monitoring price changes on e-commerce sites or tracking news updates.

- **Competitor Analysis**: Businesses often use scraping to gather data on competitors, such as pricing strategies or product listings.

- **Research and Analysis**: Data analysts and researchers use scraping to collect data for studies, reports, and data-driven insights.

### Crawling and Scraping Synergy

In practice, crawling and scraping often work together. Crawlers traverse the web to find new pages, and once they reach a page of interest, scraping techniques are applied to extract valuable data. This synergy is what powers search engines, news aggregators, and data-driven applications on the internet.

### Conclusion

Understanding the concepts of crawling and scraping is essential for anyone looking to work with web data. Whether you want to build a search engine, gather market research, or simply automate data collection, these techniques are your gateway to unlocking the wealth of information available on the web.

## Requests vs Urllib

In [5]:
import urllib.request

# Define the URL to scrape
url = 'https://www.pyladies.com'

# Set up the request with a custom user-agent header
req = urllib.request.Request(url, headers={'User-Agent': 'Magic Browser'})

# Open the URL and retrieve the HTML content
con = urllib.request.urlopen(req)
html = con.read().decode()

# Check if 'Python' is in the HTML content
print('Python' in html)


True


In [14]:
## request and beautifulsoup

from bs4 import BeautifulSoup
import requests

ModuleNotFoundError: No module named 'requests'

In [13]:
#pip install bs4
pip install requests

SyntaxError: invalid syntax (126902333.py, line 2)

In [11]:
import requests

# Define the URL to scrape
url = 'https://www.pyladies.com'

# Send a GET request with a custom user-agent header
response = requests.get(url, headers={'User-Agent': 'Magic Browser'})

# Get the HTML content from the response
html = response.text

# Check if 'Python' is in the HTML content
print('Python' in html)


ModuleNotFoundError: No module named 'requests'

In [None]:
response.content

In [None]:
soup = BeautifulSoup(response.content)
print(soup.)