# Session 2: Functions and Basic Web Scraping in Python for SEO

## Agenda
- Functions
    - What is a Function?
    - How to Define and Call a Function
    - Return Statements
    - Components of a Function
    - Example: Function to Calculate Square of a Number
    - Understanding Docstrings
        - What is a Docstring?
- Setting Up a Virtual Environment for SEO Web Scraping
    - Why Use Virtual Environments?
    - Steps to Create the Conda Environment and Add to Jupyter Kernel List
- Basics of Web Scraping
    - What is Web Scraping?
    - Introduction to `requests`
    - Introduction to `BeautifulSoup`
    - Introduction to `newspaper3k`
- Build a Simple Web Scraper
    - Walkthrough: Fetching and Parsing a Web Page
    - Extracting SEO-relevant Information

## Functions

### What is a Function?
A function in Python is a reusable piece of code that performs a specific task. Functions are essential for code reusability and organization.

### How to Define and Call a Function
In Python, functions are defined using the `def` keyword followed by the function name and parentheses `()`. The code block within a function is executed when the function is called.

### Return Statements
The `return` keyword is used to exit a function and return a value.

### Components of a Function
- `def` keyword: To start the function definition
- Function name: To identify the function
- Parameters: Variables to pass into the function (optional)
- Code block: The actions to perform
- `return` keyword: To return a value (optional)

#### Example: Function to Calculate Square of a Number
```python
# Definition
def square_number(num):
    """This function returns the square of a number."""
    result = num * num
    return result

# Calling the Function
square_of_five = square_number(5)
print(f"The square of 5 is {square_of_five}")



---
### Exercises

#### 1. Create a function that takes a number and returns its cube.




#### 2. Create a function that takes two numbers and returns their sum.

#### 3. Create a function that reverses a string.

---

### Understanding Docstrings

#### What is a Docstring?
A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. In Python, docstrings are used for documentation and can be accessed at runtime using the `help()` function. You can also view the docstring by placing your cursor inside the function's parentheses and pressing Shift + Tab in a Jupyter Notebook.

#### How to Write a Docstring
To write a docstring, you enclose your descriptive text in triple quotes, either single (`''' ... '''`) or double (`""" ... """`). It should be placed right after the function definition. 

#### Example:


In [None]:
# Run this cell

def add_numbers(a, b):
    """
    This function takes two numbers and returns their sum.
    Parameters:
    - a: first number
    - b: second number
    
    Returns:
    Sum of a and b
    """
    return a + b

In [None]:
# Click Shift + Tab inside the parentheses

add_numbers()

---
## Setting Up a Virtual Environment for SEO Web Scraping

### Why Use Virtual Environments?

Virtual environments are isolated spaces where you can install software and Python packages independently of the system-wide Python installation. Using a virtual environment is advantageous for several reasons:

1. **Isolation**: You can isolate your project's dependencies to avoid version conflicts.
2. **Reproducibility**: Makes it easier to share your code and environment setup with others.
3. **Simplicity**: Simplifies your system Python setup by allowing you to only install packages needed for each specific project.

### Steps to Create the Conda Environment and Add to Jupyter Kernel List

#### Create the Conda Environment

First, save the following YAML configuration into a file named `environment.yml`.

```yaml
name: seo_webscraper
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.9
  - requests
  - beautifulsoup4
  - pandas
  - selenium
  - lxml
  - scikit-learn
  - matplotlib
  - jupyter
  - ipykernel
  - newspaper3k
```

After saving the `environment.yml`, open your terminal and navigate to the folder containing the file. Then run:

```bash
conda env create
```

#### Activate the Conda Environment

To activate the environment, run:

```bash
conda activate seo_webscraper
```

#### Add the Environment to Jupyter Kernel List

To make this environment accessible in Jupyter Notebook or Jupyter Lab, add it to your kernel list:

```bash
python -m ipykernel install --user --name=seo_webscraper --display-name="SEO Webscraper"
```

Now, you should be able to switch to this kernel while working in Jupyter and have access to all the packages you've specified.

<video controls src="media/verify_kernel.mp4" width="800" height="500" />

---
## Basics of Web Scraping

Web scraping is an essential skill for SEO and data analysis. It involves fetching a web page and then extracting necessary information. In Python, libraries like `requests` and `BeautifulSoup` are commonly used for web scraping.

### What is Web Scraping?

Web scraping is the process of extracting data from websites. It's essentially a way to collect information from the web programmatically, as opposed to manually browsing and collecting data.

### Introduction to `requests`

The `requests` library is one of the most popular Python libraries for making HTTP requests to a specified URL. With `requests`, you can send HTTP/1.1 requests and handle the response to get the web content.

---
### Example of using `requests`:

In [None]:
import requests

response = requests.get('http://www.example.com')
print(response.text)

---

### Introduction to `BeautifulSoup`

`BeautifulSoup` is a Python library used for web scraping purposes to pull data out of HTML and XML files. It creates a parse tree from the page source code that can be used to extract data in a hierarchical and more readable manner.

---
#### Example of using BeautifulSoup:

In [None]:
from bs4 import BeautifulSoup
import requests

response = requests.get('http://www.example.com')

# Initialize BeautifulSoup object with 'lxml' parser for efficient HTML parsing
soup = BeautifulSoup(response.text, 'lxml')

# Header and separator for the first <h1> tag
print("------")
print("First <h1> Tag:")
print("------")

# Find the first <h1> tag on the page
first_h1 = soup.find('h1')
print(first_h1.text)

# Add some space for readability
print("\n")

# Header and separator for the first <p> tag
print("------")
print("First <p> Tag:")
print("------")

# Find the first <p> tag on the page
first_p = soup.find('p')
print(first_p.text)


---
In this example above, we used `BeautifulSoup` along with the `requests` library to fetch a web page. We then demonstrated two different operations:

1. Finding the first `<h1>` tag on the page using the `find()` method. The method returns the first occurrence of the specified HTML tag.
   
2. Finding the first `<p>` (paragraph) tag on the page, again using the `find()` method.

Both operations show the simplicity and power of using `BeautifulSoup` for web scraping tasks. With these basic techniques, you'll be well-prepared to start building your own web scrapers for gathering data for SEO or other analytical purposes.

---

### Introduction to `newspaper3k`

`newspaper3k` is a Python library designed for web scraping articles from various news outlets or any textual content from web pages. It can handle article parsing, natural language processing, and even downloading images. The library is highly versatile and makes it easy to collect structured information from the web, making it a valuable tool for SEO analysis, data mining, or content aggregation.


---
#### Example of Using `newspaper3k`:


In [None]:
from newspaper import Article

# Specify the URL of the article to scrape
url = "http://example.com/"

# Create an Article object
article = Article(url)

# Download and parse the article
article.download()
article.parse()

# Print the article's text
print(article.text)


---

In this example, we used the newspaper3k library to download, parse, and print the text of an article from a given URL. This shows how straightforward it is to gather text-based data for SEO or other analytical purposes.

---

## Build a Simple Web Scraper

In this section, we will put our newfound knowledge into practice. We will build a simple web scraper that will scrape multiple websites.

### Objectives:

- Create a Python function to handle the web scraping.
- Use the function to scrape information from a list of websites.

### Exercise Steps:

1. **Find Three Websites**: List down URLs of three websites you're interested in scraping.
  
2. **Initialize URL List**: Put these URLs in a Python list.

3. **Create a Web Scraping Function**: Write a function that will take a URL as an input and perform the following tasks:
    - Fetch the web page using the `requests` library.
    - Parse the page using `BeautifulSoup`.
    - Extract and print the title of the web page.
    - Extract and print the first paragraph (`<p>` tag) of the web page.
    - Identify and print the number of outbound links in the page (count the number of `<a>` tags).
    
4. **Iterate Over URL List**: Use a loop to iterate over the list of URLs. For each URL, call the web scraping function you created.

5. **Observe the Output**: Review the data that your function has gathered from each website.

In [None]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Initialize the list of URLs
urls = [
    'http://example.com/first-website',
    'http://example.com/second-website',
    'http://example.com/third-website'
]

# Define the web scraping function
def scrape_website(url):
    # TODO: Fetch the web page using 'requests'
    
    # TODO: Parse the page using 'BeautifulSoup'
    
    # TODO: Extract and print the title of the web page
    
    # TODO: Extract and print the first paragraph (<p> tag) of the web page
    
    # TODO: Identify and print the number of outbound links (count the <a> tags)

# Iterate over the list of URLs and call the web scraping function for each
for url in urls:
    print(f"Scraping {url}...")
    scrape_website(url)
    print("---------------------------")


---

### Completed Example with the Newspaper package

In [None]:
# Import the newspaper library
from newspaper import Article

# List of URLs to scrape
urls = [
    'https://veteran.com/va-loans-disability-rating/',
    'https://veteran.com/housing-home-ownership/',
    'https://veteran.com/va-owned-for-sale/'
]

# Define the web scraping function using newspaper3k
def scrape_article(url):
    # Create an Article object
    article = Article(url)
    
    # Download and parse the article
    article.download()
    article.parse()
    
    # Print the article's title
    print(f"Title: {article.title}")
    
    # Print the number of images in the article
    print(f"Number of Images: {len(article.images)}")

# Loop through each URL in the list
for url in urls:
    print(f"Scraping {url}...")
    scrape_article(url)
    print("---------------------------")
