# Websites and APIs
**Web scraping** and **APIs** are popular methods for collecting data from websites. 

## Webscraping
Webscraping involves directly parsing a website's HTML. This method can extract a wide range of data available on a page. It can also introduce complexity to your project, especially if the website's data is not consistently structured. 

## API (Application Programming Interface)
APIs provide structured data and detailed documentation for querying the website and filtering results. They are generally easier to use but may come with restrictions, such as limits on the number of requests per day or the number of records you can retrieve. 

---

# Step 1: Copyright | Terms of Use
Before starting a webscraping or API project, you must 

## Review and understand the terms of use.

o	Do the terms of service include any restrictions or guidelines?

o	Are permissions/licenses needed to scrape data? If yes, have you obtained these permissions/licenses?

o	Is the information publicly available?

o	If a database, is the database protected by copyright? Or in the public domain?

## Fair Use 
Limited use of copyrighted materials is allowed under certain conditions for journalism, scholarship, and teaching. [Use the Resources for determining fair use](https://library.osu.edu/copyright/fair-use) to verify you project is within the scope of fair use. Contact University Libraries [Copyright Services](https://library.osu.edu/copyright/fair-use) if you have any questions.

## Check for robots.txt directives
robots.txt directives limit web-scraping or web-crawling. Respect these directives.

---



# Step 2: Is an API available?
APIs can simplify data collection from a website by returning structured data (e.g. JSON, XML) Examples of APIs include:

- [PubMed eUtilities](https://www.ncbi.nlm.nih.gov/books/NBK25500/)
- [Elsevier APIs](https://dev.elsevier.com/)
- [Spotify Web API](https://developer.spotify.com/documentation/web-api)

To determine is an API is available, try searching for the name of the website and the "API" or "documentation." If an API is available, read the terms of use and consider factors like rate limits, costs, and access restrictions. 

If an API is not available, that's okay. Data collection might be a bit more complex, but remember to respect copyright and terms of use.

## Activity 1: The Lantern - Part 1
Go to the [OSU Publication Archives website](https://osupublicationarchives.osu.edu/?a=p&p=home&e=-------en-20--1--txt-txIN-------). **Note** that The Ohio State University provides the online archives of Ohio State's student newspaper *The Lantern*, the student yearbook *The Makio*, and alumni magazines for research and educational purposes. This is explicitly stated at the bottom of this page.

Search for `homecoming parade`. On the left, adjust the filters to show results for the decade `1970-1979`, publication `The Lantern`, and category `Article`. 

![TheLantern_HomecomingParade.png](images/TheLantern_HomecomingParade.png "Screenshot of search result")

Once you've set your filters, you should have 51 results. The first 20 search results are displayed on page 1.

Take a look at the search URL.

![search_results_1.png](images/search_results_1.png "Screenshot of url, r=1")

Scroll to the bottom and click on page 2 to see search results 21-40. Note that the search result has changed to

![search_results_2.png](images/search_results_2.png "Screenshot of url, r now equals 21")

Return to page 1 of your search, scroll to the right end of your search URL and add the characters `&f=XML` to the end of the string.

![search_results_3.png](images/search_results_3.png "Screenshot of url with &f=XML added to the end of the url")

As you've searched for `homecoming parade`, the decade `1970-1979`, publication `The Lantern` and category  `Article`, the software powering the osupublications service has constructed a server request and inserted your parameters into the url. By adding `&f=XML` to the request parameters, the server returns structured XML output. We can iterate through this output to gather our data.

### Search parameters
#### The Lantern
![parameter_publication_the_lantern.png](images/parameter_publication_the_lantern.png "Screenshot showing position of publication identifier in search url")

#### homecoming+parade
![parameter_homecoming_parade.png](images/parameter_homecoming_parade.png "Screenshot showing position of homecoming parade in search url")

#### Decade: 1970-1979
![parameter_decade.png](images/parameter_decade.png "Screenshot showing position of decade filter in search url")

#### Category: Article
![parameter_category_article.png](images/parameter_category_article.png "Screenshot showing position of category filter in search url")

#### XML
![search_results_3.png](images/search_results_3.png "Screenshot of url with &f=XML added to the end of the url")

![TheLantern_XML.png](images/TheLantern_XML.png "Screenshot of XML output")


---










# Step 3. Inspect the elements

XML and HTML are tree-structured documents. When you request a search URL, it retrieves an HTML or XML page from a server. The browser then downloads the page into local memory and parses the HTML or XML for display.

The [Document Object Model (DOM)](https://en.wikipedia.org/wiki/Document_Object_Model) respresents the overall tree-structure of the XML or HTML document. For example, in the XML document shown in Step2:
- `VeridianXMLResponse` represents the document node.
- All XML elements within `VeridianXMLResponse` are element nodes. 
- There is some HTML present in the `SearchResultSnippetHTML` node.
- There are no XML attribute nodes, but there are HTML attribute nodes in the `SearchResultSnippetHTML` node.
- Text between the XML elements are text nodes.

The tree is hierachically structured and each tree branch ends with a node. Each node contains objects, and nodes can be nested within nodes.

## Developer Tools

The XML returned for our `The Lantern` search is well structured.

- A unique identifier is provided for each article under the tag `<LogicalSectionID>`
- The title for each article appears in the tag `<LogicalSectionTitle>`
- The category type is included in the tag `<LogicalSectionType>`

HTML responses are often less clear but can be navigated with persistence. Google offers a range of [Developer Tools](https://developer.chrome.com/docs/devtools/dom) that can help understand a webpage's DOM elements. For example, if we search the [OSU Publication Archives](https://osupublicationarchives.osu.edu/?a=p&p=home&e=-------en-20--1--txt-txIN-------) website for `homecoming parade`, decade `1970-1979`, publication `The Lantern`, and category `Article` (removing `&f=XML` from the end of your search URL), our browser renders our search results in HTML. 

Right-click on the HTML search results and select "Inspect" to explore the DOM elements.

![TheLantern_Inspect.png](attachment:f225f96f-9563-43da-b3bc-632c0410e7dd.png "Screenshot showing the word Inspect at the bottom of options in window that opens after right clicking on search results")

This action opens Google's Developer Tools on the right side of your screen. **Note** The default tab, "Elements," displays the HTML elements for the page. 

Click the inspect icon ![inspect_icon.png](images/inspect_icon.png "Decorative") and then select an HTML element on your page. ![TheLantern_Element.png](images/TheLantern_Element.png "Screenshot showing selected element The Lantern 23 October 1978")

The Developer Tools will highlight where the element is located in the HTML and provide a tooltip with additional information about the element.  

![TheLantern_DeveloperTools.png](images/TheLantern_DeveloperTools.png "Screenshot showing the selected element is highlighted, the location of selected element is highlighted in the Developer Tools and a tooltip now appears above the selected element in the HTML")

To view more details about the element's location in the DOM structure, right-click on the element in the Developer Tools, select **Copy > Copy element**, and then paste the text into Notepad or a similar text editor.

![TheLantern_CopyElement.png](images/TheLantern_CopyElement.png "Screenshot showing the div tag is selected for the element The Lantern 23 October 1978, the menu that appears after right clicking on the element, and the then location of Copy and Copy element in the menu")

![TheLantern_Notepad.png](images/TheLantern_Notepad.png "Screenshot of HTML snippet for selected element")


---


## Activity 2: [Meet the Animals](https://nationalzoo.si.edu/animals/list) - Part 1

[Meet the Animals](https://nationalzoo.si.edu/animals/list) at the Smithsonian National Zoo & Conservation Biology Institute. Practice using Google's Develop Tools to find the following elements for one animal. Copy the elements you find to Notepad.

- Common name
- Scientific name
- Taxonomic information
     - Class
     - Order
     - Family
     - Genus and species
- Physical description
- Size
- Native habitat
- Conservation status
- Fun facts



# Step 4: Identify Python libraries for project
To gather XML and HTML data from websites and APIs, you'll need several Python libraries. Some libraries handle web server requests and responses, while others parse the retrieved content. Libraries like Pandas and CSV are used to store and output results as .csv files.

## [requests](https://requests.readthedocs.io/en/latest/)
The [requests](https://requests.readthedocs.io/en/latest/) library retrieves HTML or XML documents from a server and processes the response. 

In [None]:
import requests
url="INSERT URL HERE"
response=requests.get(url)
text=response.text # This returns the response content as text
bytes=response.content  # This returns the response content as bytes. 

## [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)

[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) parses HTML and XML documents, helping you search for and extract elements from the DOM. The first argument is the content to be parsed, and the second specifies the parsing library to use.

- `html.parser` The default HTML parser.
- `lxml` a faster parser with more features.* 
- `xml` parses XML
- `html5lib` for HTML5 parsing.*

Additional keyword arguments (**kwargs) are available. See the [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) documentation for more information.


In [None]:
import requests
from bs4 import BeautifulSoup

url="INSERT URL HERE"
response=requests.get(url).content
soup=BeautifulSoup(response, 'xml')

Other Python libraries for parsing include:
- [lxml.html](https://lxml.de/lxmlhtml.html)
- [pyQuery](https://www.pyquery.org/pyquery-core-functions/)
- [Selenium](https://www.selenium.dev/documentation/)

Each library has its strengths and weaknesses. To learn more about different parsing tools read Anish Chapagain's [Hands-On Web Scraping with Python, 2nd edition](https://library.ohio-state.edu/record=b10892054~S7) 

*Verify that `lxml` or `html5lib` is installed in your Anaconda environment before using them.

## [pandas](https://pandas.pydata.org/docs/user_guide/index.html)
Pandas is a large Python library used for manipulating and analyzing tabular data. 

**Sample workflow**

1. `import pandas as pd`
2. Create a DataFrame to store results or data that you plan to gather. `DataFrame([data, index, columns, dtype, copy])`
3. Gather variables
4. Create a dictionary to store one row of variables.
5. Create a DataFrame to store the one row of variables.
6. Concatenate the DataFrame storing one row of variables to the DataFrame storing results that was constructed in step 2.
7. Export results to .csv



In [1]:
import pandas as pd

results=pd.DataFrame(columns=['name','age','profession'])

name="Jordan"
age=8
profession="chipmunk control"

pets={
    "name":name,
    "age":age,
    "profession":profession
    }

add_row_to_results=pd.DataFrame(pets, index=[0])
results=pd.concat([add_row_to_results, results], axis=0, ignore_index=True)
results.to_csv('pets2.csv')

## [csv](https://docs.python.org/3/library/csv.html)
The csv module both writes and reads .csv data.

**Sample workflow**

1. `import csv`
2. Create an empty list named dataset
3. Assign .csv headers to a list named columns
4. Define the writeto_csv function to write results to a .csv file
5. Gather variable
6. For each row of data, append a list of variables following the order of the .csv headers to the dataset list.
7. Use the writeto_csv function to write results to a .csv file 

In [None]:
import csv


#####     STEP 1 - CREATE EMPTY DATASET AND DEFINE CSV HEADINGS     ##### 
dataSet=[]
columns=['name','age','profession'] # for CSV headings


#####     STEP 2 - DEFINE FUNCTION TO WRITE RESULTS TO CSV FILE     #####

def writeto_csv(data,filename,columns):
    with open(filename,'w+',newline='',encoding="UTF-8") as file:
        writer = csv.DictWriter(file,fieldnames=columns)
        writer.writeheader()
        writer = csv.writer(file)
        for element in data:
            writer.writerows([element])


name="Jordan"
age=8
profession="chipmunk control"

dataSet.append([name, age, profession])

writeto_csv(dataSet,'pets.csv',columns)

## [time.sleep( )](https://docs.python.org/3/library/time.html#time.sleep)
Most APIs limit the number of records you can request per second. time.sleep() suspends your program for the specified number of seconds.

In [None]:
import time
time.sleep(5)

## [datetime](https://docs.python.org/3/library/datetime.html#datetime.date.today)
It is good practice to include a `last_updated` column in any dataset you've created after gathering HTML or XML data. The datetime module can be used to identify the date you last ran your Python program.

In [None]:
from datetime import date
today = date.today()

last_updated=today

# Step 5: Write code

## [Spyder](https://docs.spyder-ide.org/current/index.html) 
[Spyder](https://docs.spyder-ide.org/current/index.html) is an interactive development environment (IDE) that offers quick feedback as you iteratively create your code. Designed by and for scientists, engineers, and data analysts, Spyder allows you to interactively write code, explore your data, and more.

## Activity 3: The Lantern - Part 2
Using Spyder and our XML search for [homecoming parade](https://osupublicationarchives.osu.edu/?a=q&r=1&results=1&tyq=ARTICLE&e=------197-en-20-LTN-1--txt-txIN-homecoming+parade------&f=XML) in The Lantern, gather the following elements:

- unique_id
- article_title
- article_type

Then use the unique_id to gather the following elements for each publication:

- publication_date
- publication_text

Export the elements to a .csv file.

