# 2) Creating a DataFrame from JSON Data

In [1]:
import pandas as pd

#Create Json manually
json_data = '{"country":["Canada","England","Japan"],"region":["North America","Europe","Asia"]}'

#convert JSON into pandas dataframe
df_country = pd.read_json(json_data)
df_total_rows = df_country.shape[0]

#print number of rows
print(df_country)
print(df_total_rows)

   country         region
0   Canada  North America
1  England         Europe
2    Japan           Asia
3


  df_country = pd.read_json(json_data)


# 3) Data Formats in API Retrieval

When we request data from an API, the response usually comes in a specific format, allowing easy data exchange and integration with different systems. The two most commonly used data formats are `JSON` and `XML`.

In the context of APIs, JSON data is structured to provide information in a neat and easily accessible manner. For example, the Global Economic Indicators API might return data in a format like this:

```Python
{
  "Indicator": "GDP (current US$)",
  "Country": "United States",
  "Year": 2019,
  "Value": 21433226000000
}
```

In [2]:
import requests

#send get request to API server
response = requests.get('https://api-server.dataquest.io/economic_data/countries')

#convert response to JSON
economic_data = response.json()

print(economic_data)

[{"country_code": "ABW", "short_name": "Aruba", "table_name": "Aruba", "long_name": "Aruba", "2-alpha_code": "AW", "currency_unit": "Aruban florin", "special_notes": null, "region": "Latin America & Caribbean", "income_group": "High income", "wb-2_code": "AW", "national_accounts_base_year": "2013", "national_accounts_reference_year": null, "sna_price_valuation": "Value added at basic prices (VAB)", "lending_category": null, "other_groups": null, "system_of_national_accounts": "Country uses the 1993 System of National Accounts methodology", "alternative_conversion_factor": null, "ppp_survey_year": null, "balance_of_payments_manual_in_use": "BPM6", "external_debt_reporting_status": null, "system_of_trade": "General trade system", "government_accounting_concept": null, "imf_data_dissemination_standard": "Enhanced General Data Dissemination System (e-GDDS)", "latest_population_census": "2020 (expected)", "latest_household_survey": null, "source_of_most_recent_income_and_expenditure_data": 

# 4) Reading JSON API Data into a Dataframe

In [3]:
import pandas as pd
import requests

#API endpoint URL
url = 'https://api-server.dataquest.io/economic_data/indicators'

#send get request to API server and convert response to JSON
response = requests.get(url)
data = response.json()

#convert JSON data into pandas dataframe
df_economic = pd.read_json(data)

#idxmax() method, which returns the index of the maximum value in the series.
most_frequent_source = df_economic['source'].value_counts().idxmax()

print(most_frequent_source)

International Monetary Fund, Balance of Payments Statistics Yearbook and data files.


  df_economic = pd.read_json(data)


# 5) Handling Data without API's

Many websites and data sources don't provide APIs due to technical limitations, policy restrictions, or simply a lack of demand. However, this doesn't mean we can't extract the data.

One common approach is to extract data directly from a web page's HTML. **HTML**, which stands for **HyperText Markup Language**, is the standard language for creating web pages. It uses a series of elements to structure and style content. We can extract the underlying data by parsing a web page's HTML.

Fortunately, `pandas` provides a convenient function, `pd.read_html()`, which can extract tabular data from HTML content directly into a DataFrame. This function is particularly useful for scraping data from web pages that contain structured data in the form of HTML tables.

The code below, for example, performs web scraping by sending an HTTP request to a specified URL, parsing the returned HTML to find the first table, and then using Pandas read_html function to convert that table into a DataFrame, which it prints out.

```Python

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = 'https://www.iana.org/help/example-domains' # Example URL
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
df = pd.read_html(str(table))[0]
print(df)

```


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#URL of the webpage containing the table
url='https://dataquestio.github.io/web-scraping-pages/'
response = requests.get(url)

#Parse the HTML content of the webpage
soup = BeautifulSoup(response.content, 'html.parser')

#Find the table in the HTML
table = soup.find('table')

#Convert the HTML table into a pandas DataFrame
df = pd.read_html(str(table))[0] #the [0] is used to get the first table if multiple tables are present

#find number of columns
df_columns = df.shape[1]

print(df)

                                Location  Population % of world         Date  \
0                                  World  8232000000       100%  13 Jun 2025   
1                                  India  1417492000      17.3%   1 Jul 2025   
2                                  China  1408280000      17.2%  31 Dec 2024   
3                          United States   340110988       4.1%   1 Jul 2024   
4                              Indonesia   284438782       3.5%  30 Jun 2025   
..                                   ...         ...        ...          ...   
237                   Niue (New Zealand)        1681         0%  11 Nov 2022   
238                Tokelau (New Zealand)        1647         0%   1 Jan 2019   
239                         Vatican City         882         0%  31 Dec 2024   
240  Cocos (Keeling) Islands (Australia)         593         0%  30 Jun 2020   
241                Pitcairn Islands (UK)          35         0%   1 Jul 2023   

    Source (official or from the United

  df = pd.read_html(str(table))[0] #the [0] is used to get the first table if multiple tables are present


# 6) Exploring HTML Pages

HTML pages include **paragraphs**, **links**, **lists**, and various elements with different attributes that can contain valuable data. To effectively scrape and analyze this data, we need to understand how to navigate and extract from these varied elements.

* **Paragraphs **(`p`): These are typically blocks of text. You can extract text content from them.

* **Links** (`a`): They have an href attribute that you can extract to get the URL they point to.

* **Classes and IDs**: Many HTML elements have class or ID attributes that can be used to identify them uniquely.

* **Other Elements**: Such as `span`, `div`, `header`, `footer`, etc., that might contain specific data or be used to structure the data in a certain way.

Let's explore how we can create a DataFrame that contains all the **paragraph** texts from this Wikipedia page. 

```Python
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://dataquestio.github.io/web-scraping-pages/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')


# Extracting paragraphs and creating a DataFrame
wikipedia_paragraphs = []
for p in soup.find_all('p'):
    text = p.get_text().strip()  # strip() removes leading and trailing whitespace
    if text:  # Only append non-empty strings
        wikipedia_paragraphs.append(text)

wikipedia_df = pd.DataFrame(wikipedia_paragraphs)
print(wikipedia_df)

```

we can quickly determine the **number of paragraphs** on this page by simply calling the `.shape` method on our DataFrame, i.e., `wikipedia_paragraphs.shape[0]`

in the following exercise, we'll scrape the same Wikipedia page to extract the `text` and `href` attributes from all **hyperlink (anchor)** elements

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://dataquestio.github.io/web-scraping-pages/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

#link_texts and link_hrefs to store text and href attributes of all hyperlink (anchor) elements
link_texts = []
link_hrefs = []

#loop in each (a) achor - hyper links
for a in soup.find_all('a'):
    text = a.get_text().strip() # strip() removes leading and trailing whitespace
    link = a.get('href') #get hyperlink
   
    link_texts.append(text)
    link_hrefs.append(link)

#create dictionary to store link texts and hrefs
links_dict = {'Link Text':link_texts,'URL':link_hrefs}

#convert dictionary into pandas dataframe
df_links = pd.DataFrame(links_dict)

print(df_links.head())

         Link Text                          URL
0  Jump to content                 #bodyContent
1        Main page              /wiki/Main_Page
2         Contents     /wiki/Wikipedia:Contents
3   Current events  /wiki/Portal:Current_events
4   Random article         /wiki/Special:Random
