# ðŸ“‚ Data Collection Techniques: Definitions

### 1. Public Datasets
**Definition:** Curated collections of data that are freely available for public access and research. These datasets are often pre-cleaned and formatted, serving as a standard resource for benchmarking machine learning models and statistical analysis.
* *Examples:* Kaggle, UCI Machine Learning Repository, Government Data Portals.

### 2. Databases
**Definition:** Systematically organized collections of data stored and accessed electronically. They allow for the efficient storage, retrieval, and manipulation of large volumes of structured (SQL) or unstructured (NoSQL) information.
* *Examples:* MySQL, PostgreSQL, MongoDB, SQLite.

### 3. APIs (Application Programming Interfaces)
**Definition:** A set of defined rules and protocols that enables different software applications to communicate with each other. In data collection, APIs allow developers to programmatically request and receive specific data from a server in a structured format (like JSON) without accessing the user interface.
* *Examples:* Twitter API, Google Maps API, OpenWeatherMap.

### 4. Web Scraping
**Definition:** The automated process of extracting large amounts of data from websites using software scripts. This technique parses the HTML code of web pages to gather specific information when no official API or dataset is available.
* *Examples:* Price monitoring bots, scraping news headlines, lead generation.

# Working with API's

In [1]:
# Stephen King Books - API
import requests

In [8]:
URL = "https://stephen-king-api.onrender.com/api/books"

res = requests.get(URL)

In [9]:
res

<Response [200]>

In [10]:
json_data = res.json()

In [16]:
import pandas as pd

df = pd.json_normalize(json_data["data"])
df

Unnamed: 0,id,Year,Title,handle,Publisher,ISBN,Pages,Notes,created_at,villains
0,1,1974,Carrie,carrie,Doubleday,978-0-385-08695-0,199,[],2023-11-13T23:48:47.848Z,"[{'name': 'Tina Blake', 'url': 'https://stephe..."
1,2,1975,Salem's Lot,salem-s-lot,Doubleday,978-0-385-00751-1,439,"[Nominee, World Fantasy Award, 1976[2]]",2023-11-13T23:48:48.098Z,"[{'name': 'Kurt Barlow', 'url': 'https://steph..."
2,3,1977,The Shining,the-shining,Doubleday,978-0-385-12167-5,447,"[Runner-up (4th place), Locus Award for Best F...",2023-11-13T23:48:48.219Z,"[{'name': 'Horace M. Derwent', 'url': 'https:/..."
3,4,1977,Rage,rage,Signet Books,978-0-451-07645-8,211,[Published under pseudonym Richard Bachman],2023-11-13T23:48:48.339Z,[]
4,5,1978,The Stand,the-stand,Doubleday,978-0-385-12168-2,823,"[Nominee, World Fantasy Award, 1979, Runner-up...",2023-11-13T23:48:48.477Z,"[{'name': 'Donald Merwin Elbert', 'url': 'http..."
...,...,...,...,...,...,...,...,...,...,...
58,59,2018,The Outsider,the-outsider,Scribner,978-1-50118-098-9,576,"[First novel in the Holly Gibney Series, Runne...",2023-11-13T23:48:55.165Z,"[{'name': 'The Outsider (Creature)', 'url': 'h..."
59,60,2018,Elevation,elevation,Scribner,978-1-98210-231-9,144,[],2023-11-13T23:48:55.291Z,[]
60,61,2019,The Institute,the-institute,Scribner,978-1-98211-056-7,576,"[Nominee, British Fantasy Award's August Derle...",2023-11-13T23:48:55.457Z,"[{'name': 'Gladys Hickson', 'url': 'https://st..."
61,62,2021,Later,later,Hard Case Crime,978-1-78909-649-1,256,[],2023-11-13T23:48:55.579Z,[]


In [17]:
df = df[['id', 'Title', 'Year', 'ISBN']]
df

Unnamed: 0,id,Title,Year,ISBN
0,1,Carrie,1974,978-0-385-08695-0
1,2,Salem's Lot,1975,978-0-385-00751-1
2,3,The Shining,1977,978-0-385-12167-5
3,4,Rage,1977,978-0-451-07645-8
4,5,The Stand,1978,978-0-385-12168-2
...,...,...,...,...
58,59,The Outsider,2018,978-1-50118-098-9
59,60,Elevation,2018,978-1-98210-231-9
60,61,The Institute,2019,978-1-98211-056-7
61,62,Later,2021,978-1-78909-649-1


In [58]:
# Stranger Things Quotes - API
URL = "https://strangerthings-quotes.vercel.app/api/quotes/100"

res = requests.get(URL)

In [59]:
res

<Response [200]>

In [60]:
json_data = res.json()

In [61]:
json_data

[{'quote': 'Itâ€™s called code shut-your-mouth.', 'author': 'Erica Sinclair'},
 {'quote': 'I asked if you wanted to be my friend. And you said yes. You said yes. It was the best thing Iâ€™ve ever done.',
  'author': 'Mike Wheeler'},
 {'quote': 'Weâ€™re talking about the destruction of our world as we know it.',
  'author': 'Lucas Sinclair'},
 {'quote': 'Iâ€™m the monster.', 'author': 'Eleven'},
 {'quote': 'Itâ€™s just, I know I can be a jerk like him sometimes, and I do not want to be like him. Ever.',
  'author': 'Max Mayfield'},
 {'quote': 'She is our friend and she is crazy.',
  'author': 'Dustin Henderson'},
 {'quote': 'Sometimes, I impress even myself.', 'author': 'Sam Owens'},
 {'quote': 'This is finger lickinâ€™ good.', 'author': 'Steve Harrington'},
 {'quote': 'She is hotter than phoebeâ€™s cats.', 'author': 'Dustin Henderson'},
 {'quote': 'I will never, ever let anything bad happen to you ever again. Whateverâ€™s going on in you, weâ€™re gonna fix it.',
  'author': 'Joyce Byer

In [62]:
import pandas as pd

df = pd.json_normalize(json_data)
df

Unnamed: 0,quote,author
0,Itâ€™s called code shut-your-mouth.,Erica Sinclair
1,I asked if you wanted to be my friend. And you...,Mike Wheeler
2,Weâ€™re talking about the destruction of our wor...,Lucas Sinclair
3,Iâ€™m the monster.,Eleven
4,"Itâ€™s just, I know I can be a jerk like him som...",Max Mayfield
...,...,...
95,"I just feel whole, like a piece of me was miss...",Kali Prasad
96,I felt this evil like it was looking at me.,Will Byers
97,I told you a million times my teeth are coming...,Dustin Henderson
98,Weâ€™re just friends.,Robin Buckley


# Web Scraping 
### Involves programmatically extracting data from websites

In [6]:
import requests

# https://toscrape.com/
# https://www.scrapethissite.com/

URL = "https://www.scrapethissite.com/pages/simple/"
res = requests.get(URL)

print(res.status_code)

200


In [5]:
res

# Common Status Codes
# 200 --> OK
# 301/302 --> Redirect
# 400 --> Bad Request
# 401 --> Unauthorized
# 403 --> Forbidden
# 404 --> Not Found
# 500 --> Server Error

<Response [200]>

In [19]:
if res.status_code == 200:
    # print(res.text)
    # print(res.content)
    print(res.headers)
else:
    print("Failed to Fetch page")

{'Date': 'Tue, 30 Dec 2025 12:46:25 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Nel': '{"report_to":"heroku-nel","response_headers":["Via"],"max_age":3600,"success_fraction":0.01,"failure_fraction":0.1}', 'Report-To': '{"group":"heroku-nel","endpoints":[{"url":"https://nel.heroku.com/reports?s=TjWUTaZresfZApeYnYeybV0WXVmv8wkZiEzUk3dHbp8%3D\\u0026sid=67ff5de4-ad2b-4112-9289-cf96be89efed\\u0026ts=1767098785"}],"max_age":3600}', 'Reporting-Endpoints': 'heroku-nel="https://nel.heroku.com/reports?s=TjWUTaZresfZApeYnYeybV0WXVmv8wkZiEzUk3dHbp8%3D&sid=67ff5de4-ad2b-4112-9289-cf96be89efed&ts=1767098785"', 'Server': 'cloudflare', 'Via': '1.1 heroku-router', 'cf-cache-status': 'DYNAMIC', 'Content-Encoding': 'zstd', 'CF-RAY': '9b61a3ccd9b844ba-SIN', 'alt-svc': 'h3=":443"; ma=86400'}


In [20]:
# Downloading & Storing to Files

with open("scraped_data/data1.html", "w", encoding = "utf-8") as f:
    f. write(res.text)

In [21]:
# Working with BeautifulSoup
from bs4 import BeautifulSoup

with open("scraped_data/data1.html", "r") as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, "lxml")

In [70]:
soup.find("h1")
soup.h1
soup.find("h3")
all_h3 = soup.find_all("h3")

all_countries = []

for h3 in all_h3:
    name = h3.get_text(strip=True)
    # print(h3.find_parent("div")["class"])

    population = h3.find_next("div").select("span.country-population")[0].get_text(strip=True)
    # print(h3.find_next("div").select_one("span.country-population").get_text(strip=True))

    # print(h3.attrs)

    all_countries.append([name, population])

    # DOM --> Document Object Model
    # print(h3.parent) 
    # print(h3.children) ...descendents/next_sibling()/previous_sibling()

In [71]:
import pandas as pd

df = pd.DataFrame(all_countries, columns = ["Name", "Population"])
df

Unnamed: 0,Name,Population
0,Andorra,84000
1,United Arab Emirates,4975593
2,Afghanistan,29121286
3,Antigua and Barbuda,86754
4,Anguilla,13254
...,...,...
245,Yemen,23495361
246,Mayotte,159042
247,South Africa,49000000
248,Zambia,13460305


In [67]:
df.to_csv("cleaned_data/data1.csv", index = False)

In [66]:
# soup