## Collecting Data

## Learning Objectives
- Understand different data sources.
- Read and write data from flat files.
- Work with APIs.
- Scrape data from web pages.
- Connect to SQL databases.
- Manage large datasets efficiently.
- Understand basic data pipelines and versioning.

![DS_Process](./images/12_page.jpg)

![Collecting_Data](./images/13_page.jpg)

![Collecting_Data_2](./images/14_page.jpg)

![Collecting_Data_3](./images/15_page.jpg)

# ðŸ“Š Where to Find Data for Data

## Research & Institutional Data
- [**NIMH Data Archive (NDA)**](https://nda.nih.gov/) â€“ Human subjects data from hundreds of research projects in mental health and neuroscience. *(requires application for access)*  
- [**StatLib Datasets Archive**](http://lib.stat.cmu.edu/datasets/) â€“ Classic collection of older but useful datasets hosted by Carnegie Mellon.  
- [**UCI Machine Learning Repository**](https://archive.ics.uci.edu/) â€“ Classic benchmark datasets for machine learning research.  

---

## Competitions & Community
- [**Kaggle Datasets**](https://www.kaggle.com/datasets) â€“ Thousands of community-contributed datasets, with notebooks and competitions.  
- [**r/datasets (Reddit)**](https://www.reddit.com/r/datasets/) â€“ A subreddit where users post and request datasets.  
- [**Awesome Public Datasets (GitHub)**](https://github.com/awesomedata/awesome-public-datasets) â€“ Curated list of dataset sources across domains.  

---

## Open Data Portals
- [**Google Dataset Search**](https://datasetsearch.research.google.com/) â€“ Like Google search, but just for datasets.  
- [**Data.gov**](https://www.data.gov/) â€“ US Government open data (health, climate, education, etc.).  
- [**European Union Open Data Portal**](https://data.europa.eu/) â€“ EU datasets across policy areas.  
- [**UN Data**](http://data.un.org/) â€“ Global statistics from the United Nations.  
- [**World Bank Open Data**](https://data.worldbank.org/) â€“ Economic, social, and development indicators.  

---

## APIs & Crawls
- [**Programmable Web**](https://www.programmableweb.com) â€“ Large directory of public APIs (some links are aging, but still useful for discovery).  
- [**Common Crawl**](https://commoncrawl.org/) â€“ Open web crawl data for text mining, NLP, and search projects.  
- [**Internet Archive**](https://archive.org/) â€“ Huge archive of books, websites, audio, video, and more.  
- **Other APIs**: Twitter/X API, Reddit API, News API, Spotify API â€“ Great for building projects around social or media data.  

---

## Special Interest Data Sources
- [**IMDb Datasets**](https://www.imdb.com/interfaces/) â€“ Movie/TV metadata.  
- [**OpenStreetMap (OSM)**](https://www.openstreetmap.org/) â€“ Global geospatial data (with [Overpass API](https://overpass-api.de/)).  
- [**FiveThirtyEight Data**](https://data.fivethirtyeight.com/) â€“ Datasets behind 538â€™s journalism.  
- [**Our World in Data**](https://ourworldindata.org/) â€“ Research-driven datasets on health, climate, energy, etc.  
- [**Awesome ML Datasets (Wikipedia list)**](https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research) â€“ Overview of standard datasets.  

---

## Personal Data Projects
- **Export Your Own Data**: Google Takeout, Facebook/Instagram export, Fitbit/Apple Health, bank statements.  

## Data Acquisition

Data acquisition is the process of collecting and preparing data from various sources. It's often the most time-consuming part of a data science project. Without high-quality data, analysis and modeling will be flawed.

## Reading Flat Files

We start by reading common file types such as CSV, Excel, JSON, and Parquet.

In [1]:
import pandas as pd

# CSV
# df_csv = pd.read_csv('data.csv')

# Excel
# df_excel = pd.read_excel('data.xlsx')

# JSON
# df_json = pd.read_json('data.json')

# Parquet
# df_parquet = pd.read_parquet('data.parquet')

## Accessing APIs

APIs (Application Programming Interfaces) allow us to **communicate with other applications or services** over the internet.  
One of the most common types of APIs is the **REST API** (Representational State Transfer).

## ðŸ”¹ What is a REST API?
- A REST API exposes **endpoints** (URLs) that clients can send requests to.  
- Clients typically use the **HTTP methods**:
  - `GET` â†’ retrieve data  
  - `POST` â†’ send new data  
  - `PUT/PATCH` â†’ update data  
  - `DELETE` â†’ remove data  
- The server responds with data, often in **JSON format**, which is easy to parse in Python.  

In [2]:

## ðŸ”¹ Example 1: Using the `requests` library
import requests

# API endpoint (a RESTful URL)
url = 'https://catfact.ninja/fact'

# Send a GET request to the API
response = requests.get(url)

# Convert the response (JSON) into a Python dictionary
data = response.json()

# Print the result
print("Random cat fact:", data['fact'])

Random cat fact: The claws on the catâ€™s back paws arenâ€™t as sharp as the claws on the front paws because the claws in the back donâ€™t retract and, consequently, become worn.


In [10]:
#example 2
import requests

url = 'https://dog.ceo/api/breeds/image/random'
response = requests.get(url)
data = response.json()

print("Random dog image URL:", data['message'])

Random dog image URL: https://images.dog.ceo/breeds/schnauzer-miniature/n02097047_151.jpg


## Web Scraping

Use BeautifulSoup to extract data from web pages. Use responsibly!

In [11]:
from bs4 import BeautifulSoup

html = requests.get('https://books.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')
titles = [book.h3.a['title'] for book in soup.select('.product_pod')]
titles[:5]

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind']

## SQL Databases

Use SQLite for local testing. In production, you'd use PostgreSQL, MySQL, or similar.

In [12]:
import sqlite3

conn = sqlite3.connect('example.db')
df_sql = pd.read_sql_query('SELECT name FROM sqlite_master WHERE type="table";', conn)
df_sql

Unnamed: 0,name


## Large Dataset Handling

Read data in chunks or optimize data types to handle large files.

In [15]:
# for chunk in pd.read_csv('./data/forestfires.csv', chunksize=10000):
#     process(chunk)


total_temp = 0
total_rows = 0

for chunk in pd.read_csv('./data/forestfires.csv', chunksize=1000):
    total_temp += chunk['temp'].sum()
    total_rows += len(chunk)

average_temp = total_temp / total_rows
print("Average temperature:", average_temp)

Average temperature: 18.88916827852998


# ðŸ”„ Data Pipelines

## ðŸ”¹ What is a Data Pipeline?
A **data pipeline** is a series of steps that data goes through from its **source** to its **destination**.  
It is designed to **move, transform, and deliver** data so that it becomes useful for analysis, reporting, or applications.


## ðŸ”¹ Typical Stages of a Data Pipeline
1. **Ingestion** â€“ Collecting raw data from sources such as databases, sensors, files, APIs, or logs.  
2. **Transformation** â€“ Cleaning, filtering, normalizing, aggregating, or enriching the data.  
3. **Storage** â€“ Saving data in a database, data warehouse, data lake, or other storage system.  
4. **Analysis / Consumption** â€“ Using the data for reporting, dashboards, machine learning, or business applications.  



## ðŸ”¹ Example (Conceptual)
Imagine an e-commerce company:
- **Ingestion**: Sales transactions are captured from the website.  
- **Transformation**: Data is cleaned (e.g., fixing missing values, converting currencies).  
- **Storage**: Processed data is stored in a data warehouse.  
- **Analysis**: Analysts and data scientists build dashboards to track revenue trends.  


## ðŸ”¹ Why Data Pipelines Are Important
- **Automation**: Reduces manual work by ensuring data flows automatically.  
- **Scalability**: Handles large and growing volumes of data.  
- **Consistency**: Applies the same transformations each time.  
- **Reproducibility**: Results can be trusted because the steps are standardized.  
- **Collaboration**: Shared data flows make it easier for teams to work with the same data.  


## ðŸ”¹ Tools Commonly Used
- **Batch Pipelines**: Apache Airflow, Luigi, Prefect.  
- **Streaming Pipelines**: Apache Kafka, Apache Spark Streaming, Flink.  
- **Cloud Pipelines**: AWS Glue, Google Dataflow, Azure Data Factory.  


ðŸ‘‰ In summary: A data pipeline is like an **assembly line for data** â€” raw material goes in, gets processed, and comes out ready for use.  