# Data Collection

### 📦 What is Data Collection? 
- Data collection is the process of gathering raw information from various sources to be analyzed, modeled, or used for decision-making. It includes acquiring, importing, and storing data in a format ready for wrangling.

### 🧷 Types of Data Sources
| Type                | Examples                                                |
| ------------------- | ------------------------------------------------------- |
| **Structured**      | CSV, SQL databases, Excel, JSON                         |
| **Semi-structured** | XML, JSON from APIs, logs, NoSQL                        |
| **Unstructured**    | Images, PDFs, audio, videos, social media text          |
| **Real-time**       | IoT sensors, financial data streams, Kafka, sockets     |
| **Big Data**        | Distributed sources like HDFS, AWS S3, cloud data lakes |

# 🛠️ Methods of Data Collection (in detail)

### 1. 🗃️ From Local Files
| File Type | Function                       |
| --------- | ------------------------------ |
| CSV       | `pd.read_csv("file.csv")`      |
| Excel     | `pd.read_excel("file.xlsx")`   |
| JSON      | `pd.read_json("file.json")`    |
| Text      | `open("file.txt").readlines()` |

---

### 2. From Databases(SQL/NoSQL)

##### a. SQL Databases
| Database   | Libraries                              |
| ---------- | -------------------------------------- |
| MySQL      | `mysql-connector-python`, `SQLAlchemy` |
| PostgreSQL | `psycopg2`, `pg8000`                   |
| SQLite     | `sqlite3`                              |


##### b. NoSQL (MongoDB Example)
```
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["ecommerce"]
collection = db["orders"]
data = list(collection.find())
```

---
### 3. 🌐 From Web APIs
- Most APIs return JSON or XML data via HTTP requests.
```
import requests
import pandas as pd

url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data)
```
---

### 4. 🌍 Web Scraping
| Tool            | Use Case             |
| --------------- | -------------------- |
| `BeautifulSoup` | HTML parsing         |
| `Selenium`      | Dynamic JS content   |
| `Scrapy`        | Large-scale scraping |

```
from bs4 import BeautifulSoup
import requests

url = "https://example.com"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
titles = [tag.text for tag in soup.find_all('h2')]
```

---

### 5. 🛰️ Real-Time Data (Streaming)
| Tool       | Description                                |
| ---------- | ------------------------------------------ |
| Kafka      | Distributed stream platform                |
| Spark      | Real-time data processing                  |
| MQTT       | Lightweight IoT protocol                   |
| WebSockets | Browser/server bidirectional communication |


```
# Simulated Kafka consumption
from kafka import KafkaConsumer
consumer = KafkaConsumer('sensor-data', bootstrap_servers='localhost:9092')
for message in consumer:
    print(message.value)
```