# Module 2: Data Collection
---
This notebook is part of the **Accelerated Data Science Teaching Kit** adapted for our cohort at Eldohub.

## Learning Objectives
- Understand different ways to collect data.
- Practice collecting data from APIs.
- Explore scraping techniques.
- Learn about data annotation and quality.
- Introduction to SQLite and SQL refresher.

Duration: ~2 hours (Hands-on Lecture)


## Lecture 2.1 – Collecting Data

### Data You Can Download
- NYC Taxi data
- StackOverflow (XML)
- Wikipedia (data dump)
- Open data portals: [data.gov](https://data.gov), [data.nasa.gov](https://data.nasa.gov)

### Collect Data via APIs
- Google Data API
- Twitter API
- The Movie DB (TMDb) API (we’ll use this today)


In [None]:
import requests
import sys

# Example: Collecting data from The Movie Database (TMDb) API
# You need to create an account and get your API_KEY from https://www.themoviedb.org/account/signup

API_KEY = "aa660b1104dc0d570d967ddb4698a570"  # replace with your TMDb API Key
BASE_URL = "https://api.themoviedb.org/3" # https://www.themoviedb.org/settings/api

def get_movie(movie_id):
    url = f"{BASE_URL}/movie/{movie_id}?api_key={API_KEY}"
    response = requests.get(url)
    return response.json()

# Example: Get details for the movie 'Fight Club' (ID: 550)
movie = get_movie(550)
print(movie)


{'adult': False, 'backdrop_path': '/hZkgoQYus5vegHoetLkCJzb17zJ.jpg', 'belongs_to_collection': None, 'budget': 63000000, 'genres': [{'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}], 'homepage': 'http://www.foxmovies.com/movies/fight-club', 'id': 550, 'imdb_id': 'tt0137523', 'origin_country': ['US'], 'original_language': 'en', 'original_title': 'Fight Club', 'overview': 'A ticking-time-bomb insomniac and a slippery soap salesman channel primal male aggression into a shocking new form of therapy. Their concept catches on, with underground "fight clubs" forming in every town, until an eccentric gets in the way and ignites an out-of-control spiral toward oblivion.', 'popularity': 29.0064, 'poster_path': '/pB8BM7pdSp6B6Ih7QZ4DrQ3PmJK.jpg', 'production_companies': [{'id': 711, 'logo_path': '/tEiIH5QesdheJmDAqQwvtN60727.png', 'name': 'Fox 2000 Pictures', 'origin_country': 'US'}, {'id': 508, 'logo_path': '/4sGWXoboEkWPphI6es6rTmqkCBh.png', 'name': 'Regency Enterprises', 'origin_cou

## Lecture 2.2 – Scraping Data
Sometimes APIs are unavailable. We use scraping instead.

### Tools for scraping:
- BeautifulSoup
- Scrapy
- Selenium

Below is a simple example with **BeautifulSoup**.


In [None]:
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/Data_science"

# Pretend to be a browser (so Wikipedia doesn't block us)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# Find all paragraphs, then pick the first non-empty one
paragraphs = soup.find_all('p')

first_paragraph = None
for p in paragraphs:
    text = p.get_text(strip=True)
    if text:
        first_paragraph = text
        break

print(first_paragraph)




Data scienceis aninterdisciplinaryacademic field[1]that usesstatistics,scientific computing,scientific methods, processing,scientific visualization,algorithmsand systems to extract or extrapolateknowledgefrom potentially noisy,structured, orunstructured data.[2]


## Lecture 2.4 – Data Annotation & Data Quality

### Annotation Examples
- Image classification (cat, dog)
- NLP: Sentiment classification, Named Entity Recognition (NER)

### Data Quality Properties
- Relevance
- Accuracy
- Timeliness
- Completeness

👉 High quality data = better models.


## Module 2 Lab 1 – Data Annotation in Active Learning

We’ll implement **Active Learning** using MNIST dataset with `modAL`.


# Why This Code is Useful for You

- It demonstrates the concept of Active Learning hands-on.

- Shows that with just 100 samples, you can already get 71% accuracy → proves labeling everything is not always necessary.

- It’s a teaching demo: helps  visualize data annotation challenges + the efficiency gains from Active Learning.

- It ties theory → practice → real-world use cases.


## Why Active Learning?

Normally:

- You label all data → train model → done.

With Active Learning:

- Model starts with small labeled set.

- Then it asks for labels only on the most uncertain samples (the ones it’s struggling with).

- Saves time, money, and human effort.


## Real-World Applications

- Healthcare – Doctors label only tricky medical scans, instead of all
  scans. Saves cost & time.

- Finance – Fraud detection models ask human analysts to review only transactions it’s unsure about.

- Customer Support – Chatbot sentiment classifiers ask humans to label only confusing customer messages.

- Autonomous Vehicles – Self-driving cars focus annotation on ambiguous road situations instead of labeling every frame.


In [None]:
!pip install git+https://github.com/modAL-python/modAL.git


Collecting git+https://github.com/modAL-python/modAL.git
  Cloning https://github.com/modAL-python/modAL.git to /tmp/pip-req-build-ggm0v_8p
  Running command git clone --filter=blob:none --quiet https://github.com/modAL-python/modAL.git /tmp/pip-req-build-ggm0v_8p
  Resolved https://github.com/modAL-python/modAL.git to commit bba6f6fd00dbb862b1e09259b78caf6cffa2e755
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from modAL.models import ActiveLearner
import numpy as np

# Load MNIST dataset
# Loads MNIST (handwritten digits 0–9).
# Normalize pixels between 0–1 for better model performance.
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist.data / 255.0, mnist.target.astype(int)

# Train/Test Split
# Train = 80%, Test = 20%.
# Train is for learning, Test is for evaluation.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Select small initial dataset
# Instead of labeling all 60,000 MNIST training samples, we start with just 100 labeled samples.
# This simulates real-world cost constraints (labeling is expensive).
n_initial = 100
initial_idx = np.random.choice(range(len(X_train)), size=n_initial, replace=False)
X_initial, y_initial = X_train.iloc[initial_idx], y_train.iloc[initial_idx]

# Create the ActiveLearner
# Uses Logistic Regression as the base model.
# Learns only from the 100 labeled samples.
learner = ActiveLearner(
    estimator=LogisticRegression(max_iter=1000),
    X_training=X_initial, y_training=y_initial
)

# Evaluate
# Accuracy = ~71%, even with only 100 labels!
# Pretty good, but not as high as training on all 60,000 samples (~95%).
print("Initial score:", learner.score(X_test, y_test))


Initial score: 0.7492142857142857


## Lecture 2.5 – SQLite as Simple Storage

We’ll store our data using SQLite.

👉 If you cannot run SQL queries directly in Jupyter, I recommend using **[SQLite Online](https://sqliteonline.com/)** (free and open-source) for hands-on SQL practice.


In [None]:
from google.colab import files
uploaded = files.upload()


Saving movie_metadata.csv to movie_metadata (2).csv


In [None]:
import sqlite3
import pandas as pd

# Use the uploaded filename
df = pd.read_csv("movie_metadata.csv")   # make sure this matches exactly

# Connect to SQLite
conn = sqlite3.connect('movies.db')

# Save to SQLite
df.to_sql("movies", conn, if_exists="replace", index=False)

# Check
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM movies")
print("Total rows:", cursor.fetchone()[0])

conn.close()

from google.colab import files
files.download("movies.db")



Total rows: 5043


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>