# Chapter 14: Data Collection, Integration, and Understanding

Welcome to **Chapter 14**! This chapter is a crucial step in your data analytics journey. Before you can analyze data, you need to **find it**, **bring it together**, and **understand its quality**.

---

## What You Will Learn

In this chapter, you will learn how to:

1. **Identify data sources** — Know where to find the data you need (files, databases, APIs, etc.)
2. **Distinguish structured vs unstructured data** — Understand different data formats and how to work with them
3. **Combine internal and external data** — Enrich your analysis with data from multiple sources
4. **Apply data integration techniques** — Use merging, joining, and concatenation to combine datasets
5. **Assess data quality dimensions** — Check for completeness, validity, uniqueness, consistency, and timeliness
6. **Document your data** — Create metadata and data dictionaries for clarity and reproducibility
7. **Perform initial data assessment** — Run quick checks before diving into deep analysis

---

## Why This Chapter Matters

> "Garbage in, garbage out."

No matter how sophisticated your analysis or model is, if the underlying data is incomplete, incorrect, or poorly understood, your results will be unreliable. This chapter teaches you to **ask the right questions about your data** before you start analyzing it.

---

## Prerequisites

Before starting this chapter, you should be comfortable with:
- Basic Python syntax (Chapter 2)
- Pandas DataFrames and basic operations (Chapter 4)

---

Let's begin!

---

## 14.0 Setup: Import Libraries and Create Example Data

Before we explore data collection and integration concepts, let's set up our environment and create some realistic example datasets.

We'll use:
- **pandas** — for working with tabular data
- **numpy** — for numeric operations
- **matplotlib** — for quick visualizations

> **Tip:** If you see `ModuleNotFoundError`, install the required packages with:
> ```
> pip install pandas numpy matplotlib
> ```
> (or use Anaconda, which includes these by default)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

In [None]:
# Create two small, realistic datasets: customers and orders
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105, 106],
    'name': ['Asha', 'Bilal', 'Chen', 'Dina', 'Evan', 'Fatima'],
    'email': ['asha@example.com', 'bilal@example.com', None, 'dina@example.com', 'evan@example.com', 'fatima@example.com'],
    'country': ['PK', 'PK', 'CN', 'US', 'US', 'PK'],
    'signup_date': ['2025-10-01', '2025-10-05', '2025-10-05', '2025-10-12', '2025-10-20', '2025-11-02']
})

orders = pd.DataFrame({
    'order_id': [5001, 5002, 5003, 5004, 5005, 5006, 5007],
    'customer_id': [101, 102, 102, 104, 999, 105, 106],  # note: 999 doesn't exist in customers
    'order_date': ['2025-10-03', '2025-10-06', '2025-10-06', '2025-10-19', '2025-10-21', '2025-10-25', '2025-11-05'],
    'amount': [120.50, 49.99, 49.99, 220.00, 15.00, -10.00, 80.00],  # note: negative amount is suspicious
    'channel': ['web', 'mobile', 'mobile', 'web', 'web', 'web', 'mobile']
})

# Convert date columns to real datetime types (important for time logic!)
customers['signup_date'] = pd.to_datetime(customers['signup_date'])
orders['order_date'] = pd.to_datetime(orders['order_date'])

customers, orders

---

## 14.1 Identifying Data Sources

Before you analyze anything, you need to answer: **Where will the data come from?**

### Common Data Sources for Analytics Projects

| Source Type | Examples | Typical Format |
|-------------|----------|----------------|
| **Files** | CSV, Excel, JSON, Parquet | Structured/Semi-structured |
| **Databases** | SQLite, PostgreSQL, MySQL, SQL Server | Structured (SQL) |
| **APIs** | REST APIs, web services | JSON, XML |
| **Logs/Events** | Application logs, clickstream | Semi-structured |
| **Manual inputs** | Surveys, forms, spreadsheets | Structured |

### How to Choose a Data Source (Beginner Checklist)

Ask these questions before selecting a data source:

1. **Does it contain the fields you need?** (e.g., `customer_id`, `date`, `amount`)
2. **How often is it updated?** (daily, real-time, monthly)
3. **Is it trustworthy?** (authoritative system vs. unofficial copy)
4. **Can you access it legally and safely?** (permissions, privacy)
5. **How much data is it?** (small file vs. billions of rows)

> **Tip:** In real projects, it's common to use *multiple* sources (e.g., CRM + web analytics + sales DB).

### Example: loading data from common file formats
In this chapter we created DataFrames directly, but in real work you usually load from files.

Below are examples you can adapt. (They won’t run unless the files exist on your machine.)

In [None]:
# Examples (uncomment and edit paths to run)
# df_csv = pd.read_csv('data/customers.csv')
# df_excel = pd.read_excel('data/customers.xlsx', sheet_name='Sheet1')
# df_json = pd.read_json('data/customers.json')

# Best practice: always inspect after loading
# print(df_csv.head())
# print(df_csv.info())

#### Common beginner mistakes (data sources)
- **Assuming** the export file is the “truth” (it might be outdated).
- Not checking **encoding** (text files can break on special characters).
- Not checking **types** (dates imported as strings, numbers as text).
- Loading the *wrong* sheet/tab from Excel.

Exercise: Write down 3 possible data sources for a problem you care about (e.g., sales analysis, student performance, social media). For each source, note one risk (missing data, outdated, access restrictions).

---

## 14.2 Structured vs Unstructured Data

Understanding the **shape** of data helps you choose the right tools and methods.

### Types of Data Structure

| Type | Description | Examples | Tools |
|------|-------------|----------|-------|
| **Structured** | Organized in rows/columns (tables) | CSV, Excel, SQL tables | pandas, SQL |
| **Semi-structured** | Has structure, but not fixed columns | JSON, XML, logs | `pd.json_normalize()`, parsers |
| **Unstructured** | No consistent table-like structure | Emails, PDFs, images, audio | NLP, OCR, specialized libraries |

In beginner analytics, you'll spend most time with **structured** or **semi-structured** data.

### Example: semi-structured JSON → structured table
Imagine you receive API data like this (nested JSON). We can normalize it into a DataFrame.

Why this matters: most analysis needs tabular columns like `customer_id`, `order_id`, `amount`.

In [None]:
api_like_json = [
    {
        'customer': {'customer_id': 101, 'country': 'PK'},
        'order': {'order_id': 7001, 'amount': 35.5, 'channel': 'web'},
        'tags': ['promo', 'new_user']
    },
    {
        'customer': {'customer_id': 102, 'country': 'PK'},
        'order': {'order_id': 7002, 'amount': 120.0, 'channel': 'mobile'},
        'tags': []
    }
]

# pd.json_normalize flattens nested dictionaries into columns
normalized = pd.json_normalize(api_like_json)
normalized

Tip: Arrays/lists inside JSON (like `tags`) are not automatically “tabular”.
- If each row can have multiple tags, you may need a **separate table** (one row per tag).

Exercise: Create a JSON-like list with 3 records. Include a nested object and convert it using `pd.json_normalize`.

In [None]:
# Exercise starter: edit this JSON and normalize it
exercise_json = [
    {'user': {'id': 1, 'name': 'Sam'}, 'event': {'type': 'click', 'value': 10}},
    {'user': {'id': 2, 'name': 'Rita'}, 'event': {'type': 'purchase', 'value': 99.99}},
]

pd.json_normalize(exercise_json)

---

## 14.3 Internal vs External Data

Data can come from **inside your organization** or **outside**. Both can be valuable.

### Internal Data
- **Examples:** Sales transactions, customer profiles, support tickets, app logs
- **Pros:** Usually detailed, aligned to your business
- **Cons:** May have missing fields, messy historical changes, siloed systems

### External Data
- **Examples:** Census data, market prices, weather, competitor data, public APIs
- **Pros:** Adds context and comparability
- **Cons:** Different definitions, update schedules, licensing restrictions

> ⚠️ **Warning:** External data can be **legally restricted**. Always check terms of use and privacy rules.

### Example: enriching internal data with external mapping
Suppose our internal data has `country` codes, and we want readable country names (external reference table).

Why we do this: readable labels help analysis and reporting.

In [None]:
country_lookup = pd.DataFrame({
    'country': ['PK', 'US', 'CN'],
    'country_name': ['Pakistan', 'United States', 'China'],
    'region': ['South Asia', 'North America', 'East Asia']
})

customers_enriched = customers.merge(country_lookup, on='country', how='left')
customers_enriched

Common mistake: If you join and see **missing `country_name`**, it usually means:
- The key values don’t match (e.g., `pk` vs `PK`, extra spaces)
- The lookup table is incomplete

Exercise: Intentionally break a key (change `PK` to `pk`) and see what happens. Then fix it using `.str.upper()`.

In [None]:
# Exercise: break and fix join keys
customers_bad = customers.copy()
customers_bad.loc[customers_bad['country'] == 'PK', 'country'] = 'pk'

broken = customers_bad.merge(country_lookup, on='country', how='left')
print('Broken join (notice missing country_name):')
display(broken)

# Fix by standardizing keys
customers_fixed = customers_bad.copy()
customers_fixed['country'] = customers_fixed['country'].str.upper().str.strip()
fixed = customers_fixed.merge(country_lookup, on='country', how='left')
print('Fixed join:')
fixed

---

## 14.4 Data Integration Techniques

**Data integration** means combining data from multiple sources into a form you can analyze.

### Common Integration Patterns

| Pattern | When to Use | pandas Method |
|---------|-------------|---------------|
| **Merge/Join** | Combine columns using a key (e.g., `customer_id`) | `pd.merge()` |
| **Append/Concatenate** | Stack rows of similar tables | `pd.concat()` |
| **Union with schema alignment** | Combine tables after matching column names/types | `pd.concat()` after alignment |
| **Deduplication** | Remove repeated records after combining | `.drop_duplicates()` |
| **Mapping/Standardization** | Make categories consistent | `.map()`, `.replace()` |

> **Key Concept:** A join is only as good as its **join key quality**. If keys are missing, duplicated, or inconsistent, results can be wrong.

### 14.4.1 Merge (join) basics
We’ll join `orders` with `customers` using `customer_id`.

Why: Orders alone tell us *what was bought*, but customers tell us *who bought it* (country, signup date, etc.).

In [None]:
orders_with_customers = orders.merge(customers, on='customer_id', how='left', indicator=True)
orders_with_customers

Notice the `_merge` column: it tells us whether the join found a match.
- `both` means matching `customer_id` existed in both tables
- `left_only` means the order had a `customer_id` not found in customers

This is an **early warning** sign of integration issues.

In [None]:
# Which orders didn't match a customer?
unmatched = orders_with_customers[orders_with_customers['_merge'] != 'both']
unmatched

#### Common join mistakes
- Joining on the wrong key (e.g., name instead of id)
- Forgetting that keys can have different formats (strings vs integers)
- **Many-to-many joins** causing duplicated rows

Tip: Always check row counts before and after merge and validate with a quick sanity check.

In [None]:
print('Rows in orders:', len(orders))
print('Rows after merge:', len(orders_with_customers))

# Sanity: order_id should still be unique if each order is one row
print('Unique order_id:', orders_with_customers['order_id'].nunique())

### 14.4.2 Append/Concatenate basics
Concatenation is used when you have **the same kind of table** split into multiple parts.
Example: `orders_october.csv` + `orders_november.csv` → one `orders` table.

In [None]:
orders_part_1 = orders.iloc[:4].copy()
orders_part_2 = orders.iloc[4:].copy()

combined_orders = pd.concat([orders_part_1, orders_part_2], ignore_index=True)
combined_orders

Warning: If columns don’t match, `concat` will create missing values.
This is good (it prevents silent data loss), but it means you must align schemas.

In [None]:
# Example of mismatched columns
orders_part_2_mismatch = orders_part_2.drop(columns=['channel']).copy()
mismatch_concat = pd.concat([orders_part_1, orders_part_2_mismatch], ignore_index=True)
mismatch_concat

Exercise: Create two small DataFrames with mostly the same columns, then concatenate.
- Identify which column becomes missing
- Decide whether to fill missing values or fix the schema before concatenation

---

## 14.5 Data Quality Dimensions

Data quality answers: **Can we trust this data enough to use it?**

### The Six Dimensions of Data Quality

| Dimension | Question | Example Issue |
|-----------|----------|---------------|
| **Completeness** | Are required fields present? | Missing email addresses |
| **Validity** | Do values follow rules? | Negative amounts, future dates |
| **Accuracy** | Are values correct in reality? | Wrong customer address |
| **Consistency** | Are values the same across systems? | `PK` vs `pk` vs `Pakistan` |
| **Uniqueness** | Are there duplicates? | Same order recorded twice |
| **Timeliness** | Is the data up to date? | Data from 6 months ago |

> **Important:** You can't always fix quality issues immediately, but you must at least **detect and document** them.

### 14.5.1 Completeness: missing values
We start with missing values because they are common and easy to measure.
Why: Missing values can break calculations or bias results (e.g., if only some customers have emails).

In [None]:
def missingness_report(df: pd.DataFrame) -> pd.DataFrame:
    missing_count = df.isna().sum()
    missing_pct = (missing_count / len(df) * 100).round(1)
    report = pd.DataFrame({
        'missing_count': missing_count,
        'missing_pct': missing_pct
    }).sort_values('missing_pct', ascending=False)
    return report

missingness_report(customers)

In [None]:
# Simple visual: missing percentage by column
report = missingness_report(customers)
ax = report['missing_pct'].plot(kind='bar', title='Missing % by Column (customers)', ylabel='Missing %')
ax.set_ylim(0, max(5, report['missing_pct'].max() + 5))
plt.tight_layout()
plt.show()

Tip: For critical columns (like keys), missing values are often **not acceptable**.

Exercise: Check missingness for `orders`.
- Which columns have missing values?
- If `amount` were missing, what would you do (drop, fill, investigate)?

In [None]:
missingness_report(orders)

### 14.5.2 Validity: rule checks
Validity means values follow **business rules** or **logical rules**.
Examples:
- `amount` should be ≥ 0
- `order_date` should not be before `signup_date`
- `country` should be one of allowed codes

We’ll implement simple rule checks and flag violations.

In [None]:
# Join first so we can compare order_date vs signup_date
joined = orders.merge(customers, on='customer_id', how='left')

invalid_amount = joined[joined['amount'] < 0]
invalid_customer = joined[joined['name'].isna()]  # customer missing
invalid_date_logic = joined[(~joined['signup_date'].isna()) & (joined['order_date'] < joined['signup_date'])]

print('Invalid amounts (amount < 0):')
display(invalid_amount)
print('Orders with missing customer record:')
display(invalid_customer)
print('Orders before signup_date (date logic issue):')
display(invalid_date_logic)

Warning: A rule violation doesn’t automatically mean “delete the row”.
It means you must **investigate**:
- Is it a refund (negative amount)?
- Is the customer table incomplete?
- Was the signup date recorded incorrectly?

Exercise: Add a new rule check: channel must be either `web` or `mobile`. Then intentionally insert a bad value and see if your check catches it.

In [None]:
# Exercise starter
orders_test = orders.copy()
orders_test.loc[0, 'channel'] = 'phone'  # invalid

allowed_channels = {'web', 'mobile'}
invalid_channel = orders_test[~orders_test['channel'].isin(allowed_channels)]
invalid_channel

### 14.5.3 Uniqueness and duplicates
Duplicates can happen when you:
- Import the same file twice
- Merge incorrectly
- Get repeated events from logs

We’ll check uniqueness for `order_id` and also check for duplicated rows.

In [None]:
print('order_id unique?', orders['order_id'].is_unique)
print('customer_id unique in customers?', customers['customer_id'].is_unique)

# Find duplicate rows (exact duplicates)
duplicate_rows = orders[orders.duplicated()]
duplicate_rows

Exercise: Create a duplicate order row (append it), then use `.duplicated()` to find it.
Tip: Use `keep=False` to mark *all* duplicates, not just later copies.

In [None]:
orders_dup = pd.concat([orders, orders.iloc[[1]]], ignore_index=True)
dups_all = orders_dup[orders_dup.duplicated(keep=False)]
dups_all

### 14.5.4 Consistency and standardization
Consistency means the same concept is recorded the same way across datasets.
Examples:
- Country codes are always uppercase (`PK`, not `pk`)
- Date formats are consistent
- Categories are standardized (`mobile` vs `Mobile`)

A simple habit: standardize text columns early using `.str.strip()` and consistent casing.

In [None]:
orders_messy = orders.copy()
orders_messy.loc[2, 'channel'] = ' Mobile '  # messy value

print('Before standardization:', orders_messy['channel'].unique())

orders_clean = orders_messy.copy()
orders_clean['channel'] = orders_clean['channel'].str.strip().str.lower()

print('After standardization:', orders_clean['channel'].unique())
orders_clean.head()

### 14.5.5 Timeliness (freshness)
Timeliness asks: **Is the data recent enough for the decision?**
Example: If you’re monitoring daily sales, data from 30 days ago might be too old.

A simple freshness check is to look at the latest date in the dataset.

In [None]:
latest_order_date = orders['order_date'].max()
earliest_order_date = orders['order_date'].min()

print('Earliest order:', earliest_order_date)
print('Latest order:', latest_order_date)

# Example: define 
today = pd.Timestamp('2025-11-10')  # pretend 
 for demonstration
freshness_days = (today - latest_order_date).days
print('Freshness (days since latest):', freshness_days)

---

## 14.6 Data Documentation and Metadata

Professional analytics work is not just code—it's also **communication**.

### What is Metadata?

**Metadata** is "data about data". It answers questions like:
- What does each column mean?
- What are the allowed values?
- Where did the data come from and when was it extracted?
- Who owns the dataset?

### The Data Dictionary

A **data dictionary** is a table describing each column in your dataset. It's one of the most beginner-friendly documentation tools.

| Column | Data Type | Description | Allowed Values/Rules |
|--------|-----------|-------------|----------------------|
| `order_id` | int | Unique identifier for each order | Must be unique |
| `amount` | float | Order total in USD | `>= 0` (or negative for refunds) |

> **Tip:** When you return to a project later (or share it with a teammate), documentation saves hours.

### Example: building a simple data dictionary
We’ll generate a starting data dictionary automatically, then you can fill in descriptions and rules.

Why: When you return to a project later (or share it with a teammate), documentation saves hours.

In [None]:
def make_data_dictionary(df: pd.DataFrame, table_name: str) -> pd.DataFrame:
    return pd.DataFrame({
        'table': table_name,
        'column': df.columns,
        'dtype': [str(t) for t in df.dtypes],
        'example_value': [df[c].dropna().iloc[0] if df[c].notna().any() else None for c in df.columns],
        'description': ['' for _ in df.columns],
        'allowed_values_or_rules': ['' for _ in df.columns]
    })

dd_customers = make_data_dictionary(customers, 'customers')
dd_orders = make_data_dictionary(orders, 'orders')

pd.concat([dd_customers, dd_orders], ignore_index=True)

Tip: Keep your data dictionary near your analysis (in the notebook or as a CSV/Markdown file).

Exercise: Fill in descriptions for 3 columns in `dd_orders` and add at least one rule (e.g., `amount >= 0`).

In [None]:
dd_orders_filled = dd_orders.copy()

dd_orders_filled.loc[dd_orders_filled['column'] == 'order_id', 'description'] = 'Unique identifier for each order'
dd_orders_filled.loc[dd_orders_filled['column'] == 'customer_id', 'description'] = 'ID of the customer who placed the order'
dd_orders_filled.loc[dd_orders_filled['column'] == 'amount', 'description'] = 'Order total amount in USD'

dd_orders_filled.loc[dd_orders_filled['column'] == 'amount', 'allowed_values_or_rules'] = 'amount >= 0 (unless refunds are represented as negatives)'
dd_orders_filled.loc[dd_orders_filled['column'] == 'channel', 'allowed_values_or_rules'] = "Must be one of: 'web', 'mobile'"

dd_orders_filled

## Chapter 14.7: Initial Data Assessment
An **initial data assessment** is a quick, structured review before doing deep analysis.

Goal: Find obvious problems early so you don’t build a full analysis on broken assumptions.

A practical first-pass checklist:
1. **Shape**: How many rows/columns?
2. **Types**: Are dates really dates, numbers really numbers?
3. **Missing values**: Where are the gaps?
4. **Duplicates**: Are keys unique?
5. **Ranges**: Are values in reasonable ranges?
6. **Join coverage**: Do keys match across tables?
7. **Basic distributions**: Quick histograms / value counts

We’ll implement a simple assessment report.

In [None]:
---

## 14.7 Initial Data Assessment

An **initial data assessment** is a quick, structured review before doing deep analysis.

### Goal
Find obvious problems early so you don't build a full analysis on broken assumptions.

### First-Pass Checklist

| Check | What to Look For | pandas Method |
|-------|------------------|---------------|
| **Shape** | How many rows/columns? | `.shape` |
| **Types** | Are dates really dates? | `.dtypes`, `.info()` |
| **Missing values** | Where are the gaps? | `.isna().sum()` |
| **Duplicates** | Are keys unique? | `.duplicated()`, `.is_unique` |
| **Ranges** | Are values in reasonable ranges? | `.describe()`, `.min()`, `.max()` |
| **Join coverage** | Do keys match across tables? | `merge(..., indicator=True)` |
| **Distributions** | Quick histograms / value counts | `.value_counts()`, `.hist()` |

Let's implement a simple assessment report.

### A quick numeric summary and distribution check
Why: Sometimes you catch errors instantly (negative amounts, impossible ages, etc.).

In [None]:
orders[['amount']].describe()

In [None]:
# Histogram of amounts
orders['amount'].plot(kind='hist', bins=10, title='Order Amount Distribution', xlabel='amount')
plt.tight_layout()
plt.show()

Exercise: Run `.value_counts()` on `channel` and `country`.
- Which category is most common?
- If you saw an unexpected category, what would you do?

In [None]:
print('Order channels:')
display(orders['channel'].value_counts())

print('Customer countries:')
display(customers['country'].value_counts())

---

## 14.8 Mini-Project: Build an Integrated, Quality-Checked Dataset

In this mini-project you will practice everything from this chapter:

1. **Integrate** orders + customers using a merge
2. **Flag quality issues** (missing customers, invalid amounts)
3. **Produce a clean dataset** for analysis
4. **Create a data dictionary** for the final table

> **The goal is not "perfect data"** — it's building a **repeatable workflow** that you can apply to any dataset.

In [None]:
# Step 1: integrate
integrated = orders.merge(customers, on='customer_id', how='left')

# Step 2: add quality flags
integrated['flag_missing_customer'] = integrated['name'].isna()
integrated['flag_invalid_amount'] = integrated['amount'] < 0

# A simple 'status' column can help summarize issues
conditions = [
    integrated['flag_missing_customer'],
    integrated['flag_invalid_amount']
]
choices = [
    'missing_customer',
    'invalid_amount'
]
integrated['quality_status'] = np.select(conditions, choices, default='ok')

integrated.sort_values(['quality_status', 'order_date'])

### Decide what “clean” means
Cleaning is a decision based on your context. Here’s a simple approach for beginners:
- Exclude rows where `customer_id` is unknown (cannot analyze customer behavior)
- Keep negative amounts but treat them separately as potential refunds (do not hide them!)

We’ll build a dataset that is usable for typical revenue analysis.

In [None]:
clean_for_revenue = integrated[~integrated['flag_missing_customer']].copy()

# For revenue calculations, you might exclude negative amounts (or handle as refunds)
clean_for_revenue['amount_nonnegative'] = clean_for_revenue['amount'].clip(lower=0)

clean_for_revenue[['order_id', 'customer_id', 'name', 'order_date', 'amount', 'amount_nonnegative', 'quality_status']].sort_values('order_date')

### Quick analysis check (sanity)
Let’s do one small analysis to confirm the integrated dataset behaves as expected.

In [None]:
revenue_by_country = (
    clean_for_revenue
    .groupby('country', dropna=False)['amount_nonnegative']
    .sum()
    .sort_values(ascending=False)
)
revenue_by_country

In [None]:
revenue_by_country.plot(kind='bar', title='Revenue by Country (non-negative amounts)')
plt.ylabel('Revenue')
plt.tight_layout()
plt.show()

### Document the final dataset
We’ll generate a data dictionary for the `clean_for_revenue` table.
Then you can extend it with business meaning.

In [None]:
dd_final = make_data_dictionary(clean_for_revenue, 'clean_for_revenue')
dd_final

Mini-project exercises (recommended)
1. Add a new flag: `flag_missing_email` (customer email missing).
2. Create a small report table counting rows by `quality_status`.
3. Decide a policy for negative amounts (refunds) and implement it.
4. Add 3 meaningful descriptions to the final data dictionary.

In [None]:
### Mini-Project Exercises

Try these exercises to reinforce what you've learned:

1. **Add a new flag:** `flag_missing_email` (customer email is missing)
2. **Create a summary report:** Count rows by `quality_status`
3. **Decide a policy:** How will you handle negative amounts (refunds)?
4. **Document your work:** Add 3 meaningful descriptions to the final data dictionary

---

## Additional Resources (Optional Reading)

Expand your knowledge with these resources:

- **pandas Merge/Join Guide:** https://pandas.pydata.org/docs/user_guide/merging.html
- **pandas Missing Data Guide:** https://pandas.pydata.org/docs/user_guide/missing_data.html
- **JSON Normalization:** https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
- **Intro to Data Quality (Wikipedia):** https://en.wikipedia.org/wiki/Data_quality

> **Tip:** When you start working with real APIs and databases, you'll also want to learn about authentication, rate limits, and SQL joins (covered in Chapters 9 and 10).

---

## Summary / Key Takeaways

Here's what you learned in this chapter:

| Topic | Key Insight |
|-------|-------------|
| **Data Sources** | Know where your data comes from and verify it's trustworthy and accessible |
| **Structured vs Unstructured** | Most analytics uses structured (tables) or semi-structured (JSON) data |
| **Internal vs External** | Combine internal data with external sources for richer insights |
| **Data Integration** | Use `merge()` for joins and `concat()` for appending — watch your keys! |
| **Data Quality** | Check completeness, validity, uniqueness, consistency, and timeliness |
| **Documentation** | Create data dictionaries to make your work reusable and trustworthy |
| **Initial Assessment** | Always do a quick assessment before advanced analysis |

---

### What's Next?

In **Chapter 15: Data Cleaning, Transformation, and Preprocessing**, you'll learn how to:
- Handle missing data
- Detect and treat outliers
- Normalize and scale data
- Encode categorical variables
- Create new features

---

**Congratulations!** You now have the foundation to collect, integrate, and understand data before diving into analysis.