# Notebook 1: Data Loading

Welcome to the first notebook in our Data Wizardry with Python series! This notebook covers the fundamentals of loading data into pandas DataFrames.

## What You'll Learn

- Reading CSV files with `pd.read_csv()`
- Inspecting DataFrame shape, columns, and data types
- Using `.head()`, `.tail()`, `.info()`, and `.describe()`
- Understanding DataFrames, Series, and Index objects
- Basic column selection and renaming
- Identifying missing values

## Prerequisites

Make sure you have:
- Python 3.11+ installed
- pandas library installed (`uv pip install -e .`)
- VS Code with Jupyter extension (or Jupyter Lab)
- The sample datasets in the `../data/` folder

Let's get started!

In [None]:
# Setup: imports and display options
import pandas as pd
import numpy as np

# Set display options for better readability
pd.options.display.max_columns = 50
pd.options.display.width = 120
pd.options.display.max_rows = 20

print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("\nSetup complete! Ready to load data.")

## 1. Reading CSV Files

CSV (Comma-Separated Values) files are one of the most common data formats. pandas makes it easy to read them with `pd.read_csv()`.

We'll load two datasets:
- **media_contacts.csv**: Media contact data with TV, online, and social media metrics
- **socio_demos.csv**: Demographic information including age, gender, and household details

In [None]:
# Load the media contacts dataset
media_df = pd.read_csv('../data/media_contacts.csv')

print("Media contacts data loaded successfully!")
print(f"Shape: {media_df.shape}")  # (rows, columns)

### Viewing the First Few Rows

Use `.head()` to see the first 5 rows (or specify a number like `.head(10)`):

In [None]:
media_df.head()

### Inspecting the Data Structure

Use `.info()` to get a quick overview of:
- Number of rows and columns
- Column names and data types
- Memory usage
- Non-null count (helps identify missing data)

In [None]:
media_df.info()

### Checking Columns and Data Types

View all column names and their data types:

In [None]:
print("Column names:")
print(media_df.columns.tolist())
print("\nData types:")
print(media_df.dtypes)

### Quick Statistical Summary

Use `.describe()` to get summary statistics for numerical columns:

In [None]:
media_df.describe()

## 2. Loading a Second Dataset

Let's load the demographic data and inspect it:

In [None]:
# Load demographic data
demo_df = pd.read_csv('../data/socio_demos.csv')

print("Demographic data loaded!")
print(f"Shape: {demo_df.shape}")

In [None]:
demo_df.head()

In [None]:
demo_df.info()

## 3. Understanding DataFrames, Series, and Index

### DataFrame
A DataFrame is a 2D table with labeled rows and columns. Think of it as a spreadsheet or SQL table.

### Series
A Series is a single column (1D array) from a DataFrame. Selecting one column returns a Series:

In [None]:
# Select a single column - returns a Series
gender_series = demo_df['Gender']
print(type(gender_series))
print("\nFirst 5 values:")
print(gender_series.head())

### Index
The Index is the row labels. By default, pandas uses integer positions (0, 1, 2, ...), but you can set a meaningful column as the index:

In [None]:
print("Current index:")
print(demo_df.index)

print("\nColumn names (also an Index object):")
print(demo_df.columns)

## 4. Handling Column Names

Column names with spaces or special characters can be inconvenient. Let's clean them:

In [None]:
# View current column names
print("Original column names:")
print(demo_df.columns.tolist())

# Rename columns - method 1: rename specific columns
demo_df = demo_df.rename(columns={
    'Person ID': 'person_id',
    'Number_of children': 'num_children',
    'People_in_Household': 'household_size'
})

print("\nRenamed columns:")
print(demo_df.columns.tolist())

You can also rename all columns at once by assigning a new list:

In [None]:
# Let's standardize the media_df columns to lowercase with underscores
media_df.columns = media_df.columns.str.replace(' ', '_').str.lower()
print("Media columns (standardized):")
print(media_df.columns.tolist())

## 5. Checking for Missing Values

Missing data is common in real datasets. Use `.isnull()` or `.isna()` to detect it:

In [None]:
# Check for missing values - returns True/False for each cell
print("Missing values in media_df:")
print(media_df.isnull().sum())  # Sum counts True values per column

In [None]:
print("Missing values in demo_df:")
print(demo_df.isnull().sum())

### Percentage of Missing Data

It's often helpful to see what percentage of each column is missing:

In [None]:
missing_pct = (demo_df.isnull().sum() / len(demo_df)) * 100
print("Percentage missing:")
print(missing_pct)

## 6. Other Useful Inspection Methods

### Viewing the Last Rows

Use `.tail()` to see the last few rows:

In [None]:
demo_df.tail()

### Random Sample

View a random sample of rows:

In [None]:
media_df.sample(5)  # 5 random rows

### Unique Values

Check unique values in categorical columns:

In [None]:
print("Unique genders:")
print(demo_df['Gender'].unique())

print("\nValue counts for Gender:")
print(demo_df['Gender'].value_counts())

In [None]:
print("Unique household sizes:")
print(demo_df['household_size'].value_counts())

## Summary

In this notebook, you learned:

âœ… How to load CSV files with `pd.read_csv()`  
âœ… Inspecting DataFrames with `.head()`, `.tail()`, `.info()`, `.describe()`  
âœ… Understanding DataFrame, Series, and Index objects  
âœ… Renaming columns for cleaner code  
âœ… Checking for missing values with `.isnull().sum()`  
âœ… Exploring unique values and value counts

### Next Steps

In the next notebook (**02_data_cleaning.ipynb**), we'll learn how to:
- Handle missing values (fill, drop, interpolate)
- Remove duplicates
- Fix data types
- Clean and transform text data

### Key Takeaways

1. **Always inspect your data first**: Use `.head()`, `.info()`, `.describe()` before analysis
2. **Check for missing data early**: Use `.isnull().sum()` to identify columns with missing values
3. **Standardize column names**: Lowercase with underscores makes coding easier
4. **Understand the shape**: Knowing (rows, columns) helps you understand your dataset size

## ðŸŽ¯ Practice Exercises

Try these on your own:

1. Load both datasets again, but this time set `person_id` as the index using the `index_col` parameter
2. Find which column in `media_df` has the highest maximum value
3. Calculate the percentage of people in each household size category
4. Check if there are any duplicate `person_id` values in either dataset
5. Create a new column in `demo_df` that extracts the birth year from `BIRTHDAY` (hint: convert to string and slice)

### Bonus Challenges

6. Load the media contacts data with only specific columns (use `usecols` parameter)
7. Find the correlation between `TV_Total` and `Online Total` in the media dataset
8. Export the cleaned `demo_df` (with renamed columns) to a new CSV file in `../outputs/` folder

**Note**: Some of these exercises use techniques covered in later notebooks, so don't worry if you can't solve them all yet!