# FRE 521D: Data Analytics in Climate, Food and Environment
## Lab 3: Python Wrangling - Tidy Data, Types, and Validation

**Program:** UBC Master of Food and Resource Economics  
**Instructor:** Asif Ahmed Neloy

---

<div style="background-color: #FFF3CD; border-left: 4px solid #E6A23C; padding: 15px; margin: 15px 0;">
    <h3 style="margin-top: 0; color: #856404;">Submission Deadline</h3>
    <p style="margin-bottom: 0; font-size: 1.2em;"><strong>Wednesday, January 21, 2026 - 11:59 PM (End of Day)</strong></p>
</div>

---

## Lab Objectives

In this lab, you will apply Python wrangling techniques to the climate-agriculture data from Assignment 1. You will:

1. **Read** data from your MySQL database tables
2. **Check and convert** data types appropriately
3. **Reshape** data from wide to long format using `pd.melt()`
4. **Analyze and handle** missing data
5. **Validate** data quality with range, null, and uniqueness checks

---

## Prerequisites

- Assignment 1 completed (tables created in your MySQL database)
- Docker container running with MySQL
- Conda environment activated

---

## Setup: Import Libraries and Connect to Database

In [64]:
# Import libraries
import pandas as pd
import numpy as np
import mysql.connector
from mysql.connector import Error
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', None)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")

Libraries imported successfully!
Pandas version: 2.3.3


In [65]:
# Database connection configuration
# Update these values if your setup is different

DB_CONFIG = {
    "host": "localhost",
    "port": 3306,
    "user": "mfre521d_user",
    "password": "mfre521d_user_pw",
    "database": "mfre521d",
}

def get_connection():
    """Create and return a database connection."""
    return mysql.connector.connect(**DB_CONFIG)

def read_table(query):
    """Execute a SQL query and return results as DataFrame."""
    conn = get_connection()
    try:
        df = pd.read_sql(query, conn)
        return df
    finally:
        conn.close()

# Test connection
try:
    conn = get_connection()
    print("Connected to MySQL successfully!")
    conn.close()
except Error as e:
    print(f"Error: {e}")
    print("Make sure your Docker container is running.")

Connected to MySQL successfully!


---
---

# Question 1: Read Data and Check Types

## Task: Load Data from A1 Tables and Inspect Data Types

As discussed in Lecture 6, **always check `df.dtypes` after loading data**. Different sources may encode the same information differently.

### Part A: Read the tables from your database

The solution for reading from the database is provided below.

---

In [66]:
# SOLUTION PROVIDED: Read data from A1 tables

# Read countries table
df_countries = read_table("SELECT * FROM dim_country")
print(f"Countries: {len(df_countries)} rows")

# Read crop_production table
df_crops = read_table("SELECT * FROM Crop_2")
print(f"Crop Production: {len(df_crops)} rows")

# Read temperature_anomalies table
df_temp = read_table("SELECT * FROM Temp_2")
print(f"Temperature Anomalies: {len(df_temp)} rows")

print("\nData loaded successfully!")

Countries: 0 rows
Crop Production: 83740 rows
Temperature Anomalies: 1137 rows

Data loaded successfully!


In [67]:

# 1) clean year to numeric
df_crops['year'] = pd.to_numeric(df_crops['year'], errors='coerce')

# 2) quick check
df_crops['year'].dtype
df_crops[['year']].head()


Unnamed: 0,year
0,2001
1,1993
2,1995
3,2018
4,2013


In [68]:
num_cols = [
    'area_harvested_ha',
    'production_tonnes',
    'yield_kg_ha',
    'fertilizer_use_kg_ha',
    'irrigation_pct'
]

for c in num_cols:
    df_crops[c] = (
        df_crops[c].astype(str)
        .str.replace(",", "", regex=False)
        .str.replace("%", "", regex=False)
        .str.strip()
    )
    df_crops[c] = pd.to_numeric(df_crops[c], errors='coerce')

df_crops.dtypes


country                  object
iso3_code                object
region                   object
income_group             object
year                      int64
crop                     object
area_harvested_ha         int64
production_tonnes       float64
yield_kg_ha             float64
fertilizer_use_kg_ha    float64
irrigation_pct          float64
notes                    object
dtype: object

In [69]:
# View sample of each table
print("=" * 60)
print("COUNTRIES TABLE")
print("=" * 60)
df_countries.head()

COUNTRIES TABLE


Unnamed: 0,country_id,iso3_code,country_name,region,income_group


In [70]:
print("=" * 60)
print("CROP PRODUCTION TABLE")
print("=" * 60)
df_crops.head()

CROP PRODUCTION TABLE


Unnamed: 0,country,iso3_code,region,income_group,year,crop,area_harvested_ha,production_tonnes,yield_kg_ha,fertilizer_use_kg_ha,irrigation_pct,notes
0,China,CHN,East Asia,Upper middle income,2001,Soybeans,3751494,12036421.75,3208.43,100.9,,
1,Nepal,NPL,South Asia,Low income,1993,Maize,2112762,11377270.55,5385.02,19.14,9.8,
2,South Korea,KOR,East Asia,High income,1995,Soybeans,1650777,7474101.16,4527.63,193.84,56.6,
3,United States,USA,North America,High income,2018,Wheat,4782989,32397951.41,6773.58,205.12,62.5,
4,Japan,JPN,East Asia,High income,2013,Rice,5434696,58322509.35,10731.51,211.64,61.4,


In [71]:
print("=" * 60)
print("TEMPERATURE ANOMALIES TABLE")
print("=" * 60)
df_temp.head()

TEMPERATURE ANOMALIES TABLE


Unnamed: 0,country,year,annual_anomaly_c,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec
0,United States of America,1990,0.07,-0.02,,,-0.09,-0.11,0.44,-0.44,0.3,-0.08,-0.03,-0.31,-0.39
1,United States of America,1991,0.2,0.36,0.4,0.6,0.74,0.22,0.34,0.22,0.7,,-0.44,-0.5,-0.22
2,United States of America,1992,0.54,0.46,0.76,0.85,0.68,0.6,0.98,0.52,0.92,0.53,0.01,0.09,0.07
3,United States of America,1993,0.43,0.28,0.22,0.74,0.6,,1.3,0.81,0.35,-0.3,0.04,0.55,-0.13
4,United States of America,1994,0.87,0.75,0.86,0.65,1.14,0.45,1.71,1.27,1.04,0.01,0.62,1.05,0.88


### Part B: Check and Document Data Types

**YOUR TASK:** 
1. Use `.dtypes` to check the data types of each DataFrame
2. Use `.info()` to get a summary including non-null counts
3. Answer the questions in the markdown cell below

---

In [72]:
# ============================================
# YOUR CODE HERE: Check data types for df_crops
# ============================================

# Print the data types

print(df_crops.dtypes)
# Print info (includes non-null counts)

df_crops.info()

country                  object
iso3_code                object
region                   object
income_group             object
year                      int64
crop                     object
area_harvested_ha         int64
production_tonnes       float64
yield_kg_ha             float64
fertilizer_use_kg_ha    float64
irrigation_pct          float64
notes                    object
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83740 entries, 0 to 83739
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   country               83740 non-null  object 
 1   iso3_code             83740 non-null  object 
 2   region                83740 non-null  object 
 3   income_group          83740 non-null  object 
 4   year                  83740 non-null  int64  
 5   crop                  83740 non-null  object 
 6   area_harvested_ha     83740 non-null  int64  
 7   production_tonnes     81240 n

In [73]:
# ============================================
# YOUR CODE HERE: Check data types for df_temp
# ============================================

# Print the data types

print(df_temp.dtypes)
# Print info
df_temp.info()


country              object
year                  int64
annual_anomaly_c    float64
jan                 float64
feb                 float64
mar                 float64
apr                 float64
may                 float64
jun                 float64
jul                 float64
aug                 float64
sep                 float64
oct                 float64
nov                 float64
dec                 float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           1137 non-null   object 
 1   year              1137 non-null   int64  
 2   annual_anomaly_c  1137 non-null   float64
 3   jan               1114 non-null   float64
 4   feb               1114 non-null   float64
 5   mar               1114 non-null   float64
 6   apr               1115 non-null   float64
 7   may               1115 non-null

In [77]:
df_crops['year'].dtype
df_crops['yield_kg_ha'].dtype


dtype('float64')

### Part C: Answer These Questions (write your answers below)

1. What is the data type of the `year` column in `df_crops`?

   **Your answer:*int64*

2. What is the data type of the `yield_kg_ha` column? Why is it this type?

   **Your answer:*float64* 

3. How many monthly temperature columns are there in `df_temp`? List them.

   **Your answer:*since crop_2 those fields are stored as varchar in the data, so it can be read as object* 

4. Which column in `df_crops` has the most NULL values? How many?

   **Your answer:** 

---

---
---

# Question 2: Reshape Data - Wide to Long (25 points)

## Task: Convert Monthly Temperature Data from Wide to Long Format

The `temperature_anomalies` table has **monthly data stored in wide format** (columns: jan, feb, mar, ..., dec). This is a classic case where we need to reshape the data to make it **tidy**.

### Tidy Data Principle
- Each **variable** should have its own **column**
- Each **observation** should have its own **row**
- Each **value** should have its own **cell**

### Current Structure (Wide - NOT Tidy)
```
country_id | year | annual_anomaly_c | jan  | feb  | mar  | ... | dec
-----------+------+------------------+------+------+------+-----+-----
    1      | 2020 |      1.5         | 1.2  | 1.8  | 1.4  | ... | 1.6
```

### Target Structure (Long - Tidy)
```
country_id | year | month | monthly_anomaly_c
-----------+------+-------+------------------
    1      | 2020 |  jan  |       1.2
    1      | 2020 |  feb  |       1.8
    1      | 2020 |  mar  |       1.4
    ...    | ...  |  ...  |       ...
```

---

In [75]:
# First, let's look at the current structure
print("Current temperature table structure:")
print(f"Shape: {df_temp.shape}")
print(f"\nColumns: {df_temp.columns.tolist()}")

Current temperature table structure:
Shape: (1137, 15)

Columns: ['country', 'year', 'annual_anomaly_c', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']


### Part A: Use pd.melt() to Reshape the Data

**YOUR TASK:** Complete the code below to:
1. Use `pd.melt()` to convert monthly columns to rows
2. Keep `country_id`, `year`, and `annual_anomaly_c` as identifier columns
3. Melt the month columns (jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec)
4. Name the new columns: `month` and `monthly_anomaly_c`

**Hint:** Use the syntax:
```python
pd.melt(df, 
        id_vars=['cols', 'to', 'keep'], 
        value_vars=['cols', 'to', 'unpivot'],
        var_name='new_col_name', 
        value_name='value_col_name')
```

---

In [76]:
# ============================================
# YOUR CODE HERE: Reshape temperature data
# ============================================

# Define the month columns to unpivot
month_columns = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 
                 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

# Use pd.melt() to reshape from wide to long
df_temp_long = pd.melt(
    df_temp,
    id_vars=___,  # columns to keep as identifiers
    value_vars=___,  # columns to unpivot
    var_name=___,  # name for the new 'month' column
    value_name=___  # name for the new value column
)

# ============================================

TypeError: unhashable type: 'DataFrame'

In [None]:
# Verify the reshape
print("Reshaped temperature data:")
print(f"Original shape: {df_temp.shape}")
print(f"New shape: {df_temp_long.shape}")
print(f"\nExpected rows: {len(df_temp)} Ã— 12 months = {len(df_temp) * 12}")
print(f"Actual rows: {len(df_temp_long)}")

print("\nSample of reshaped data:")
df_temp_long.head(15)

### Part B: Add Month Number for Sorting

**YOUR TASK:** Create a `month_num` column that converts month names to numbers (jan=1, feb=2, ..., dec=12)

**Hint:** Create a dictionary mapping and use `.map()`

---

In [None]:
# ============================================
# YOUR CODE HERE: Add month number column
# ============================================

# Create a mapping dictionary
month_to_num = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4,
    'may': 5, 'jun': 6, 'jul': 7, 'aug': 8,
    'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}

# Add the month_num column using .map()
df_temp_long['month_num'] = ___

# Sort by country_id, year, month_num
df_temp_long = df_temp_long.sort_values(['country_id', 'year', 'month_num'])

# ============================================

In [None]:
# Verify the month numbers
print("Temperature data with month numbers:")
df_temp_long[['country_id', 'year', 'month', 'month_num', 'monthly_anomaly_c']].head(15)

### Part C: Answer These Questions

1. Why is the long format considered "tidy" for this data?

   **Your answer:** 

2. What is the formula to calculate the expected number of rows after melting?

   **Your answer:** 

3. When would you want to convert BACK from long to wide format?

   **Your answer:** 

---

---
---

# Question 3: Missing Data Analysis (25 points)

## Task: Analyze and Handle Missing Data

As discussed in Lecture 6, understanding **why** data is missing helps decide how to handle it:

| Type | Description | Strategy |
|------|-------------|----------|
| **MCAR** | Missing Completely At Random | Drop or impute |
| **MAR** | Missing At Random (depends on other variables) | Impute with care |
| **MNAR** | Missing Not At Random | Cannot ignore |

---

### Part A: Calculate Missing Data Statistics

**YOUR TASK:** Create a function that calculates missing data statistics for any DataFrame.

The function should return a DataFrame with:
- Column name
- Total count
- Missing count
- Missing percentage

---

In [None]:
# ============================================
# YOUR CODE HERE: Create missing data function
# ============================================

def missing_data_report(df):
    """
    Generate a missing data report for a DataFrame.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input DataFrame to analyze
    
    Returns:
    --------
    pd.DataFrame with columns: column, total, missing, missing_pct
    """
    # Calculate total rows
    total = len(df)
    
    # Calculate missing for each column
    # Hint: use df.isnull().sum() for missing count
    
    report = pd.DataFrame({
        'column': ___,
        'total': ___,
        'missing': ___,
        'missing_pct': ___
    })
    
    # Round percentage to 2 decimal places
    report['missing_pct'] = report['missing_pct'].round(2)
    
    # Sort by missing_pct descending
    report = report.sort_values('missing_pct', ascending=False)
    
    return report

# ============================================

In [None]:
# Test your function on crop_production
print("Missing Data Report: Crop Production")
print("=" * 50)
missing_crops = missing_data_report(df_crops)
missing_crops

In [None]:
# Test on temperature (long format)
print("Missing Data Report: Temperature Anomalies (Long)")
print("=" * 50)
missing_temp = missing_data_report(df_temp_long)
missing_temp

### Part B: Analyze Missing Patterns

**YOUR TASK:** Investigate which countries/years have the most missing temperature data.

---

In [None]:
# ============================================
# YOUR CODE HERE: Find countries with most missing monthly data
# ============================================

# Group by country_id and count missing values
# Hint: use .isnull() and .sum() after groupby

missing_by_country = df_temp_long.groupby('country_id')['monthly_anomaly_c'].apply(
    lambda x: ___  # count null values
).reset_index()

missing_by_country.columns = ['country_id', 'missing_months']

# Sort by missing count descending and show top 10
missing_by_country = missing_by_country.sort_values('missing_months', ascending=False)
print("Countries with most missing monthly temperature data:")
missing_by_country.head(10)

# ============================================

### Part C: Handle Missing Data

**YOUR TASK:** For the `df_crops` DataFrame, handle missing `yield_kg_ha` values using **group-based imputation** (fill with the mean yield for each crop type).

This is appropriate when we believe the missing mechanism is **MAR** - missing values depend on the crop type.

---

In [None]:
# ============================================
# YOUR CODE HERE: Impute missing yield with crop-specific mean
# ============================================

# Make a copy to avoid modifying original
df_crops_imputed = df_crops.copy()

# Count missing before
missing_before = df_crops_imputed['yield_kg_ha'].isnull().sum()
print(f"Missing yield values before: {missing_before}")

# Calculate mean yield for each crop
crop_mean_yield = df_crops_imputed.groupby('crop')['yield_kg_ha'].transform('mean')

# Fill missing values with crop-specific mean
# Hint: use fillna()
df_crops_imputed['yield_kg_ha'] = df_crops_imputed['yield_kg_ha'].fillna(___)

# Count missing after
missing_after = df_crops_imputed['yield_kg_ha'].isnull().sum()
print(f"Missing yield values after: {missing_after}")

# ============================================

In [None]:
# Add a flag column to track which values were imputed
df_crops_imputed['yield_imputed'] = df_crops['yield_kg_ha'].isnull()

print(f"Rows with imputed yield: {df_crops_imputed['yield_imputed'].sum()}")

# Show some imputed rows
print("\nSample of imputed rows:")
df_crops_imputed[df_crops_imputed['yield_imputed']][['country_id', 'year', 'crop', 'yield_kg_ha', 'yield_imputed']].head(10)

### Part D: Answer These Questions

1. What percentage of `yield_kg_ha` values were missing in the original data?

   **Your answer:** 

2. Why did we use crop-specific mean instead of overall mean for imputation?

   **Your answer:** 

3. Why is it important to create a flag column (`yield_imputed`) to track imputed values?

   **Your answer:** 

---

---
---

# Question 4: Data Validation (25 points)

## Task: Implement Validation Checks

As discussed in Lecture 6, validation catches problems early:

| Type | Check | Example |
|------|-------|----------|
| **Range** | Values within bounds | Year between 1900-2100 |
| **Null** | Required fields present | country_id not null |
| **Type** | Correct data type | Year is integer |
| **Uniqueness** | No duplicates | Unique country-year-crop combo |

---

### Part A: Range Validation

**YOUR TASK:** Check that values are within expected ranges:
- `year`: between 1900 and 2100
- `yield_kg_ha`: between 0 and 50000 (reasonable crop yield)
- `irrigation_pct`: between 0 and 100

---

In [None]:
# ============================================
# YOUR CODE HERE: Implement range validation
# ============================================

def validate_range(df, column, min_val, max_val):
    """
    Check if values in a column are within the specified range.
    
    Returns: DataFrame with rows that FAIL validation
    """
    # Find rows where value is outside range (excluding nulls)
    # Hint: use (df[column] < min_val) | (df[column] > max_val)
    
    mask = ___
    
    invalid_rows = df[mask]
    return invalid_rows

# ============================================

# Test range validations
print("Range Validation Results")
print("=" * 50)

# Check year range
invalid_year = validate_range(df_crops, 'year', 1900, 2100)
print(f"Invalid year values (outside 1900-2100): {len(invalid_year)}")

# Check yield range
invalid_yield = validate_range(df_crops, 'yield_kg_ha', 0, 50000)
print(f"Invalid yield values (outside 0-50000): {len(invalid_yield)}")

# Check irrigation range
invalid_irrigation = validate_range(df_crops, 'irrigation_pct', 0, 100)
print(f"Invalid irrigation values (outside 0-100): {len(invalid_irrigation)}")

### Part B: Null Validation

**YOUR TASK:** Check that required fields are not null:
- `country_id`: Required
- `year`: Required
- `crop`: Required

---

In [None]:
# ============================================
# YOUR CODE HERE: Implement null validation
# ============================================

def validate_not_null(df, columns):
    """
    Check that specified columns have no null values.
    
    Parameters:
    -----------
    df : pd.DataFrame
    columns : list of column names
    
    Returns: dict with column name and count of nulls
    """
    results = {}
    for col in columns:
        # Count null values in each column
        null_count = ___
        results[col] = null_count
    return results

# ============================================

# Test null validation
required_columns = ['country_id', 'year', 'crop']
null_results = validate_not_null(df_crops, required_columns)

print("Null Validation Results")
print("=" * 50)
for col, count in null_results.items():
    status = "PASS" if count == 0 else "FAIL"
    print(f"{col}: {count} nulls [{status}]")

### Part C: Uniqueness Validation

**YOUR TASK:** Check that there are no duplicate records for the same country-year-crop combination.

---

In [None]:
# ============================================
# YOUR CODE HERE: Check for duplicates
# ============================================

def validate_unique(df, columns):
    """
    Check that combinations of columns are unique.
    
    Parameters:
    -----------
    df : pd.DataFrame
    columns : list of columns that should be unique together
    
    Returns: DataFrame with duplicate rows
    """
    # Find duplicates
    # Hint: use df.duplicated(subset=columns, keep=False)
    
    duplicates = df[___]
    return duplicates

# ============================================

# Test uniqueness validation
unique_cols = ['country_id', 'year', 'crop']
duplicates = validate_unique(df_crops, unique_cols)

print("Uniqueness Validation Results")
print("=" * 50)
print(f"Duplicate country-year-crop combinations: {len(duplicates)}")

if len(duplicates) > 0:
    print("\nSample duplicates:")
    print(duplicates[unique_cols + ['production_tonnes']].head(10))

### Part D: Create a Complete Validation Report

**YOUR TASK:** Combine all validations into a summary report.

---

In [None]:
# ============================================
# YOUR CODE HERE: Create validation summary
# ============================================

print("=" * 60)
print("DATA VALIDATION SUMMARY - CROP PRODUCTION")
print("=" * 60)
print(f"Total rows: {len(df_crops)}")
print(f"Generated at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print()

# 1. Range checks
print("1. RANGE CHECKS")
print("-" * 40)
print(f"   Year (1900-2100): {len(validate_range(df_crops, 'year', 1900, 2100))} failures")
print(f"   Yield (0-50000): {len(validate_range(df_crops, 'yield_kg_ha', 0, 50000))} failures")
print(f"   Irrigation (0-100): {len(validate_range(df_crops, 'irrigation_pct', 0, 100))} failures")
print()

# 2. Null checks
print("2. NULL CHECKS (Required Fields)")
print("-" * 40)
for col, count in validate_not_null(df_crops, ['country_id', 'year', 'crop']).items():
    status = "PASS" if count == 0 else f"FAIL ({count} nulls)"
    print(f"   {col}: {status}")
print()

# 3. Uniqueness check
print("3. UNIQUENESS CHECK")
print("-" * 40)
dup_count = len(validate_unique(df_crops, ['country_id', 'year', 'crop']))
status = "PASS" if dup_count == 0 else f"FAIL ({dup_count} duplicates)"
print(f"   country_id + year + crop: {status}")
print()

print("=" * 60)

### Part E: Answer These Questions

1. Why is range validation important for data quality?

   **Your answer:** 

2. What would you do if you found duplicate country-year-crop records?

   **Your answer:** 

3. List one additional validation check that would be useful for this dataset.

   **Your answer:** 

---

---

## Submission Checklist

Before submitting, make sure:

- [ ] **Question 1**: Checked data types and answered all questions
- [ ] **Question 2**: Successfully reshaped temperature data from wide to long
- [ ] **Question 3**: Created missing data report and implemented imputation
- [ ] **Question 4**: Implemented all three validation checks (range, null, unique)
- [ ] All markdown questions have been answered

### How to Submit

1. Save this notebook
2. Export as PDF or HTML
3. Submit via Canvas by **Wednesday, January 21, 2026 at 11:59 PM**

---

## Grading Rubric

| Question | Points | Description |
|----------|--------|-------------|
| Q1 | 15 | Data type inspection and questions |
| Q2 | 25 | Reshape (melt) implementation |
| Q3 | 25 | Missing data analysis and imputation |
| Q4 | 25 | Validation checks implementation |
| **Style** | 10 | Code quality, comments, formatting |
| **Total** | **100** | |

---