# Lecture 1: Introduction to Pandas and Matplotlib

**Applied Mathematics - Data Science Module**

Welcome to your first lecture on using Python for data analysis! In this session, we will introduce two of the most important libraries in the data science world: **Pandas** and **Matplotlib**. These tools are essential for anyone working with data, including applied mathematicians, engineers, and scientists.

## 📚 Learning Objectives

By the end of this lecture, you will be able to:

- ✅ Understand the basic data science workflow
- ✅ Load and inspect datasets using Pandas
- ✅ Select and filter data in a Pandas DataFrame
- ✅ Compute basic statistics on data
- ✅ Create fundamental plots (line, scatter, bar) using Matplotlib
- ✅ Customize plots with labels, titles, and colors
- ✅ Interpret visualizations in the context of real-world data

## 🔄 The Data Science Pipeline

A typical data science project follows a pipeline of steps. Understanding this workflow is crucial for effective data analysis:

```
┌─────────────────┐
│ Data Acquisition│  ← Getting data from files, databases, APIs
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Data Cleaning   │  ← Handling missing values, removing duplicates
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Exploration     │  ← Visualizing, summarizing, understanding patterns
│ (EDA)           │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Modeling        │  ← Statistical/ML models (not covered today)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Communication   │  ← Presenting results through visualizations
└─────────────────┘
```

In this lecture, we focus on **Data Acquisition**, **Exploration**, and **Communication**.

## 1. Introduction to Pandas

**Pandas** is a powerful library for data manipulation and analysis. It provides data structures and functions needed to work with structured data seamlessly.

The primary data structure in Pandas is the **DataFrame**, which is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table.

In [None]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options for better output
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)

### 1.1 Loading Data

We will be using a dataset of **CO2 emissions** from [Our World in Data](https://ourworldindata.org/co2-emissions). This dataset contains information about CO2 emissions, GDP, population, and other metrics for countries around the world from 1750 to the present.

**Why this dataset?** Climate change is one of the most pressing issues of our time, and understanding CO2 emissions is crucial for addressing it. This dataset allows us to explore trends, compare countries, and understand the relationship between economic development and environmental impact.

In [None]:
# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('data/owid-co2-data.csv')

print("Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")

### 1.2 Inspecting the Data

Before we start analyzing data, we need to understand its structure. Let's look at the first few rows using the `head()` method.

In [None]:
# Display the first 5 rows
df.head()

**What do we see?**

- `country`: The name of the country or region
- `year`: The year of observation
- `co2`: Annual CO2 emissions (in million tonnes)
- `population`: Population of the country
- `gdp`: Gross Domestic Product
- And many more columns...

Now let's get more information about the dataset structure:

In [None]:
# Get a concise summary of the DataFrame
df.info()

**Key observations:**
- We can see the data types of each column (float64, int64, object)
- Some columns have missing values (non-null count < total entries)
- This is normal in real-world datasets!

In [None]:
# Get basic descriptive statistics
df.describe()

**Interpretation:**
- The `describe()` method gives us statistics like mean, standard deviation, min, max, and quartiles
- For example, the mean CO2 emissions is around 200 million tonnes, but the max is over 10,000 million tonnes!
- This suggests a highly skewed distribution with some countries emitting much more than others

### 1.3 Selecting Data

One of the most common operations in data analysis is selecting specific columns or rows. Let's explore different ways to do this.

In [None]:
# Select a single column
countries = df['country']
print(f"Type: {type(countries)}")
print(f"\nFirst 10 countries:")
print(countries.head(10))

In [None]:
# Select multiple columns
subset = df[['country', 'year', 'co2', 'population']]
subset.head()

### 1.4 Filtering Data

Filtering allows us to select rows that meet certain conditions. This is extremely useful for focusing on specific subsets of data.

In [None]:
# Filter: Get data for the entire world
world_df = df[df['country'] == 'World']
print(f"World data: {len(world_df)} rows")
world_df.head()

In [None]:
# Filter: Get data for recent years (after 2000)
recent_df = df[df['year'] > 2000]
print(f"Recent data (after 2000): {len(recent_df)} rows")

In [None]:
# Multiple conditions: World data after 2000
world_recent = df[(df['country'] == 'World') & (df['year'] > 2000)]
print(f"World data after 2000: {len(world_recent)} rows")
world_recent.head()

## 2. Introduction to Matplotlib

**Matplotlib** is the most widely used Python library for creating static, animated, and interactive visualizations. It provides a MATLAB-like interface and is highly customizable.

The basic workflow is:
1. Create a figure and axes
2. Plot data on the axes
3. Customize (labels, titles, colors, etc.)
4. Display or save the figure

### 2.1 Line Plot

A **line plot** is ideal for showing trends over time. Let's visualize how global CO2 emissions have changed over the years.

In [None]:
# Create a line plot of global CO2 emissions over time
plt.figure(figsize=(12, 6))
plt.plot(world_df['year'], world_df['co2'], linewidth=2, color='darkred')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Annual CO2 Emissions (million tonnes)', fontsize=12)
plt.title('Global CO2 Emissions Over Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Interpretation:**

- We can see a dramatic increase in CO2 emissions, especially after 1950
- This corresponds to the post-WWII industrial boom and rapid economic growth
- The curve is exponential, which is concerning for climate change
- There's a slight dip around 2020 (likely due to the COVID-19 pandemic)

### 2.2 Scatter Plot

A **scatter plot** is useful for examining the relationship between two variables. Let's explore the relationship between GDP and CO2 emissions.

In [None]:
# Remove rows with missing GDP or CO2 data
world_clean = world_df.dropna(subset=['gdp', 'co2'])

# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(world_clean['gdp'], world_clean['co2'], alpha=0.6, s=50, color='steelblue')
plt.xlabel('GDP (in billions)', fontsize=12)
plt.ylabel('Annual CO2 Emissions (million tonnes)', fontsize=12)
plt.title('Global CO2 Emissions vs. GDP', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Interpretation:**

- There's a clear positive correlation between GDP and CO2 emissions
- As economies grow, they tend to emit more CO2
- This raises an important question: Can we achieve economic growth while reducing emissions?
- This is the concept of "decoupling" that many countries are trying to achieve

### 2.3 Bar Chart

A **bar chart** is excellent for comparing values across categories. Let's compare CO2 emissions of different countries in a specific year.

In [None]:
# Select top emitting countries in 2020
countries_of_interest = ['United States', 'China', 'India', 'Russia', 'Japan']
year_2020_df = df[(df['year'] == 2020) & (df['country'].isin(countries_of_interest))]

# Sort by CO2 emissions for better visualization
year_2020_df = year_2020_df.sort_values('co2', ascending=False)

# Create a bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(year_2020_df['country'], year_2020_df['co2'], color=['#d62728', '#ff7f0e', '#2ca02c', '#1f77b4', '#9467bd'])
plt.xlabel('Country', fontsize=12)
plt.ylabel('Annual CO2 Emissions (million tonnes)', fontsize=12)
plt.title('CO2 Emissions by Country in 2020', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

**Interpretation:**

- China is by far the largest emitter, followed by the United States
- However, it's important to note that China has a much larger population
- To get a fairer comparison, we should look at **per capita emissions** (emissions per person)
- Let's explore that next!

In [None]:
# Compare per capita emissions
year_2020_df_clean = year_2020_df.dropna(subset=['co2_per_capita'])

plt.figure(figsize=(10, 6))
bars = plt.bar(year_2020_df_clean['country'], year_2020_df_clean['co2_per_capita'], 
               color=['#d62728', '#ff7f0e', '#2ca02c', '#1f77b4', '#9467bd'])
plt.xlabel('Country', fontsize=12)
plt.ylabel('CO2 Emissions Per Capita (tonnes per person)', fontsize=12)
plt.title('CO2 Emissions Per Capita by Country in 2020', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

**Interpretation:**

- The picture changes dramatically when we look at per capita emissions!
- The United States has much higher per capita emissions than China
- India has relatively low per capita emissions despite its large population
- This highlights the importance of choosing the right metric for comparison

### 2.4 Customizing Plots

Let's create a more sophisticated plot with multiple customizations.

In [None]:
# Select data for a few countries over time
countries_to_plot = ['United States', 'China', 'India']
recent_years = df[df['year'] >= 1990]

plt.figure(figsize=(12, 7))

for country in countries_to_plot:
    country_data = recent_years[recent_years['country'] == country]
    plt.plot(country_data['year'], country_data['co2'], marker='o', markersize=4, linewidth=2, label=country)

plt.xlabel('Year', fontsize=12)
plt.ylabel('Annual CO2 Emissions (million tonnes)', fontsize=12)
plt.title('CO2 Emissions Trends: USA, China, and India (1990-Present)', fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Interpretation:**

- China's emissions have grown dramatically since the 1990s, overtaking the US around 2005
- US emissions have remained relatively stable and even decreased slightly in recent years
- India's emissions are growing but at a slower rate than China's
- These trends reflect different stages of economic development and industrialization

## 3. Exercises

Now it's your turn! Try these exercises to practice what you've learned:

**Exercise 1:** Create a line plot showing the population growth of the world over time.

**Exercise 2:** Create a scatter plot showing the relationship between population and CO2 emissions for all countries in 2020.

**Exercise 3:** Create a bar chart comparing the GDP of the top 5 economies in 2020.

**Exercise 4:** Filter the dataset to show only European countries and create a line plot of their combined CO2 emissions over time.

In [None]:
# Your code for Exercise 1 here


In [None]:
# Your code for Exercise 2 here


In [None]:
# Your code for Exercise 3 here


In [None]:
# Your code for Exercise 4 here


## 6. Exploratory Data Analysis (EDA)

Exploratory Data Analysis is the process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions. Let's perform a comprehensive EDA on our CO2 emissions dataset!

### 6.1 Understanding Data Distribution

Let's examine how CO2 emissions are distributed across countries.

In [None]:
# Distribution of CO2 emissions in 2020
exclude_regions = ['World', 'Asia', 'Africa', 'Europe', 'North America', 'South America', 'Oceania',
                   'European Union (27)', 'High-income countries', 'Low-income countries',
                   'Lower-middle-income countries', 'Upper-middle-income countries']

countries_2020 = df[(df['year'] == 2020) & (~df['country'].isin(exclude_regions))].dropna(subset=['co2'])

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Histogram
axes[0].hist(countries_2020['co2'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('CO2 Emissions (million tonnes)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Number of Countries', fontsize=12, fontweight='bold')
axes[0].set_title('Distribution of CO2 Emissions (2020)', fontsize=13, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
axes[0].text(0.7, 0.95, f'Total countries: {len(countries_2020)}', transform=axes[0].transAxes,
            fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Log-scale histogram
axes[1].hist(countries_2020['co2'], bins=50, color='coral', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('CO2 Emissions (million tonnes)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Number of Countries', fontsize=12, fontweight='bold')
axes[1].set_title('Distribution (Log Scale)', fontsize=13, fontweight='bold')
axes[1].set_yscale('log')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nStatistics for CO2 emissions in 2020:")
print(f"Mean: {countries_2020['co2'].mean():.2f} million tonnes")
print(f"Median: {countries_2020['co2'].median():.2f} million tonnes")
print(f"Std Dev: {countries_2020['co2'].std():.2f} million tonnes")
print(f"Min: {countries_2020['co2'].min():.2f} million tonnes")
print(f"Max: {countries_2020['co2'].max():.2f} million tonnes")

**Key Observations:**
- The distribution is highly skewed (mean >> median)
- Most countries have relatively low emissions
- A few countries dominate global emissions
- This is a classic example of a **power law distribution**

### 6.2 Correlation Analysis

Let's explore relationships between different variables using correlation.

In [None]:
# Calculate correlation matrix for key variables
correlation_vars = ['co2', 'population', 'gdp', 'co2_per_capita']
corr_data = df[df['year'] == 2020][correlation_vars].dropna()
correlation_matrix = corr_data.corr()

print("Correlation Matrix (2020):")
print(correlation_matrix.round(3))

# Visualize correlation matrix
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(correlation_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)

# Set ticks and labels
ax.set_xticks(np.arange(len(correlation_vars)))
ax.set_yticks(np.arange(len(correlation_vars)))
ax.set_xticklabels(correlation_vars, fontsize=11)
ax.set_yticklabels(correlation_vars, fontsize=11)

# Rotate the tick labels
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Add correlation values as text
for i in range(len(correlation_vars)):
    for j in range(len(correlation_vars)):
        text = ax.text(j, i, f'{correlation_matrix.iloc[i, j]:.2f}',
                      ha="center", va="center", color="black", fontsize=12, fontweight='bold')

ax.set_title('Correlation Matrix of Key Variables (2020)', fontsize=14, fontweight='bold', pad=20)
plt.colorbar(im, ax=ax, label='Correlation Coefficient')
plt.tight_layout()
plt.show()

**Interpretation:**
- **Strong positive correlation** between CO2 and GDP (0.9+): Economic activity drives emissions
- **Moderate correlation** between CO2 and population: More people ≠ proportionally more emissions
- **Weak/negative correlation** between population and CO2 per capita: Large populations don't always mean high per capita emissions
- This suggests that **wealth matters more than population size** for emissions

### 6.3 Temporal Trends Analysis

How have emissions changed over different time periods?

In [None]:
# Analyze emissions by decade
world_data = df[df['country'] == 'World'].dropna(subset=['co2', 'year'])
world_data['decade'] = (world_data['year'] // 10) * 10

# Calculate average emissions per decade
decade_avg = world_data.groupby('decade')['co2'].mean().reset_index()
decade_avg = decade_avg[decade_avg['decade'] >= 1900]

plt.figure(figsize=(14, 7))
bars = plt.bar(decade_avg['decade'], decade_avg['co2'], width=8, 
               color=plt.cm.Reds(np.linspace(0.3, 0.9, len(decade_avg))),
               edgecolor='black', linewidth=1.5)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.0f}',
            ha='center', va='bottom', fontsize=9, fontweight='bold')

plt.xlabel('Decade', fontsize=13, fontweight='bold')
plt.ylabel('Average Annual CO2 Emissions (million tonnes)', fontsize=13, fontweight='bold')
plt.title('Global CO2 Emissions by Decade (1900-Present)', fontsize=15, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate growth rates
decade_avg['growth_rate'] = decade_avg['co2'].pct_change() * 100
print("\nDecadal Growth Rates:")
print(decade_avg[['decade', 'co2', 'growth_rate']].to_string(index=False))

**Key Findings:**
- Exponential growth from 1950 onwards
- 1970s-1980s: Rapid industrialization
- 2000s-2010s: Continued growth despite climate awareness
- Recent decades show slowing growth rate (but still growing!)

### 6.4 Regional Patterns

Let's compare emissions across different regions and income levels.

In [None]:
# Compare continents over time
continents = ['Asia', 'Europe', 'North America', 'South America', 'Africa', 'Oceania']
continent_data = df[(df['country'].isin(continents)) & (df['year'] >= 1950)]

plt.figure(figsize=(14, 8))
colors_cont = ['#e74c3c', '#3498db', '#2ecc71', '#f39c12', '#9b59b6', '#1abc9c']

for continent, color in zip(continents, colors_cont):
    cont_data = continent_data[continent_data['country'] == continent]
    plt.plot(cont_data['year'], cont_data['co2'], linewidth=2.5, 
            label=continent, color=color, marker='o', markersize=3, markevery=10)

plt.xlabel('Year', fontsize=13, fontweight='bold')
plt.ylabel('Annual CO2 Emissions (million tonnes)', fontsize=13, fontweight='bold')
plt.title('CO2 Emissions by Continent (1950-Present)', fontsize=15, fontweight='bold')
plt.legend(fontsize=11, loc='upper left', framealpha=0.9)
plt.grid(True, alpha=0.3, linestyle='--')
plt.tight_layout()
plt.show()

# Calculate 2020 shares
continent_2020 = df[(df['country'].isin(continents)) & (df['year'] == 2020)][['country', 'co2']]
continent_2020['percentage'] = (continent_2020['co2'] / continent_2020['co2'].sum() * 100).round(1)
print("\nContinental Share of Emissions (2020):")
print(continent_2020.sort_values('co2', ascending=False).to_string(index=False))

**Regional Insights:**
- **Asia** has overtaken all other continents combined
- **Europe** shows declining trend since 1990
- **North America** peaked around 2005
- **Africa** remains lowest despite large population
- This reflects the shift of manufacturing to Asia

### 6.5 Outlier Detection

Let's identify countries with unusual emission patterns.

In [None]:
# Find outliers using IQR method for per capita emissions
pc_2020 = df[(df['year'] == 2020) & (~df['country'].isin(exclude_regions))].dropna(subset=['co2_per_capita'])

Q1 = pc_2020['co2_per_capita'].quantile(0.25)
Q3 = pc_2020['co2_per_capita'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = pc_2020[(pc_2020['co2_per_capita'] < lower_bound) | (pc_2020['co2_per_capita'] > upper_bound)]
high_outliers = outliers[outliers['co2_per_capita'] > upper_bound].nlargest(15, 'co2_per_capita')

print(f"Outlier Detection Results:")
print(f"Q1 (25th percentile): {Q1:.2f}")
print(f"Q3 (75th percentile): {Q3:.2f}")
print(f"IQR: {IQR:.2f}")
print(f"Upper bound for outliers: {upper_bound:.2f}")
print(f"\nTop 15 High-Emission Outliers (Per Capita):")
print(high_outliers[['country', 'co2_per_capita', 'population']].to_string(index=False))

# Visualize outliers
plt.figure(figsize=(14, 7))
plt.barh(range(len(high_outliers)), high_outliers['co2_per_capita'].values,
        color='darkred', alpha=0.7, edgecolor='black', linewidth=1.5)
plt.yticks(range(len(high_outliers)), high_outliers['country'].values, fontsize=10)
plt.xlabel('CO2 Per Capita (tonnes/person)', fontsize=13, fontweight='bold')
plt.title('Countries with Exceptionally High Per Capita Emissions (2020)', fontsize=15, fontweight='bold')
plt.axvline(x=upper_bound, color='blue', linestyle='--', linewidth=2, label=f'Outlier Threshold ({upper_bound:.1f})')
plt.legend(fontsize=11)
plt.grid(axis='x', alpha=0.3)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

**Outlier Analysis:**
- Many outliers are **oil-rich nations** (Qatar, Kuwait, UAE)
- Small populations but massive energy consumption
- Some are **industrial hubs** with high manufacturing
- These countries need different policy approaches than large emitters

### 6.6 Time Series Decomposition

Let's analyze the components of the emission trend.

In [None]:
# Analyze growth acceleration
world_ts = df[df['country'] == 'World'].dropna(subset=['co2', 'year']).sort_values('year')
world_ts = world_ts[world_ts['year'] >= 1900].copy()

# Calculate year-over-year change and acceleration
world_ts['co2_change'] = world_ts['co2'].diff()
world_ts['co2_acceleration'] = world_ts['co2_change'].diff()

fig, axes = plt.subplots(3, 1, figsize=(14, 12), sharex=True)

# Plot 1: Absolute emissions
axes[0].plot(world_ts['year'], world_ts['co2'], linewidth=2.5, color='darkred')
axes[0].fill_between(world_ts['year'], world_ts['co2'], alpha=0.3, color='darkred')
axes[0].set_ylabel('CO2 Emissions\n(million tonnes)', fontsize=11, fontweight='bold')
axes[0].set_title('A) Absolute Emissions', loc='left', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Plot 2: Year-over-year change
axes[1].plot(world_ts['year'], world_ts['co2_change'], linewidth=2, color='blue')
axes[1].axhline(y=0, color='black', linestyle='--', linewidth=1)
axes[1].fill_between(world_ts['year'], world_ts['co2_change'], alpha=0.3, color='blue')
axes[1].set_ylabel('Annual Change\n(million tonnes/year)', fontsize=11, fontweight='bold')
axes[1].set_title('B) Rate of Change (First Derivative)', loc='left', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# Plot 3: Acceleration
axes[2].plot(world_ts['year'], world_ts['co2_acceleration'], linewidth=2, color='green')
axes[2].axhline(y=0, color='black', linestyle='--', linewidth=1)
axes[2].fill_between(world_ts['year'], world_ts['co2_acceleration'], 
                     where=(world_ts['co2_acceleration'] >= 0), alpha=0.3, color='green', label='Accelerating')
axes[2].fill_between(world_ts['year'], world_ts['co2_acceleration'], 
                     where=(world_ts['co2_acceleration'] < 0), alpha=0.3, color='red', label='Decelerating')
axes[2].set_ylabel('Acceleration\n(million tonnes/year²)', fontsize=11, fontweight='bold')
axes[2].set_xlabel('Year', fontsize=12, fontweight='bold')
axes[2].set_title('C) Acceleration (Second Derivative)', loc='left', fontsize=12, fontweight='bold')
axes[2].legend(fontsize=10)
axes[2].grid(True, alpha=0.3)

fig.suptitle('Time Series Decomposition of Global CO2 Emissions', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

**Mathematical Insight:**
- **First derivative** (rate of change) shows emissions are still increasing
- **Second derivative** (acceleration) reveals periods of rapid growth and slowdown
- Negative acceleration doesn't mean emissions are falling—just growing slower
- We need negative first derivative (actual decline) to reverse climate change

### 6.7 Comparative Box Plots

Let's compare the distribution of emissions across different country groups.

In [None]:
# Compare per capita emissions for selected countries over time
comparison_countries = ['United States', 'China', 'India', 'Germany', 'Brazil', 'Japan']
box_data_df = df[(df['country'].isin(comparison_countries)) & (df['year'] >= 1990)].dropna(subset=['co2_per_capita'])

# Prepare data for box plot
data_for_box = [box_data_df[box_data_df['country'] == country]['co2_per_capita'].values 
                for country in comparison_countries]

plt.figure(figsize=(14, 7))
bp = plt.boxplot(data_for_box, labels=comparison_countries, patch_artist=True,
                 boxprops=dict(facecolor='lightblue', alpha=0.7),
                 medianprops=dict(color='red', linewidth=2),
                 whiskerprops=dict(linewidth=1.5),
                 capprops=dict(linewidth=1.5))

plt.ylabel('CO2 Per Capita (tonnes/person)', fontsize=13, fontweight='bold')
plt.title('Distribution of Per Capita Emissions (1990-2022)', fontsize=15, fontweight='bold')
plt.xticks(rotation=45, ha='right', fontsize=11)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Print summary statistics
print("\nSummary Statistics by Country (1990-2022):")
for country in comparison_countries:
    country_data = box_data_df[box_data_df['country'] == country]['co2_per_capita']
    print(f"{country:15s}: Median={country_data.median():.2f}, Mean={country_data.mean():.2f}, "
          f"Std={country_data.std():.2f}, Range=[{country_data.min():.2f}, {country_data.max():.2f}]")

**Box Plot Interpretation:**
- **Box**: Contains middle 50% of data (IQR)
- **Red line**: Median value
- **Whiskers**: Extend to 1.5×IQR
- **US**: High and stable
- **China**: Wide range showing rapid growth
- **India**: Consistently low
- **Germany**: Declining trend visible in compressed upper range

## 4. Summary

In this lecture, you've learned:

1. **The data science pipeline** and where Pandas and Matplotlib fit in
2. **Pandas basics**: loading data, inspecting it, selecting columns, and filtering rows
3. **Matplotlib basics**: creating line plots, scatter plots, and bar charts
4. **Customization**: adding labels, titles, colors, and legends to make plots more informative
5. **Interpretation**: understanding what visualizations tell us about the data

These are fundamental skills that you'll use throughout your data science journey. In the next lecture, we'll explore more advanced techniques including groupby operations, pivot tables, subplots, and heatmaps.

## References

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Matplotlib Documentation](https://matplotlib.org/stable/index.html)
- [Our World in Data - CO2 Emissions](https://ourworldindata.org/co2-emissions)