# 01 - Exploratory Data Analysis (EDA)

Welcome to the first notebook! Here, you'll explore the carbon emission dataset, understand its variables, and gain initial insights with data visualization.

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

In [None]:
# Load the dataset
df = pd.read_csv('../data/carbon_emission_ml_dataset.csv')
df.head()

## 2. Overview of the Data

In [None]:
# Check dataset shape
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
# Show info and data types
df.info()

In [None]:
# Show basic statistics for numerical columns
df.describe().T

In [None]:
# List unique countries and years
print("Countries:", df['country'].unique())
print("Years:", df['year'].unique())

## 3. Missing Values and Duplicates

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Check for duplicate rows
df.duplicated().sum()

## 4. Univariate Analysis

Let's look at the distribution of key features.

In [None]:
# Histograms for main numeric columns
numeric_cols = ['gdp_per_capita','energy_consumption','population',
                'renewable_energy_pct','urban_pct','co2_emission_per_capita']
df[numeric_cols].hist(bins=20, figsize=(15,10))
plt.suptitle('Histograms of Numeric Features', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Boxplots to check for outliers
plt.figure(figsize=(15,6))
sns.boxplot(data=df[numeric_cols])
plt.title('Boxplot of Numeric Features')
plt.show()

## 5. Bivariate Analysis

Relationships between features and the target.

In [None]:
# Pairplot to visualize feature relationships
sns.pairplot(df, vars=numeric_cols, hue='country', corner=True)
plt.suptitle('Pairplot of Features by Country', y=1.02)
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10,7))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

## 6. Interactive Visualizations

Use Plotly for interactive exploration.

In [None]:
# Interactive scatter plot: GDP vs CO2 emissions
fig = px.scatter(df, x='gdp_per_capita', y='co2_emission_per_capita', color='country',
                 size='population', hover_data=['year'],
                 title='GDP per Capita vs CO2 Emission per Capita')
fig.show()

In [None]:
# Interactive line plot: CO2 emission over time by country
fig = px.line(df, x='year', y='co2_emission_per_capita', color='country', markers=True,
              title='CO2 Emission per Capita Over Time')
fig.show()

## 7. Insights and Next Steps

- Summarize your observations (e.g., which countries have higher/lower emissions, trends, correlations).
- Note any data quality issues or features of interest for modeling.
- Ready to move to modeling? Continue to the `02_model_training_and_comparison.ipynb` notebook.