# 🌍 CO₂ Emissions Dashboard #

A Python/Dash application for interactive exploration and AI-powered insights of global and national CO₂ emissions data.

## 🚀 Features ##
1. **Dual Datasets**
- ***Global:*** Annual CO₂ from fossil fuels & cement (million metric tons of C)
- ***National:*** Per-country CO₂ breakdown (thousand metric tons of C)
2. **Preprocessing & Cleaning**
- Year-based sorting
- Linear interpolation & back-filling of missing numeric values
- Outlier capping via IQR
- Automatic renaming of original dataset columns to a uniform schema
3. **Feature Engineering**
- Lag feature (`lag1`) and year-over-year % change (`YoY_pct`)
- Grouped computation for each nation
4. **Interactive Dashboard (Dash + Plotly)**
- ***Dataset selector:*** Switch between global and national data
- ***Country filter*** for national data
- ***Year range slider***
- ***Four core charts*** on a single grid:
    1. Total emissions time series
    2. Emissions by source type (area chart)
    3. Per capita emissions (line chart)
    4. Correlation heatmap
    5. Top 10 bar chart (by country or recent years)

- ***AI Insights:*** At-the-end summarization generated by `Google Gemini LLM`
5. **LLM Integration**
- Uses Google Generative AI (`gemini-2.0-flash`) to generate 3–5 concise, actionable insights
- Built-in retry/backoff logic

### Imports ###

In [3]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import warnings
import time
import google.generativeai as genai

warnings.filterwarnings('ignore')

In [4]:
# Attempt to import LLM client if enabled
GEMINI_MODEL = 'gemini-2.0-flash'
GOOGLE_API_KEY = 'google_api_keygenai.configure(api_key=GOOGLE_API_KEY)

### Data Loading ###

In [6]:
df_global = pd.read_csv('global.1751_2021.csv')
df_global

Unnamed: 0,Year,Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C),Carbon emissions from solid fuel consumption,Carbon emissions from liquid fuel consumption,Carbon emissions from gas fuel consumption,Carbon emissions from cement production,Carbon emissions from gas flaring,Per capita carbon emissions (metric tons of carbon; after 1949 only)
0,1751,3,3,0,0,0,0,
1,1752,3,3,0,0,0,0,
2,1753,3,3,0,0,0,0,
3,1754,3,3,0,0,0,0,
4,1755,3,3,0,0,0,0,
...,...,...,...,...,...,...,...,...
266,2017,9865,3975,3431,1986,387,87,1.297627
267,2018,10120,4095,3473,2060,405,86,1.316166
268,2019,10193,4067,3473,2147,414,91,1.311672
269,2020,9764,3939,3187,2127,426,86,1.243612


In [7]:
df_global.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271 entries, 0 to 270
Data columns (total 8 columns):
 #   Column                                                                                                Non-Null Count  Dtype  
---  ------                                                                                                --------------  -----  
 0   Year                                                                                                  271 non-null    int64  
 1   Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C)  271 non-null    int64  
 2   Carbon emissions from solid fuel consumption                                                          271 non-null    int64  
 3   Carbon emissions from liquid fuel consumption                                                         271 non-null    int64  
 4   Carbon emissions from gas fuel consumption                                                            271

In [8]:
df_nation = pd.read_csv('nation.1751_2021.csv')
df_nation

Unnamed: 0,Nation,Year,Total CO2 emissions from fossil-fuels and cement production (thousand metric tons of C),Emissions from solid fuel consumption,Emissions from liquid fuel consumption,Emissions from gas fuel consumption,Emissions from cement production,Emissions from gas flaring,Per capita CO2 emissions (metric tons of carbon),Emissions from bunker fuels (not included in the totals)
0,AFGHANISTAN,1949,4,4.0,0.0,0.0,0.0,,,0.0
1,AFGHANISTAN,1950,23,6.0,18.0,0.0,0.0,,0.003025,0.0
2,AFGHANISTAN,1951,25,7.0,18.0,0.0,0.0,,0.003172,0.0
3,AFGHANISTAN,1952,25,9.0,17.0,0.0,0.0,,0.003206,0.0
4,AFGHANISTAN,1953,29,10.0,18.0,0.0,0.0,,0.003551,0.0
...,...,...,...,...,...,...,...,...,...,...
18986,ZIMBABWE,2017,2683,1638.0,916.0,,129.0,,0.182450,36.0
18987,ZIMBABWE,2018,3057,1770.0,1135.0,,151.0,,0.204890,44.0
18988,ZIMBABWE,2019,2800,1641.0,1031.0,,128.0,,0.184866,44.0
18989,ZIMBABWE,2020,2326,1347.0,836.0,,143.0,,0.151097,15.0


In [9]:
df_nation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18991 entries, 0 to 18990
Data columns (total 10 columns):
 #   Column                                                                                   Non-Null Count  Dtype  
---  ------                                                                                   --------------  -----  
 0   Nation                                                                                   18991 non-null  object 
 1   Year                                                                                     18991 non-null  int64  
 2   Total CO2 emissions from fossil-fuels and cement production (thousand metric tons of C)  18991 non-null  int64  
 3   Emissions from solid fuel consumption                                                    13209 non-null  float64
 4   Emissions from liquid fuel consumption                                                   18366 non-null  float64
 5   Emissions from gas fuel consumption                         

### Configuration ###

In [11]:
# Column mapping for uniform naming
COLUMN_MAPPING = {
    'global': {
        'Year': 'Year',
        'Total carbon emissions from fossil fuel consumption and cement production (million metric tons of C)': 'Total',
        'Carbon emissions from solid fuel consumption': 'Solid',
        'Carbon emissions from liquid fuel consumption': 'Liquid',
        'Carbon emissions from gas fuel consumption': 'Gas',
        'Carbon emissions from cement production': 'Cement',
        'Carbon emissions from gas flaring': 'Flaring',
        'Per capita carbon emissions (metric tons of carbon; after 1949 only)': 'PerCapita'
    },
    'nation': {
        'Nation': 'Nation',
        'Year': 'Year',
        'Total CO2 emissions from fossil-fuels and cement production (thousand metric tons of C)': 'Total',
        'Emissions from solid fuel consumption': 'Solid',
        'Emissions from liquid fuel consumption': 'Liquid',
        'Emissions from gas fuel consumption': 'Gas',
        'Emissions from cement production': 'Cement',
        'Emissions from gas flaring': 'Flaring',
        'Per capita CO2 emissions (metric tons of carbon)': 'PerCapita',
        'Emissions from bunker fuels (not included in the totals)': 'Bunker'
    }
}

### Data Processor Class ###

In [13]:
class CO2Processor:
    """
    Handles preprocessing and feature engineering for CO2 datasets.
    """
    def __init__(self, data: pd.DataFrame, mode: str):
        mapping = {orig: new for orig, new in COLUMN_MAPPING[mode].items() if orig in data.columns}
        self.df = data.rename(columns=mapping).copy()
        self.mode = mode

    def preprocess(self):
        """
        Sort, interpolate, and clean missing values.
        """
        if 'Year' in self.df:
            self.df = self.df.sort_values('Year')

        nums = self.df.select_dtypes(include=[np.number]).columns

        if not nums.empty:
            self.df[nums] = (
                self.df[nums]
                    .interpolate()
                    .fillna(method='bfill')
                    .fillna(method='ffill')
                    .clip(lower=0)
            )

        for col in self.df.select_dtypes(include=['object']):
            self.df['col'] = self.df[col].fillna('Unknown')

    def cap_outliers(self):
        """
        Cap extreme values using IQR.
        """
        for col in ['Total','Solid','Liquid','Gas','Cement','Flaring']:
            if col in self.df:
                Q1, Q3 = self.df[col].quantile([0.25, 0.75])
                IQR = Q3 - Q1
                self.df[col] = self.df[col].clip(Q1-1.5*IQR, Q3+1.5*IQR)

    def engineer_features(self):
        """
        Generate lag and YoY percent change features.
        """
        if 'Total' not in self.df or 'Year' not in self.df:
            return
        if self.mode=='nation' and 'Nation' in self.df:
            self.df = self.df.groupby('Nation',group_keys=False).apply(
                lambda g: g.assign(
                    lag1=g['Total'].shift(1).fillna(0),
                    YoY_pct=((g['Total'] - g['Total'].shift(1))/(g['Total'].shift(1) + 1e-8) * 100).fillna(0)
                )
            )
        else:
            df = self.df.sort_values('Year')
            df['lag1'] = df['Total'].shift(1).fillna(0)
            df['YoY_pct'] = ((df['Total']-df['lag1'])/(df['lag1']+1e-8)*100).fillna(0)
            self.df = df 

### Initialize and process datasets ###

In [15]:
proc_global = CO2Processor(df_global, 'global')
proc_global.preprocess()
proc_global.cap_outliers()
proc_global.engineer_features()

In [16]:
proc_nation = CO2Processor(df_nation, 'nation')
proc_nation.preprocess()
proc_nation.cap_outliers()
proc_nation.engineer_features()

### LLM Insights Functions ###

In [18]:
def generate_insights(prompt: str) -> str:
    """
    Query LLM with retry and exponential backoff.
    """
    backoff = 1
    for _ in range(3):
        try:
            model = genai.GenerativeModel(GEMINI_MODEL)
            return model.generate_content(prompt).text
        except Exception:
            time.sleep(backoff)
            backoff *= 2
    return 'Insigts unavailable.'

def get_insights(df: pd.DataFrame) -> str:
    """
    Build prompt and get insights for a DataFrame.
    """
    if df.empty:
        return "No data."
    
    summary_stats = df.describe()
    recent_data = df.tail(10) if len(df) > 10 else df
            
    prompt = prompt = f"""
        Analyze CO2 emissions data and provide brief practical insights:
        
        Statistics: {summary_stats.to_string()}
        
        Latest data: {recent_data.to_string()}
        
        Provide 3-4 key findings on CO2 emission trends.
        """
    
    return generate_insights(prompt)

insights = get_insights(proc_nation.df if not proc_nation.df.empty else proc_global.df)

### Dash App Setup ###

In [20]:
app = dash.Dash(__name__)

In [21]:
def create_layout():
    """
    Build the layout with controls and graph container.
    """
    yrs = proc_global.df.Year if 'Year' in proc_global.df else pd.Series([0])
    mi,ma = int(yrs.min()),int(yrs.max())
    marks = {y:str(y) for y in range(mi,ma+1,max((ma-mi)//10,1))}

    return html.Div([
        html.H1(
            '🌍 CO2 Emissions Dashboard',
            style={'textAlign':'center'}
        ),
        dcc.Dropdown(
            id='dataset',
            options=[
                {'label':'Global','value':'global'},
                {'label':'Nation','value':'nation'}
            ],
            value='global',
            clearable=False,
            style={
                'width':'300px',
                'margin':'10px auto',
                'color': 'black'
            }
        ),
        dcc.Dropdown(
            id='country',
            multi=True,
            placeholder='Select countries',
            style={
                'width':'400px',
                'margin':'10px auto',
                'color': 'black'
            }
        ),
        dcc.RangeSlider(
            id='year_slider',
            min=mi,max=ma,
            value=[mi,ma],
            marks=marks,
            tooltip={'always_visible':True}
        ),
        html.Div(
            id='graphs',
            style={
                'display':'grid',
                'gridTemplateColumns':'repeat(auto-fit,minmax(400px,1fr))',
                'gap':'20px',
                'padding':'20px'
            }
        )
    ],style={'background':'#2c2f33','color':'#fff','fontFamily':'Arial','padding':'20px'})

app.layout = create_layout()

### Callbacks ###

In [23]:
@app.callback(
    [Output('country','options'),
     Output('country','style'),
     Output('year_slider','min'),
     Output('year_slider','max'),
     Output('year_slider','value'),
     Output('year_slider','marks')],
    Input('dataset','value')
)
def update_controls(dataset):
    """
    Update filters depending on dataset selection.
    """
    if dataset=='nation' and 'Nation' in proc_nation.df:
        df=proc_nation.df;countries=sorted(df.Nation.dropna().unique());opts=[{'label':c,'value':c} for c in countries]
        mi,ma=int(df.Year.min()),int(df.Year.max())
    else:
        opts=[];df=proc_global.df;mi,ma=int(df.Year.min()),int(df.Year.max())
    step=max((ma-mi) // 10, 1);marks={y:str(y) for y in range(mi,ma+1,step)}
    style={
        'display':'block','width':'400px','margin':'auto'} if opts else {'display':'none'}
    return opts,style,mi,ma,[mi,ma],marks

@app.callback(
    Output('graphs','children'),
    [Input('dataset','value'),Input('country','value'),Input('year_slider','value')]
)
def render_graphs(dataset,countries,year_range):
    """Generate graph components based on filters."""
    df=proc_global.df if dataset=='global' else proc_nation.df
    if dataset=='nation' and countries: df=df[df.Nation.isin(countries)]
    df=df[(df.Year>=year_range[0])&(df.Year<=year_range[1])]
    if df.empty:
        return [html.Div("No data to display",style={'textAlign':'center'})]

    comps=[]
    # Total Emissions
    if 'Total' in df:
        fig=px.line(df,x='Year',y='Total',color=('Nation' if dataset=='nation' else None),template='plotly_dark')
        comps.append(dcc.Graph(figure=fig))
    # Source breakdown
    srcs=[c for c in ['Solid','Liquid','Gas','Cement','Flaring'] if c in df]
    if srcs:
        dfsrc=df.groupby('Year')[srcs].sum().reset_index() if dataset=='nation' else df[['Year']+srcs]
        dflong=dfsrc.melt(id_vars=['Year'],value_vars=srcs,var_name='Source',value_name='Emissions')
        fig=px.area(dflong,x='Year',y='Emissions',color='Source',template='plotly_dark')
        comps.append(dcc.Graph(figure=fig))
    # Per Capita
    if 'PerCapita' in df:
        fig=px.line(df,x='Year',y='PerCapita',color=('Nation' if dataset=='nation' else None),template='plotly_dark')
        comps.append(dcc.Graph(figure=fig))
    # Correlation
    corr=df.select_dtypes(include=np.number).corr()
    fig=px.imshow(corr,text_auto=True,template='plotly_dark')
    comps.append(dcc.Graph(figure=fig))
    # Top Bar
    if dataset=='nation':
        top=df.groupby('Nation')['Total'].sum().nlargest(10)
        fig=px.bar(x=top.values,y=top.index,orientation='h',template='plotly_dark')
    else:
        recent=df[df.Year>=df.Year.max()-20];top=recent.nlargest(10,'Total').set_index('Year')['Total']
        fig=px.bar(x=top.values,y=top.index,orientation='h',template='plotly_dark')
    comps.append(dcc.Graph(figure=fig))
    # AI Insights
    comps.append(html.Div([html.H3('🤖 AI Insights'),html.Pre(insights, style={'whiteSpace':'pre-wrap'})],style={'gridColumn':'span 2'}))
    return comps


In [24]:
if __name__ == '__main__':
    app.run_server(debug=True)