<a href="https://colab.research.google.com/github/RajBhadani/GHG_Emission_Prediction/blob/main/Green_House_Gas_Emission_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Supply Chain Emissions Modeling Using Industry and Commodity Data (2010–2016)**

Problem Statement:

You have annual supply chain emission data from 2010–2016 categorized into industries and commodities. The goal is to develop a regression model that can predict the Supply Chain Emission Factors with Margins based on descriptive and quality metrics (substance, unit, reliability, temporal/geographical/technological/data collection correlations, etc.)

**🌱 Greenhouse Gas Emission Prediction Project**

Project Goal:
To analyze and predict greenhouse gas (GHG) emissions from various U.S. industries and commodities using the official dataset from data.gov.

Source:
Supply Chain Greenhouse Gas Emission Factors

Tools: Python, Pandas, Scikit-learn, Matplotlib, Seaborn

## **📂 Dataset Overview**

This dataset contains supply chain emission factors associated with various U.S. industries and commodities.

**Key Columns:**

*   Code: Industry classification code
*   Industry_Name: Name of the industry
*   Commodity: Item or commodity name
*   GHG_Emissions_kgCO2e: GHG emissions per unit (kg *   CO2 equivalent)
*   Units: Measurement units (e.g., [kg/2018 USD, purchaser price])

# **🧹 Data Preprocessing**

**Steps:**

* Handle missing values
* Convert units where needed
* Encode categorical features
* Normalize/scale numeric columns

# **🤖 Model Building & Evaluation**

We aim to predict `GHG_Emissions_kgCO2e` using regression models.

**Models to try:**

* Linear Regression
* Random Forest
* Evaluation Metrics:

**Evaluation Metrics:**

* RMSE (Root Mean Squared Error)
* MAE (Mean Absolute Error)
* R² Score

**Steps:**

* Step 1: Import Required Libraries
* Step 2: Load Dataset
* Step 3: Data Preprocessing (EDA+Cleaning+Encoding)
* Step 4: Training
* Step 5: Prediction and Evaluation
* Step 6: Hyperparameter Tuning
* Step 7: Comapartive Study and Slecting the Best model

**Step 1: Import Required Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib

**Step 2: Load Dataset**

In [None]:
# Define the path to the Excel file
excel_file = 'SupplyChainEmissionFactorsforUSIndustriesCommodities.xlsx'  # Replace with actual path


In [None]:
# Define the range of years to process
years = range(2010, 2017)

In [None]:
# Accessing an element from the 'years' range (this line can be removed as it's just for inspection)
years[2]

In [None]:
# Load data for the first year from the 'Commodity' sheet
df_1 = pd.read_excel(excel_file, sheet_name=f'{years[0]}_Detail_Commodity')
# Display the first few rows of the loaded DataFrame
df_1.head()

In [None]:
# Load data for the first year from the 'Industry' sheet
df_2 = pd.read_excel(excel_file, sheet_name=f'{years[0]}_Detail_Industry')
# Display the first few rows of the loaded DataFrame
df_2.head()

In [None]:
# Initialize an empty list to store DataFrames from each yea
all_data = []

# Loop through each year in the defined range
for year in years:
    try:
        # Load data from 'Commodity' and 'Industry' sheets for the current year
        df_com = pd.read_excel(excel_file, sheet_name=f'{year}_Detail_Commodity')
        df_ind = pd.read_excel(excel_file, sheet_name=f'{year}_Detail_Industry')


        # Add a 'Source' column to indicate whether the data is from 'Commodity' or 'Industry'
        df_com['Source'] = 'Commodity'
        df_ind['Source'] = 'Industry'
        # Add a 'Year' column to identify the year of the data
        df_com['Year'] = df_ind['Year'] = year

        # Clean up column names by removing leading/trailing whitespace
        df_com.columns = df_com.columns.str.strip()
        df_ind.columns = df_ind.columns.str.strip()

        # Rename specific columns for consistency across data sources
        df_com.rename(columns={
            'Commodity Code': 'Code',
            'Commodity Name': 'Name'
        }, inplace=True)

        df_ind.rename(columns={
            'Industry Code': 'Code',
            'Industry Name': 'Name'
        }, inplace=True)

        # Concatenate the commodity and industry data for the current year and append to the list
        all_data.append(pd.concat([df_com, df_ind], ignore_index=True))

    except Exception as e:
        print(f"Error processing year {year}: {e}")

In [None]:
# Accessing an element from the list of DataFrames (this line can be removed as it's just for inspection)
all_data[3]

In [None]:
# Check the number of DataFrames in the list (this line can be removed as it's just for inspection)
len(all_data)

In [None]:
# Concatenate all DataFrames in the 'all_data' list into a single DataFrame
df = pd.concat(all_data, ignore_index=True)
# Display the first 10 rows of the combined DataFrame
df.head(10)

In [None]:
# Print the number of rows in the concatenated DataFrame (this line can be removed as it's just for inspection)
len(df)

# **Step 3: Data Preprocessing**

In [None]:
# Check the column names of the combined DataFrame (this line can be removed as it's just for inspection)
df.columns # Checking columns

In [None]:
# Check for missing values in each column of the DataFrame
df.isnull().sum()