# Supply Chain Emissions Modeling Using Industry and Commodity Data (2010–2016)

Problem Statement:

You have annual supply chain emission data from 2010–2016 categorized into industries and commodities. The goal is to develop a regression model that can predict the Supply Chain Emission Factors with Margins based on descriptive and quality metrics (substance, unit, reliability, temporal/geographical/technological/data collection correlations, etc.).

# 🌱 Greenhouse Gas Emission Prediction Project

![GHG Emissions](https://www.shalom-education.com/wp-content/uploads/2022/12/Shutterstock_1667551381-1-1024x1006.jpg)

**Project Goal:**  
To analyze and predict greenhouse gas (GHG) emissions from various U.S. industries and commodities using the official dataset from [data.gov](https://catalog.data.gov/dataset/supply-chain-greenhouse-gas-emission-factors-for-us-industries-and-commodities).

![GHG Emissions](https://edg.epa.gov/EPALogo.svg)

**Source:**  
[Supply Chain Greenhouse Gas Emission Factors](https://catalog.data.gov/dataset/supply-chain-greenhouse-gas-emission-factors-for-us-industries-and-commodities)

  
**Tools:** Python, Pandas, Scikit-learn, Matplotlib, Seaborn  


## 📂 Dataset Overview

This dataset contains supply chain emission factors associated with various U.S. industries and commodities.

**Key Columns:**
- `Code`: Industry classification code
- `Industry_Name`: Name of the industry
- `Commodity`: Item or commodity name
- `GHG_Emissions_kgCO2e`: GHG emissions per unit (kg CO2 equivalent)
- `Units`: Measurement units (e.g., [kg/2018 USD, purchaser price])



## 🧹 Data Preprocessing

Steps:
- Handle missing values
- Convert units where needed
- Encode categorical features
- Normalize/scale numeric columns

## 🤖 Model Building & Evaluation

We aim to predict `GHG_Emissions_kgCO2e` using regression models.

Models to try:
- Linear Regression
- Random Forest

**Evaluation Metrics:**
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- R² Score


##### Steps:
- Step 1: Import Required Libraries
- Step 2: Load Dataset
- Step 3: Data Preprocessing (EDA+Cleaning+Encoding)
- Step 4: Training
- Step 5: Prediction and Evaluation
- Step 6: Hyperparameter Tuning
- Step 7: Comapartive Study and Slecting the Best model 


# Step 1: Import Required Libraries

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib


ModuleNotFoundError: No module named 'seaborn'

# Step 2: Load Dataset

In [None]:
excel_file = 'SupplyChainEmissionFactorsforUSIndustriesCommodities.xlsx'  # Replace with actual path
years = range(2010, 2017)

In [None]:
years[0]

In [None]:
df_1 = pd.read_excel(excel_file, sheet_name=f'{years[0]}_Detail_Commodity')
df_1.head()

In [None]:
df_2 = pd.read_excel(excel_file, sheet_name=f'{years[0]}_Detail_Industry')
df_2.head()

In [None]:
all_data = []

for year in years:
    try:
        df_com = pd.read_excel(excel_file, sheet_name=f'{year}_Detail_Commodity')
        df_ind = pd.read_excel(excel_file, sheet_name=f'{year}_Detail_Industry')
        
        df_com['Source'] = 'Commodity'
        df_ind['Source'] = 'Industry'
        df_com['Year'] = df_ind['Year'] = year
        
        df_com.columns = df_com.columns.str.strip()
        df_ind.columns = df_ind.columns.str.strip()

        df_com.rename(columns={
            'Commodity Code': 'Code',
            'Commodity Name': 'Name'
        }, inplace=True)
        
        df_ind.rename(columns={
            'Industry Code': 'Code',
            'Industry Name': 'Name'
        }, inplace=True)
        
        all_data.append(pd.concat([df_com, df_ind], ignore_index=True))
        
    except Exception as e:
        print(f"Error processing year {year}: {e}")

In [None]:
all_data[3]

In [None]:
len(all_data)

In [None]:
df = pd.concat(all_data, ignore_index=True)
df.head()

In [None]:
len(df)

# Step 3: Data Preprocessing

In [None]:
df.columns # Checking columns

In [None]:
df.isnull().sum()

In [None]:
# As there is no data avaialble in Unnamed coulmn so we will drop the column
df.drop(columns=['Unnamed: 7'],inplace=True)

In [None]:
df.columns

In [None]:
print(df.info())   # Checking data types and non-null counts 

In [None]:
df.describe().T # Checking summary statistics 

In [None]:
df.isnull().sum() # Checking for null values in each column 

In [None]:
# Visualize distribution
sns.histplot(df['Supply Chain Emission Factors with Margins'], bins=50, kde=True)
plt.title('Target Variable Distribution')
plt.show()

In [None]:
# Check categorical variables
print(df['Substance'].value_counts())

In [None]:
print(df['Unit'].value_counts()) # Checking unique values in 'Unit' with count 

In [None]:
print(df['Unit'].unique()) # Checking unique values in 'Unit'

In [None]:
print(df['Source'].value_counts()) # Checking unique values in 'Source' with count 

In [None]:
df['Substance'].unique() # Checking unique values in 'Substance' 

In [None]:
substance_map={'carbon dioxide':0, 'methane':1, 'nitrous oxide':2, 'other GHGs':3} # Mapping substances to integers 

In [None]:
df['Substance']=df['Substance'].map(substance_map) 

In [None]:
df['Substance'].unique() # Checking unique values in 'Substance' 

In [None]:
print(df['Unit'].unique()) # Checking unique values in 'Unit' 

In [None]:
unit_map={'kg/2018 USD, purchaser price':0, 'kg CO2e/2018 USD, purchaser price':1} # Mapping units to integers 

In [None]:
df['Unit']=df['Unit'].map(unit_map)

In [None]:
print(df['Unit'].unique()) # Checking unique values in 'Unit' 

In [None]:
print(df['Source'].unique()) # Checking unique values in 'Source' 

In [None]:
source_map={'Commodity':0, 'Industry':1} # Mapping sources to integers 

In [None]:
df['Source']=df['Source'].map(source_map)   # applying the mapping to 'Source' column 

In [None]:
print(df['Source'].unique()) # Checking unique values in 'Source' 

In [None]:
df.info() # Checking data types and non-null counts after mapping 

In [None]:
df.Code.unique() # Checking unique values in 'Code' 

In [None]:
df.Name.unique() # Checking unique values in 'Name' 

In [None]:
len(df.Name.unique()) # Checking number of unique values in 'Name' 

##### Top 10 Emmiting Industry

In [None]:
top_emitters = df[['Name', 'Supply Chain Emission Factors with Margins']].groupby('Name').mean().sort_values(
    'Supply Chain Emission Factors with Margins', ascending=False).head(10) 

# Resetting index for better plotting
top_emitters = top_emitters.reset_index()

In [None]:
top_emitters

In [None]:
# Plotting the top 10 emitting industries


plt.figure(figsize=(10,6))
# Example: Top emitting industries (already grouped)
sns.barplot(
    x='Supply Chain Emission Factors with Margins',
    y='Name',
    data=top_emitters,
    hue='Name',
    palette='pastel'  # Use 'Blues', 'viridis', etc., for other color maps
)

# Add ranking labels (1, 2, 3...) next to bars
for i, (value, name) in enumerate(zip(top_emitters['Supply Chain Emission Factors with Margins'], top_emitters.index), start=1):
    plt.text(value + 0.01, i - 1, f'#{i}', va='center', fontsize=11, fontweight='bold', color='black')

plt.title('Top 10 Emitting Industries', fontsize=14, fontweight='bold') # Title of the plot 
plt.xlabel('Emission Factor (kg CO2e/unit)') # X-axis label
plt.ylabel('Industry') # Y-axis label
plt.grid(axis='x', linestyle='--', alpha=0.6) # Adding grid lines for better readability
plt.tight_layout() # Adjust layout to prevent overlap

plt.show()

##### Drop non-numeric columns not needed,
##### Alos drop Code and Year columns since there is no need of both of the columns

In [None]:

df.drop(columns=['Name','Code','Year'], inplace=True) 

In [None]:
df.head(1)

In [None]:
df.shape

##### Define features and target

In [None]:
X = df.drop(columns=['Supply Chain Emission Factors with Margins']) # Feature set excluding the target variable
y = df['Supply Chain Emission Factors with Margins'] # Target variable 

In [None]:
X.head()

In [None]:
y.head()

### Univariate Analysis

In [None]:
# Count plot for Substance
plt.figure(figsize=(6, 3))
sns.countplot(x=df["Substance"])
plt.title("Count Plot: Substance")
plt.xticks()
plt.tight_layout()
plt.show()

In [None]:
# Count plot for Unit
plt.figure(figsize=(6, 3))
sns.countplot(x=df["Unit"])
plt.title("Count Plot: Unit")
plt.tight_layout()
plt.show()


In [None]:
# Count plot for Source
plt.figure(figsize=(6, 4))
sns.countplot(x=df["Source"])
plt.title("Count Plot: Source (Industry vs Commodity)")
plt.tight_layout()
plt.show()

In [None]:
df.columns

### Multivariate Anslysis

##### Correlation heatmap

In [None]:
df.select_dtypes(include=np.number).corr() # Checking correlation between numerical features 

In [None]:
df.info() # Checking data types and non-null counts after mapping 

In [None]:
# Correlation matrix 
plt.figure(figsize=(12, 8))
sns.heatmap(df.select_dtypes(include=np.number).corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

## Normalize features

In [None]:
X.describe().T

In [None]:
# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
X_scaled[0].min(),X_scaled[0].max()

In [None]:
np.round(X_scaled.mean()),np.round(X_scaled.std())

#### Divide the data into train and test

In [None]:
X.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) # Splitting data into training and testing sets 

In [None]:
X_train.shape

In [None]:
X_test.shape

### Select the model for training

In [None]:
RF_model = RandomForestRegressor(random_state=42) # Initializing Random Forest Regressor 

# Step 4: Training

In [None]:
RF_model.fit(X_train, y_train) # Fitting the model on training data 

# Step 5 Prediction and Evaluation

In [None]:
RF_y_pred = RF_model.predict(X_test) # Making predictions on the test set 

In [None]:
RF_y_pred[:20]

In [None]:
RF_mse = mean_squared_error(y_test, RF_y_pred) # Calculating Mean Squared Error (MSE)
RF_rmse = np.sqrt(RF_mse) # Calculating Root Mean Squared Error (RMSE)
# Calculating R² score
RF_r2 = r2_score(y_test, RF_y_pred)

print(f'RMSE: {RF_rmse}')
print(f'R² Score: {RF_r2}')

In [None]:
from sklearn.linear_model import LinearRegression # Importing Linear Regression model 
LR_model = LinearRegression() # Initializing Linear Regression model
# Fitting the Linear Regression model on training data

LR_model.fit(X_train, y_train)

LR_y_pred = LR_model.predict(X_test) # Making predictions on the test set using Linear Regression model 


LR_mse = mean_squared_error(y_test, LR_y_pred) # Calculating Mean Squared Error (MSE) for Linear Regression model
LR_rmse = np.sqrt(LR_mse) # Calculating Root Mean Squared Error (RMSE) for Linear Regression model 
LR_r2 = r2_score(y_test, LR_y_pred) # Calculating R² score for Linear Regression model 

print(f'RMSE: {LR_rmse}')
print(f'R² Score: {LR_r2}')

# Step 6: Hyperparameter Tuning

In [None]:
# Hyperparameter tuning for Random Forest Regressor using GridSearchCV 
# Define the parameter grid for hyperparameter tuning 
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

# Perform grid search with cross-validation to find the best hyperparameters 
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, n_jobs=-1)

# Fit the grid search model on the training data 
grid_search.fit(X_train, y_train)

# Best model from grid search
best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)


### Use best parameters for prediction

In [None]:
# Use the best model to make predictions on the test set 
y_pred_best = best_model.predict(X_test)


HP_mse = mean_squared_error(y_test, y_pred_best)
HP_rmse = np.sqrt(HP_mse)
HP_r2 = r2_score(y_test, y_pred_best)

print(f'RMSE: {HP_rmse}')
print(f'R² Score: {HP_r2}')


# Step 7: Comapartive Study and Slecting the Best model 

In [None]:
# Create a comparative DataFrame for all models
results = {
    'Model': ['Random Forest (Default)', 'Linear Regression', 'Random Forest (Tuned)'],
    'MSE': [RF_mse, LR_mse, HP_mse],
    'RMSE': [RF_rmse, LR_rmse, HP_rmse],
    'R2': [RF_r2, LR_r2, HP_r2]
}

# Create a DataFrame to compare the results of different models
comparison_df = pd.DataFrame(results)
print(comparison_df)

If we compare the above three models we can see that linear regression is performing better than random forest regressor. So we will save the linear regression model.


### Save model and encoders

In [None]:
# Create a directory to save the models if it doesn't exist 
!mkdir models 

In [None]:
# Save model and encoders 
joblib.dump(best_model, 'models/LR_model.pkl')    # Save the best model 
joblib.dump(scaler, 'models/scaler.pkl') # Save the scaler used for normalization
