# Plan
1. Create ML Model based on this dataset
2. Create models for specific industries 

# Project Structure:
1. Abstract
2. Data collection
3. Preprocess data
4. Feature engineering
5. Train, test split
6. Modelling
7. Model evaluation
8. Hyperparameter tuning 
9. Model Interpretation 
10. Results and conclusions
11. References and Acknowledgments

# Abstract 

Title: Predicting Scope 3 Emissions using Machine Learning: A Novel Approach

The following research of mine is based on a study conducted by Serafeim, George and Velez Caicedo, Gladys. 2022. "Machine Learning Models for Prediction of Scope 3 Carbon Emissions." Harvard Business School Working Paper, 2022.

I would like to thank the authors for sharing their methodology and data, which allows me to independently conduct research and modeling and then compare the results with the conclusions of the researchers thanks to which I have a great opportunity to learn and to lead the research in new directions by updating the model.

In [1]:
import polars as pl
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import plotly.graph_objects as go
import plotly.express as px

## Data Guidence

Copyright © 2022 by George Serafeim and Gladys Velez Caicedo. "Machine Learning Models for Prediction of Scope 3 Carbon Emissions." Harvard Business School Working Paper, 2022		
Funding for this research was provided in part by Harvard Business School.		
		
Data source: 		
“Serafeim, George and Velez Caicedo, Gladys. 2022. "Machine Learning Models for Prediction of Scope 3 Carbon Emissions." Harvard Business School Working Paper, 2022		
		
GUIDANCE:		
Column A "Year" is the year in which the environmental impact was incurred by the firm's operations.		
		
Column B "Company Name" is the name of the issuer.		
		
Column C "Country" is the country in which the companies' headquarters are located.		
		
Column D "Industry" refers to the Exiobase industry category to which the firm belongs: We provide only the Exiobase industries here as they are open source, but in our paper, we use GICS taxonomy as fixed effects.		
All Exiobase industries are based on the International Standard Industrial Classification Revision 3.1 (ISIC). 	To learn more about ISIC and a comprehensive list of industries included, please refer to: unstats.un.org/unsd/statcom/doc02/isic.pdf	
	For example, the term "nec" refers to "not elsewhere classified."	
		
Column E "GHG Intensity (Sales)" is the monetized GHG impact of the firm's operations during the specific year indicated in column A divided by revenue in that year		
		
Column F "GHG Intensity (Op Inc)" is the monetized GHG impact of the firm's operations during the specific year indicated in column A divided by operating income in that year		
		
Column G "Total GHG Environmental Cost (Scope 1, 2, 3) " is the total monetized GHG environmental impact of Scope 1, 2, and 3 emissions of the firm's operations during the specific year indicated in Column A.		
		
Columns H-J are Scope 1, 2, 3 Emissions		
Each scope of emissions is defined by the GHG Protocol. More information can be found at the Greenhouse Gas Protocol: https://ghgprotocol.org/		
	Column H:	 Scope 1 Emissions: emissions from direct operations that occur from sources that are controlled or owned by the firm 
	Column I:	 Scope 2 Emissions: emissions associated with the purchase of electricity, steam, heat, or cooling as a result of the firm's energy use 
	Column J:	 Scope 3 Emissions: emissions from 15 categories that are result of activities from assets not owned or controlled by the reporting firm, not within a firm's scope 1 and 2 boundary and occur through the value chain. 
		
Columns K-BC are fiveteen Scope 3 emissions category types in alphabetical order followed by an indicator variable denoting if the data point is company reported (0) or if the data point is predicted via machine learning (1)		
	Column K	Business Travel
	Column N:	Capital Goods
	Column Q:	Downstream Leased Assets
	Column T:	Downstream Transportation and Distribution
	Column W:	Employee Commuting
	Column Z:	End of Life Treatment of Sold Products
	Column AC:	Franchises
	Column AF:	Fuel-and-energy-related activities (not included in Scope 1 or 2)
	Column AI:	Investments
	Column AL:	Processing of Sold Products
	Column AO:	Purchased Goods and Services
	Column AR:	Upstream Leased Assets
	Column AU:	Upstream Transportation and Distribution
	Column AX:	Use of Sold Products
	Column BA:	Waste Generated in Operations
		
The dataset is a combination of primary firm reported emissions data supplemented with Scope 3 predictions by category.		
Our methodology takes firm reported values first and incorporates imputations only when companies' self-reported emissions data are not publicly available.		
If the data point is imputed, the Scope 3 category "Imputed" value is 1.		
If the data point is company reported, the Scope 3 category "Imputed" value is 0.		
The Scope 3 category "Test" column indicates if the data point was used to "train" or "test" the machine learning model. If no company value is reported, the value. Is set to "none". 		
		
Other Notes:		
The "Final Raw Sample(0%)" tab includes all raw outputs, discounted at 0%, from our environmental impact calculation methodology. The Social Cost of Carbon discounted at 0% applied here is roughly $300 USD per metric ton of emissions.		
The "Final Raw Sample(3%)" tab includes all raw outputs, discounted at 3%, from our environmental impact calculation methodology. The Social Cost of Carbon discounted at 3% applied here is roughly $100 USD per metric ton of emissions.		
All observations in the tabs are sorted by 1) Year in descending order, 2) Industry in alphabetical order, and 3) Environmental Intensity (Sales) in descending order.		
		
		
Also, if you are a researcher planning to use the data in an academic research project, please email us and we will send you a file including ISINs to facilitate merging with other datasets.		
Our team can be reached at: ImpactWeightedAccounts@hbs.edu

# Data cllection

In [2]:
# Trainging dataset
df = pl.read_excel(
    "../data/01_raw/IWA-External-Scope-3-Data.xlsx",
    sheet_name="3%",
)

# 3% tab includes all raw outputs, discounted at 3%, from our environmental impact calculation methodology. 
# The Social Cost of Carbon discounted at 3% applied here is roughly $100 USD per metric ton of emissions.

In [3]:
# Dataset with 0% rate

df_0_percent = pl.read_excel(
    "../data/01_raw/IWA-External-Scope-3-Data.xlsx",
    sheet_name="0%",
)

# 0% tab includes all raw outputs, discounted at 0%, from our environmental impact calculation methodology. 
# The Social Cost of Carbon discounted at 0% applied here is roughly $300 USD per metric ton of emissions

### 2. AI Agent for EDA 

In [13]:
# Create in the future

### 3. AI suggestions 

In [14]:
# Variable scaling in preprocessing phase 
# Time series analysis at the end of EDA

### 4. Data types and structures

In [5]:
# Function to check data types
def check_data_types(df):
    data_types = df.dtypes.value_counts()
    return data_types

In [21]:
check_data_types(df_eda)

float64    21
object     18
int64      16
Name: count, dtype: int64

### Missing values

In [None]:
# First missing values are in GHG Intensity

# Preprocess data

In [None]:
# Functions used in preprocessing 
def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"


def _parse_percentage(x: pd.Series) -> pd.Series:
    x = x.str.replace("%", "")
    x = x.astype(float) / 100
    return x

# Template function which will be implement to data_prcessing pipeline to nodes.py 
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses the data for companies.

    Args:
        companies: Raw data.
    Returns:
        Preprocessed data, with `company_rating` converted to a float and
        `iata_approved` converted to boolean.
    """
    companies["iata_approved"] = _is_true(companies["iata_approved"])
    companies["company_rating"] = _parse_percentage(companies["company_rating"])
    return companies

# Feature engineering

# Train, Test split

In [None]:
# Template for function used for splitting data in data_science pipeline in nodes.py 
def split_data(data: pd.DataFrame, parameters: Dict) -> Tuple:
    """Splits data into features and targets training and test sets.

    Args:
        data: Data containing features and target.
        parameters: Parameters defined in parameters/data_science.yml.
    Returns:
        Split data.
    """
    X = data[parameters["features"]]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test

# Modelling

In [None]:
# Template for function used for training in data_science pipeline in nodes.py
def train_model(X_train: pd.DataFrame, y_train: pd.Series) -> LinearRegression:
    """Trains the linear regression model.

    Args:
        X_train: Training data of independent features.
        y_train: Training data for price.

    Returns:
        Trained model.
    """
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor

# Model evaluation

In [None]:
# Template for function used for model evalution in data_science pipeline in nodes.py 
def evaluate_model(
    regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series
):
    """Calculates and logs the coefficient of determination.

    Args:
        regressor: Trained model.
        X_test: Testing data of independent features.
        y_test: Testing data for price.
    """
    y_pred = regressor.predict(X_test)
    score = r2_score(y_test, y_pred)
    logger = logging.getLogger(__name__)
    logger.info("Model has a coefficient R^2 of %.3f on test data.", score)

# Hyperparameter tuning

# Model Interpretation

# Results and conclusions

In [None]:
# DALSZE MODELOWANIE
# MODELOWANIE DLA FIRM Z TYCH SAMYCH BRANZ, lub kilku podobnych branz
# MODELOWANIE DLA FIRM O PODOBNEJ CHARAKTERYSTYCE PROCESOW, DZIALANOSCI,
# MODELOWANIE DLA DANYCH KONTYNENTOW 

# References and Acknowledgments