## Load data & organize for training/inference
We are loading in the outputs from the `basic_feature_engineering.ipynb` notebook

In [1]:
import pandas as pd

#load data
train = pd.read_pickle('final_train.pkl')
test = pd.read_pickle('final_test.pkl')

# Separate features, targets
x_train = train.drop(columns=['target_scope_1', 'target_scope_2', 'entity_id'])
y_scope1 = train['target_scope_1']
y_scope2 = train['target_scope_2']

#this is the data we will pass to our final models for inference to create our predictions
x_test = test.drop(columns=['entity_id'])
#you will need to provide entity_id in your submission file along with your predictions
entity_id_test = test.entity_id

## Train and validate model
We are using a linear regression model for simplicity.

In [2]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


def validate_model(x, y):

    # Initialize model
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', LinearRegression())
        ])
    # Set up K-Fold cross validation
    kfold = KFold(n_splits=5, shuffle=True, random_state=42)

    # Perform cross-validation with RMSE
    # Note: cross_val_score uses negative MSE, so we convert to RMSE
    cv_scores = cross_val_score(
        pipeline, 
        x, 
        y, 
        cv=kfold, 
        scoring='neg_mean_squared_error'
    )

    # Convert to RMSE (positive values)
    rmse_scores = np.sqrt(-cv_scores)

    # Print cross-validation results
    print(f"Cross-Validation RMSE Scores: {rmse_scores}")
    print(f"Mean RMSE: {rmse_scores.mean():.4f}")
    print(f"Standard Deviation RMSE: {rmse_scores.std():.4f}")




There is a lot of room for improvement in the model results below. The poor performance we are seeing likely has to do with how skewed the data is between smaller companies and much larger companies. There are additional data scaling techniques (e.g., taking the log of certain columns to address their heavy skew) that we can investigate to manage for this but we will not explore this further in this notebook.

In [3]:
s1_model = validate_model(x_train,y_scope1)

Cross-Validation RMSE Scores: [133422.78132096  98572.95659937 106891.89022782  91938.02672004
  91389.65056992]
Mean RMSE: 104443.0611
Standard Deviation RMSE: 15537.3517


In [4]:
s2_model = validate_model(x_train,y_scope2)

Cross-Validation RMSE Scores: [167419.49212111 185236.19964815  86912.76537692  80307.22373449
 274623.85227414]
Mean RMSE: 158899.9066
Standard Deviation RMSE: 71438.3280


In [5]:
## for testing
print(train.columns.tolist())

['revenue', 'overall_score', 'environmental_score', 'social_score', 'governance_score', 'region_code_NAM', 'region_code_WEU', 'sect_C_pct', 'sect_G_pct', 'sect_J_pct', 'env_score_adjustment', 'sdg_id_3', 'sdg_id_9', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'revenue_log', 'rev

## Added - Kahnishga

In [6]:
# --- Reconstruct region from one-hot columns ---
train["region"] = (
    train[["region_code_NAM", "region_code_WEU"]]
    .idxmax(axis=1)
    .str.replace("region_code_", "")
)

# --- Group by region ---
region_scores = train.groupby("region")["overall_score"].mean().sort_values()

# --- ASCII Plot ---
max_val = region_scores.max()
scale = 40 / max_val   # bar width scale

print("Average Overall Score by Region\n")

for region_name, value in region_scores.items():
    bar = "█" * int(value * scale)
    print(f"{region_name:<20} {bar}  {value:.2f}")


Average Overall Score by Region

WEU                  ███████████████████████████████████████  2.97
NAM                  ████████████████████████████████████████  3.02


In [7]:
# organize data first then run this cell

cols = ["revenue", "environmental_score", "social_score", "governance_score", "overall_score"]
corr = train[cols].corr()

print("Correlation Heatmap\n")
for row in corr.index:
    line = ""
    for val in corr.loc[row]:
        shade = int((val + 1) / 2 * 10)  # map -1..1 to 0..10
        block = "░▒▓█"[min(shade // 3, 3)]
        line += block * 3
    print(f"{row:<20} {line}")


Correlation Heatmap

revenue              ███▒▒▒▒▒▒▒▒▒▒▒▒
environmental_score  ▒▒▒███▒▒▒▒▒▒███
social_score         ▒▒▒▒▒▒███▒▒▒▓▓▓
governance_score     ▒▒▒▒▒▒▒▒▒███▓▓▓
overall_score        ▒▒▒███▓▓▓▓▓▓███


In [8]:
def ascii_hist(series, bins=10):
    counts, bin_edges = np.histogram(series, bins=bins)
    max_count = counts.max()
    scale = 40 / max_count
    
    for i in range(bins):
        bar = "█" * int(counts[i] * scale)
        print(f"{bin_edges[i]:5.1f} – {bin_edges[i+1]:5.1f} | {bar} ({counts[i]})")

print("\nEnvironmental Score Distribution\n")
ascii_hist(train["environmental_score"])

## print("\nSocial Score Distribution\n")
## ascii_hist(train["social_score"])

print("\nGovernance Score Distribution\n")
ascii_hist(train["governance_score"])



Environmental Score Distribution

  1.0 –   1.4 | ██ (6)
  1.4 –   1.8 |  (2)
  1.8 –   2.2 | ██ (7)
  2.2 –   2.6 | ██████ (20)
  2.6 –   3.0 | █████████ (28)
  3.0 –   3.4 | ████████████████████████████████████████ (120)
  3.4 –   3.8 | ████████████████████████████████ (98)
  3.8 –   4.2 | ██████████████████████████████████████ (115)
  4.2 –   4.5 | ████ (13)
  4.5 –   4.9 | ██████ (20)

Governance Score Distribution

  1.1 –   1.4 | ████ (8)
  1.4 –   1.7 | █████████████████████████ (48)
  1.7 –   2.0 | ███████████████████████████ (51)
  2.0 –   2.2 | ████████████████████████████████████████ (75)
  2.2 –   2.5 | ████████████████████████████████████ (68)
  2.5 –   2.8 | ███████████████████████████████ (59)
  2.8 –   3.1 | ███████████████████████████████████████ (74)
  3.1 –   3.3 | ███████████████████ (36)
  3.3 –   3.6 | ███ (7)
  3.6 –   3.9 | █ (3)


## Final model inference

In [30]:
# Train final models on all training data
s1_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
    ])
s1_pipeline.fit(x_train,y_scope1)

s2_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
    ])
s2_pipeline.fit(x_train,y_scope2)

# Predict
s1_predictions = s1_pipeline.predict(x_test)
s2_predictions = s2_pipeline.predict(x_test)

Model 2 using RandomForestRegressor

In [None]:
#Final Pattern-Based Models for Scope 1 and Scope 2 using RandomForestRegressor

from sklearn.ensemble import RandomForestRegressor

# Scope 1 Pattern Model 
s1_pattern_model = RandomForestRegressor(
    n_estimators=600,
    max_depth=12,
    min_samples_split=4,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
s1_pattern_model.fit(x_train, y_scope1)

# Scope 2 Pattern Model 
s2_pattern_model = RandomForestRegressor(
    n_estimators=600,
    max_depth=12,
    min_samples_split=4,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
s2_pattern_model.fit(x_train, y_scope2)

# Predict pattern-based outputs 
s1_pattern_predictions = s1_pattern_model.predict(x_test)
s2_pattern_predictions = s2_pattern_model.predict(x_test)

print("Pattern-based Scope 1 predictions (first 10):")
print(s1_pattern_predictions[:10])

print("\nPattern-based Scope 2 predictions (first 10):")
print(s2_pattern_predictions[:10])


Pattern-based Scope 1 predictions (first 10):
[ 90927.7523139   13996.62511465  80984.22591555 130684.74460901
  18170.16653097  39670.67142552  28537.82829084  27515.95634325
  49770.963677    57371.08368473]

Pattern-based Scope 2 predictions (first 10):
[ 78741.89293949  14719.50217852  56006.50714538 158614.95178889
  26107.97681273  13528.65965054  23985.4018331   22427.36801157
 168245.61276249  54848.40315661]


Creating Submission for Model using RandomForestRegressor

In [35]:
submission = pd.DataFrame({
    "entity_id": entity_id_test,
    "s1_predictions": s1_pattern_predictions,
    "s2_predictions": s2_pattern_predictions
})
submission.to_csv("submission_pattern.csv", index=False)


## Create your submission file

In [34]:
submission = pd.DataFrame({
    'entity_id': entity_id_test,
    's1_predictions': s1_predictions,
    's2_predictions': s2_predictions
})

submission.to_csv('submission.csv')

submission

Unnamed: 0,entity_id,s1_predictions,s2_predictions
0,1076,37056.263192,40174.828246
1,2067,58002.419505,35322.087395
2,910,76265.572092,106384.876228
3,4082,113703.913695,102490.485434
4,4102,26894.692547,12794.286425
5,1535,31112.238727,32314.365748
6,4213,40518.88576,13657.458861
7,107,44412.657023,52360.028431
8,2301,30945.341291,54399.12533
9,1463,45546.407009,34748.201501
