<div style="background-color: #FFEB3B; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px #888888;">
  <h1 style="text-align: center; font-family: Arial, sans-serif; color: #333333;">🏭 Carbon Majors Emissions Data Analysis and Modeling 🌍</h1>
</div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro
from sklearn.preprocessing import PowerTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

## 🔍 Initial Data Exploration

In [None]:
df = pd.read_csv("/kaggle/input/carbon-majors-emissions-data/emissions_high_granularity.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

## 📊 Data Preprocessing

In [None]:
df.isnull().sum()

In [None]:
# Handle Duplicates
duplicates = df.duplicated().sum()
print(f'Duplicates: {duplicates}')

In [None]:
numeric_columns = df.select_dtypes(include=[np.number]).columns
Q1 = df[numeric_columns].quantile(0.25)
Q3 = df[numeric_columns].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df[numeric_columns] < (Q1 - 1.5 * IQR)) |(df[numeric_columns] > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
# Check Normalization
stat, p = shapiro(df['production_value'])
print('Statistics=%.3f, p=%.3f' % (stat, p))

if p < 0.05:
    transformer = PowerTransformer()
    df['production_value'] = transformer.fit_transform(df[['production_value']])

## 📈 Exploratory Data Analysis (EDA)

In [None]:
# Line Plot for Trends over Years
plt.figure(figsize=(14, 7))
plt.plot(df['year'], df['total_emissions_MtCO2e'], marker='o', linestyle='-')
plt.title('Total Emissions Over the Years')
plt.xlabel('Year')
plt.ylabel('Total Emissions (MtCO2e)')
plt.grid(True)
plt.show()


### Short Interpretation

The line plot shows a clear upward trend in total emissions (MtCO2e) from the early 1900s to 2020.

- **Pre-1940s**: Emissions were relatively low and sporadic.
- **1940s to 1980s**: There was a steady increase in emissions, indicating industrial growth.
- **1980s to 2020**: Emissions continued to rise significantly, with higher and more frequent spikes, reflecting rapid industrialization and economic growth globally.

Overall, the plot highlights a significant increase in emissions over the past century.


In [None]:
# Scatter Plot for Production Value vs Total Emissions
plt.figure(figsize=(10, 6))
plt.scatter(df['production_value'], df['total_emissions_MtCO2e'], alpha=0.7, color='g')
plt.title('Production Value vs Total Emissions')
plt.xlabel('Production Value')
plt.ylabel('Total Emissions (MtCO2e)')
plt.show()

### Short Interpretation

The scatter plot illustrates the relationship between production value and total emissions (MtCO2e). Here are the key observations:

- **Positive Correlation**: There is a clear positive correlation between production value and total emissions. As production value increases, total emissions also tend to increase.
- **Curved Relationship**: The data points form a curved pattern, suggesting a non-linear relationship where emissions rise more sharply at higher production values.
- **Clusters of Data**: The plot shows distinct clusters of data points, indicating groups of entities with similar production values and emissions levels.

Overall, the plot highlights that higher production values are associated with significantly higher total emissions.

In [None]:
# Pie Chart for Commodity Distribution
commodity_counts = df['commodity'].value_counts()
plt.figure(figsize=(10, 6))
plt.pie(commodity_counts, labels=commodity_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Commodity Distribution')
plt.axis('equal')
plt.show()


### Short Interpretation

The pie chart illustrates the distribution of different commodities. Here are the key observations:

- **Oil & NGL**: Represents the largest share at 30.2% of the total commodity distribution.
- **Natural Gas**: Accounts for 18.8%, making it the second largest commodity.
- **Metallurgical Coal**: Holds 11.0% of the distribution.
- **Bituminous Coal**: Comprises 10.7% of the commodities.
- **Lignite Coal**: Makes up 10.1% of the distribution.
- **Other Commodities**: Include Thermal Coal, Sub-Bituminous Coal, Anthracite Coal, Cement, and Sub-Bituminous Coal, each with smaller shares ranging from 3.2% to 6.7%.

Overall, the chart highlights that Oil & NGL and Natural Gas are the predominant commodities in the distribution.

In [None]:
# Histograms for Emission Types in Subplots
emissions_columns = ['product_emissions_MtCO2', 'flaring_emissions_MtCO2', 'venting_emissions_MtCO2', 
                     'own_fuel_use_emissions_MtCO2', 'fugitive_methane_emissions_MtCO2e', 'total_operational_emissions_MtCO2e']

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))
axes = axes.flatten()

for i, col in enumerate(emissions_columns):
    axes[i].hist(df[col], bins=30, edgecolor='k', alpha=0.7)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()


### Short Interpretation

The subplots display the distribution of various emission types. Here are the key observations:

1. **Product Emissions**:
   - Most values are clustered between 0 and 20 MtCO2, with a long tail extending to higher values.

2. **Flaring Emissions**:
   - Predominantly low values, mostly concentrated around 0 MtCO2, with very few instances exceeding 0.1 MtCO2.

3. **Venting Emissions**:
   - Similar to flaring emissions, values are mostly close to 0 MtCO2, with few occurrences above 0.05 MtCO2.

4. **Own Fuel Use Emissions**:
   - Almost all values are close to 0 MtCO2, indicating minimal emissions from own fuel use.

5. **Fugitive Methane Emissions**:
   - The majority of values are below 2.5 MtCO2e, with some extending up to 15 MtCO2e.

6. **Total Operational Emissions**:
   - The distribution is skewed towards lower values, with most emissions below 2.5 MtCO2e and a long tail up to 15 MtCO2e.

Overall, the histograms reveal that emissions are generally low for most types, with a few high-emission outliers.

In [None]:
# Ensure only numeric columns are included for the correlation matrix
numeric_df = df.select_dtypes(include=[np.number])

# Correlation matrix heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = numeric_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()


### Short Interpretation 

The correlation matrix heatmap displays the relationships between different numeric variables in the dataset. Here are the key observations:

- **High Correlation**: 
  - `product_emissions_MtCO2` has a very high correlation (0.96) with both `fugitive_methane_emissions_MtCO2e` and `total_operational_emissions_MtCO2e`.
  - `own_fuel_use_emissions_MtCO2` and `venting_emissions_MtCO2` are also highly correlated (0.83).
  - `total_emissions_MtCO2e` is highly correlated with `total_operational_emissions_MtCO2e` (0.97), indicating these variables are closely related.

- **Moderate Correlation**:
  - `production_value` has moderate correlations with `own_fuel_use_emissions_MtCO2` (0.51) and `venting_emissions_MtCO2` (0.68).

- **Low/Negative Correlation**: 
  - There are low or negative correlations among other pairs, such as `flaring_emissions_MtCO2` with most other variables, indicating minimal or inverse relationships.

Overall, the heatmap highlights that certain types of emissions are strongly correlated, especially `product_emissions_MtCO2`, `fugitive_methane_emissions_MtCO2e`, and `total_operational_emissions_MtCO2e`.

In [None]:
# Bar Plot for Parent Entity Distribution
parent_entity_counts = df['parent_entity'].value_counts().head(10)  # Top 10 entities
plt.figure(figsize=(14, 7))
parent_entity_counts.plot(kind='bar')
plt.title('Top 10 Parent Entity Distribution')
plt.xlabel('Parent Entity')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


### Short Interpretation

The bar plot displays the distribution of the top 10 parent entities based on frequency. Here are the key observations:

- **Chevron**: Has the highest frequency among the parent entities, indicating it is the most prominent entity in the dataset.
- **Poland, Occidental Petroleum, Westmoreland Mining, Conocophillips, BP, Singareni Collieries**: These entities have similar frequencies, showcasing their significant presence.
- **Former Soviet Union, CONSOL Energy, ExxonMobil**: These entities have relatively lower frequencies compared to the others but are still among the top 10.

Overall, Chevron stands out as the most frequent parent entity, followed by a mix of international and national corporations.


In [None]:
# Swarm Plot for Flaring Emissions by Commodity
plt.figure(figsize=(14, 7))
sns.swarmplot(x='commodity', y='flaring_emissions_MtCO2', data=df)
plt.title('Flaring Emissions by Commodity')
plt.xlabel('Commodity')
plt.ylabel('Flaring Emissions (MtCO2)')
plt.xticks(rotation=45)
plt.show()


### Short Interpretation 

The swarm plot displays the distribution of flaring emissions (MtCO2) by different commodities. Here are the key observations:

- **Oil & NGL**: This commodity has the highest flaring emissions, with values ranging from 0 to 0.5 MtCO2. The data points are densely packed, indicating frequent flaring emissions in this category.
- **Natural Gas**: Shows flaring emissions close to 0 MtCO2, with very few instances of higher emissions.
- **Other Commodities**: Sub-Bituminous Coal, Metallurgical Coal, Bituminous Coal, Thermal Coal, Anthracite Coal, Cement, and Lignite Coal all have minimal to no flaring emissions, with values clustered around 0 MtCO2.

Overall, the plot highlights that Oil & NGL is the predominant source of flaring emissions, whereas other commodities contribute negligibly.

In [None]:
# Count Plot for Parent Type
plt.figure(figsize=(14, 7))
sns.countplot(y='parent_type', data=df, order=df['parent_type'].value_counts().index)
plt.title('Count Plot for Parent Type')
plt.xlabel('Count')
plt.ylabel('Parent Type')
plt.show()


### Short Interpretation 

The count plot displays the distribution of different parent types in the dataset. Here are the key observations:

- **Investor-owned Company**: This is the most common parent type, with a significantly higher count compared to other types.
- **State-owned Entity**: The second most common parent type, though with considerably fewer instances than investor-owned companies.
- **Nation State**: The least common parent type among the three, with the lowest count.

Overall, the plot highlights that investor-owned companies dominate the dataset, followed by state-owned entities and nation states.

In [None]:
# Facet Grid for Emissions by Year and Commodity
g = sns.FacetGrid(df, col='commodity', col_wrap=4, height=4)
g.map(sns.boxplot, 'year', 'total_emissions_MtCO2e')
g.add_legend()
plt.show()


### Short Interpretation 

The facet grid displays the distribution of total emissions (MtCO2e) by year for each commodity. Here are the key observations:

- **Oil & NGL**: Shows relatively consistent emissions over the years, with some variability but generally lower compared to coal commodities.
- **Natural Gas**: Exhibits low emissions with little variation over the years.
- **Sub-Bituminous Coal, Metallurgical Coal, Bituminous Coal, Thermal Coal, Anthracite Coal, Lignite Coal**: These coal commodities display significant variability in emissions, with noticeable increases in certain periods. The emissions for these commodities are generally higher and show distinct peaks.
- **Cement**: Has moderate emissions with some variability over the years.

Overall, the plot highlights that coal commodities have the highest and most variable emissions, while oil & NGL and natural gas have relatively lower and more consistent emissions.

In [None]:
# Strip Plot for Emissions by Commodity
plt.figure(figsize=(14, 7))
sns.stripplot(x='commodity', y='total_emissions_MtCO2e', data=df, jitter=True)
plt.title('Strip Plot for Emissions by Commodity')
plt.xlabel('Commodity')
plt.ylabel('Total Emissions (MtCO2e)')
plt.xticks(rotation=45)
plt.show()


### Short Interpretation

The strip plot displays the distribution of total emissions (MtCO2e) by different commodities. Here are the key observations:

- **Oil & NGL**: Exhibits a relatively lower range of emissions, mostly below 40 MtCO2e.
- **Natural Gas**: Shows even lower emissions, with values clustered around 0 to 20 MtCO2e.
- **Sub-Bituminous Coal, Metallurgical Coal, Bituminous Coal, Thermal Coal, Anthracite Coal, Lignite Coal**: These coal commodities have a wide range of emissions, often reaching up to 160 MtCO2e, indicating higher and more variable emissions compared to other commodities.
- **Cement**: Displays moderate emissions, with values spread across a smaller range compared to coal commodities.

Overall, the plot highlights that coal commodities are the major contributors to higher emissions, whereas oil & NGL and natural gas have relatively lower emissions.

In [None]:
# Data Preprocessing
df = pd.get_dummies(df, columns=['parent_entity', 'parent_type', 'reporting_entity', 'commodity', 'production_unit', 'source'], drop_first=True)

## 🏋️‍♂️ Model Training

In [None]:
# Train-Test Split
X = df.drop(columns=['year', 'total_emissions_MtCO2e'])
y = df['total_emissions_MtCO2e']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Model Training and Evaluation
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## 📊 Model Evaluation

In [None]:
# Evaluation Metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')
print(f'R-squared: {r2}')
print(f'Root Mean Squared Error: {rmse}')

In [None]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7, color='b')
plt.xlabel('Actual Emissions')
plt.ylabel('Predicted Emissions')
plt.title('Actual vs Predicted Emissions')
plt.show()


### Short Interpretation

The plot compares the actual emissions with the predicted emissions. Here are the key observations:

- **Linear Relationship**: The points lie almost perfectly along the diagonal line, indicating a strong linear relationship between actual and predicted values.
- **High Accuracy**: The close alignment of points with the diagonal suggests that the model predicts emissions with high accuracy.
- **Minimal Deviation**: There is minimal deviation from the diagonal, highlighting that the predicted emissions closely match the actual emissions across the entire range.

Overall, the plot demonstrates that the model performs exceptionally well in predicting emissions, with high accuracy and minimal errors.

In [None]:
# Residuals Plot
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.7, color='r')
plt.xlabel('Predicted Emissions')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Emissions')
plt.axhline(y=0, color='black', linestyle='--')
plt.show()


### Short Interpretation 

The plot displays the residuals (differences between actual and predicted values) against the predicted emissions. Here are the key observations:

- **Centered Residuals**: Most of the residuals are centered around 0, indicating that the model's predictions are generally accurate.
- **No Clear Pattern**: The residuals do not show any clear pattern or trend, suggesting that the model does not suffer from heteroscedasticity (changing variability).
- **Outliers**: There are some residuals that deviate significantly from 0, indicating the presence of outliers or instances where the model's predictions were less accurate.

Overall, the plot suggests that the model performs well, with residuals mostly centered around 0 and no evident pattern, though there are a few outliers.

<div style="background-color: #FFEB3B; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px #888888;">
  <h1 style="text-align: center; font-family: Arial, sans-serif; color: #333333;">👍 If you like my notebook, please upvote! 👍</h1>
</div>