<a href="https://colab.research.google.com/github/Keyurinath13/Climate-nasa/blob/main/climate_nasa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install the LightGBM library
!pip install lightgbm

# Import necessary libraries
import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set plot style
sns.set_style('whitegrid')



-----

## 2\. Data Loading

As before, load the training and test datasets. Ensure the CSV files are in the same directory as your Jupyter Notebook.

In [None]:
# Load the datasets from your local directory
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
except FileNotFoundError:
    print("Execution halted: Make sure 'train.csv' and 'test.csv' are in the correct folder.")
    train_df = pd.DataFrame() # Stop further execution if files are not found

# Display shapes to confirm they loaded correctly
if not train_df.empty:
    print(f"Train shape: {train_df.shape}")
    print(f"Test shape: {test_df.shape}")

-----

## 3\. Advanced Preprocessing & Feature Engineering

This is where we'll implement a much more robust data preparation pipeline.

### Step 3.1: Handle Skewed Target Variable

[cite\_start]Your EDA showed the `emission` data is highly skewed[cite: 345]. A log transform will make its distribution more normal, which helps the model learn more effectively. We use `np.log1p` which is equivalent to `log(1 + x)` to handle potential zero values.

In [None]:
if 'emission' in train_df.columns:
    # Visualize original vs. transformed distribution
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    sns.histplot(train_df['emission'], kde=True, ax=axes[0], bins=50)
    axes[0].set_title('Original Emission Distribution')

    # Apply log transformation
    train_df['emission_log'] = np.log1p(train_df['emission'])

    sns.histplot(train_df['emission_log'], kde=True, ax=axes[1], bins=50)
    axes[1].set_title('Log-Transformed Emission Distribution')
    plt.show()

The transformed distribution is much closer to a bell curve, which is ideal for many regression models.

### Step 3.2: Feature Selection and Cleaning

[cite\_start]Your EDA revealed that many columns are almost entirely empty (99% missing)[cite: 448, 452]. It's better to remove these than to try and fill them. For the rest, we will fill missing values using the **median**.

In [None]:
# Identify and drop the ID column and the original emission target
train_df = train_df.drop(columns=['ID', 'emission'])
test_ids = test_df['ID'] # Keep test IDs for submission
test_df = test_df.drop(columns=['ID'])

# Identify columns with a high percentage of missing values from the training data
missing_pct = train_df.isnull().sum() / len(train_df)
cols_to_drop = missing_pct[missing_pct > 0.90].index

train_df = train_df.drop(columns=cols_to_drop)
test_df = test_df.drop(columns=cols_to_drop)

print(f"Dropped {len(cols_to_drop)} columns with >90% missing values.")

# Select only numeric features for imputation and modeling
numeric_features = train_df.select_dtypes(include=np.number).columns.tolist()
numeric_features.remove('emission_log') # Remove the target

# Use SimpleImputer to fill remaining NaNs with the median
imputer = SimpleImputer(strategy='median')

# Fit on training data and transform both train and test data
train_df[numeric_features] = imputer.fit_transform(train_df[numeric_features])
test_df[numeric_features] = imputer.transform(test_df[numeric_features])

print("Filled remaining missing values using the median.")

-----

## 4\. Advanced Modeling with LightGBM and Cross-Validation

We'll now train the LightGBM model. Using **K-Fold Cross-Validation** gives a more accurate measure of performance by training and testing the model on different subsets ("folds") of the data.

In [None]:
# Define features (X) and target (y)
X = train_df[numeric_features]
y = train_df['emission_log']
X_test = test_df[numeric_features]

# Model parameters for LightGBM
# These are a good starting point; they can be tuned for even better performance
params = {
    'objective': 'regression_l1', # L1 loss is robust to outliers
    'metric': 'rmse',
    'n_estimators': 2000,
    'learning_rate': 0.01,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1,
    'num_leaves': 31,
    'verbose': -1,
    'n_jobs': -1,
    'seed': 42,
    'boosting_type': 'gbdt',
}

# Set up K-Fold cross-validation
NFOLDS = 5
folds = KFold(n_splits=NFOLDS, shuffle=True, random_state=42)

oof_preds = np.zeros(X.shape[0]) # To store out-of-fold predictions
sub_preds = np.zeros(X_test.shape[0]) # To store test predictions

for n_fold, (train_idx, valid_idx) in enumerate(folds.split(X, y)):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_valid, y_valid = X.iloc[valid_idx], y.iloc[valid_idx]

    # Define the model
    model = lgb.LGBMRegressor(**params)

    # Train the model
    model.fit(X_train, y_train,
              eval_set=[(X_valid, y_valid)],
              eval_metric='rmse',
              callbacks=[lgb.early_stopping(100, verbose=False)])

    # Store predictions
    oof_preds[valid_idx] = model.predict(X_valid)
    sub_preds += model.predict(X_test) / folds.n_splits

    print(f"Fold {n_fold+1} validation RMSE: {np.sqrt(mean_squared_error(y_valid, oof_preds[valid_idx])):.4f}")

# Overall validation score
overall_rmse = np.sqrt(mean_squared_error(y, oof_preds))
print(f"\nOverall Cross-Validation RMSE: {overall_rmse:.4f}")

-----

## 5\. Advanced Evaluation and Feature Importance

Now let's evaluate our results and see which features the model found most important.

### Step 5.1: Performance Metrics

Since we transformed our target variable, we must transform the predictions back to the original scale using `np.expm1` before calculating the final error metrics. This makes the results interpretable.

In [None]:
# Transform predictions and actuals back to the original scale
y_original = np.expm1(y)
oof_preds_original = np.expm1(oof_preds)

# Calculate final metrics
rmse = np.sqrt(mean_squared_error(y_original, oof_preds_original))
mae = mean_absolute_error(y_original, oof_preds_original)
r2 = r2_score(y_original, oof_preds_original)

print(f"Overall Metrics on Original Scale:")
print(f"  Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"  Mean Absolute Error (MAE): {mae:.2f}")
print(f"  R-squared (R2): {r2:.2f}")

### Step 5.2: Actual vs. Predicted Plot

This plot helps visualize the model's accuracy. A perfect model would have all points lying on the red diagonal line.

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(y_original, oof_preds_original, alpha=0.3)
plt.plot([y_original.min(), y_original.max()], [y_original.min(), y_original.max()], '--r', linewidth=2)
plt.xlabel('Actual Emission')
plt.ylabel('Predicted Emission')
plt.title('Actual vs. Predicted Emissions (Out-of-Fold)')
plt.show()

### Step 5.3: Feature Importance

Let's see which features the model relied on most. [cite\_start]This is a key part of "understanding" climate change factors as per the project goal[cite: 33].

In [None]:
# Re-train a single model on all data to get feature importances
final_model = lgb.LGBMRegressor(**params)
final_model.fit(X, y)

# Get and plot feature importances
importances = pd.DataFrame({'feature': X.columns, 'importance': final_model.feature_importances_})
importances = importances.sort_values('importance', ascending=False).head(20)

plt.figure(figsize=(10, 8))
sns.barplot(x='importance', y='feature', data=importances)
plt.title('Top 20 Feature Importances')
plt.show()

This chart clearly shows which sensor readings and location/time data were most predictive, providing valuable insights.