# **Summer Analytics 2025 - First Hackathon**
*This notebook contains my solution for the NDVI-based Land Cover Classification Hackathon hosted by Consulting & Analytics Club IIT Guwahati.*

## Objective
Predict land cover type (Water, Forest, Grass, etc.) using NDVI time-series data from satellite imagery.

# 1.1 Load & Explore the Data

Here, we load the training and test data CSVs and remove any unwanted index column (Unnamed: 0). Also, we examine the dataset with .head(), .info() and .describe() to get an idea of the structure and distribution.

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import metrics

# Fix: Load files with proper names
train_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktest.csv")

# Drop unnecessary column if exists
if "Unnamed: 0" in train_df.columns:
    train_df = train_df.drop(columns=["Unnamed: 0"])
if "Unnamed: 0" in test_df.columns:
    test_df = test_df.drop(columns=["Unnamed: 0"])

# Store Test IDs for final submission
test_ids = test_df["ID"]

# Basic sanity checks
print(train_df.head())
print(train_df.info())
print(train_df.describe())

   ID  class  20150720_N  20150602_N  20150517_N  20150501_N  20150415_N  \
0   1  water    637.5950     658.668   -1882.030    -1924.36     997.904   
1   2  water    634.2400     593.705   -1625.790    -1672.32     914.198   
2   4  water     58.0174   -1599.160         NaN    -1052.63         NaN   
3   5  water     72.5180         NaN     380.436    -1256.93     515.805   
4   8  water   1136.4400         NaN         NaN     1647.83    1935.800   

   20150330_N  20150314_N  20150226_N  ...  20140610_N  20140525_N  \
0   -1739.990     630.087         NaN  ...         NaN   -1043.160   
1    -692.386     707.626   -1670.590  ...         NaN    -933.934   
2   -1564.630         NaN     729.790  ...    -1025.88     368.622   
3   -1413.180    -802.942     683.254  ...    -1813.95     155.624   
4         NaN    2158.980         NaN  ...     1535.00    1959.430   

   20140509_N  20140423_N  20140407_N  20140322_N  20140218_N  20140202_N  \
0   -1942.490     267.138         NaN        

# 1.2 Preprocessing: Feature Selection & Imputation
In this section, we :

- Select the NDVI columns as features

- Separate the input X and output y

- Handle missing values in the NDVI columns using mean imputation

- Prepare the test dataset in the same way

- Encode the class labels numerically using LabelEncoder for model compatibility

In [44]:
# Identify feature columns (those ending with '_N')
ndvi_columns = [col for col in train_df.columns if '_N' in col]

# X = features, y = label
X = train_df[ndvi_columns]
y = train_df['class']
X_test = test_df[ndvi_columns]

# Use mean imputation to handle NaNs
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
X_test_imputed = imputer.transform(X_test)

# Encode class labels (e.g., Water → 0)
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# 1.3 Model Training: Logistic Regression (Multiclass)
Here, we train a Logistic Regression model for multiclass classification using the imputed NDVI values.

- We use lbfgs solver with multinomial setting, suitable for multiple land cover classes.

- max_iter=1000 ensures convergence for larger datasets.

In [24]:
# Train Logistic Regression (multiclass)
model = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')
model.fit(X_imputed, y_encoded)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 1.4 Prepare Test Data
We apply the same feature engineering steps to the test data so that the model can make predictions.

In [25]:
# Predict encoded labels on test set
test_preds_encoded = model.predict(X_test_imputed)

# Decode labels back to class names
test_preds = le.inverse_transform(test_preds_encoded)

# Final submission DataFrame
submission = pd.DataFrame({
    'ID': test_ids,
    'class': test_preds
})
submission.head()

Unnamed: 0,ID,class
0,1,orchard
1,2,forest
2,3,orchard
3,4,forest
4,5,forest


# 1.5 Predict and Save CSV for Submission
The model predicts labels on test data, we reverse-transform the encoded labels, and save them to a .csv for submission on Kaggle.

In [26]:
submission.to_csv("submission.csv", index=False)
print("Submission file created successfully!")

Submission file created successfully!


# 2.1 Improve Data Quality (Denoising & Imputation)
To handle noisy and missing NDVI values due to cloud cover and other artifacts, we apply the following :
- Interpolation to fill missing values
- Savitzk – Golay smoothing filter to reduce noise and highlight vegetation patterns


In [27]:
from scipy.signal import savgol_filter

ndvi_cols = [col for col in train_df.columns if col.endswith('_N')]

# Interpolate missing values
train_df[ndvi_cols] = train_df[ndvi_cols].interpolate(axis=1)

# Apply Savitzky-Golay filter to smooth NDVI series
train_df[ndvi_cols] = pd.DataFrame(
    savgol_filter(train_df[ndvi_cols], window_length=5, polyorder=2, axis=1),
    columns=ndvi_cols
)

# 2.2 Feature Engineering from NDVI Series
Now we extract statistical features from each NDVI sequence for a location :

- Mean, max, min, Std. Dev.
- Amplitude (Max - Min)
- Peak NDVI date (argmax)

In [28]:
X_features = pd.DataFrame()
X_features["mean_ndvi"] = train_df[ndvi_cols].mean(axis=1)
X_features["std_ndvi"] = train_df[ndvi_cols].std(axis=1)
X_features["min_ndvi"] = train_df[ndvi_cols].min(axis=1)
X_features["max_ndvi"] = train_df[ndvi_cols].max(axis=1)
X_features["amplitude"] = X_features["max_ndvi"] - X_features["min_ndvi"]
X_features["peak_time"] = train_df[ndvi_cols].idxmax(axis=1).apply(lambda x: ndvi_cols.index(x))

# 2.3 Encode Labels and Split Dataset
We encode class labels (like “Forest”, “Water”, etc.) into numerical format and split the data for training and validation.

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

y = train_df["class"]
le = LabelEncoder()
y_encoded = le.fit_transform(y)

X_train, X_val, y_train, y_val = train_test_split(X_features, y_encoded, test_size=0.2, random_state=42)

# 2.4 Train the Logistic Regression Model
We now train a multiclass logistic regression model using the engineered features.

In [30]:
model = LogisticRegression(max_iter=2000, multi_class='multinomial')
model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 2.5 Validation data evaluation

In [31]:
from sklearn.metrics import accuracy_score

val_preds = model.predict(X_val)
accuracy_score(y_val, val_preds)

0.81125

# 2.6 Prepare the Test Set and Predict & Save CSV for Submission
We apply the same preprocessing and feature engineering steps to the test data, then use the trained model to predict the land cover class.

In [32]:
test_df[ndvi_cols] = test_df[ndvi_cols].interpolate(axis=1)
test_df[ndvi_cols] = pd.DataFrame(
    savgol_filter(test_df[ndvi_cols], window_length=5, polyorder=2, axis=1),
    columns=ndvi_cols
)

X_test = pd.DataFrame()
X_test["mean_ndvi"] = test_df[ndvi_cols].mean(axis=1)
X_test["std_ndvi"] = test_df[ndvi_cols].std(axis=1)
X_test["min_ndvi"] = test_df[ndvi_cols].min(axis=1)
X_test["max_ndvi"] = test_df[ndvi_cols].max(axis=1)
X_test["amplitude"] = X_test["max_ndvi"] - X_test["min_ndvi"]
X_test["peak_time"] = test_df[ndvi_cols].idxmax(axis=1).apply(lambda x: ndvi_cols.index(x))

# Predict
preds = model.predict(X_test)
pred_labels = le.inverse_transform(preds)

# Export submission
submission = pd.DataFrame({"ID": test_df["ID"], "class": pred_labels})
submission.to_csv("submission2.csv", index=False)

# 3. Random Forest Classifier

In [33]:
# Importing Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier

# Load & Prepare Dataset
train_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktest.csv")

# Drop unwanted column and separate test IDs
train_df = train_df.drop(columns=["Unnamed: 0"])
test_ids = test_df["ID"]
test_df = test_df.drop(columns=["Unnamed: 0"])

# Data Preprocessing
ndvi_columns = [col for col in train_df.columns if '_N' in col]
X = train_df[ndvi_columns]
y = train_df["class"]
X_test = test_df[ndvi_columns]

# Handle missing values
imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)
X_test = imputer.transform(X_test)

# Label encoding
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X, y_encoded)

# Predict test data
y_pred_encoded = rf.predict(X_test)
y_pred = le.inverse_transform(y_pred_encoded)

# Generate submission file
submission = pd.DataFrame({
    "ID": test_ids,
    "class": y_pred
})

submission.to_csv("submission3.csv", index=False)
submission.head()

Unnamed: 0,ID,class
0,1,forest
1,2,forest
2,3,forest
3,4,forest
4,5,forest


# 4. SVM (Support Vector Machine) Based Land Cover Classification :


In [34]:
## 1. Load Libraries and Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

# Load datasets
train_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktest.csv")

# Drop unwanted column
train_df = train_df.drop(columns=["Unnamed: 0"])
test_ids = test_df["ID"]
test_df = test_df.drop(columns=["Unnamed: 0"])

## 2. Preprocess the Data
# Extract features (NDVI columns)
ndvi_columns = [col for col in train_df.columns if '_N' in col]

X = train_df[ndvi_columns]
y = train_df['class']
X_test = test_df[ndvi_columns]

# Handle missing values using mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
X_test_imputed = imputer.transform(X_test)

# Standardize features (important for SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Encode target labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

## 3. Train the SVM Classifier
# Use RBF kernel (default), with regularization C=1
svm_model = SVC(kernel='rbf', C=1, probability=False, random_state=42)
svm_model.fit(X_scaled, y_encoded)

## 4. Predict and Generate Submission File
y_pred_encoded = svm_model.predict(X_test_scaled)
y_pred = label_encoder.inverse_transform(y_pred_encoded)

submission_df = pd.DataFrame({
    'ID': test_ids,
    'class': y_pred
})

submission_df.to_csv("submission4.csv", index=False)
print("SVM submission file saved as 'submission4.csv'")

SVM submission file saved as 'submission4.csv'


# 5. Next Method: Neural Network (MLP Classifier) :
Implementing a Multi-Layer Perceptron (MLP) classifier from sklearn.neural_network. It works well on small-to-medium tabular datasets like this, and can capture non-linear patterns in NDVI time-series better than logistic regression.


In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neural_network import MLPClassifier

# Load Datasets
train_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktest.csv")

# Preprocess
train_df = train_df.drop(columns=["Unnamed: 0"])
test_df = test_df.drop(columns=["Unnamed: 0"])
test_ids = test_df["ID"]

ndvi_columns = [col for col in train_df.columns if '_N' in col]
X = train_df[ndvi_columns]
y = train_df['class']
X_test = test_df[ndvi_columns]

# Imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
X_test_imputed = imputer.transform(X_test)

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Encode class labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Classifier
mlp = MLPClassifier(hidden_layer_sizes=(128, 64), activation='relu', solver='adam', max_iter=500, random_state=42)
mlp.fit(X_scaled, y_encoded)

# Predict on Test Data
preds = mlp.predict(X_test_scaled)
final_preds = le.inverse_transform(preds)

# Prepare Submission
submission = pd.DataFrame({
    'ID': test_ids,
    'class': final_preds
})

submission.to_csv("submission5.csv", index=False)
print("Submission file 'submission5.csv' created successfully.")

Submission file 'submission5.csv' created successfully.


#6. Gradient Boosting with XGBoost
In this section, we implement a Gradient Boosting model using **XGBoost**, a powerful ensemble technique effective on structured data.

**Steps :**
- Preprocessed NDVI features with imputation and scaling
- Encoded target labels for classification
- Trained XGBoost classifier
- Generated predictions for submission

This approach often boosts performance by reducing both bias and variance compared to linear models.

## 6.1 Install and Import Required Libraries

In [36]:
# If not already installed
!pip install xgboost

import xgboost as xgb
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np



## 6.2 Load and Prepare the Data

In [37]:
# Load data
train_df = pd.read_csv('/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktrain.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktest.csv')

# Drop unwanted columns
train_df.drop(columns=['Unnamed: 0'], inplace=True)
test_df.drop(columns=['Unnamed: 0'], inplace=True)

# Store test IDs
test_ids = test_df['ID']

## 6.3 Preprocess the Data


In [38]:
# Extract features (NDVI columns end with '_N')
ndvi_cols = [col for col in train_df.columns if '_N' in col]

# Prepare X and y
X = train_df[ndvi_cols]
y = train_df['class']
X_test = test_df[ndvi_cols]

# Impute missing values
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
X_test = imputer.transform(X_test)

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_test = scaler.transform(X_test)

# Encode target labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

## 6.4 Train the XGBoost Model

In [39]:
# Initialize and train the model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
model.fit(X, y_encoded)

Parameters: { "use_label_encoder" } are not used.



## 6.5 Make Predictions & CSV Submission

In [40]:
# Predict and convert back to original labels
preds_encoded = model.predict(X_test)
preds = le.inverse_transform(preds_encoded)

# Prepare submission DataFrame
submission = pd.DataFrame({'ID': test_ids, 'class': preds})
submission.to_csv('submission6.csv', index=False)
submission.head()

Unnamed: 0,ID,class
0,1,farm
1,2,forest
2,3,forest
3,4,farm
4,5,orchard


#7. Soft Voting Classifier
Combining predictions from multiple strong models like :
- Logistic Regression
- Random Forest Classifier
- Support Vector Machine (SVM)
- MLP Classifier (Neural Network)

All of these will be part of an ensemble using soft voting, which averages the predicted probabilities from all classifiers and selects the class with the highest average probability.

In [41]:
#Imports
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

#Load the Data
train_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktest.csv")

#Drop Unnamed column if exists
train_df.drop(columns=[col for col in train_df.columns if 'Unnamed' in col], inplace=True)
test_df.drop(columns=[col for col in test_df.columns if 'Unnamed' in col], inplace=True)

#Save test IDs for final submission
test_ids = test_df["ID"]

#Features and Labels
ndvi_cols = [col for col in train_df.columns if '_N' in col]

X = train_df[ndvi_cols]
y = train_df['class']
X_test = test_df[ndvi_cols]

#Missing value handling
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
X_test = imputer.transform(X_test)

#Label Encoding
le = LabelEncoder()
y_encoded = le.fit_transform(y)

#Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

#Define Base Models
lr = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
svm = SVC(probability=True, kernel='rbf', C=1.0, gamma='scale', random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(128, 64), activation='relu', solver='adam', max_iter=300, random_state=42)

#Voting Classifier (Soft Voting)
voting_clf = VotingClassifier(
    estimators=[('lr', lr), ('rf', rf), ('svm', svm), ('mlp', mlp)],
    voting='soft'
)

#Fit model
voting_clf.fit(X_scaled, y_encoded)

#Predict
y_pred = voting_clf.predict(X_test_scaled)
y_pred_labels = le.inverse_transform(y_pred)

#Submission File
submission = pd.DataFrame({
    'ID': test_ids,
    'class': y_pred_labels
})
submission.to_csv('submission7*.csv', index=False)



#8. Stacking Classifier
In this section, we implement a **Stacking Ensemble Classifier** to improve prediction accuracy by combining multiple models:

- **Base Models**: Logistic Regression, Random Forest, and SVM
- **Meta Model**: Logistic Regression

Stacking helps capture diverse patterns from each model and often generalizes better than a single model alone.

In [42]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

# Load datasets
train_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktrain.csv")
test_df = pd.read_csv("/content/drive/MyDrive/Other Stuffs/Summer Analytics 2025 IIT-G/First Hackathon /hacktest.csv")

# Drop unwanted column
train_df = train_df.drop(columns=["Unnamed: 0"])
test_ids = test_df["ID"]
test_df = test_df.drop(columns=["Unnamed: 0"])

# Extract features
ndvi_cols = [col for col in train_df.columns if '_N' in col]
X = train_df[ndvi_cols]
y = train_df["class"]
X_test = test_df[ndvi_cols]

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
X_test_imputed = imputer.transform(X_test)

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Define base models
base_models = [
    ('lr', LogisticRegression(max_iter=1000, solver='lbfgs', multi_class='multinomial')),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svm', SVC(probability=True, kernel='rbf', C=1, gamma='scale'))
]

# Define meta model
meta_model = LogisticRegression(max_iter=1000)

# Build the Stacking Classifier
stacking_clf = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5
)

# Train
stacking_clf.fit(X_scaled, y_encoded)

# Predict
predictions_encoded = stacking_clf.predict(X_test_scaled)
predictions = le.inverse_transform(predictions_encoded)

# Create submission
submission = pd.DataFrame({
    "ID": test_ids,
    "class": predictions
})

# Save
submission.to_csv("submission8*.csv", index=False)
print("Submission file 'submission8*.csv' generated successfully.")



Submission file 'submission8*.csv' generated successfully.
