# Hackathon 1


Here i have the solution code to the first Hackathon problem of the Summer Analytics 2025 course

# The Problem Statement


Hackathon Problem Statement:

NDVI-based Land Cover Classification
Key Concepts

1. NDVI (Normalized Difference Vegetation Index)
Measures vegetation health using satellite data:
Where:-
NIR = Near-Infrared reflectance
Red = Red reflectance


2. Data Challenges
Noise: The main challenge with the dataset is that both the imagery and the crowdsourced data contain noise (due to cloud cover in the images and inaccurate labeling/digitizing of polygons).

Missing Data: Certain NDVI values are missing because of cloud cover obstructing the satellite view.

Temporal Variations: NDVI values vary seasonally, requiring careful feature engineering to extract meaningful trends.

Important Note:
The training and public leaderboard test data may contain noisy observations, while the private leaderboard data is clean and free of noise. This design helps evaluate how well your model generalizes beyond noisy training conditions.

Dataset
Each row in the dataset contains:

class: Ground truth label of the land cover type — one of {Water, Impervious, Farm, Forest, Grass, Orchard}

ID:Unique identifier for the sample

27 NDVI Time Points: Columns labeled in the format YYYYMMDD_N (e.g., 20150720_N, 20150602_N) represent NDVI values collected on different dates. These values form a time series representing vegetation dynamics for each location.

Rules
Model: Logistic Regression only (multiclass).

Preprocessing: Denoising, imputation, and feature engineering allowed.

Leaderboard:

Public (89% test data): Immediate feedback.

Private (11% test data): Final ranking (avoids overfitting).

Evaluation
Submissions will be evaluated on basis of accuracy score of the predicted class.

Submission format:
ID,class
1,water
2,water
3,grass
4,impervious
..

# The Solution Code

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
train = pd.read_csv('/content/hacktrain.csv')
test = pd.read_csv('/content/hacktest.csv')

# Identify NDVI columns
ndvi_cols = [col for col in train.columns if '_N' in col]

# Impute missing NDVI values (median)
imputer = SimpleImputer(strategy='median')
train_ndvi = pd.DataFrame(imputer.fit_transform(train[ndvi_cols]), columns=ndvi_cols)
test_ndvi = pd.DataFrame(imputer.transform(test[ndvi_cols]), columns=ndvi_cols)

# Feature engineering
def extract_features(df):
    feats = pd.DataFrame()
    feats['ndvi_mean'] = df.mean(axis=1)
    feats['ndvi_median'] = df.median(axis=1)
    feats['ndvi_std'] = df.std(axis=1)
    feats['ndvi_min'] = df.min(axis=1)
    feats['ndvi_max'] = df.max(axis=1)
    feats['ndvi_range'] = feats['ndvi_max'] - feats['ndvi_min']
    feats['ndvi_skew'] = df.skew(axis=1)
    feats['ndvi_kurtosis'] = df.kurtosis(axis=1)
    feats['ndvi_missing_prop'] = df.isna().sum(axis=1) / df.shape[1]
    feats['ndvi_pos_count'] = (df > 0).sum(axis=1)
    feats['ndvi_neg_count'] = (df < 0).sum(axis=1)
    # Slope (trend)
    time_idx = np.arange(df.shape[1])
    feats['ndvi_slope'] = df.apply(lambda row: np.polyfit(time_idx, row, 1)[0], axis=1)
    return feats

X_train_feats = extract_features(train_ndvi)
X_test_feats = extract_features(test_ndvi)

# Optionally, concatenate raw NDVI values (if model does not overfit)
X_train = pd.concat([train_ndvi, X_train_feats], axis=1)
X_test = pd.concat([test_ndvi, X_test_feats], axis=1)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Encode target
le = LabelEncoder()
y = le.fit_transform(train['class'])

# Validation split
X_tr, X_val, y_tr, y_val = train_test_split(X_train_scaled, y, test_size=0.2, stratify=y, random_state=42)

# Logistic Regression
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=2000, random_state=42)
clf.fit(X_tr, y_tr)

# Validation accuracy
val_pred = clf.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, val_pred))

# Predict on test set
test_pred = le.inverse_transform(clf.predict(X_test_scaled))

# Submission
submission = pd.DataFrame({'ID': test['ID'], 'class': test_pred})
submission.to_csv('submission7.csv', index=False)


Validation Accuracy: 0.914375
