# Drinking Water Feasibility Prediction (GAMMAFEST)

This notebook is a cleaned, publishable version of our competition code for **predicting household drinking-water feasibility** using machine learning.

**Highlights**
- End-to-end pipeline: EDA → preprocessing → model selection → tuning → submission
- Evaluation focus: **F1-score** (imbalance-aware)
- Final model family: tree-based boosting (e.g., XGBoost / LightGBM / CatBoost)

> Note: Dataset files are expected under `data/` (see repository structure).

## 0. Environment setup

If you run this locally, install dependencies first:

```bash
pip install -r requirements.txt
```

In Colab, you can also install missing packages as needed.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc

In [None]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

PROJECT_NAME = "GammaFest"
DATASET_PATH = "data"

## 1. Load data

## 5. Generate submission

In [None]:
df_train = pd.read_csv(f"{DATASET_PATH}/train.csv")
df_test  = pd.read_csv(f"{DATASET_PATH}/test.csv")
sample   = pd.read_csv(f"{DATASET_PATH}/sample_submission.csv")

print(df_train.shape, df_test.shape)

In [None]:
df_train.head()

In [None]:
df_test.head()

## 2. Quick EDA

In [None]:
df_train.describe()

In [None]:
df_train.dtypes

In [None]:
round(df_train.isnull().mean()*100,2)

In [None]:
df_train.info()

In [None]:
for col in df_train:
  print(col, df_train[col].unique())

In [None]:
for col in df_train:
  if col == 'DC201':
    continue
  df_train[col] = df_train[col].astype('Int64')

In [None]:
df_train['DC201'] = (df_train['DC201'] == 'Layak Minum').astype('int')

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
for col in df_train:
  print(col, df_train[col].unique())

In [None]:
data = df_train.sample(frac=1, random_state=42)
data_unseen = df_train.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))

In [None]:
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

## 3. Modeling with PyCaret

In [None]:
from pycaret.classification import *

In [None]:
latih = setup(data = data, target = 'DC201', session_id=42,
                  normalize = True,
                  transformation = True,
                  log_experiment = True,
                  #handle_unknown_categorical = True,
                  #unknown_categorical_method = 'most_frequent',
                  remove_multicollinearity = True, #rop one of the two features that are highly correlated with each other
                  #ignore_low_variance = True,#all categorical features with statistically insignificant variances are removed from the dataset.
                  #combine_rare_levels = True,# all levels in categorical features below the threshold defined in rare_level_threshold param are combined together as a single level
                  numeric_imputation='median',
           #ignore_features=['FKP02'],
            #date_features=['FKP03','FKP04'],
           fix_imbalance = True,
            train_size = 0.8
          )

In [None]:
models()

In [None]:
compare_models(exclude = ['lr', 'knn', 'nb', 'dt', 'svm', 'rbfsvm', 'gpc', 'rf', 'qda', 'ada', 'et', 'dummy'])

In [None]:
xgboost  = create_model('xgboost')

In [None]:
tuned_xgboost = tune_model(xgboost)

In [None]:
plot_model(estimator = tuned_xgboost, plot = 'learning')

In [None]:
plot_model(estimator = tuned_xgboost, plot = 'feature')

In [None]:
evaluate_model(tuned_xgboost)

## 4. Evaluation & selection

In [None]:
predict_model(tuned_xgboost, data=df_test)

In [None]:
preds = predict_model(tuned_xgboost, data=df_test)

In [None]:
preds.head()

In [None]:
sample.head()

In [None]:
sample['DC201'] = round(preds['prediction_label']).astype(int)

In [None]:
sample.head()

In [None]:
sample['DC201'].value_counts()

In [None]:
sample['DC201'] = sample['DC201'].map({1:'Layak Minum', 0:'Tidak Layak Minum'})

In [None]:
sample.head()

In [None]:
sample.to_csv('submission.csv',index=False)

In [None]:
!head -n20 "submission.csv"

In [None]:
sample["DC201"] = sample["DC201"].astype(str)

In [None]:
sample.to_csv('submission1.csv',index=False)

In [None]:
!head -n20 "submission1.csv"

In [None]:
sample.info()

In [None]:
import csv

sample.to_csv('subs1.csv', quoting = csv.QUOTE_NONNUMERIC)

In [None]:
!head -n20 "subs1.csv"

In [None]:
sample.head()

In [None]:
sample.to_csv('subss1.csv',index=False)

In [None]:
!head -n20 "subss1.csv"

In [None]:
sample.info()

In [None]:
sample["DC201"] = sample["DC201"].astype(str)

In [None]:
sample.info()

In [None]:
sampley = sample.update(sample[['DC201']].astype(str))