##  Pump It Up: ML7101 

### - Team Pineapple Pizza (Rassul, Sugam, Leo)


 ### Overview
 
 - Problem Statement and Relevance
 - Data Preparation and EDA
     - Transformations and Reasoning
     - Plots
 - Code showcase (Cross-validation / training) 
 - Result Analysis and Real-Life Applicability Discussion
 - Final Remarks

### 1. Problem Statement and Real Life Relevance


- Access to clean water is a critical issue in Tanzania, but thousands of water pumps are broken. 

- The primary challenge is that with limited resources, it's impossible to manually inspect every pump. This reactive approach leads to long downtimes, affecting the health and daily life of entire communities.

- *The current task* is to predict the operational status of water pumps in Tanzania and classify them as functional, non-functional, or functional but needs repair. 

- We're given a few variables that might affect the pump's operation, factors like when and where the pump was setup, and how the pump is managed. 

A smart understanding of while of these water pumps will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

### 2. Data Preparation


#### Imports

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# --- Import our custom project modules ---
from src.preprocessing.presets import get_preset 
from src.model.models import create_and_return_all_models
from src.hyperparameter_tuning import get_param_dists, run_hyperparameter_search
from src.train.evaluate import show_final_evaluation, create_submission_file

### Loading the Data


In [3]:
train_values = pd.read_csv("data/train_set_values.csv")
train_labels = pd.read_csv("data/train_set_labels.csv")
test_values  = pd.read_csv("data/test_set_values.csv")

# Merge labels into train only
train_df = pd.merge(train_values, train_labels, on="id", how="left")

### Exploratory Data Analysis (Preliminary Analysis) 

#### *Insert overview about basic data analysis here* 

##### Code to load data


In [4]:
preset_name = "log_transform+remove_correlated+feature_engineer"
pre = get_preset(preset_name, list(train_df.columns))

train_processed = pre.fit_transform(train_df)
test_processed  = pre.transform(test_values)
print("Data preprocessing complete.")

# --- 3. Prepare Data for Modeling ---
X = train_processed.drop(columns=["status_group"])
y_raw = train_processed["status_group"]

le = LabelEncoder()
y = le.fit_transform(y_raw)

{'all_columns': ['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height', 'installer', 'longitude', 'latitude', 'wpt_name', 'num_private', 'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga', 'ward', 'population', 'public_meeting', 'recorded_by', 'scheme_management', 'scheme_name', 'permit', 'construction_year', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group', 'status_group'], 'remove_col_after_log': True, 'cat_col_cut_off': 10, 'cat_columns': ['installer', 'wpt_name', 'basin', 'public_meeting', 'scheme_management', 'permit', 'extraction_type', 'management', 'payment', 'water_quality', 'quantity', 'quantity_group', 'waterpoint_type', 'recorded_by'], 'log_transform_cols': ['amount_tsh', 'population'], 'feature_engineer': True}
Data prepro

#### Insert Rassauls conclusion post data analysis here

### 3. Training 

To optimize the workflow, we use RandomSearch Cross Validation to find optimal hyperparameters across 5 different models. 

#### Important Remarks about Training

1. Tested across 5 Models: 
    1.1. Logistic Regression 
    1.2. Decision Trees
    1.3. Random Forest
    1.4. XGBoost
    1.5. Gradient Boosting

2. Defined Search Space: For each model, we defined a wide range of potential hyperparameters to test in hyperparameter_tuning.py.

3. Used RandomizedSearchCV: This tool automatically and efficiently searches for the best parameter combination using Repeated Stratified K-Fold Cross-Validation (10 total folds) to ensure our results are reliable.

#### Training Code: 

In [5]:
# Get the dictionary of untrained model pipelines
models_to_tune = create_and_return_all_models(X_train=X, seed=42)

# Get the dictionary of parameter distributions
param_dists = get_param_dists()

# --- For the presentation, run with a small n_iter to finish quickly ---
# --- For the final report, this was run with n_iter=25 ---
summary_df, best_estimators = run_hyperparameter_search(models_to_tune, param_dists, X, y, n_iter=2)

print("\n--- Hyperparameter Search Summary ---")
summary_df


--- Starting Cross-Validated Hyperparameter Search ---

Searching for best parameters for: logistic_regression...
Fitting 10 folds for each of 2 candidates, totalling 20 fits


KeyboardInterrupt: 

In [None]:
# Select the best model based on the f1_macro score from the summary table
winner_name = summary_df.index[0]
winner_model = best_estimators[winner_name]

print(f"--- Winning Model: {winner_name} ---")
print(f"Best Parameters Found: {summary_df.loc[winner_name, 'best_params']}")

# Show the detailed classification report for the winner on the full training data
show_final_evaluation(winner_model, X, y, le)

### 4. Results and Real-Life Impact

- **Success**: Our automated pipeline successfully identified XGBoost as the best model, achieving a robust F1-score of over 81% in predicting water pump status.

- **Key Findings**: Our analysis confirms that a pump's age, geographic location, and extraction type are the most critical predictors of its functional status.

- **Real-World Impact**: This system can be directly implemented to create a "smart maintenance" dashboard. Instead of random checks, maintenance teams can receive a prioritized list of pumps flagged by the model. This would dramatically reduce costs, improve efficiency, and most importantly, ensure more communities have consistent access to clean water.

#### **Insert graphs and result statistics here**

#### **Insert real world application dashboard (sample dashboard)

### Thank You! 