## Phase 1: Data Preparation

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import numpy as np

In [15]:
# 1. Load Data and Apply Imputation (from EDA step)

data_path = '../data/water_potability.csv'
df = pd.read_csv(data_path)

In [17]:
# Apply Median Imputation to handle missing values found in EDA
median_values = df[['ph', 'Sulfate', 'Trihalomethanes']].median()
df.fillna(median_values, inplace=True)
print("Data loaded and median imputation applied successfully.")

Data loaded and median imputation applied successfully.


In [19]:
# 2. Separate Features (X) and Target (y)
X = df.drop('Potability', axis=1)
y = df['Potability']

In [21]:
# 3. Stratified Train-Test Split (80% Train, 20% Test)
# stratify=y ensures the 61:39 imbalance is maintained in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Data split: Train samples={len(X_train)}, Test samples={len(X_test)}")

Data split: Train samples=2620, Test samples=656


In [23]:
# 4. Feature Scaling (StandardScaler)
scaler = StandardScaler()


In [25]:
# Fit the scaler ONLY on the training data to prevent data leakage
scaler.fit(X_train) 

In [27]:
# Transform both training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Data scaling complete.")

Data scaling complete.


## PHASE 2: MODEL IMPLEMENTATION AND TRAINING

In [49]:
# 5. Initialize Logistic Regression Model (Baseline)
model = LogisticRegression(
    random_state=42, 
    # Use 'liblinear' solver: suitable for small datasets and handles the L2 penalty well.
    solver='liblinear',
    class_weight='balanced'
)

In [51]:
# 6. Train the Model
model.fit(X_train_scaled, y_train)
print("Logistic Regression model trained.")

Logistic Regression model trained.


## PHASE 3: EVALUATION AND REPORTING

In [54]:
# 7. Generate Predictions (Classes and Probabilities)
y_pred = model.predict(X_test_scaled)
# Get probabilities for ROC-AUC calculation (using the positive class, index 1)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

In [56]:
# 8. Evaluation Metrics

print("\n--- BASELINE MODEL PERFORMANCE (LOGISTIC REGRESSION) ---")

# a) Classification Report (includes Precision, Recall, F1-Score)
print("\nClassification Report:")
report = classification_report(y_test, y_pred, output_dict=True)
print(classification_report(y_test, y_pred))

# b) ROC-AUC Score (Critical for imbalanced data)
auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC-AUC Score: {auc:.4f}")

# Extract F1-Score for the Potable (1) class for the report
f1_score_potable = report['1']['f1-score']
print(f"F1-Score (Potable=1): {f1_score_potable:.4f}")


--- BASELINE MODEL PERFORMANCE (LOGISTIC REGRESSION) ---

Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.52      0.57       400
           1       0.42      0.53      0.47       256

    accuracy                           0.53       656
   macro avg       0.53      0.53      0.52       656
weighted avg       0.55      0.53      0.53       656


ROC-AUC Score: 0.5475
F1-Score (Potable=1): 0.4666


In [None]:
#completed Baseline model Report

# BASELINE MODEL REPORT: LOGISTIC REGRESSION (BALANCED)

## Model & Methodology
* **Model Used:** Logistic Regression with **`class_weight='balanced'`**.
* **Data Preparation:** Data was split using **Stratified Sampling** (80/20 ratio). Features were normalized using **StandardScaler**.

### Impact of Imbalance Fix
The use of `class_weight='balanced'` successfully addressed the initial model failure (where F1-Score was 0.0000), compelling the model to predict the minority class.

## Results
The balanced model was evaluated on the held-out test set (656 samples).

| Metric | Score | Detail/Interpretation |
| :--- | :--- | :--- |
| **Accuracy** | **0.53** | Slightly better than random guessing (0.50). |
| **F1-Score (Potable=1)** | **0.4666** | The harmonic mean of precision and recall for the positive class. This is the **primary benchmark**. |
| **Recall (Potable=1)** | **0.53** | The model correctly identifies 53% of all actual Potable samples. |
| **Precision (Potable=1)** | **0.42** | When the model predicts water is Potable, it is correct only 42% of the time. |
| **ROC-AUC** | **0.5475** | The score is still very close to 0.50, confirming that the simple linear model struggles significantly to separate the two classes. |

## Conclusion: Establishing the Benchmark
The Logistic Regression baseline achieved an F1-Score (Potable=1) of **0.4666** and an ROC-AUC score of **0.5475**.

This result now serves as the **quantitative benchmark**. The poor performance confirms that the features do not have a strong linear relationship with potability, demanding that **advanced, non-linear models** (tree-based, neural nets) must be used in the next phase to achieve acceptable performance.

***

## Next Steps for Intermediate Report (17 Oct)

Your immediate focus for the next milestone is to apply non-linear models to beat this **0.4666 F1-Score**.

1.  **Implement Advanced Models:** Build and evaluate a **Random Forest Classifier** and a **Gradient Boosting Classifier**, ensuring you still use `class_weight='balanced'` to prevent the imbalance issue.
2.  **Spatio-Temporal Data:** Begin investigating how to handle the `.mat` data for the pH forecasting task.