# Tanzania Water Wells – Documented Notebook

**Author:** Calvine Dasilver  
**Project:** Machine Learning to Assess Water Well Performance in Tanzania

This notebook includes a clean, documented version of the original analysis with:
- Function-level docstrings
- Markdown explanations
- Step-by-step logic

It is structured to improve readability and reusability for educational or open-source purposes.



### Step 0: Import Required Libraries

We import all necessary libraries for data manipulation, visualization, preprocessing, and model building. Scikit-learn is used for ML tasks and evaluation metrics.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, mean_squared_error
)
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier


### Step 1: Define Data Loading Function

We define a reusable function to load a CSV dataset, inspect its shape, and display basic structure info.


In [None]:
def load_data_and_check_info(file_path):
    """
    Load data from a CSV file and get information about the DataFrame.

    Parameters:
    - file_path (str): Path to the CSV file.

    Returns:
    - df_info (str): Information about the DataFrame.
    """
    df_1 = pd.read_csv(file_path, index_col='id')
    df_head = df_1.head()
    df_info = df_1.info()
    df_shape = df_1.shape
    print("Data Shape:", df_shape)
    return df_1, df_info, df_head, df_shape


### Step 2: Define Function to Check Data Types & Missing Values

This function summarizes the data types and missing values for all features, helping us plan data cleaning strategies.


In [None]:
def check_data_types_and_missing_values(data):
    """
    Analyze data types and missing values for a given DataFrame.

    Args:
        data (pd.DataFrame): Dataset to inspect.

    Returns:
        dict: Dictionary containing data type summary and missing value count.
    """
    data_types = data.dtypes.replace({'object': 'string'}).value_counts().to_dict()
    missing_values = data.isnull().sum()
    return {"data_types": data_types, "missing_values": missing_values}


### Step 3: Load Training Data

We use the earlier function to load the training feature dataset and preview its structure and shape.


In [None]:
file_path_1 = "Data/Training_set_values.csv"
df_1, df_info, df_head, df_shape = load_data_and_check_info(file_path_1)
print("Shape of the DataFrame:", df_shape)
print(df_info)
df_head


### Step 4: Explore Missing Data and Column Types

We apply our function to evaluate how many columns have missing data and what types they are. This informs our cleaning strategy.


In [None]:
dtypes_and_mv_df_1 = check_data_types_and_missing_values(df_1)
print("Data types of each column:")
print(dtypes_and_mv_df_1["data_types"])

if dtypes_and_mv_df_1["missing_values"].sum() == 0:
    print("\nNo missing values found.")
else:
    print("\nMissing values:")
    print(dtypes_and_mv_df_1["missing_values"])


### Step 5: Load and Merge Labels with Feature Set

We now load the label dataset and merge it with the feature dataset (`df_1`) using the index (`id`).


In [None]:
df_2 = pd.read_csv("Data/Training_set_labels.csv", index_col='id')
merged_df = df_1.merge(df_2, how='inner', left_index=True, right_index=True)
print("Shape after merge:", merged_df.shape)
merged_df.head()


### Step 6: Target Variable Distribution

Understanding class balance helps guide our modeling strategy. Here we inspect the distribution of the target variable `status_group`.


In [None]:
sns.countplot(data=merged_df, x='status_group')
plt.title('Target Class Distribution')
plt.xlabel('Status Group')
plt.ylabel('Count')
plt.show()


### Step 7: Drop Redundant or Sparse Features

Based on domain knowledge or null values, we drop columns that are unlikely to contribute meaningfully to predictions.


In [None]:
columns_to_drop = ['scheme_name', 'recorded_by', 'region_code', 'wpt_name']
merged_df.drop(columns=columns_to_drop, axis=1, inplace=True)
merged_df.head()


### Step 8: Fill Missing Values

We use median or forward-fill methods to handle missing data. This is crucial for avoiding errors during model training.


In [None]:
merged_df['construction_year'].replace(0, np.nan, inplace=True)
merged_df['construction_year'].fillna(merged_df['construction_year'].median(), inplace=True)

merged_df['public_meeting'].fillna(method='ffill', inplace=True)
merged_df['permit'].fillna(method='ffill', inplace=True)


### Step 9: Encode Categorical Variables

We convert string-based categorical columns into numeric format using Label Encoding, allowing compatibility with ML models.


In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in merged_df.select_dtypes(include='object').columns:
    merged_df[col] = le.fit_transform(merged_df[col])


### Step 10: Split Features and Target

We prepare the independent variables `X` and dependent variable `y` before training.


In [None]:
X = merged_df.drop('status_group', axis=1)
y = merged_df['status_group']


### Step 11: Split Dataset into Training and Testing Sets

We split the dataset into training and test subsets using an 80/20 ratio. This allows us to evaluate model performance on unseen data.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


### Step 12: Train Baseline Models

We train multiple machine learning classifiers to compare performance:
- Logistic Regression
- Decision Tree
- Random Forest


In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name} Results")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))


### Step 13: Visualize Confusion Matrix for Best Model

A confusion matrix helps visually inspect model prediction performance.


In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

best_model = models["Random Forest"]
y_pred_best = best_model.predict(X_test)

disp = ConfusionMatrixDisplay.from_estimator(
    best_model, X_test, y_test, cmap='Blues', xticks_rotation='vertical'
)
disp.ax_.set_title("Confusion Matrix - Random Forest")
plt.show()


### Step 14: Feature Importance from Random Forest

We identify which features contributed most to predictions using `.feature_importances_`.


In [None]:
importances = best_model.feature_importances_
features = X.columns
indices = np.argsort(importances)[::-1]

# Plot
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
sns.barplot(x=importances[indices][:10], y=features[indices][:10])
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()
