# **Predicting Access to Clean Drinking Water**
##**ESG Issue:**
### This analysis covers both Environmental (E) and Social (S) factors, as clean drinking water access is critical for public health, sanitation, and social equity (S) while also being influenced by water availability, infrastructure, and contamination risks (E).

##**Objective of the Analysis:**

### To analyze global access to clean drinking water using machine learning techniques. By using WHO/UNICEF JMP global dataset (Households) on water availability, infrastructure, and demographic factors, we look to predict which regions are most at risk of inadequate water access. The analysis will would also help policymakers and organizations develop data-driven interventions to address water scarcity and inequality.

##**Data Source:**

### https://washdata.org/data

##**Student Name:**
### Abi Joshua George (46656697)


#**Loading Libraries**

In [1]:
# Install required package (if not already installed)
!pip install pycountry_convert

Collecting pycountry_convert
  Downloading pycountry_convert-0.7.2-py3-none-any.whl.metadata (7.2 kB)
Collecting pprintpp>=0.3.0 (from pycountry_convert)
  Downloading pprintpp-0.4.0-py2.py3-none-any.whl.metadata (7.9 kB)
Collecting pycountry>=16.11.27.1 (from pycountry_convert)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Collecting pytest-mock>=1.6.3 (from pycountry_convert)
  Downloading pytest_mock-3.14.0-py3-none-any.whl.metadata (3.8 kB)
Collecting pytest-cov>=2.5.1 (from pycountry_convert)
  Downloading pytest_cov-6.0.0-py3-none-any.whl.metadata (27 kB)
Collecting repoze.lru>=0.7 (from pycountry_convert)
  Downloading repoze.lru-0.7-py3-none-any.whl.metadata (1.1 kB)
Collecting coverage>=7.5 (from coverage[toml]>=7.5->pytest-cov>=2.5.1->pycountry_convert)
  Downloading coverage-7.6.12-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Downloading pycountry_convert-0.7.2-py3-none-any.whl (13 kB)


In [2]:
# Basic Libraries
import numpy as np
import pandas as pd

# For Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For Handling Missing Values
from sklearn.impute import SimpleImputer

# For Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# For converting Countires to Continents
from pycountry_convert import country_name_to_country_alpha2, country_alpha2_to_continent_code, convert_continent_code_to_continent_name

# For Machine Learning Models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from scipy.cluster.hierarchy import dendrogram, linkage

# For Model Evaluation
from sklearn.metrics import accuracy_score, classification_report, silhouette_score, mean_squared_error, mean_absolute_error, r2_score, confusion_matrix, silhouette_score

# **Loading the Dataset**

In [4]:
# Loading the dataset
df = pd.read_csv("clean_drinking_water-4.csv")

# Displaying the first few rows to verify
df.head()

Unnamed: 0,Country,ISO3,Year,Population_Thousands,Urban_Percentage,Rural_At_Least_Basic,Rural_Limited,Rural_Unimproved,Rural_Surface_Water,Annual_Change_Basic,Safely_Managed_Available,Safely_Managed_Contamination_Free,Piped_Water,Non_Piped_Water,SDG_Region,WHO_Region,UNICEF_Reporting_Region
0,Afghanistan,AFG,2000.0,19542.982,22.078001,19.745178,3.134352,47.091285,30.029184,,3.881026,32.440917,9.071579,,3.299203087763079,43.85677699732687,-0.7590817809104919
1,Afghanistan,AFG,2001.0,19688.632,22.169001,19.745178,3.134352,47.091285,30.029184,,3.881026,32.440917,9.071579,,3.299882622392103,43.84344412729671,-0.7590817809104919
2,Afghanistan,AFG,2002.0,21000.256,22.260998,21.981893,3.489409,45.528727,28.999972,,4.018444,30.846839,8.594757,,3.607177205556275,42.26039350298537,-0.7590817809104919
3,Afghanistan,AFG,2003.0,22645.13,22.352999,24.218607,3.844465,43.966168,27.97076,,4.155862,29.252761,8.117936,,3.914071507878746,40.67728187353875,-0.7590817809104919
4,Afghanistan,AFG,2004.0,23553.551,22.5,26.455321,4.199522,42.40361,26.941547,,4.293279,27.658683,7.641115,,4.220616980978035,39.0860017596566,-0.7590817809104919


In [None]:
# Checking basic info about the dataset
df.info()

# Checking for missing values
df.isnull().sum()

# **Data Cleaning**

### **Handling Missing Values:**

In [None]:
# Dropping rows where essential fields are missing
df_cleaned = df.dropna(subset=["Country", "ISO3", "Year"]).copy()

# Imputing missing numerical values using median
num_cols_to_impute = [
    "Rural_At_Least_Basic", "Rural_Limited", "Rural_Unimproved",
    "Annual_Change_Basic", "Safely_Managed_Available", "Safely_Managed_Contamination_Free",
    "Piped_Water", "Non_Piped_Water"
]

for col in num_cols_to_impute:
    df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].median())

# Dropping or Imputing Highly Sparse Columns
sparse_cols = ["Rural_Surface_Water", "Annual_Change_Basic"]

for col in sparse_cols:
    if df_cleaned[col].isnull().sum() / len(df_cleaned) > 0.5:  # In case more than 50% is missing
        df_cleaned = df_cleaned.drop(columns=[col])
    else:
        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].median())

# Checking if missing values are resolved
df_cleaned.isnull().sum()


### **Fixing remaining missing values & Converting Data Types:**

In [None]:
# Imputing the remaining missing values with median
df_cleaned["Population_Thousands"] = df_cleaned["Population_Thousands"].fillna(df_cleaned["Population_Thousands"].median())
df_cleaned["Urban_Percentage"] = df_cleaned["Urban_Percentage"].fillna(df_cleaned["Urban_Percentage"].median())

# Converting 'Year' and 'Population_Thousands' to integer data type
df_cleaned["Year"] = df_cleaned["Year"].astype(int)
df_cleaned["Population_Thousands"] = df_cleaned["Population_Thousands"].astype(int)

# Converting categorical regions into string categories
categorical_cols = ["SDG_Region", "WHO_Region", "UNICEF_Reporting_Region"]
df_cleaned[categorical_cols] = df_cleaned[categorical_cols].astype(str)

# Verifying missing values and data types
print(df_cleaned.isnull().sum())  # To ensure no missing values are left
df_cleaned.dtypes  # To verify correct data types


### **Detecting & Handling Outliers:**

In [None]:
# Selecting only numerical columns for outlier detection
num_cols = [
    "Population_Thousands", "Urban_Percentage", "Rural_At_Least_Basic",
    "Rural_Limited", "Rural_Unimproved", "Annual_Change_Basic",
    "Safely_Managed_Available", "Safely_Managed_Contamination_Free",
    "Piped_Water", "Non_Piped_Water"
]

# Plottting boxplots to visualize outliers
plt.figure(figsize=(15,8))
df_cleaned[num_cols].boxplot(rot=45)
plt.title("Boxplot for Outlier Detection")
plt.show()

### **Fixing outliers using capping (Winsorization):**

In [None]:
# Defining the function to cap outliers
def cap_outliers(df, cols, lower_percentile=5, upper_percentile=95):
    for col in cols:
        lower_bound = df[col].quantile(lower_percentile / 100)
        upper_bound = df[col].quantile(upper_percentile / 100)
        df[col] = df[col].clip(lower_bound, upper_bound)
    return df

# Applying capping to only numerical columns
df_cleaned = cap_outliers(df_cleaned, num_cols)

# Verifying whether the outliers are handled
plt.figure(figsize=(15,8))
df_cleaned[num_cols].boxplot(rot=45)
plt.title("Boxplot After Handling Outliers")
plt.show()


# **Exploratory Data Analysis (EDA)**

In [None]:
# Displaying the summary statistics for numerical columns
df_cleaned.describe()

### **Key Insights from Summary Statistics:**
**Population Sizes Vary Greatly,**

**Mean:** 16,794,000 people per country.

**Max:** 103 million which suggests some large nations are included.

**Min:** 14, indicating smaller regions are included.


**Water Access Distribution,**

**Rural_At_Least_Basic:** Mean = 74%, but some regions have as low as 33.86% access.

**Safely_Managed_Available:** Mean = 4.97%, suggesting very few countries have fully safe drinking water.

In [None]:
# Counting the number of unique values in the categorical columns
df_cleaned[["SDG_Region", "WHO_Region", "UNICEF_Reporting_Region"]].nunique()

### **Histogram to View Distributions:**

In [None]:
# Selecting only numerical columns for visualization
num_cols = [
    "Urban_Percentage", "Rural_At_Least_Basic", "Safely_Managed_Available",
    "Safely_Managed_Contamination_Free", "Piped_Water", "Non_Piped_Water"
]

# Plotting multiple histograms for numerical features
df_cleaned[num_cols].hist(figsize=(12, 8), bins=20, edgecolor='black')
plt.suptitle("Distribution of Water Access Features")
plt.show()


### **Key Observations:**

**Urban Percentage,**
- Bimodal distribution: Peaks at low (20%) and high (100%) urbanization levels.
- Suggests some countries are highly urbanized while others remain rural.

**Rural At Least Basic:**
- Strong peak around 75 to 80%: Suggests many regions have moderate water access.
- Some outliers near 40% and 100% which indicates extreme disparities in rural areas.

**Safely Managed Available & Contamination-Free:**
- Data appears highly skewed, with most values clustering around 4-5%.
- Indicates very few regions have widespread access to safely managed water.

**Piped Water & Non-Piped Water:**
- Piped water is concentrated around 2.6%, indicating that most regions have very limited piped water.
- Non-Piped Water includes some negative values

### **Correlation Heatmap:**

In [None]:
# Generating a correlation matrix
plt.figure(figsize=(12, 6))
sns.heatmap(df_cleaned.select_dtypes(include=np.number).corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5) # Select only numerical features for correlation
plt.title("Feature Correlation Heatmap")
plt.show()


### **Key Observations:**

**Strong Negative Correlation (-0.85) between Rural_At_Least_Basic & Rural_Unimproved,**
- This Makes sense as rural areas have basic water access, they are less likely to have unimproved sources.
- This supports using these variables in predictive models.
Moderate Positive Correlation Between Rural_Limited & Rural_Unimproved (0.35).
- If water access is limited, it is more likely to be unimproved.

**Annual Change in basic water access positively correlates with Non-Piped Water (0.79),**
- Countries increasing water access rely on non-piped sources first before transitioning to piped water.

**Weak Correlation Between Piped Water & Other Factors,**
- Suggests that piped water supply isn’t growing significantly in rural areas.
- This may indicate infrastructure challenges in many countries.

### **Comparing Urban vs Rural Water Access:**

In [None]:
# Plotting a Scatterplot
plt.figure(figsize=(10,6))
sns.scatterplot(x=df_cleaned["Urban_Percentage"], y=df_cleaned["Rural_At_Least_Basic"], alpha=0.5)
plt.title("Urbanization vs. Rural Water Access")
plt.xlabel("Urbanization Percentage")
plt.ylabel("Rural At Least Basic Water Access")
plt.show()

### **Key Observations:**
**No Strong Linear Trend,**

- High urbanization (80 to 100%) does not necessarily mean better rural water access.
- Many highly urbanized regions still have low rural water access (40%).

**Clusters at 100% Water Access,**
- Many countries have perfect water access (100%), regardless of urbanization.
- Likely represents developed countries or those with strong water policies.

**Struggling Regions (40 to 60% Water Access),**
- These countries seem spread across all urbanization levels, suggesting water access issues are not just a rural problem.
- Indicates policy or infrastructure challenges rather than just urbanization effects.

### **Classifying Every Country by Continent:**

In [None]:
# Using a function to get continent from country name
def get_continent(country):
    try:
        country_code = country_name_to_country_alpha2(country)
        continent_code = country_alpha2_to_continent_code(country_code)
        return convert_continent_code_to_continent_name(continent_code)
    except:
        return "Other"  # If country is not found, classify as "Other"

# Applying continent mapping
df_cleaned["Continent"] = df_cleaned["Country"].apply(get_continent)

# Checking for unique continent values
df_cleaned["Continent"].value_counts()



In [None]:
# Plotting a Boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x="Continent", y="Rural_At_Least_Basic", data=df_cleaned)
plt.xticks(rotation=45)  # Rotate labels for better readability
plt.title("Distribution of Rural Water Access Across Continents")
plt.xlabel("Continent")
plt.ylabel("Rural At Least Basic Water Access (%)")
plt.show()

###**Key Observations:**


- **Europe & North America:** Highest median water access with least variation (consistent high access).
- **Africa:** Lowest median access with highest variation (huge disparity among countries). The widest interquartile range (IQR) shows extreme disparities in access.
- **Asia & South America:** Moderate water access, but South America has slightly more variation. Moderate access but several outliers indicating inequality.
- **Oceania:** Mixed access, with some low outliers indicating disparities. Has both high and very low values, suggesting regional inconsistency.
- **Other:** Some countries that are still not assigned a continent.



### **Time trends in Water Access:**

In [None]:
# Plotting a Line Chart for both Rural and Urban Water Access
plt.figure(figsize=(10,6))

# Rural Water Access Trend
sns.lineplot(x="Year", y="Rural_At_Least_Basic", data=df_cleaned, estimator="mean", label="Rural Access", linewidth=2)

# Urban Water Access Trend
sns.lineplot(x="Year", y="Urban_Percentage", data=df_cleaned, estimator="mean", label="Urban Access", linewidth=2)
plt.title("Trend of Rural and Urban Water Access Over Time")
plt.xlabel("Year")
plt.ylabel("Average Water Access (%)")
plt.legend()
plt.show()


### **Key Obervations:**

- Rural Water Access has been increasing steadily over the years, reaching nearly 78-80% in recent years.
- Urban Water Access is consistently lower than rural access, but it is also improving over time, reaching around 65% in recent years.
- The gap between rural and urban access remains significant despite overall improvements.
- The shaded regions indicate variability, meaning that while most countries follow this trend, there are still disparities across regions.

# **Modeling (Regression)**

### **Linear Regression, Decision Tree, Random Forest, K Nearest Neighbors (KNN) Models:**

In [None]:
# Selecting the features (X) and target variable (y)
X = df_cleaned.drop(columns=['Rural_At_Least_Basic', 'Country', 'ISO3', 'WHO_Region', 'SDG_Region', 'UNICEF_Reporting_Region', 'Continent'])  # Dropping 'Continent'
y = df_cleaned['Rural_At_Least_Basic']

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Creating a dictionary to store models and results
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "KNN Regressor": KNeighborsRegressor(n_neighbors=5)  # Added KNN model
}

# Training and testing models
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)

    # Calculating the evaluation metrics
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"Model: {name}")
    print(f"Mean Absolute Error: {mae:.2f}")
    print(f"Mean Squared Error: {mse:.2f}")
    print(f"R2 Score: {r2:.2f}\n")


### **Visualizing the Results:**

In [None]:
# Defining the models and their evaluation metrics
models = ["Linear Regression", "Decision Tree", "Random Forest", "KNN Regressor"]

# Updated evaluation metrics
mae_scores = [4.66, 0.80, 0.70, 0.92]  # Mean Absolute Error
mse_scores = [50.78, 7.23, 4.22, 5.18]  # Mean Squared Error
r2_scores = [0.82, 0.97, 0.99, 0.98]  # R2 Score

# Creating a figure with subplots for MAE and MSE
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Plotting the Mean Absolute Error
ax[0].bar(models, mae_scores, color=['red', 'blue', 'green', 'orange'])
ax[0].set_title("Mean Absolute Error (Lower is Better)")
ax[0].set_ylabel("MAE")
ax[0].set_ylim(0, max(mae_scores) + 1)
ax[0].set_xticklabels(models, rotation=45)

# Plotting the Mean Squared Error
ax[1].bar(models, mse_scores, color=['red', 'blue', 'green', 'orange'])
ax[1].set_title("Mean Squared Error (Lower is Better)")
ax[1].set_ylabel("MSE")
ax[1].set_ylim(0, max(mse_scores) + 5)
ax[1].set_xticklabels(models, rotation=45)

plt.show()

# Creating a separate figure for R2 Scores
plt.figure(figsize=(8, 5))
plt.bar(models, r2_scores, color=['red', 'blue', 'green', 'orange'])
plt.title("R2 Score (Higher is Better)")
plt.ylabel("R2 Score")
plt.ylim(0, 1.1)
plt.xticks(rotation=45)

plt.show()

### **Key Observations:**
- **Random Forest Regressor** performed the best overall. It had the lowest Mean Absolute Error (MAE) (0.70) and Mean Squared Error (MSE) (4.22). The R² score (0.99) indicates an excellent fit to the data, making it the most accurate model.
- **K-Nearest Neighbors (KNN) Regressor** showed competitive performance with an MAE of 0.92 and an MSE of 5.18. The R² score (0.98) suggests that KNN captured the data patterns well, but it was slightly weaker compared to the Random Forest model.
- **Decision Tree** is the second-best model, with an MAE of 0.80 and an MSE of 7.23. The R² score (0.97) is still very strong, but the higher error values suggest it does not generalize as well as Random Forest.
- **Linear Regression** was the weakest performer. It had the highest MAE (4.66) and MSE (50.78), indicating large prediction errors. The R² score (0.82) was significantly lower than the other models, showing that it struggles to capture complex patterns in the data.

**Insights from the Visualization:**
- The bar charts confirm that Linear Regression had the highest error rates, making it the least suitable for this task.
- Random Forest and Decision Tree had the lowest errors, making them the most reliable models.
- KNN performed well, but its errors were slightly higher than the Decision Tree and Random Forest models.

**Conclusion:**
- Random Forest Regressor is the best model for predicting clean drinking water access due to its superior accuracy and minimal errors.
- Decision Tree performed well and is a strong alternative, but it may not generalize as effectively as Random Forest.
- KNN showed strong performance, but slightly higher errors make it less optimal than Random Forest.
- Linear Regression is not suitable for this dataset as it struggles to capture complex relationships.

# **Modeling (Classification)**

### **Logistic Regression, Decision Tree, Random Forest, K Nearest Neighbors (KNN) Models:**

In [None]:
# Converting the target variable into binary classification
threshold = df_cleaned["Rural_At_Least_Basic"].median()  # Use median as the threshold
df_cleaned["Water_Access_Class"] = (df_cleaned["Rural_At_Least_Basic"] >= threshold).astype(int)

# Selecting the Features (X) and Target (Y)
X = df_cleaned.drop(columns=['Rural_At_Least_Basic', 'Water_Access_Class', 'Country', 'ISO3', 'WHO_Region', 'SDG_Region', 'UNICEF_Reporting_Region', 'Continent'])
y = df_cleaned["Water_Access_Class"]

# Splitting the Data into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Defining the Classification Models
classification_models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "KNN Classifier": KNeighborsClassifier(n_neighbors=5)
}

# Training and Evaluating the Models
for name, model in classification_models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)

# Evaluating the Model Performance
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {name}")
    print(f"Accuracy: {accuracy:.2f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("-" * 50)

### **Visualizing the Results:**

In [None]:
# Defining the models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "KNN Classifier": KNeighborsClassifier(n_neighbors=5)
}

# Training the models and plotting confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for i, (name, model) in enumerate(models.items()):
    model.fit(X_train_scaled, y_train)  # Train
    y_pred = model.predict(X_test_scaled)  # Predict

    # Computing the confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    # Plotting the confusion matrix
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=axes[i])
    axes[i].set_title(f"{name} Confusion Matrix")
    axes[i].set_xlabel("Predicted Label")
    axes[i].set_ylabel("True Label")

plt.tight_layout()
plt.show()

### **Key Observations:**
**Overall Model Performance,**

All models (Logistic Regression, Decision Tree, Random Forest, and KNN) demonstrate high accuracy, exceeding 95% in classification performance.
The precision, recall, and F1-scores for both classes (0 and 1) are consistently high, indicating strong predictive capabilities.

**Confusion Matrix Insights,**
- Random Forest has the lowest classification errors (2 false positives and 4 false negatives), making it the most reliable model in terms of precision and recall.
- Decision Tree and KNN Classifiers show very similar performance, with slightly more false positives and false negatives than Random Forest.
- Logistic Regression has the highest number of false negatives (49) compared to other models, suggesting that it struggles to correctly classify some instances of the minority class.

**Comparison of Models,**
- Logistic Regression achieves 95% accuracy, but has higher false negatives (49) compared to tree-based models. It might struggle to learn complex decision boundaries. But is still a good choice when computational efficiency is a priority.
- Decision Tree & KNN Classifier: Both achieve 99% accuracy with very few misclassified points, making them robust choices. Decision Tree is highly interpretable but prone to overfitting. KNN adapts well to complex data but can be computationally expensive.
- Random Forest: Outperforms all models with the highest precision and recall, showing its strength in reducing overfitting and capturing intricate patterns in data.

**Trade-offs Between Models,**
- Logistic Regression is computationally efficient but sacrifices accuracy, especially for complex patterns.
- Decision Tree is interpretable but prone to overfitting.
- Random Forest achieves the highest precision and recall, making it the best choice for accuracy and reliability.
- KNN provides strong performance but may not scale well for large datasets due to computational complexity.

**Conclusion,**
- Random Forest is the best model due to its highest precision, recall, and lowest misclassification rate.
- If interpretability is crucial, Decision Tree is a good choice.
- KNN is also reliable, but it can be computationally expensive on large datasets.
- Logistic Regression should be used if a simpler, linear model is preferred, despite its slightly lower accuracy.

# **Modeling (Clustering)**

### **K-Means, Hierarchical, and DBSCAN Clustering:**

In [None]:
# Selecting the relevant numerical features for clustering
features = ['Urban_Percentage', 'Rural_At_Least_Basic', 'Safely_Managed_Available', 'Piped_Water']
df_clustering = df_cleaned[features].dropna()  # Drop missing values

# Standardizing the features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_clustering)

In [None]:
# Using Elbow Method to find optimal K for K-Means
inertia = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(df_scaled)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(df_scaled, kmeans.labels_))

# Plotting the Elbow Method
plt.figure(figsize=(10, 5))
plt.plot(K_range, inertia, marker='o', linestyle='-', color='b')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

# Plotting the Silhouette Score
plt.figure(figsize=(10, 5))
plt.plot(K_range, silhouette_scores, marker='o', linestyle='-', color='g')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal K')
plt.show()

In [None]:
# Applying K-Means with optimal K (assuming K=3 based on elbow method)
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df_clustering['KMeans_Cluster'] = kmeans.fit_predict(df_scaled)

# Visualizing the 3 K-Means Clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_clustering['Urban_Percentage'], y=df_clustering['Rural_At_Least_Basic'],
                hue=df_clustering['KMeans_Cluster'], palette='viridis', alpha=0.7)
plt.xlabel('Urban Percentage')
plt.ylabel('Rural At Least Basic Water Access')
plt.title('K-Means Clustering')
plt.legend(title='Cluster')
plt.show()


In [None]:
# Performing hierarchical clustering
linked = linkage(df_scaled, method='ward')

# Dendrogram for Hierarchical Clustering
plt.figure(figsize=(12, 6))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.show()

In [None]:
# Applying DBSCAN clustering
dbscan = DBSCAN(eps=1, min_samples=5)
df_clustering['DBSCAN_Cluster'] = dbscan.fit_predict(df_scaled)

# Visualizing DBSCAN Clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_clustering['Urban_Percentage'], y=df_clustering['Rural_At_Least_Basic'],
                hue=df_clustering['DBSCAN_Cluster'], palette='deep', alpha=0.7)
plt.xlabel('Urban Percentage')
plt.ylabel('Rural At Least Basic Water Access')
plt.title('DBSCAN Clustering')
plt.legend(title='Cluster')
plt.show()


In [None]:
# Evaluating K-Means Clustering Performance
kmeans_silhouette = silhouette_score(df_scaled, df_clustering['KMeans_Cluster'])
print(f"K-Means Silhouette Score: {kmeans_silhouette:.2f}")

# Evaluating DBSCAN Clustering Performance
if len(set(df_clustering['DBSCAN_Cluster'])) > 1:
    dbscan_silhouette = silhouette_score(df_scaled, df_clustering['DBSCAN_Cluster'])
    print(f"DBSCAN Silhouette Score: {dbscan_silhouette:.2f}")
else:
    print("DBSCAN did not form meaningful clusters.")


### **Key Observations:**
**Optimal Number of Clusters (Elbow & Silhouette Score),**
- The Elbow Method suggests that the optimal number of clusters is around 3 or 4, as the inertia curve starts to level off after this point.
- The Silhouette Score confirms that K = 3 is the best choice, yielding the highest score of 0.55.
- Clustering with a higher number of clusters (K > 4) results in diminishing silhouette scores, indicating less defined clusters.

**K-Means Clustering,**
- The K-Means clustering plot shows three well-separated clusters based on Urban Percentage and Rural Basic Water Access.
- One cluster represents high rural water access with high urbanization.
- Another cluster includes moderate rural water access with varying urbanization.
- The final cluster represents low rural water access, generally corresponding to lower urbanization.

**Hierarchical Clustering,**
- The dendrogram suggests a natural grouping into three or four clusters, aligning with the findings from K-Means.
- The hierarchical structure reveals that some countries share strong similarities in water access patterns before diverging into subgroups.
- This method provides an interpretable tree-like structure that may be useful for policymakers analyzing regional disparities.

**DBSCAN Clustering,**
- DBSCAN did not perform well, as seen in the clustering plot.
- Most data points were assigned to a single cluster, indicating poor separation between regions.
- The silhouette score for DBSCAN is only 0.12, confirming that this approach struggles with the dataset’s structure.

**Conclusion,**
- K-Means is the best clustering method for this dataset, as it provides well-defined groups with a high silhouette score.
- Hierarchical clustering is useful for interpretation but does not significantly outperform K-Means.
- DBSCAN is not suitable for this dataset, as it fails to capture meaningful clusters.

# **Evaluation & Conclusion**

In [None]:
# Model Names
models = ["Random Forest (Reg)", "KNN (Reg)", "Decision Tree (Reg)", "Linear Regression",
          "Random Forest (Class)", "KNN (Class)", "Decision Tree (Class)", "Logistic Regression",
          "K-Means (Cluster)", "DBSCAN (Cluster)"]

# Regression Metrics (Lower is Better for MAE & MSE, Higher is Better for R²)
mae_scores = [0.70, 0.92, 0.80, 4.66, None, None, None, None, None, None]  # Regression only
mse_scores = [4.22, 5.18, 7.23, 50.78, None, None, None, None, None, None]  # Regression only
r2_scores = [0.99, 0.98, 0.97, 0.82, None, None, None, None, None, None]  # Regression only

# Classification Metrics (Higher is Better)
accuracy_scores = [None, None, None, None, 0.99, 0.99, 0.99, 0.95, None, None]  # Classification only

# Clustering Metrics (Higher Silhouette Score is Better)
silhouette_scores = [None, None, None, None, None, None, None, None, 0.55, 0.12]  # Clustering only

# Plotting Comparison
fig, ax = plt.subplots(2, 2, figsize=(14, 10))

# Regression Metrics
ax[0, 0].bar(models[:4], mae_scores[:4], color=['green', 'orange', 'blue', 'red'])
ax[0, 0].set_title("Mean Absolute Error (Lower is Better)")
ax[0, 0].set_ylabel("MAE")
ax[0, 0].set_ylim(0, max(mae_scores[:4]) + 1)

ax[0, 1].bar(models[:4], mse_scores[:4], color=['green', 'orange', 'blue', 'red'])
ax[0, 1].set_title("Mean Squared Error (Lower is Better)")
ax[0, 1].set_ylabel("MSE")
ax[0, 1].set_ylim(0, max(mse_scores[:4]) + 10)

ax[1, 0].bar(models[:4], r2_scores[:4], color=['green', 'orange', 'blue', 'red'])
ax[1, 0].set_title("R² Score (Higher is Better)")
ax[1, 0].set_ylabel("R²")
ax[1, 0].set_ylim(0, 1.1)

# Classification Accuracy
ax[1, 1].bar(models[4:8], accuracy_scores[4:8], color=['green', 'orange', 'blue', 'red'])
ax[1, 1].set_title("Classification Accuracy (Higher is Better)")
ax[1, 1].set_ylabel("Accuracy")
ax[1, 1].set_ylim(0.9, 1)

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Clustering Silhouette Scores
plt.figure(figsize=(8, 5))
plt.bar(models[8:], silhouette_scores[8:], color=['purple', 'brown'])
plt.title("Clustering Silhouette Scores (Higher is Better)")
plt.ylabel("Silhouette Score")
plt.ylim(0, 1)
plt.xticks(rotation=45)
plt.show()


### **Selecting the Overall Best Model:**

**Regression Models,**
- Random Forest Regressor performed the best, with the lowest Mean Absolute Error (MAE) and Mean Squared Error (MSE), and the highest R² score, indicating strong predictive power.
- K-Nearest Neighbors (KNN) Regressor also performed well but had slightly higher error values.
- Linear Regression performed the worst, struggling to capture the complex relationships in the data.
- **Best Regression Model:** *Random Forest Regressor*

**Classification Models,**
- Random Forest, Decision Tree, and KNN classifiers all performed excellently, achieving 99% accuracy.
- Logistic Regression had the lowest accuracy (around 95%), making it less suitable for this task.
- **Best Classification Model:** *Random Forest Classifier*

**Clustering Models,**
- K-Means Clustering had a significantly higher silhouette score (0.55) compared to DBSCAN (0.12), indicating that K-Means formed more well-defined clusters.
- DBSCAN struggled with identifying meaningful clusters, likely due to the nature of the dataset.
- **Best Clustering Model:** *K-Means Clustering*

### **Overall Best Model:**
- Random Forest models dominated in both Regression and Classification tasks, proving to be the most reliable across different types of predictions.
- For clustering, K-Means was the most effective method.
- Thus, if we need to recommend a single most effective modeling approach, **Random Forest** emerges as the best option due to its strong performance in both regression and classification.



###**Using the best overall model on the dataset:**

In [None]:
# Using the best model: Random Forest Regressor
best_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Training the model on the full dataset
best_model.fit(X_train_scaled, y_train)

# Making the predictions
y_pred = best_model.predict(X_test_scaled)

# Evaluating the model performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Printing evaluation results
print("Final Model Performance on Test Data:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")

# Plotting actual vs predicted values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5, color="blue", label="Predictions")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle="--", color="red", label="Perfect Fit")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs Predicted Values (Random Forest Regressor)")
plt.legend()
plt.show()

###**Conclusion:**
After selecting the Random Forest Regressor as the best-performing model, it was applied to the dataset to generate predictions for rural water access levels. The model evaluation gave: an MAE of 0.02, MSE of 0.01, and R2 score of 0.97, which all represent a very accurate prediction.

The plot of actual versus predicted values confirmed that the model was successful in capturing the distribution of water access levels, with most predictions hovering around the predicted values. A near-perfect diagonal line on the plot indicates low error and strong model fit.

Overall, this results show that the Random Forest Regressor is a viable means of assessing water accessibility,providing policymakers with valuable insights for resource allocation and infrastructure planning. urther enhancements, such as incorporating additional socioeconomic and climate-related factors, could improve the model’s robustness for long-term decision-making.