# Q3 – Recent Water Data and pH Classification

This project analyzes multi-location water pH data from the Florida WaterAtlas dataset.
The goal is to identify safe and unsafe water conditions using AI classification models
and visualize unsafe locations geographically.


## 1. Data Preparation

The dataset contains water quality measurements across multiple locations.  
Only pH measurements are selected for this analysis. A new column called **SAFE-PH** is created based on environmental standards:

- Safe range: 6.5 – 8.5  
- Unsafe: <= 6.5 or > 8.5


In [1]:
import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/biplav-s/course-tai/main/sample-code/common-data/water/WaterAtlas-ManySites.csv"

df_q3 = pd.read_csv(url, engine='python', on_bad_lines='skip')

df_q3.head()



Unnamed: 0,DataSourceName,DataSourceCode,StationID,ActualStationID,Latitude_DD,Longitude_DD,SampleDate,SampleTime,ActivityDepth,ActivityDepthUnit,Characteristic,ResultValue,ResultUnit,ValueQualifier,ResultComment,WaterbodyID,WaterbodyName
0,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-08-18 00:00:00.000,00:00:00,,,pH,7.5,,,,8509,Bugg Spring
1,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-08-18 00:00:00.000,00:00:00,,,Phosphorus as P,70.0,ug/l,,,8509,Bugg Spring
2,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-08-18 00:00:00.000,00:00:00,,,Specific conductance,270.0,umho,,,8509,Bugg Spring
3,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-08-18 00:00:00.000,00:00:00,,,Nitrogen,670.0,ug/l,,,8509,Bugg Spring
4,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-08-18 00:00:00.000,00:00:00,,,Sodium,5.0,mg/l,,,8509,Bugg Spring


In [2]:
# Keep only pH measurements
df_ph = df_q3[df_q3['Characteristic'] == 'pH'].copy()

df_ph.head()


Unnamed: 0,DataSourceName,DataSourceCode,StationID,ActualStationID,Latitude_DD,Longitude_DD,SampleDate,SampleTime,ActivityDepth,ActivityDepthUnit,Characteristic,ResultValue,ResultUnit,ValueQualifier,ResultComment,WaterbodyID,WaterbodyName
0,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-08-18 00:00:00.000,00:00:00,,,pH,7.5,,,,8509,Bugg Spring
13,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-03-10 00:00:00.000,00:00:00,,,pH,7.7,,,,8509,Bugg Spring
40,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Bugg Springs-Lake,,28.75361,-81.90444,1991-08-18 00:00:00.000,00:00:00,,,pH,7.6,,,,8509,Bugg Spring
59,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Church-Lake,,28.64625,-81.84342,1995-02-13 00:00:00.000,00:00:00,,,pH,5.9,,,,7844,Church Lake
70,LAKEWATCH Supplemental Water Quality Sampling,LAKEWATCH_SUPP,Turkey-Lake,,28.70128,-81.85039,1995-02-13 00:00:00.000,00:00:00,,,pH,6.1,,,,8186,Turkey Lake


In [3]:
df_ph['SAFE-PH'] = df_ph['ResultValue'].apply(
    lambda x: 'yes' if 6.5 <= x <= 8.5 else 'no'
)

df_ph[['ResultValue','SAFE-PH']].head()


Unnamed: 0,ResultValue,SAFE-PH
0,7.5,yes
13,7.7,yes
40,7.6,yes
59,5.9,no
70,6.1,no


### Distribution of Safe vs Unsafe pH

Understanding the class balance helps interpret model performance.


In [4]:
df_ph['SAFE-PH'].value_counts()


SAFE-PH
yes    386
no      86
Name: count, dtype: int64

## 2. Model Training

Two classification algorithms are used:

- Logistic Regression
- Decision Tree

20% of data is used for testing as required.


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

X = df_ph[['ResultValue']]
y = df_ph['SAFE-PH']

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_encoded
)



In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

model_lr = LogisticRegression(max_iter=2000)
model_lr.fit(X_train, y_train)

y_pred_lr = model_lr.predict(X_test)

print("Logistic Regression Results:")
print(classification_report(y_test, y_pred_lr))


Logistic Regression Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      1.00      1.00        78

    accuracy                           1.00        95
   macro avg       1.00      1.00      1.00        95
weighted avg       1.00      1.00      1.00        95



In [7]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier(random_state=42)
model_dt.fit(X_train, y_train)

y_pred_dt = model_dt.predict(X_test)

print("Decision Tree Results:")
print(classification_report(y_test, y_pred_dt))


Decision Tree Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      1.00      1.00        78

    accuracy                           1.00        95
   macro avg       1.00      1.00      1.00        95
weighted avg       1.00      1.00      1.00        95



### Cross Validation

To ensure model reliability, k-fold cross validation is applied.


In [8]:
from sklearn.model_selection import cross_val_score

scores_lr = cross_val_score(model_lr, X, y_encoded, cv=10)
scores_dt = cross_val_score(model_dt, X, y_encoded, cv=10)

print("LR 10-fold accuracy:", scores_lr.mean())
print("DT 10-fold accuracy:", scores_dt.mean())


LR 10-fold accuracy: 0.9936170212765958
DT 10-fold accuracy: 1.0


## Geographic Analysis of Unsafe Water by Location

To understand the sustainability impact of water quality, the dataset was analyzed geographically using latitude and longitude information.  

Unsafe pH occurrences were aggregated by waterbody location to determine which places have the most and least unsafe water samples. This helps identify areas requiring environmental monitoring and intervention.


In [9]:
# Count unsafe occurrences by waterbody
unsafe_counts = (
    df_ph[df_ph['SAFE-PH'] == 'no']
    .groupby(['WaterbodyName','Latitude_DD','Longitude_DD'])
    .size()
    .reset_index(name='UnsafeCount')
)

unsafe_counts.head()



Unnamed: 0,WaterbodyName,Latitude_DD,Longitude_DD,UnsafeCount
0,Bugg Spring Run,28.7525,-81.901667,1
1,Church Lake,28.642111,-81.840597,13
2,Church Lake,28.645,-81.8464,11
3,Church Lake,28.64625,-81.84342,1
4,Kess Lake,28.663611,-81.849444,1


In [10]:
print("Top locations with MOST unsafe water:")
display(unsafe_counts.sort_values('UnsafeCount', ascending=False).drop_duplicates(subset=['WaterbodyName']))

print("\nLocations with LEAST unsafe water:")
display(unsafe_counts.sort_values('UnsafeCount', ascending=True).drop_duplicates(subset=['WaterbodyName']))


Top locations with MOST unsafe water:


Unnamed: 0,WaterbodyName,Latitude_DD,Longitude_DD,UnsafeCount
14,Palatlakaha River,28.748033,-81.874853,24
1,Church Lake,28.642111,-81.840597,13
5,Kess Lake,28.664713,-81.84201,2
17,Turkey Lake,28.70128,-81.85039,2
0,Bugg Spring Run,28.7525,-81.901667,1
6,Moon Lake,28.632746,-81.859896,1



Locations with LEAST unsafe water:


Unnamed: 0,WaterbodyName,Latitude_DD,Longitude_DD,UnsafeCount
0,Bugg Spring Run,28.7525,-81.901667,1
3,Church Lake,28.64625,-81.84342,1
6,Moon Lake,28.632746,-81.859896,1
4,Kess Lake,28.663611,-81.849444,1
11,Palatlakaha River,28.720303,-81.884551,1
17,Turkey Lake,28.70128,-81.85039,2


###  Geographic Distribution of Unsafe pH Measurements in Florida



In [11]:
import folium

# Centers the  map roughly around Florida
m = folium.Map(location=[28.5, -81.5], zoom_start=7)

# Adds unsafe locations as markers
for _, row in unsafe_counts.iterrows():
    
    folium.CircleMarker(
        location=[row['Latitude_DD'], row['Longitude_DD']],
        radius=4 + row['UnsafeCount'],   # size reflects severity
        popup=f"{row['WaterbodyName']} | UnsafeCount={row['UnsafeCount']}",
        color='red',
        fill=True,
        fill_opacity=0.6
    ).add_to(m)

m


.

In [12]:
# Show top unsafe locations
print("Top locations with most unsafe pH occurrences:")
display(unsafe_counts.sort_values(by="UnsafeCount", ascending=False).head())

print("\nLocations with least unsafe pH occurrences:")
display(unsafe_counts.sort_values(by="UnsafeCount", ascending=True).head())


Top locations with most unsafe pH occurrences:


Unnamed: 0,WaterbodyName,Latitude_DD,Longitude_DD,UnsafeCount
14,Palatlakaha River,28.748033,-81.874853,24
1,Church Lake,28.642111,-81.840597,13
13,Palatlakaha River,28.7442,-81.8728,11
2,Church Lake,28.645,-81.8464,11
7,Palatlakaha River,28.679167,-81.884778,10



Locations with least unsafe pH occurrences:


Unnamed: 0,WaterbodyName,Latitude_DD,Longitude_DD,UnsafeCount
0,Bugg Spring Run,28.7525,-81.901667,1
3,Church Lake,28.64625,-81.84342,1
6,Moon Lake,28.632746,-81.859896,1
4,Kess Lake,28.663611,-81.849444,1
11,Palatlakaha River,28.720303,-81.884551,1


This analysis demonstrates how machine learning can be applied to real-world environmental monitoring tasks using recent multi-location water quality data. By creating a SAFE-PH indicator based on established environmental standards, classification models such as Logistic Regression and Decision Trees were trained to predict whether water samples fall within safe pH ranges. Both models achieved strong predictive performance, indicating that pH measurements can be reliably classified using supervised learning techniques.

Beyond model performance, the spatial analysis revealed that certain water bodies, particularly areas such as Palatlakaha River and Church Lake, exhibited higher occurrences of unsafe pH values compared to other locations. Mapping these occurrences provided important geographic insight, allowing unsafe regions to be visually identified and prioritized for monitoring or intervention. Overall, this workflow highlights how data science supports sustainability by transforming environmental measurements into actionable intelligence, enabling stakeholders to better understand water quality risks and allocate resources toward locations with the greatest need.