# Water Quality Prediction System (Rwanda Chapter)

### Life Cycle or machine learning project

* Research previous work and Data Collection
* Data Collection
* Exploratory Data Analysis
* Preprocessing and feature engineering
* Model Development
* Model Training
* Model Analysis and Interpretation
* App Development


## 1.) Project Problem Statement

* Access to clean water is a critical challenge in many parts of the world, including Rwanda. Water quality prediction is important for ensuring the availability of safe and clean water for drinking, agriculture, and other purposes.
* However, traditional methods for water quality prediction are often time-consuming and costly, and they may not provide accurate and timely information. 
* To address this challenge, the Omdena Rwanda Chapter has initiated a project to develop an automated water quality prediction system using machine learning.

## 2.) Data Collection

* Data Source: https://drive.google.com/drive/folders/1_KJ09bHckVYVG_2ZHbRHIDCVtM04wpZI?usp=sharing
* The data consists of 18 columns and 10001 rows. 

## 2.1 Import Data and Required Libraries

#### NumPy, Pandas, Matplotlib, Seaborn, Warings Library and scikit learn

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split

## Import the CSV Data as Pandas DataFrame

In [2]:
df = pd.read_csv("Synthetic_Data_Water_Quality.csv")

## Show Top 5 Records

In [3]:
df.head()

Unnamed: 0,Colour (TCU),Turbidity (NTU),pH,Conductivity (uS/cm),Total Dissolved Solids (mg/l),Total Hardness (mg/l as CaCO3),Aluminium (mg/l),Chloride (mg/l),Total Iron (mg/l),Sodium (mg/l),Sulphate (mg/l),Zinc (mg/l),Magnesium (mg/l),Calcium (mg/l),Potassium (mg/l),Nitrate (mg/l),Phosphate (mg/l),Potability
0,8.34,3.39,8.06,819.0,787.15,279.89,0.09,129.3,0.22,13.13,81.01,2.24,12.69,107.95,17.5,22.23,0.41,potable
1,14.45,3.36,8.28,1371.1,779.66,112.04,0.2,163.73,0.13,127.48,307.99,4.05,52.01,107.12,45.28,16.06,0.68,potable
2,3.87,4.23,6.86,202.75,485.1,113.17,0.15,66.68,0.29,142.97,16.7,0.86,88.47,127.47,4.9,19.81,0.91,potable
3,14.57,1.75,7.0,696.16,409.71,140.39,0.06,102.42,0.15,194.07,393.09,2.6,61.36,99.16,36.73,42.82,0.02,potable
4,9.01,2.2,6.73,129.24,343.55,6.52,0.07,140.47,0.28,3.77,170.65,0.04,47.22,107.17,44.79,14.35,2.08,potable


## Shape of the dataset

In [4]:
df.shape

(10000, 18)

## 2.2.) Dataset information
* pH
* Aluminium (mg/l)
* Chloride (mg/l)
* Iron (mg/l)
* Sulphate (mg/l)
* Zinc (mg/l)
* Magnesium (mg/l)	
* Calcium (mg/l)	
* Potassium (mg/l)	
* Nitrate (mg/l)	
* Phosphate (mg/l)
* Potability -> (Potable/non-potable)

# 3.) Data Checks to perform


* Check Missing values
* Check Duplicates
* Check data type
* Check the number of unique values of each column
* Check statistics of data set
* Check various categories present in the different categorical column


### 3.1 Check Missing values

In [7]:
df.isna().sum()

Colour (TCU)                      0
Turbidity (NTU)                   0
pH                                0
Conductivity (uS/cm)              0
Total Dissolved Solids (mg/l)     0
Total Hardness (mg/l as CaCO3)    0
Aluminium (mg/l)                  0
Chloride (mg/l)                   0
Total Iron (mg/l)                 0
Sodium (mg/l)                     0
Sulphate (mg/l)                   0
Zinc (mg/l)                       0
Magnesium (mg/l)                  0
Calcium (mg/l)                    0
Potassium (mg/l)                  0
Nitrate (mg/l)                    0
Phosphate (mg/l)                  0
Potability                        0
dtype: int64

***There are no missing values in the dataset***

## 3.2 Check Duplicates

In [8]:
df.duplicated().sum()

0

***There are no duplicates values in the data set***

## 3.3 Check data types

In [9]:
# check null values and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Colour (TCU)                    10000 non-null  float64
 1   Turbidity (NTU)                 10000 non-null  float64
 2   pH                              10000 non-null  float64
 3   Conductivity (uS/cm)            10000 non-null  float64
 4   Total Dissolved Solids (mg/l)   10000 non-null  float64
 5   Total Hardness (mg/l as CaCO3)  10000 non-null  float64
 6   Aluminium (mg/l)                10000 non-null  float64
 7   Chloride (mg/l)                 10000 non-null  float64
 8   Total Iron (mg/l)               10000 non-null  float64
 9   Sodium (mg/l)                   10000 non-null  float64
 10  Sulphate (mg/l)                 10000 non-null  float64
 11  Zinc (mg/l)                     10000 non-null  float64
 12  Magnesium (mg/l)                1

## 3.4 Checking the number of unique values of each column

In [10]:
df.nunique()

Colour (TCU)                      2907
Turbidity (NTU)                   1001
pH                                1375
Conductivity (uS/cm)              9824
Total Dissolved Solids (mg/l)     9745
Total Hardness (mg/l as CaCO3)    9218
Aluminium (mg/l)                    41
Chloride (mg/l)                   9049
Total Iron (mg/l)                   61
Sodium (mg/l)                     8811
Sulphate (mg/l)                   9386
Zinc (mg/l)                       1001
Magnesium (mg/l)                  7931
Calcium (mg/l)                    8488
Potassium (mg/l)                  6293
Nitrate (mg/l)                    6033
Phosphate (mg/l)                   441
Potability                           2
dtype: int64

## 3.5 Check statistics of data set

In [12]:
df.describe()

Unnamed: 0,Colour (TCU),Turbidity (NTU),pH,Conductivity (uS/cm),Total Dissolved Solids (mg/l),Total Hardness (mg/l as CaCO3),Aluminium (mg/l),Chloride (mg/l),Total Iron (mg/l),Sodium (mg/l),Sulphate (mg/l),Zinc (mg/l),Magnesium (mg/l),Calcium (mg/l),Potassium (mg/l),Nitrate (mg/l),Phosphate (mg/l)
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,15.006526,5.003388,7.337763,1502.148272,1001.183584,300.613398,0.200808,249.491721,0.300165,200.793553,402.124054,5.004608,100.026299,149.944522,49.838229,45.162176,2.205561
std,8.717615,2.906118,3.101412,869.812955,578.522848,171.478482,0.115359,144.526095,0.174572,116.039382,230.187867,2.89789,57.979525,87.162086,28.79552,25.861234,1.274395
min,0.01,0.0,0.0,0.12,0.05,0.03,0.0,0.0,0.0,0.01,0.03,0.0,0.03,0.02,0.0,0.03,0.0
25%,7.5175,2.49,6.3975,741.635,494.59,154.98,0.1,122.7075,0.15,99.89,205.91,2.46,49.3625,74.4175,24.49,23.15,1.09
50%,15.0,5.0,7.47,1500.095,1000.03,300.005,0.2,249.915,0.3,199.995,400.125,5.0,100.015,150.01,49.99,45.01,2.2
75%,22.66,7.53,8.48,2259.74,1497.88,448.605,0.3,374.76,0.45,303.645,601.925,7.5,150.13,226.245,75.04,67.7725,3.32
max,30.0,10.0,14.0,2999.91,1999.96,599.97,0.4,499.87,0.6,399.98,799.88,10.0,199.98,299.97,100.0,90.0,4.4


#### Insight

* From above description of numerical data, all means are very close to each other - between 2.20 and 1502.14;
* All standard deviations are also close - between 1.27 and 869.81;
* While there is a minimum score 0 and maximum score 2999.91

## 3.6 Exploring Data

In [13]:
df.head()

Unnamed: 0,Colour (TCU),Turbidity (NTU),pH,Conductivity (uS/cm),Total Dissolved Solids (mg/l),Total Hardness (mg/l as CaCO3),Aluminium (mg/l),Chloride (mg/l),Total Iron (mg/l),Sodium (mg/l),Sulphate (mg/l),Zinc (mg/l),Magnesium (mg/l),Calcium (mg/l),Potassium (mg/l),Nitrate (mg/l),Phosphate (mg/l),Potability
0,8.34,3.39,8.06,819.0,787.15,279.89,0.09,129.3,0.22,13.13,81.01,2.24,12.69,107.95,17.5,22.23,0.41,potable
1,14.45,3.36,8.28,1371.1,779.66,112.04,0.2,163.73,0.13,127.48,307.99,4.05,52.01,107.12,45.28,16.06,0.68,potable
2,3.87,4.23,6.86,202.75,485.1,113.17,0.15,66.68,0.29,142.97,16.7,0.86,88.47,127.47,4.9,19.81,0.91,potable
3,14.57,1.75,7.0,696.16,409.71,140.39,0.06,102.42,0.15,194.07,393.09,2.6,61.36,99.16,36.73,42.82,0.02,potable
4,9.01,2.2,6.73,129.24,343.55,6.52,0.07,140.47,0.28,3.77,170.65,0.04,47.22,107.17,44.79,14.35,2.08,potable
