# **AAPG WEEK 5**

**PROJECT 1 : WATER QUALITY ANALYSIS**

13th July 2024

**Facilitators:**
1. Giles Twiss
2. Promise Ekeh




# Data Description and Task Overview

## Data Description

This dataset contains measurements of various water quality parameters collected from different water sources over time. Below is a detailed description of each column in the dataset:

1. **Index**: Unique identifier for each data entry.
2. **pH**: The pH level of the water, indicating its acidity or alkalinity.
3. **Iron**: Concentration of iron in the water (in mg/L).
4. **Nitrate**: Concentration of nitrate in the water (in mg/L).
5. **Chloride**: Concentration of chloride in the water (in mg/L).
6. **Lead**: Concentration of lead in the water (in mg/L).
7. **Zinc**: Concentration of zinc in the water (in mg/L).
8. **Color**: The color of the water sample (e.g., Colorless, Faint Yellow).
9. **Turbidity**: The cloudiness or haziness of the water, measured in Nephelometric Turbidity Units (NTU).
10. **Fluoride**: Concentration of fluoride in the water (in mg/L).
11. **Copper**: Concentration of copper in the water (in mg/L).
12. **Odor**: Descriptive term for the odor of the water (e.g., odorless, faint odor).
13. **Sulfate**: Concentration of sulfate in the water (in mg/L).
14. **Conductivity**: The ability of the water to conduct electricity, measured in microsiemens per centimeter (µS/cm).
15. **Chlorine**: Concentration of chlorine in the water (in mg/L).
16. **Manganese**: Concentration of manganese in the water (in mg/L).
17. **Total Dissolved Solids (TDS)**: Total concentration of dissolved substances in the water (in mg/L).
18. **Source**: Source of the water sample (e.g., Lake, River, Ground).
19. **Water Temperature**: Temperature of the water at the time of measurement (in °C).
20. **Air Temperature**: Temperature of the air at the time of measurement (in °C).
21. **Month**: Month when the sample was taken.
22. **Day**: Day of the month when the sample was taken.
23. **Time of Day**: Time of day when the sample was taken (e.g., 0 for midnight, 12 for noon).
24. **Target**: Binary target variable indicating whether the water quality meets a certain standard (0 for meets standard, 1 for does not meet standard).


## Task for Participants

Participants are tasked with analyzing the water quality data to determine patterns, correlations, and insights that can help in understanding and managing water quality. Specific tasks include:

1. **Data Cleaning and Preprocessing**:
    - Handle missing or inconsistent data.
    - Convert categorical data into numerical format if necessary.
    - Normalize or standardize numerical features.

2. **Exploratory Data Analysis (EDA)**:
    - Generate summary statistics for each feature.
    - Visualize data distributions and relationships between variables using plots (e.g., histograms, scatter plots, box plots).
    - Identify any trends or anomalies in the data.

3. **Feature Engineering**:
    - Create new features that might be useful for prediction, such as interaction terms or aggregates.
    - Evaluate the importance of different features for predicting water quality.

4. **Model Building**:
    - Split the data into training and testing sets.
    - Train machine learning models to predict the target variable (water quality standard).
    - Evaluate model performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

5. **Model Interpretation and Insights**:
    - Interpret the model to understand which factors most influence water quality.
    - Provide actionable insights based on model findings.
    - Discuss potential interventions or policy recommendations to improve water quality.

# **Data Access**

Download the `water_quality_dataset_100k.csv` dataset from the google drive

https://drive.google.com/drive/folders/1k89P3amXikq3Fd_40GUVKjUav1YO6Zyl

Thanks to one of you who pointed it out, the data had only one value as Target(only zeros). A new data has now been added to the google drive with and updated dataset

`water_quality_dataset_100k_new.csv`

## **Import Neccessary Libraries**

In [None]:
# write your code

## **Read Data**
![](image/data_snapshot.png)

In [None]:
# write your code

# **Task 1: Data Cleaning and Preprocessing**


## Step 1: Handle Missing or Inconsistent Data

1. **Identify Missing Values:**
   - Detect missing values in the dataset.


In [None]:
# Write you code for step 1.1


2. **Impute or Remove Missing Values:**
   - For numerical features, you can impute missing values using mean, median, or mode.
   - For categorical features, you can impute missing values using the most frequent category or a placeholder such as 'Unknown'.
   - Alternatively, remove rows with missing values if they are few and do not significantly impact the dataset.


In [None]:
# Write your code for step 1.2

## Step 2: Convert Categorical Data into Numerical Format

1. **Identify Categorical Features:**
   - Features like `Color`, `Odor`, and `Source` are categorical.



2. **Encode Categorical Features:**
   - Use one-hot encoding for nominal categorical features (e.g., `Color`, `Odor`, `Source`).
   - Use label encoding for ordinal categorical features, if any.

In [None]:
# write your code

## Step 3: Normalize or Standardize Numerical Features

1. **Identify Numerical Features:**
   - Features like `pH`, `Iron`, `Nitrate`, `Chloride`, etc., are numerical.

2. **Normalize or Standardize:**
   - Normalize the features using Min-Max scaling or Standardize the features to have a mean of 0 and a standard deviation of 1.

In [None]:
# write your code

## Explanation:
1. SimpleImputer is used to handle missing values.
2. OneHotEncoder is used for encoding categorical variables.
3. MinMaxScaler is used to normalize numerical features.
4. ColumnTransformer is used to apply different preprocessing steps to different columns.

Ensure to replace 'your_data.csv' with the actual file path of your dataset. The preprocessed data will be saved in a new file called cleaned_data.csv. This cleaned and preprocessed data is now ready for further analysis or modeling.


# **Task 2. Exploratory Data Analysis (EDA)**

## Exploratory Data Analysis (EDA)

## Step 1: Generate Summary Statistics for Each Feature

1. **Summary Statistics:**
   - Use descriptive statistics to summarize the central tendency, dispersion, and shape of the dataset’s distribution for each feature.

In [None]:
# Write your code

## Step 2: Visualize Data Distributions and Relationships Between Variables

1. **Histograms:**
   - Visualize the distribution of numerical features.

2. **Box Plots:**
   - Identify outliers and understand the distribution of numerical features.

3. **Scatter Plots:**
   - Examine relationships between pairs of numerical features.

4. **Line Plots:**
   - Examine trends in numerical features.

5. **Correlation Matrix and Heatmap:**
   - Show the correlation between numerical features.

6. **Count Plots:**
   - Visualize the frequency of categorical features.

In [None]:
# Write your code

## Step 3: Identify Trends or Anomalies in the Data

1. **Trend Analysis:**
   - Look for patterns or trends over time or other categorical dimensions.

2. **Anomaly Detection:**
   - Identify any anomalies or unusual observations in the dataset.

In [1]:
# write your code

## Explanation:
1. Descriptive Statistics: Use describe() method to generate summary statistics for each feature.
2. Histograms: Use hist() method to visualize the distribution of numerical features.
3. Box Plots: Use boxplot() function to visualize outliers and understand the distribution of numerical features.
4. Scatter Plots: Use pairplot() function from Seaborn to examine relationships between pairs of numerical features.
5. Correlation Matrix and Heatmap: Use corr() method and heatmap() function from Seaborn to visualize the correlation between numerical features.
6. Count Plots: Use countplot() function from Seaborn to visualize the frequency of categorical features.
7. Trend Analysis: Group data by month and plot average water temperature over time to identify trends.
8. Anomaly Detection: Use box plots to identify outliers in numerical features.
9. Ensure to replace 'your_data.csv' with the actual file path of your dataset. These visualizations and analyses will help you understand the distributions, relationships, and potential anomalies in your data.


# **Task 3. Feature Engineering** (OPTIONAL - Increase the quality of the dataset)
    


## Step 1: Create New Features

1. **Interaction Terms:**
   - Create new features by multiplying or combining existing features to capture interactions between them.

2. **Aggregate Features:**
   - Create aggregate features such as mean, median, sum, or count of other features.


In [2]:
# write your code

## Step 2: Evaluate the Importance of Different Features for Predicting Water Quality

1. **Feature Importance Using Tree-Based Models:**
   - Use tree-based models like Random Forest to evaluate the importance of different features.

2. **Correlation with Target:**
   - Calculate the correlation of each feature with the target variable to evaluate their importance.

In [None]:
# write your code

# Explanation:
1. **Interaction Terms**: Create new features by combining existing features to capture interactions (e.g., pH_Iron, Nitrate_Chloride).
2. **Aggregate Features**: Create new aggregate features such as Total_Metals and Mean_Metals by summing or averaging relevant columns.
3. **Feature Importance Using Tree-Based Models**: Use a Random Forest model to evaluate the importance of each feature.
4. **Correlation with Target**: Calculate the correlation of each feature with the target variable to understand their individual impact on the prediction.


# **Task 4. Model Building**
## Step 1: Split the Data into Training and Testing Sets
   - Use a function to split the dataset into training and testing sets.


In [None]:
# write your code

## Step 2: Train Machine Learning Models to Predict the Target Variable

1. **Train Models:**
   - Train different machine learning models such as Logistic Regression, Random Forest, and Support Vector Machine.

In [None]:
# write your code

## Step 3: Evaluate Model Performance Using Appropriate Metrics

1. **Evaluate Models:**
   - Evaluate the performance of the models using metrics like accuracy, precision, recall, and F1-score.

In [None]:
# write your code

# **Task 5. Model Interpretation and Insights**


## Step 1: Interpret the Model to Understand Which Factors Most Influence Water Quality
1. **Feature Importance:**
   - Use the feature importance scores from tree-based models like Random Forest to identify which factors most influence water quality.
   - Use coefficients from Logistic Regression to understand the direction and magnitude of influence.

## Step 2: Provide Actionable Insights Based on Model Findings

1. **Identify Key Factors:**
   - Determine the most important factors influencing water quality based on the model interpretation.

2. **Actionable Insights:**
   - Provide specific recommendations for addressing key factors to improve water quality.

## Step 3: Discuss Potential Interventions or Policy Recommendations to Improve Water Quality

1. **Interventions:**
   - Suggest practical measures that can be taken to address the key factors influencing water quality.

2. **Policy Recommendations:**
   - Propose policy changes or initiatives to support the interventions and improve water quality on a broader scale.


## Explanation:
1. **Feature Importance**: Identify important features using Random Forest and Logistic Regression models.
2. **Actionable Insights**: Provide recommendations based on the most influential factors.
3. **Interventions and Policy Recommendations**: Suggest practical measures and policy changes to improve water quality.