### Final Project Submission

 <li> Student Names: Calvine Dasilver
 <li> Student Pace: Full - Time
 <li> Scheduled Review Date/Time: 
 <li> Instructor's: Nikita Njoroge

## Utilizing Machine Learning to Assess Water Well Performance in Tanzania

## **Project Overview**

 ### <li>**Business Understanding**

 Millions in Tanzania face a daily struggle for clean water, forced to walk long distances for water of questionable quality. This water scarcity contributes to a cycle of illness, high infant mortality, and economic stagnation. Uneven water distribution due to Tanzania's diverse climate and geology, coupled with rising demand from agriculture, domestic needs, and other sectors, further complicates the issue. Water wells have been a lifeline for rural communities, but their effectiveness can vary. This project combines machine learning and data visualization to pinpoint potential causes of well failure, predict the success of new wells, and ensure resources are directed to areas with the greatest need.

##### **Challenges facing water access in Tanzania**:

* **Limited Infrastructure**: Tanzania lacks sufficient water infrastructure, particularly in rural areas. This includes a shortage of wells, pipelines, and proper sanitation facilities, leading to reliance on potentially contaminated sources.
* **Uneven Distribution**:  Tanzania's geography and climate create uneven water distribution. Some regions experience frequent droughts, while others have limited groundwater reserves. This disparity leaves many communities struggling despite national averages.
* **Water Quality**: Contaminated water sources are a significant health risk. Lack of proper sanitation and treatment facilities contribute to the spread of waterborne diseases, further impacting public health outcomes.
* **Climate Change**:  The increasing frequency and intensity of droughts due to climate change further exacerbate water scarcity. Erratic weather patterns disrupt traditional rainfall patterns, impacting both surface and groundwater availability.
* **Population Growth**: Tanzania's growing population puts increasing pressure on existing water resources. Rising demand for domestic and agricultural water use threatens to outpace sustainable management practices.
* **Funding and Management**: Insufficient funding for infrastructure development, maintenance, and water management programs limits progress

#### **Proposed Solution**

1. Build More & Smarter: Construct new wells and water systems (pumps, rainwater harvesting) in areas with greatest need, guided by data and community input.

2. Monitor & Maintain: Track water quality with sensors and mobile tech, while training locals on well maintenance and responsible water use.

3. Empower & Collaborate: Partner with local communities, NGOs, and the private sector to share knowledge, manage resources efficiently, and secure funding.

4. Sustainable Financing: Explore innovative funding models like microfinance, user fees, and public-private partnerships for long-term project viability.

#### **Conclusion**

This analysis pinpointed geographic location, construction details, environmental factors, and maintenance practices as crucial influences on well functionality.

#### **Problem Statement**

In rural Tanzania, there's not enough infrastructure for safe drinking water, making it hard for people to access. While the government and NGOs work together to build wells, there's a problem with making sure they keep working in the long run. As the wells start to break down, it slows down the efforts to make sure everyone has enough clean water.

#### **Objectives**
###### *main*
Implement a machine learning model to assess the operational status of water wells in Tanzania
###### *specific*
*  Data Preprocessing and Analysis
*  Model Development and Training
*  Insights Generation and Recommendation Development

### <li> **Data Understanding**
**Data Sources**

Data provided by Taarifa and the Tanzanian Ministry of Water is organized into three separate CSV files:
 https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/

1. Test Set Values: This file contains the independent variables (features) for which we need the model to predict the water well condition.
2. Training Set Labels: This file contains the dependent variable (status_group) for each row in the Training Set Values file. This variable represents the actual condition of the well (e.g., functional, non-functional, needs repair).
3. Training Set Values: This file contains the independent variables (features) used to train the machine learning model. These features will be used by the model to learn how to predict the condition of the wells in the Test Set Values file.

**To facilitate understanding of the data, a table outlining the column names and descriptions is provided below**:
* amount_tsh - Total static head (amount water available to waterpoint)
* date_recorded - The date the row was entered
*  funder - Who funded the well
*  gps_height - Altitude of the well
*  installer - Organization that installed the well
*  longitude - GPS coordinate
*  latitude - GPS coordinate
*  wpt_name - Name of the waterpoint if there is one
*  num_private -
*  basin - Geographic water basin
*  subvillage - Geographic location
*  region - Geographic location
*  region_code - Geographic location (coded)
*  district_code - Geographic location (coded)
*  lga - Geographic location
*  ward - Geographic location
*  population - Population around the well
*  public_meeting - True/False
*  recorded_by - Group entering this row of data
* scheme_management - Who operates the waterpoint
* scheme_name - Who operates the waterpoint
* permit - If the waterpoint is permitted
* construction_year - Year the waterpoint was constructed
* extraction_type - The kind of extraction the waterpoint uses
* extraction_type_group - The kind of extraction the waterpoint uses
* extraction_type_class - The kind of extraction the waterpoint uses
* management - How the waterpoint is managed
* management_group - How the waterpoint is managed
* payment - What the water costs
* payment_type - What the water costs
* water_quality - The quality of the water
* quality_group - The quality of the water
* quantity - The quantity of water
* quantity_group - The quantity of water
* source - The source of the water
* source_type - The source of the water
* source_class - The source of the water
* waterpoint_type - The kind of waterpoint
* waterpoint_type_group - The kind of waterpoint


### Importing relevant libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,precision_score, recall_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

First, I will define a fuction that loads the Datasets and check for info and data types.


In [18]:
#create a function that loads data and gets the info about the dataset
def load_data_and_check_info(file_path):
    """
    Load data from a CSV file and get information about the DataFrame.

    Parameters:
    - file_path (str): Path to the CSV file.

    Returns:
    - df_info (str): Information about the DataFrame.
    """
    # Load data
    df_1 = pd.read_csv(file_path)

    # Display the first few rows of the DataFrame
    df_head = df_1.head()

    # Get information about the DataFrame
    df_info = df_1.info()

    return df_1,df_info, df_head

#### 1.1 Importing Training Data: We begin by loading the 'Training Set Values' dataset

In [19]:
file_path_1 = "Data\Training_set_values.csv"  # Replace "your_data.csv" with the actual file path
df1,data_info, data_head = load_data_and_check_info(file_path_1)
print(data_info)
print("\nFirst few rows of the DataFrame:")
data_head #data_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
