# Predicting Water Well conditions in Tanzania


## 1.Business understanding

### Problem Statement

The scarcity of clean and potable water remains a significant problem in many communities throughout Tanzania. In response, the Tanzanian Ministry of Water has established numerous water wells nationwide. Unfortunately, not all of these wells are operating effectively, leaving some communities without access to clean water.

### Goal

The aim of this project is to develop a predictive model that can accurately assess the condition of water wells in Tanzania using data from Taarifa and the Tanzanian Ministry of Water. The model will predict which pumps are fully operational, which require repairs, and which are non-functional.

The ultimate goal is to enhance maintenance efforts and ensure that communities across Tanzania have reliable access to clean and potable water.

### Objectives

#### Main Objective


To predict the condition of water wells in Tanzania to ensure that clean and portable water is available to communities across Tanzania.

#### Specific Objective

1. To understand the problem statement and the goal of the objective.
2. To identify the variables that can impact the functionality of water wells.
3. To determine the target variable (functional, need repairs, or non-functional)

### Metric of success.

The model will be considered a success when it achieves an accuracy and f1 score between 0.8 to 1

### Data Description

amount_tsh - Total static head (amount water available to waterpoint)

date_recorded - The date the row was entered

funder - Who funded the well

gps_height - Altitude of the well

installer - Organization that installed the well

longitude - GPS coordinate

latitude - GPS coordinate

wpt_name - Name of the waterpoint if there is one

num_private -

basin - Geographic water basin

subvillage - Geographic location

region - Geographic location

region_code - Geographic location (coded)

district_code - Geographic location (coded)

lga - Geographic location

ward - Geographic location

population - Population around the well

public_meeting - True/False

recorded_by - Group entering this row of data

scheme_management - Who operates the waterpoint

scheme_name - Who operates the waterpoint

permit - If the waterpoint is permitted

construction_year - Year the waterpoint was constructed

extraction_type - The kind of extraction the waterpoint uses

extraction_type_group - The kind of extraction the waterpoint uses

extraction_type_class - The kind of extraction the waterpoint uses

management - How the waterpoint is managed

management_group - How the waterpoint is managed

payment - What the water costs

payment_type - What the water costs

water_quality - The quality of the water

quality_group - The quality of the water

quantity - The quantity of water

quantity_group - The quantity of water

source - The source of the water

source_type - The source of the water

source_class - The source of the water

waterpoint_type - The kind of waterpoint

waterpoint_type_group - The kind of waterpoint

### Distribution of Target Column

There are three possible values:

1.functional - the waterpoint is operational and there are no repairs needed

2.functional needs repair - the waterpoint is operational, but needs repairs

3.non functional - the waterpoint is not operational

### Technique used (CRISP_ DM)


1.Business Understanding

2.Data Understanding

3.Data Preparation

4.Modelling and evaluation

5.External validation

6.Challenging the solution

7.Conclussions and Recommendations.

## Data Understanding

### 2.1 Importing libraries

In [2]:
# Importing necessary libraries
# Data Manipulation and Visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Preprocessing and Data Transforming
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Building Machine Learning Models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler

# Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV

# Performance Evaluation
from sklearn.metrics import accuracy_score, f1_score, classification_report

import warnings
warnings.filterwarnings('ignore')

### 2.2 Loading the Dataset

In [15]:
# Load the training set values and training set labels
train_values =  pd.read_csv('.Downloads/TrainingSetValues.csv,encoding='ISO-8859-1')
train_labels = pd.read_csv('0bf8bc6e-30d0-4c50-956a-603fc693d966.csv')
test_df = pd.read_csv('702ddfc5-68cd-4d1d-a0de-f5f566f76d91.csv')

# Merge the two datasets
train_df = pd.merge(train_values, train_labels, on='id')

SyntaxError: unterminated string literal (detected at line 2) (1324155604.py, line 2)