# 1. Business Understanding

#### 1.1. Project Overview

Tanzania is a country with a population of 57 million. It faces significant challenges in providing clean and reliable water to its population.
The country has established numerous water points to meet this need but many of these water points are not fully functional, with some requiring repairs and others have failed entirely.
This project aims to predict the functionality of these water pumps, distingishing between those that are fully functional, those that need repairs and those that do not work at all

#### 2.1. Objective
The primary objective is to develop a predictive model that can accurately classify the operational status of water pumps into one of three categories:
- Functional: The water pump is fully operational and provides clean water.
- Needs repair: The water pump is operational but requires some maintenance or repair to ensure optimal performance.
- Non-Functional: The water pump has failed and is not providing water.

#### 3.1. Stakeholders
- Non-Governmental Organizations(NGOs): Various NGOs involved in providing support for wells needing repairs in Tanzania
- Government of Tanzania throught the Tanzanian Ministry of Water: The government is looking to find patterns in non-functional wells to influence how new wells are built

#### 4.1. Key Questions
1. What are the critical factors influencing the functionality of water pumps in Tanzania?
2. How can we use historical data to predict the current operational status of a water pump?
3. What are the cost implications of accurately predicting pump functionality?
4. How can this model be intergrated into existing maintenance workflows to maximize its impact?


# 2. Data Understanding

#### Generating summary statistics 

In [8]:
import pandas as pd

# Load data
train_values = pd.read_csv('Data/Training-set-values.csv')
train_labels = pd.read_csv('Data/Training-set-labels.csv')

# Merge data
train_data = pd.merge(train_values, train_labels, on='id')

# Summary statistics for numerical variables
numerical_summary = train_data.describe()

# Summary statistics for categorical variables
categorical_summary = train_data.describe(include=['object'])

# Check for missing values
missing_values = train_data.isnull().sum()
missing_percentage = (missing_values / len(train_data)) * 100

#Display
print("Numerical Summary : \n", numerical_summary)
print("\nCategorical Summary : \n", categorical_summary)
print("\nMissing Values:\n", missing_values)
print("\nMissing Percentage:\n", missing_percentage)

Numerical Summary : 
                  id     amount_tsh    gps_height     longitude      latitude  \
count  59400.000000   59400.000000  59400.000000  59400.000000  5.940000e+04   
mean   37115.131768     317.650385    668.297239     34.077427 -5.706033e+00   
std    21453.128371    2997.574558    693.116350      6.567432  2.946019e+00   
min        0.000000       0.000000    -90.000000      0.000000 -1.164944e+01   
25%    18519.750000       0.000000      0.000000     33.090347 -8.540621e+00   
50%    37061.500000       0.000000    369.000000     34.908743 -5.021597e+00   
75%    55656.500000      20.000000   1319.250000     37.178387 -3.326156e+00   
max    74247.000000  350000.000000   2770.000000     40.345193 -2.000000e-08   

        num_private   region_code  district_code    population  \
count  59400.000000  59400.000000   59400.000000  59400.000000   
mean       0.474141     15.297003       5.629747    179.909983   
std       12.236230     17.587406       9.633649    471.482