# Tanzania Water Well Classification Project: Does it Need Repair?

## **About**

#### The purpose of this notebook is to share a summary of the methodology used during this project. Here you can find the goals of the project, what factors were taken into account during the Exploratory Data Analysis(EDA) phase, examples from the model building process, our final model and evaluation, and considerations and decisions taken along the way. 

### ***Poject Goals***

#### The goal of this project is to build a classification model to know if a waterpump is in need of repair. 

#### The problem this project is addressing is access to water by way of community waterpoints. Access to water is an important issue that has reverberations in the social and economic aspects of a society. The [World Health Organization writes](https://www.who.int/news-room/fact-sheets/detail/drinking-water#:~:text=Safe%20and%20readily%20available%20water,contribute%20greatly%20to%20poverty%20reduction.), “Safe and readily available water is important for public health, whether it is used for drinking, domestic use, food production or recreational purposes. Improved water supply and sanitation, and better management of water resources, can boost countries’ economic growth and can contribute greatly to poverty reduction.” 

#### The water pumps we are looking at in our modeling are meant to provide potable water. If these water pumps fail, that community’s availability of drinking water is impacted. Reduced availability of working water pumps means an increased use of the functional ones, which could mean a reduction in that water pumps lifespan before it needs repairs. Being able to know which waterpoints need maintenance can help that community have a minimal interruption of service.

#### This model will be used by the Tanzanian Ministry of Water to assess which water pumps need to be repaired. This model helps the goals of the Tanzanian Ministry of Water to provide access to water to its citizens. By knowing which water pumps need repairs, the Ministry of Water can implement better maintenance strategies. Moreover, this model and implementation of outcomes can help towards [The Tanzanian Development Vision 2025](https://mof.go.tz/mofdocs/overarch/vision2025.htm) to have universal access to safe water by the year 2025.

### ***The Data***

#### The **data comes from** [Taarifa](http://taarifa.org/) who sources it from the [Tanzanian Ministry of Water](https://www.maji.go.tz/). The datasets were downloaded from [Driven Data’s “Pump it Up: Data Mining the Water Table”](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/) competition.

#### Our target is to classify the water pumps into one of three possible categories:
1. `functional` - the waterpoint is operational and there are no repairs needed
2. `functional needs repair` - the waterpoint is operational, but needs repairs
3. `non functional` - the waterpoint is not operational


## **Exploratory Data Analysis (EDA)**

#### A few of the questions we explored before going into our modeling were:
    - What features are available to us? Do we need all of them?
    - What format are our features in? 
    - Are there missing values in our datasets? how will we account for those?

ADD CODE HERE
-column names...some examples of the on categorical, looked for reduntant variables in the catergorical data
- dtypes-- write out thinking 
- checked for missing values, discuss how will take care of them when building model


## **First Simple Model**

In [None]:
#Import custom functions

import sys
import pathlib
src_path = pathlib.Path().absolute().parent /"src"
sys.path.append(str(src_path))
import data_functions


from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
import seaborn as sns
import pickle
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## **Data Cleaning**

In [None]:
# Import data using custom function
x_train, x_test, y_train = data_functions.get_dataframes()

In [None]:
# Drop the columns that will not be using 
x_train.drop(['date_recorded','installer','funder','wpt_name', 'subvillage','ward','recorded_by','scheme_name','scheme_management','extraction_type',
             'extraction_type_class','payment','public_meeting','permit','management','management_group','source','source_class',
             'waterpoint_type_group','latitude','longitude','num_private','region_code','district_code'], inplace=True, axis=1)

x_test.drop(['date_recorded','installer','funder','wpt_name', 'subvillage','ward','recorded_by','scheme_name','scheme_management','extraction_type',
             'extraction_type_class','payment','public_meeting','permit','management','management_group','source','source_class',
             'waterpoint_type_group','latitude','longitude','num_private','region_code','district_code'], inplace=True, axis=1)

x_train_nums= x_train.select_dtypes(exclude="object")
x_train_cat= x_train.select_dtypes(include="object")

In [None]:
ohe=OneHotEncoder(sparse= False)
x_train_ohe=pd.DataFrame(ohe.fit_transform(x_train_cat), columns= ohe.get_feature_names(x_train_cat.columns), index= x_train_cat.index)
sum(x_train_ohe.isna().sum())

In [None]:
si=SimpleImputer()
x_nums_si=pd.DataFrame(si.fit_transform(x_train_nums), index= x_train_nums.index, columns= x_train_nums.columns)

In [None]:
scale= StandardScaler()
x_train_nums_scaled= pd.DataFrame(scale.fit_transform(x_nums_si), index= x_nums_si.index, columns= x_nums_si.columns)


In [None]:
x_final= x_train_nums_scaled.join(x_train_ohe)

In [None]:
x_val, x_val_test, y_val, y_val_test= train_test_split(x_final, y_train, random_state=2020)

## **Initial Model Testing**

ADD CODE for example models with the 3 classifications

## **Model Iteration**

ADD discussion about how the minority class was not getting picked up even using techniques like SMOTE to oversample it.  Issues with re-producibility, decision to combine the 

In [None]:
bin_y = lambda x: 1 if x == 'functional' else 0
y_tr_final = y_val['status_group'].apply(bin_y)
y_te_final = y_val_test['status_group'].apply(bin_y)


ADD CODE for some examples (initial and hypertuned) with 2 classifications

### ***KNeighbor Classifier***

ADD COMMENTS RE WHAT CHANGED, AND EXPLAING MODELS INBETWEEN .

In [None]:
kn_2 = KNeighborsClassifier(n_neighbors = 15)
kn_2.fit(x_val,y_tr_final)
y_pred_kn_2=kn_2.predict(x_val_test)
print(classification_report(y_te_final, y_pred_kn_2))

In [None]:
kn_8 = KNeighborsClassifier(n_neighbors = 2, weights = 'distance', p = 1, algorithm = 'brute')
kn_8.fit(x_val,y_tr_final)
y_pred_kn_8=kn_8.predict(x_val_test)
print(classification_report(y_te_final, y_pred_kn_8))

#

### ***Logisctic Classification***

Hypertuned for ADDDDDDDD. but first model was best. 


lr_1= LogisticRegression(max_iter= 2000)
lr_1.fit(x_val,y_tr_final)
y_pred_lr_1=lr_1.predict(x_val_test)
print(classification_report(y_te_final, y_pred_lr_1))
plot_confusion_matrix(lr_1, x_val_test,y_te_final)

## **Final Model**

In [None]:
new_best = RandomForestClassifier(max_features = 75, min_samples_leaf = 3, n_estimators = 200)

In [None]:
new_best.fit(x_val, y_val)

In [None]:
new_best.score(x_val_test, y_val_test)

## **Conclusion**