# Kisima, Analysis and Model for Predicting Tanzanian Waterwells Functionality

A model and analysis by Karim Oliver, Johnhoy Stephens, and Luluva Lakdawala

## Setting the Scene:

This project, Kisima, investigates the factors that determine the functionality of waterwells in the variuos regions of Tanzania. Our model and investigations are directed towards The Tanzanian Ministry of Water, Taarifa and other stakeholders intersted in the functionality of waterwells in Tanzania. Many features contribute to the functionality of the waterwells, we aim to discover some of them and build a model that uses these features and correctly predicts the status of these wells. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. 

### Goals:

Our project aims to:
- Build a model that predicts the status of the waterwells with a focus on correctly identifying the non-functional wells
- Investigate some of the features that appear to have a relationship with the status of the waterwells
- Validate the following claims made during our investigation with the features:
    - Quantity of water available to the wells determines their status
    - The Extraction type used to pull water from the wells has a bearing on the functionality of the wells
    - The waterpoint type has an influence on the status of the wells
    - Do the kind of pump that is operating with the wells, the year of installation or how it is managed has any relation with the functionality of the wells. 
    

### Definitions:

- Class:
    - Class is defined as the status of the waterwells in our dataset.
        - Class 0 relates to Functional - the waterpoints that are operational and there are no repairs needed
        - Class 1 realates to Functional but Needs Repair - the waterpoints that are operational but need repairs
        - Class 2 relates to Nonfunctional - the waterpoints that are not operational
- Features:
    - Features refer to the independent variables in each record in the dataset we are using to build our model on
- Model:
    - The term model referred to through this project is in reference to the classification model we build to predict the status of the waterwells
- Recall score:
    - The metric we are using to compare the various models we build. It is defined as the percent of correct results over the number of results that should have been returned

### Data:

The data used in this project is from the DrivenData website and can be found [here](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/). From this link, you will need to sign up for DrivenData to have access to these files
- Training set values
- Training set labels


Additional information about the features in the dataset can be found [here](../../references/data_dictionary_waterpoints.pdf).

Note that this modeling analysis contains the 'stream lined' version iterations from getting our first simple model to our final model. To get a more in depth view of our exploration process, mistakes, and triumphs, please refer to [Luluva's](../../notebooks/ll_notebooks), [Johnhoy's](../../notebooks/js_notebooks) and [Karim's](../../notebooks/ko_notebooks) notebooks.

### Analysis Takeaways:  

- Our final model, Random Forest Classifier is able to achieve a Recall score for Class 2 of about 77% and an overall Accuracy of about 80%
- Our analysis finds that geographical aspects - longitude and latitude of the waterwells were the most significant features that helped classify the waterwells
- We find that the population around the waterwells and the year in which the wells were constructed are among the important features for our model to classify the records
- Water quantity is the most significant categorical feature with 'dry', 'enough', and 'insufficient' being the specific encoded categories that determined the classification of the waterwells
- We find that extraction type is also an important categorical feature for determining the classes. Within which 'other', 'gravity', and 'hand pump' are the top classifiers.
- We also find waterpoint type - communal standpipe as being one of the top important feature in our final model.

### Future Investigations:

- Could we build another model that would focus on predicting the Class 1 waterwells, the ones that need repair?
- Which features directly affect the functional wells that need repair?

### Recommendations:

- Regular inspection of high risk areas can be undertaken to avoid having more number of nonfunctional waterwells
- Wells that work on motorpump or wind powered energy are found to be more susceptible to being nonfunctional. Maintenance procedures can be set in place to ensure that they remain functional. 

# Data Cleaning and Exploratory Data Analysis:

In [None]:
%load_ext autoreload
%autoreload 2

## Get data:

Firstly, to get the data downloaded, we'll have to go to the DrivenData website [here](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/). It requires us to sign up before we can have access to these files
- Training set values
- Training set labels

Once we get to the datadownload page, we need to download the two csv files locally and move them to the `data` folder in the root directory. Then, 
- `Training set values` file needs to be renamed -> `training_set_values.csv` 
- `Training set labels` file needs to be renamed -> `training_set_labels.csv`


### Imports

In [9]:
#import
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import classification_report, recall_score, make_scorer
from sklearn.dummy import DummyClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
#import customized functions
from src.data_cleaning import cleaning_functions as cfs
from src.data_cleaning import exploration_functions as efs
from src.data_cleaning import processing_functions as pfs
from src.data_cleaning import modeling_functions as mfs

We use our custom function to read in our files. It uses pandas to read the csv files and convert them into dataframes.

We then split the dataframes into training and testing sets setting the random_state at 2020
We do this right away so as to not learn anything from the designated test sets and let our exploration and models only learn from the training sets.

We then combine the X_train and y_train into one dataframe and in the end, return the X_train, X_test, y_train, y_test and the merged dataframe

**Note**: Through out this project, everytime we had to set a random_state, we chose 2020

In [15]:
sys.path.append('/Users/lulualakdawala/Documents/DS_course/Mod_3/project/Tanzania/data')

In [16]:
curdir=os.getcwd()
print(curdir)
X_train, X_test, y_train, y_test, df = cfs.load_data_files()

/Users/lulualakdawala/Documents/DS_course/Mod_3/project/Tanzania


FileNotFoundError: [Errno 2] File ../../data/training_set_values.csv does not exist: '../../data/training_set_values.csv'

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
59396,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
59397,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,monthly,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
59398,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump
