# Tanzania Water Pump- Machine Learning Analysis

---

### Explore several ML classification algorithms 

Predict whether a pump is functional, functional needing repair, or non-functional using data from [Taarifa](http://taarifa.org/) and [Tanzania Ministry of Water](http://maji.go.tz/) based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A proper understanding of which water pumps are likely to fail could optimize maintenance operations and more reliably provide Tanzanian citizens with potable water.

This predictive modeling challenge comes from [DrivenData](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/), an organization that helps non-profits by hosting data science competitions for social impact. The competition has open licensing: "The data is available for use outside of DrivenData." The data was provided for a private Kaggle competition held as part of BloomTech's Data Science curriculum.

### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, validation_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.inspection import permutation_importance
from xgboost import XGBClassifier
pd.set_option('display.max_columns', None)

# I. Load and Clean Data
Using the information and understanding obtained from the EDA notebook, a 'wrangle()' function was written to perform the same data cleaning operations on both the training and testing data.

In [2]:
def wrangle(feature_path, target_path=None):
    """
        This function loads and cleans data for feature matrix and target vector
        csv files. The cleaning tasks include:

            - Replace erroneously low values with NaNs
            - Convert datatypes
            - Remove unnecessary columns (duplicate, redundant, constant,
              mostly null)
            - Remove high-cardinality categorical features
        
        Parameters
        ----------
        feature_path (str): pathway to feature matrix csv file
        target_path (str): pathway to target vector csv file

        Returns
        -------
        DataFrame
    """

    if target_path:
        df = pd.merge(pd.read_csv(feature_path,
                                  na_values=[0, -2.000000e-08],
                                  parse_dates=['date_recorded']),
                      pd.read_csv(target_path)).set_index('id')

    else:
        df = pd.read_csv(feature_path,
                         na_values=-2.000000e-08,
                         parse_dates=['date_recorded'],
                         index_col='id')

    # Remove unnecessary columns
    df.drop(columns=['num_private',
                     'region_code',
                     'district_code', ###
                     'recorded_by',
                     'scheme_management', ###
                     'scheme_name',
                     'extraction_type_group', ###
                     'extraction_type_class', ###
                     'payment_type', ###
                     'quality_group', ###
                     'quantity_group',
                     'source',
                     'source_class',
                     'waterpoint_type_group'], inplace=True)

    # Remove HCCCs (columns with over 100 different categories)
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns 
                if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Create age feature
    df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
    df.drop(columns='date_recorded', inplace=True)


    return df
                     

In [3]:
# Load data
df = wrangle(feature_path='train_features.csv',
             target_path='train_labels.csv')

X_test = wrangle(feature_path='test_features.csv')

In [4]:
# View first 5 rows of df and X_test
display(df.head())
X_test.head()

Unnamed: 0_level_0,amount_tsh,gps_height,longitude,latitude,basin,region,population,public_meeting,permit,construction_year,extraction_type,management,management_group,payment,water_quality,quantity,source_type,waterpoint_type,status_group,pump_age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
454.0,50.0,2092.0,35.42602,-4.227446,Internal,Manyara,160.0,True,True,1998.0,gravity,water board,user-group,pay per bucket,soft,insufficient,spring,communal standpipe,functional,15.0
510.0,,,35.510074,-5.724555,Internal,Dodoma,,True,True,,india mark ii,vwc,user-group,never pay,soft,enough,shallow well,hand pump,functional,
14146.0,,,32.499866,-9.081222,Lake Rukwa,Mbeya,,True,False,,other,vwc,user-group,never pay,soft,enough,shallow well,other,non functional,
47410.0,,,34.060484,-8.830208,Rufiji,Mbeya,,True,True,,gravity,vwc,user-group,pay monthly,soft,insufficient,river/lake,communal standpipe,non functional,
1288.0,300.0,1023.0,37.03269,-6.040787,Wami / Ruvu,Morogoro,120.0,True,True,1997.0,other,vwc,user-group,pay when scheme fails,salty,enough,shallow well,other,non functional,14.0


Unnamed: 0_level_0,amount_tsh,gps_height,longitude,latitude,basin,region,population,public_meeting,permit,construction_year,extraction_type,management,management_group,payment,water_quality,quantity,source_type,waterpoint_type,pump_age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
37098,,,31.985658,-3.59636,Lake Tanganyika,Shinyanga,,True,True,,other,wug,user-group,unknown,soft,dry,shallow well,other,
14530,,,32.832815,-4.944937,Lake Tanganyika,Tabora,,True,True,,india mark ii,vwc,user-group,never pay,milky,insufficient,shallow well,hand pump,
62607,10.0,1675.0,35.488289,-4.242048,Internal,Manyara,148.0,True,True,2008.0,gravity,water board,user-group,pay per bucket,soft,insufficient,spring,communal standpipe,5.0
46053,,,33.140828,-9.059386,Lake Rukwa,Mbeya,,False,False,,nira/tanira,vwc,user-group,never pay,soft,seasonal,shallow well,hand pump,
47083,50.0,1109.0,34.217077,-4.430529,Internal,Singida,235.0,True,True,2011.0,mono,wua,user-group,pay per bucket,soft,enough,borehole,communal standpipe multiple,2.0


We still have many null values in our columns. To ensure that our training and testing datasets undergo the same pre-processing steps, we will impute these values in a pipeline.

# II. Split Data

In [5]:
# Create feature matrix and target vector for training data
target = 'status_group'

y = df[target]
X = df.drop(columns=target)

In [8]:
# Split training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)


In [11]:
# Sanity check
display(X_train.shape, y_train.shape, X_val.shape, y_val.shape)


(38015, 19)

(38015,)

(9504, 19)

(9504,)

# III. Establish Baseline
We will use the most-frequent class as our baseline. Our model must outperform our baseline in order for our predictions to have any merit.

In [12]:
# Baseline accuracy score
print('Baseline accuracy:', y.value_counts(normalize=True).max())

Baseline accuracy: 0.5429828068772491


# IV. Build and Train Model

### Random Forest Classifier

# V. Check Evaluation Metrics

Compare with baseline

# VI. Tune Model


# VII. Communicate Results

One way of improving our model in the future could be to change how we handle high cardinality categorical features. Rather than just dropping these columns, we could reduce the cardinality of each feature by aggregating the categories, using an "other" field.

Try undersampling or oversampling because the proportion of functional to non-functional pumps is unbalanced.