# Tanzania Water Pump- Machine Learning Analysis

---

### Explore several ML classification algorithms 

Predict whether a pump is functional, functional needing repair, or non-functional using data from [Taarifa](http://taarifa.org/) and [Tanzania Ministry of Water](http://maji.go.tz/) based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A proper understanding of which water pumps are likely to fail could optimize maintenance operations and more reliably provide Tanzanian citizens with potable water.

This predictive modeling challenge comes from [DrivenData](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/), an organization that helps non-profits by hosting data science competitions for social impact. The competition has open licensing: "The data is available for use outside of DrivenData." The data was provided for a private Kaggle competition held as part of BloomTech's Data Science curriculum.

### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency


# from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, validation_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.inspection import permutation_importance
from xgboost import XGBClassifier
pd.set_option('display.max_columns', None)

# I. Load and Clean Data
Using the information and understanding obtained from the EDA notebook, a 'wrangle()' function was written to perform the same pre-processing operations on both the training and testing data

In [None]:
def wrangle(feature_path, target_path=None):
    """
        This function loads and cleans data for feature matrix and target vector
        .csv files. The cleaning tasks include:
            - Replace erroneously low latitude values with NaNs
            - Convert datatypes
            - Remove unnecessary columns
            - Remove high-cardinality categorical columns (HCCCs)
            - Remove duplicate columns
        
        Input: filepath
        Output: pandas DataFrame
    """

    if target_path:
        df = pd.merge(pd.read_csv(feature_path,
                                  na_values=[0, -2.000000e-08],
                                  parse_dates=['date_recorded']),
                      pd.read_csv(target_path)).set_index('id')

    else:
        df = pd.read_csv(feature_path,
                         na_values=-2.000000e-08,
                         parse_dates=['date_recorded'],
                         index_col='id')

    # Remove unnecessary columns
    df.drop(columns=['region_code',
                     'district_code',
                     'recorded_by',
                     'scheme_management'
                     'extraction_type_group',
                     'extraction_type_class',
                     'payment_type',
                     'quality_group',
                     'quantity_group',
                     'source'
                     'source_group',
                     'waterpoint_type_group'], inplace=True)

    # Remove HCCCs (columns with over 100 different categories)
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns 
                if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)


    return df
                     

# II. Split Data

# III. Establish Baseline

Accuracy score

# IV. Build and Train Model

# V. Check Evaluation Metrics

Compare with baseline

# VI. Tune Model


# VII. Communicate Results

Future thing could be changing how HCCCs are handled. Rather than just dropping the columns, we could reduce the cardinality of each feature by aggregating the categories, using an "other" field.

Try undersampling or oversampling because the proportion of functional to non-functional pumps is not even