# Data Summary and Goal

## Summary

Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many waterpoints already established in the country, but some are in need of repair while others have failed altogether.

## Goal

Build a classifier to predict the condition of a water well, using information provided in the data. This information includes:
- Date
- Location
- Source
- Funder
- And more!

This data is from the DrivenData.org website. It is part of the "Pump It Up: Data Mining the Water Table" trainsetetition. DrivenData decided to split the data up into two sets, the "Training Set" and the "Test Set". 

It is implied by the names that we are to use the training set for creating our models, and the test set to test them. For this project, we considered merging the two dataframes in order to have more data to work with, however there are 59,400 entries in the training set and therefore more than enough to make good predicitons. 

If our models are subpar, we may merge the tables to aquire more data points to potentially improve model efficacy.

# Data Cleaning

In [8]:
# Import Pandas
import pandas as pd
from warnings import simplefilter

# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Load data into  Pandas dataframes
status_groups = pd.read_csv('status_groups.csv')
testset = pd.read_csv('test_set.csv')
trainset = pd.read_csv('training_set.csv')

# Analyze shape of datasets
print(f'Training Set Shape: {trainset.shape}') # Will use this set for our analysis
print(f'Testing Set Shape: {testset.shape}')


Training Set Shape: (59400, 40)
Testing Set Shape: (14850, 40)


## Missing Values

In [10]:
trainset.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

## Dates

In [4]:
from datetime import datetime
trainset['date_recorded'] = pd.to_datetime(trainset['date_recorded'])

## Categories

In [9]:
# We should wait until after dealing with missing values, duplicates, outliers, etc. 
# before splitting these up into our features and categories

In [5]:
object_cols = [col for col in trainset.select_dtypes('object').columns]
object_cols

['funder',
 'installer',
 'wpt_name',
 'basin',
 'subvillage',
 'region',
 'lga',
 'ward',
 'public_meeting',
 'recorded_by',
 'scheme_management',
 'scheme_name',
 'permit',
 'extraction_type',
 'extraction_type_group',
 'extraction_type_class',
 'management',
 'management_group',
 'payment',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'quantity_group',
 'source',
 'source_type',
 'source_class',
 'waterpoint_type',
 'waterpoint_type_group']

In [6]:
trainset['funder'].nunique()

980

In [7]:
number_unique = [int(trainset[col].nunique()) for col in object_cols]
index_cat = [i for i,x in enumerate(number_unique) if x < 20]
cats = [trainset.columns[x] for x in index_cat]
cat_df = trainset[cats]
cat_df.head()

Unnamed: 0,funder,wpt_name,num_private,basin,region,region_code,district_code,lga,ward,population,...,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group
0,Dmdd,Dinamu Secondary School,0,Internal,Manyara,21,3,Mbulu,Bashay,321,...,GeoData Consultants Ltd,Parastatal,,True,2012,other,other,other,parastatal,parastatal
1,Government Of Tanzania,Kimnyak,0,Pangani,Arusha,2,2,Arusha Rural,Kimnyaki,300,...,GeoData Consultants Ltd,VWC,TPRI pipe line,True,2000,gravity,gravity,gravity,vwc,user-group
2,,Puma Secondary,0,Internal,Singida,13,2,Singida Rural,Puma,500,...,GeoData Consultants Ltd,VWC,P,,2010,other,other,other,vwc,user-group
3,Finn Water,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Lindi,80,43,Liwale,Mkutano,250,...,GeoData Consultants Ltd,VWC,,True,1987,other,other,other,vwc,user-group
4,Bruder,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Ruvuma,10,3,Mbinga,Mbinga Urban,60,...,GeoData Consultants Ltd,Water Board,BRUDER,True,2000,gravity,gravity,gravity,water board,user-group
