# Data Summary and Goal

## Summary

Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many waterpoints already established in the country, but some are in need of repair while others have failed altogether.

## Goal

Build a classifier to predict the condition of a water well, using information provided in the data. This information includes:
- Date
- Location
- Source
- Funder
- And more!

This data is from the DrivenData.org website. It is part of the "Pump It Up: Data Mining the Water Table" dfetition. DrivenData decided to split the data up into two sets, the "Training Set" and the "Test Set". 

It is implied by the names that we are to use the training set for creating our models, and the test set to test them. For this project, we considered merging the two dataframes in order to have more data to work with, however there are 59,400 entries in the training set and therefore more than enough to make good predicitons. 

If our models are subpar, we may merge the tables to aquire more data points to potentially improve model efficacy.

# Data Cleaning

In [6]:
# Import Pandas
import pandas as pd
import numpy as np
from warnings import simplefilter

# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Load data into  Pandas dataframes
status_groups = pd.read_csv('status_groups.csv')
testset = pd.read_csv('test_set.csv')
df = pd.read_csv('training_set.csv')

# Analyze shape of dataset
print(f'Shape of dataset: {df.shape}')


Shape of dataset: (59400, 40)


## Missing Values

In [7]:
df.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [8]:
df['funder'][df['funder'].isna() == True]

34       NaN
43       NaN
47       NaN
65       NaN
71       NaN
        ... 
59357    NaN
59366    NaN
59370    NaN
59376    NaN
59397    NaN
Name: funder, Length: 3635, dtype: object

In [9]:
df.dropna(how='all')
df.shape

(59400, 40)

## Dates

In [10]:
from datetime import datetime
df['date_recorded'] = pd.to_datetime(df['date_recorded'])

## Categories

In [11]:
# We should wait until after dealing with missing values, duplicates, outliers, etc. 
# before splitting these up into our features and categories

In [12]:
object_cols = [col for col in df.select_dtypes('object').columns]
object_cols

['funder',
 'installer',
 'wpt_name',
 'basin',
 'subvillage',
 'region',
 'lga',
 'ward',
 'public_meeting',
 'recorded_by',
 'scheme_management',
 'scheme_name',
 'permit',
 'extraction_type',
 'extraction_type_group',
 'extraction_type_class',
 'management',
 'management_group',
 'payment',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'quantity_group',
 'source',
 'source_type',
 'source_class',
 'waterpoint_type',
 'waterpoint_type_group']

In [13]:
number_unique = [int(df[col].nunique()) for col in object_cols]
index_cat = [i for i,x in enumerate(number_unique) if x < 20]
cats = [df.columns[x] for x in index_cat]
cat_df = df[cats]
cat_df.head()

Unnamed: 0,funder,wpt_name,num_private,basin,region,region_code,district_code,lga,ward,population,...,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group
0,Roman,none,0,Lake Nyasa,Iringa,11,5,Ludewa,Mundindi,109,...,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group
1,Grumeti,Zahanati,0,Lake Victoria,Mara,20,2,Serengeti,Natta,280,...,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group
2,Lottery Club,Kwa Mahundi,0,Pangani,Manyara,21,4,Simanjiro,Ngorika,250,...,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group
3,Unicef,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mtwara,90,63,Nanyumbu,Nanyumbu,58,...,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group
4,Action In A,Shuleni,0,Lake Victoria,Kagera,18,1,Karagwe,Nyakasimbi,0,...,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other


In [50]:
df['funder'].nunique()

1897

In [48]:
one = []
for x in df.funder.value_counts():
   if x == 1:
      one.append(x) 
len(one)
# It might be worth dropping funder categories that only have 1 occurence

974