# Data Summary and Goal

## Summary

Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many waterpoints already established in the country, but some are in need of repair while others have failed altogether.

## Goal

Build a classifier to predict the condition of a water well, using information provided in the data. This information includes:
- Date
- Location
- Source
- Funder
- And more!

This data is from the DrivenData.org website. It is part of the "Pump It Up: Data Mining the Water Table" dfetition. DrivenData decided to split the data up into two sets, the "Training Set" and the "Test Set". 

It is implied by the names that we are to use the training set for creating our models, and the test set to test them. For this project, we considered merging the two dataframes in order to have more data to work with, however there are 59,400 entries in the training set and therefore more than enough to make good predicitons. 

If our models are subpar, we may merge the tables to aquire more data points to potentially improve model efficacy.

# Data Cleaning

In [1]:
# Import Pandas
import pandas as pd
import numpy as np
from warnings import simplefilter

# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Load data into  Pandas dataframes
status_groups = pd.read_csv('status_groups.csv')
testset = pd.read_csv('test_set.csv')
df = pd.read_csv('training_set.csv')


# Let's add our target series to the dataframe!
status_groups.drop(['id'], axis=1, inplace=True)
df = pd.concat([df, status_groups], axis=1)

#Identify Features and Target
features = df.drop(['status_group'], axis=1)
target = df['status_group']

# Analyze shape of dataset
print(f'Shape of dataset: {df.shape}')
# display(df.head())


Shape of dataset: (59400, 41)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


## Drop unneeded columns and deal with missing values

In [None]:
# EXPLORATORY! We are analyzing the columns in order to remove redundancies
# **Uncomment if viewing the unique values is desired**

# df.region.unique()
# df.scheme_management

# display(df['extraction_type_class'].unique())
# display(df['extraction_type'].unique())
# display(df['extraction_type_group'].unique())

# display(df['source_class'].unique())
# display(df['source'].unique())
# display(df['source_type'].unique())

# display(df.waterpoint_type.unique())
# display(df.waterpoint_type_group.unique())

# display(df.water_quality.unique())
# display(df.quality_group.unique())

# display(df.management.unique())
# display(df.management_group.unique())

# display(df.payment.unique())
# display(df.payment_type.unique())

# display(df.quantity.unique())
# display(df.quantity_group.unique())

df.isna().sum()[df.isna().sum()>0]

In [3]:
low_occurence = list(df.funder[df['funder'].map(df['funder'].value_counts()) < 100].values)

df['funder'].replace(low_occurence, 'other', inplace=True)
to_drop = ['scheme_name', 'recorded_by', 'wpt_name', 'extraction_type', 'extraction_type_group', 'gps_height',
           'region_code', 'district_code', 'latitude', 'longitude', 'lga', 'ward', 'public_meeting', 'date_recorded', 
           'source', 'source_type', 'waterpoint_type', 'water_quality', 'management_group', 'payment', 'quantity_group','subvillage', 'num_private']

df.drop(to_drop, axis=1, inplace=True)
df.scheme_management.replace({'None':'Ignore', np.nan:'Ignore'}, inplace=True)

df.permit.fillna(False, inplace=True)

df.dropna(axis=0, inplace=True)

## Duplicates

In [11]:
df.population.value_counts().head(10)

0      19219
1       6116
150     1872
200     1860
250     1633
300     1425
50      1125
100     1119
500      994
350      970
Name: population, dtype: int64

## Dates

In [8]:
from datetime import datetime
df['date_recorded'] = pd.to_datetime(df['date_recorded'])

KeyError: 'date_recorded'

## Categories

In [None]:
df.info()

In [12]:
df.head()

Unnamed: 0,id,amount_tsh,funder,installer,num_private,basin,region,population,scheme_management,permit,construction_year,extraction_type_class,management,payment_type,quality_group,quantity,source_class,waterpoint_type_group,status_group
0,69572,6000.0,Roman,Roman,0,Lake Nyasa,Iringa,109,VWC,False,1999,gravity,vwc,annually,good,enough,groundwater,communal standpipe,functional
1,8776,0.0,other,GRUMETI,0,Lake Victoria,Mara,280,Other,True,2010,gravity,wug,never pay,good,insufficient,surface,communal standpipe,functional
2,34310,25.0,other,World vision,0,Pangani,Manyara,250,VWC,True,2009,gravity,vwc,per bucket,good,enough,surface,communal standpipe,functional
3,67743,0.0,Unicef,UNICEF,0,Ruvuma / Southern Coast,Mtwara,58,VWC,True,1986,submersible,vwc,never pay,good,dry,groundwater,communal standpipe,non functional
4,19728,0.0,other,Artisan,0,Lake Victoria,Kagera,0,Ignore,True,0,gravity,other,never pay,good,seasonal,surface,communal standpipe,functional


In [None]:
# df.status_group.unique()
status_labels = {'status_group':{'non functional': 0, 'functional': 1, 'functional needs repair': 2}}
df = df.replace(status_labels)
df.status_group.value_counts()

In [None]:
# We should wait until after dealing with missing values, duplicates, outliers, etc. 
# before splitting these up into our features and categories

In [None]:
object_cols = [col for col in df.select_dtypes('object').columns]
object_cols

In [None]:
number_unique = [int(df[col].nunique()) for col in object_cols]
index_cat = [i for i,x in enumerate(number_unique) if x < 20]
cats = [df.columns[x] for x in index_cat]
cat_df = df[cats]
cat_df.columns

In [None]:
df.population.value_counts()

In [None]:
df['funder'].nunique()

In [None]:
one = []
for x in df.funder.value_counts():
   if x == 1:
      one.append(x) 
len(one)
# It might be worth dropping funder categories that only have 1 occurence

In [None]:
df.hist(figsize=(11,11))
plt.tight_layout()

In [None]:
cat_df.hist(figsize=(11,11))
plt.tight_layout()

In [None]:
type(df.installer[1])

In [None]:
df.installer.value_counts()