# Project Title

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

Questions to consider:

- Who are your stakeholders?
- What are your stakeholders' pain points related to this project?
- Why are your predictions important from a business perspective?
- What exactly is your deliverable: your analysis, or the model itself?
- Does your business understanding/stakeholder require a specific type of model?
    - For example: a highly regulated industry would require a very transparent/simple/interpretable model, whereas a situation where the model itself is your deliverable would likely benefit from a more complex and thus stronger model
   

Additional questions to consider for classification:

- What does a false positive look like in this context?
- What does a false negative look like in this context?
- Which is worse for your stakeholder?
- What metric are you focusing on optimizing, given the answers to the above questions?

## Data Understanding

Describe the data being used for this project.

Questions to consider:

- Where did the data come from, and how do they relate to the data analysis questions?
- What do the data represent? Who is in the sample and what variables are included?
- What is the target variable?
- What are the properties of the variables you intend to use?

In [35]:
# code here to explore your data
import pandas as pd
df_l = pd.read_csv('../../data/Training Set Labels.csv')
df_v = pd.read_csv('../../data/Training Set Values.csv')

## Data Preparation

Describe and justify the process for preparing the data for analysis.

Questions to consider:

- Were there variables you dropped or created?
- How did you address missing values or outliers?
- Why are these choices appropriate given the data and the business problem?
- Can you pipeline your preparation steps to use them consistently in the modeling process?

In [36]:
df = pd.merge(df_v, df_l, on='id')
del df_l

In [37]:
df

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
59396,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe,functional
59397,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,functional
59398,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [38]:
df.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    3635
gps_height                   0
installer                 3655
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [39]:
df.funder.value_counts()

Government Of Tanzania           9084
Danida                           3114
Hesawa                           2202
Rwssp                            1374
World Bank                       1349
                                 ... 
Artisan                             1
Nyeisa                              1
Overnment                           1
Aqua Blues Angels                   1
Resolute Golden Pride Project       1
Name: funder, Length: 1897, dtype: int64

In [40]:
def funder_top5(row):  
    '''Making top 5 values and setting the rest to 'other'''

    if row['funder']=='Government Of Tanzania':
        return 'Gov'
    elif row['funder']=='Danida':
        return 'Danida'
    elif row['funder']=='Hesawa':
        return 'Hesawa'
    elif row['funder']=='Rwssp':
        return 'Rwssp'
    elif row['funder']=='World Bank':
        return 'World_bank'    
    else:
        return 'other'
    
df['funder'] = df.apply(lambda row: funder_top5(row), axis=1)

In [41]:
str_to_num = {'functional':2, 'functional needs repair':1,
                   'non functional':0}

df['status_group_new']  = df['status_group'].replace(str_to_num)

In [42]:
piv_table = pd.pivot_table(df,index=['funder','status_group'],
                           values='status_group_new', aggfunc='count')
piv_table

Unnamed: 0_level_0,Unnamed: 1_level_0,status_group_new
funder,status_group,Unnamed: 2_level_1
Danida,functional,1713
Danida,functional needs repair,159
Danida,non functional,1242
Gov,functional,3720
Gov,functional needs repair,701
Gov,non functional,4663
Hesawa,functional,936
Hesawa,functional needs repair,232
Hesawa,non functional,1034
Rwssp,functional,805


In [43]:
# code here to prepare your data
total_danida = piv_table.loc[('Danida','functional')] + piv_table.loc[('Danida','functional needs repair')] + piv_table.loc[('Danida','non functional')]
percent_functional_danida = (piv_table.loc[('Danida','functional')] / total_danida) * 100

total_gov = piv_table.loc[('Gov','functional')] + piv_table.loc[('Gov','functional needs repair')] + piv_table.loc[('Danida','non functional')]
percent_functional_gov = (piv_table.loc[('Gov','functional')] / total_gov) * 100

total_hesawa = piv_table.loc[('Hesawa','functional')] + piv_table.loc[('Hesawa','functional needs repair')] + piv_table.loc[('Hesawa','non functional')]
percent_functional_hesawa = (piv_table.loc[('Hesawa','functional')] / total_hesawa) * 100

total_rwssp = piv_table.loc[('Rwssp','functional')] + piv_table.loc[('Rwssp','functional needs repair')] + piv_table.loc[('Rwssp','non functional')]
percent_functional_rwssp = (piv_table.loc[('Rwssp','functional')] / total_rwssp) * 100

total_world_bank = piv_table.loc[('World_bank','functional')] + piv_table.loc[('World_bank','functional needs repair')] + piv_table.loc[('World_bank','non functional')]
percent_functional_world_bank = (piv_table.loc[('World_bank', 'functional')] / total_world_bank) * 100

total_other = piv_table.loc[('other', 'functional')] + piv_table.loc[('other', 'functional needs repair')] + piv_table.loc[('other','non functional')]
percent_functional_other = (piv_table.loc[('other','functional')] / total_other) * 100

print('Percent functional danida: ', round(percent_functional_danida,3))
print('Percent functional gov: ', round(percent_functional_gov,3))
print('Percent functional hesawa: ', round(percent_functional_hesawa,3))
print('Percent functional other: ', round(percent_functional_other,3))
print('Percent functional rwssp: ', round(percent_functional_rwssp,3))
print('Percent functional world bank: ', round(percent_functional_world_bank,3))

Percent functional danida:  status_group_new    55.01
dtype: float64
Percent functional gov:  status_group_new    65.69
dtype: float64
Percent functional hesawa:  status_group_new    42.507
dtype: float64
Percent functional other:  status_group_new    58.046
dtype: float64
Percent functional rwssp:  status_group_new    58.588
dtype: float64
Percent functional world bank:  status_group_new    40.4
dtype: float64


In [44]:
df.installer.value_counts()


DWE                    17402
Government              1825
RWE                     1206
Commu                   1060
DANIDA                  1050
                       ...  
VILLAGE                    1
Kalitesi                   1
Mr Kwi                     1
UNICRF                     1
WINAM  CONSTRUCTION        1
Name: installer, Length: 2145, dtype: int64

In [45]:
def installer_top5(row):
    '''Keep top 5 values and set the rest to 'other'''
    if row['installer']=='DWE':
        return 'DWE'
    elif row['installer']=='Government':
        return 'Gov'
    elif row['installer']=='RWE':
        return 'RWE'
    elif row['installer']=='Commu':
        return 'Commu'
    elif row['installer']=='DANIDA':
        return 'Danida'
    else:
        return 'other'  

df['installer'] = df.apply(lambda row: installer_top5(row), axis=1)

In [12]:
piv_table2 = pd.pivot_table(df,index=['installer','status_group'],
                           values='status_group_new', aggfunc='count')
piv_table2

Unnamed: 0_level_0,Unnamed: 1_level_0,status_group_new
installer,status_group,Unnamed: 2_level_1
Commu,functional,724
Commu,functional needs repair,32
Commu,non functional,304
DWE,functional,9433
DWE,functional needs repair,1622
DWE,non functional,6347
Danida,functional,542
Danida,functional needs repair,83
Danida,non functional,425
Gov,functional,535


In [13]:
total_commu = piv_table2.loc[('Commu', 'functional')] + piv_table2.loc[('Commu', 'functional needs repair')] + piv_table2.loc[('Commu', 'non functional')]
percent_functional_commu = (piv_table2.loc[('Commu', 'functional')] / total_commu) * 100

total_dwe = piv_table2.loc[('DWE', 'functional')] + piv_table2.loc[('DWE', 'functional needs repair')] + piv_table2.loc[('DWE', 'non functional')]
percent_functional_dwe = (piv_table2.loc[('DWE', 'functional')] / total_dwe) * 100

total_rwe = piv_table2.loc[('RWE', 'functional')] + piv_table2.loc[('RWE', 'functional needs repair')] + piv_table2.loc[('RWE', 'non functional')]
percent_functional_rwe = (piv_table2.loc[('Commu', 'functional')] / total_rwe) * 100

total_other = piv_table2.loc[('other', 'functional')] + piv_table2.loc[('other', 'functional needs repair')] + piv_table2.loc[('other', 'non functional')]
percent_functional_other = (piv_table2.loc[('other', 'functional')] / total_other) * 100

print('Percent functional commu: ', round(percent_functional_commu,3))
print('Percent functional dwe: ', round(percent_functional_dwe,3))
print('Percent functional rwe: ', round(percent_functional_rwe,3))
print('Percent functional other: ', round(percent_functional_other,3))



Percent functional commu:  status_group_new    68.302
dtype: float64
Percent functional dwe:  status_group_new    54.206
dtype: float64
Percent functional rwe:  status_group_new    60.033
dtype: float64
Percent functional other:  status_group_new    56.22
dtype: float64


In [14]:
df.subvillage.value_counts()

Madukani     508
Shuleni      506
Majengo      502
Kati         373
Mtakuja      262
            ... 
Chikaluri      1
S/Center       1
Irembo         1
Butihama       1
Ushashili      1
Name: subvillage, Length: 19287, dtype: int64

In [15]:
print(len(df.subvillage.value_counts()))

19287


In [16]:
df = df.drop('subvillage', axis=1)

In [17]:
df.public_meeting.value_counts()


True     51011
False     5055
Name: public_meeting, dtype: int64

In [18]:
df.public_meeting = df.public_meeting.fillna('Unknown')

In [19]:
df.scheme_management.value_counts()

VWC                 36793
WUG                  5206
Water authority      3153
WUA                  2883
Water Board          2748
Parastatal           1680
Private operator     1063
Company              1061
Other                 766
SWC                    97
Trust                  72
None                    1
Name: scheme_management, dtype: int64

In [20]:
def scheme_top5(row):
    '''Keep top 5 values and set the rest to 'other'. '''
    if row['scheme_management']=='VWC':
        return 'VWC'
    elif row['scheme_management']=='WUG':
        return 'WUG'
    elif row['scheme_management']=='Water authority':
        return 'Water Authority'
    elif row['scheme_management']=='WUA':
        return 'WUA'
    elif row['scheme_management']=='Water Board':
        return 'Water Board'
    else:
        return 'other'

df['scheme_management'] = df.apply(lambda row: scheme_top5(row), axis=1)

In [21]:
piv_table3 = pd.pivot_table(df, index=['scheme_management', 'status_group'],
                           values='status_group_new', aggfunc='count')
piv_table3

Unnamed: 0_level_0,Unnamed: 1_level_0,status_group_new
scheme_management,status_group,Unnamed: 2_level_1
VWC,functional,18960
VWC,functional needs repair,2334
VWC,non functional,15499
WUA,functional,1995
WUA,functional needs repair,239
WUA,non functional,649
WUG,functional,3006
WUG,functional needs repair,672
WUG,non functional,1528
Water Authority,functional,1618


In [22]:
total_vwc = piv_table3.loc[('VWC', 'functional')] + piv_table3.loc[('VWC','functional needs repair')] + piv_table3.loc[('VWC','non functional')]
percent_functional_vwc = (piv_table3.loc[('VWC', 'functional')] / total_vwc) * 100

total_wua = piv_table3.loc[('WUA', 'functional')] + piv_table3.loc[('WUA','functional needs repair')] + piv_table3.loc[('WUA','non functional')]
percent_functional_wua = (piv_table3.loc[('WUA', 'functional')] / total_wua) * 100

total_wug = piv_table3.loc[('WUG', 'functional')] + piv_table3.loc[('WUG','functional needs repair')] + piv_table3.loc[('WUG','non functional')]
percent_functional_wug = (piv_table3.loc[('WUG', 'functional')] / total_wug) * 100

total_wtr_auth = piv_table3.loc[('Water Authority', 'functional')] + piv_table3.loc[('Water Authority','functional needs repair')] + piv_table3.loc[('Water Authority','non functional')]
percent_functional_wtr_auth = (piv_table3.loc[('Water Authority', 'functional')] / total_wtr_auth) * 100

total_wtr_brd = piv_table3.loc[('Water Board', 'functional')] + piv_table3.loc[('Water Board', 'functional needs repair')] + piv_table3.loc[('Water Board', 'non functional')]
percent_functional_wtr_brd = (piv_table3.loc[('Water Authority', 'functional')] / total_wtr_brd) * 100

total_other = piv_table3.loc[('other', 'functional')] + piv_table3.loc[('other', 'functional needs repair')] + piv_table3.loc[('other', 'non functional')]
percent_functional_other = (piv_table3.loc[('other', 'functional')] / total_other) * 100

print('Percent functional other: ', round(percent_functional_other,3))
print('Percent functional vwc: ', round(percent_functional_vwc,3))
print('Percent functional water authority: ', round(percent_functional_wtr_auth,3))
print('Percent functional water board: ', round(percent_functional_wtr_brd,3))
print('Percent functional wua: ', round(percent_functional_wua,3))
print('Percent functional wug: ', round(percent_functional_wug,3))

Percent functional other:  status_group_new    53.696
dtype: float64
Percent functional vwc:  status_group_new    51.532
dtype: float64
Percent functional water authority:  status_group_new    51.316
dtype: float64
Percent functional water board:  status_group_new    58.879
dtype: float64
Percent functional wua:  status_group_new    69.199
dtype: float64
Percent functional wug:  status_group_new    57.741
dtype: float64


In [23]:
df.scheme_name.value_counts()

K                             682
None                          644
Borehole                      546
Chalinze wate                 405
M                             400
                             ... 
BL Kilimasimba                  1
Ntang'whale                     1
TM part Three water supply      1
Heka water supply               1
Merali Juu line                 1
Name: scheme_name, Length: 2696, dtype: int64

In [24]:
len(df.scheme_name.unique())

# Lots of factors and the top 5 or so only represent a fraction of the total values. Probably 
# safe to drop this column.

df = df.drop('scheme_name', axis=1)

In [25]:
df.permit.value_counts()

True     38852
False    17492
Name: permit, dtype: int64

In [26]:
df.permit = df.permit.fillna('Unknown')

In [27]:
df.isna().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
waterpoint_type          0
w

## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How will you analyze the data to arrive at an initial approach?
- How will you iterate on your initial approach to make it better?
- What model type is most appropriate, given the data and the business problem?

## Evaluation

The evaluation of each model should accompany the creation of each model, and you should be sure to evaluate your models consistently.

Evaluate how well your work solves the stated business problem. 

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model? Is it over or under fit?
- How well does your model/data fit any relevant modeling assumptions?

For the final model, you might also consider:

- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?
- What does this final model tell you about the relationship between your inputs and outputs?

### Baseline Understanding

- What does a baseline, model-less prediction look like?

In [28]:
# code here to arrive at a baseline prediction

### First $&(@# Model

Before going too far down the data preparation rabbit hole, be sure to check your work against a first 'substandard' model! What is the easiest way for you to find out how hard your problem is?

In [29]:
# code here for your first 'substandard' model

In [30]:
# code here to evaluate your first 'substandard' model

### Modeling Iterations

Now you can start to use the results of your first model to iterate - there are many options!

In [31]:
# code here to iteratively improve your models

In [32]:
# code here to evaluate your iterations

### 'Final' Model

In the end, you'll arrive at a 'final' model - aka the one you'll use to make your recommendations/conclusions. This likely blends any group work. It might not be the one with the highest scores, but instead might be considered 'final' or 'best' for other reasons.

In [33]:
# code here to show your final model

In [34]:
# code here to evaluate your final model

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- How could the stakeholder use your model effectively?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?
