# Tanzanian Wells

## 1. Overview

This notebook examines King County, WA dataset of houses and reviews how and what renovations add value to a house's sale price. <br>
The organization of this notebook follows the CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process.

## 2. Business Understanding

Tanzania, a developing country with a population of over 57,000,000, faces significant challenges in providing clean and reliable water sources to its citizens. The country has a substantial number of existing water points, including water wells, but a considerable portion of these wells either require maintenance or have completely failed, resulting in limited access to clean water.

The objective of this project is to develop a machine learning classifier that can predict the condition of water wells in Tanzania. By analyzing various factors such as the type of pump, installation date, and other relevant attributes, we aim to categorize wells into different conditions, such as 'functional' or 'non-functional'. 

This predictive model will serve as a valuable tool for organizations and government agencies involved in water resource management and infrastructure development in Tanzania.

The target audience for this project is Non-Governmental Organizations (NGOs) focusing on improving access to clean water in Tanzania such as [WaterAid](https://www.wateraid.org/where-we-work/tanzania), [Charity Water](https://www.charitywater.org/our-projects/tanzania) or [Tanzania Water Project](https://www.tanzaniawaterproject.org/). 

**Goal**: predict whether a well is non functional.

## 3. Data Understanding

The data comes from drivendata.org, a platform which hosts data science competitions with a focus on social impact. The source of data provided by DrivenData is the Tanzanian Ministry of Water, and is stored by Taarifa. 

The actual dataset can be found [here](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/) under the 'Data download section'. 

4 files are indicated. The below files were downloaded and renamed as follows:
- Training set values: training_set_values
- Training set labels: training_set_labels
- Test set values: test_set_values

These are the files used for the main modeling and predictive analysis. 
<br>
The test set values file is the one used to measure the accuracy of the model.

In [1]:
# Importing the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.ticker as ticker
from matplotlib.patches import Rectangle
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

%matplotlib inline

In [2]:
# Loading training_set_values dataset and saving it as df_values
df_values = pd.read_csv('data/training_set_values.csv')

In [3]:
# Inspecting df_values
df_values

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
59396,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
59397,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,monthly,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
59398,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump


The training set values has 59,400 rows, with 39 feature columns and 1 id column.

* `amount_tsh`: Total static head (amount water available to waterpoint)
* `date_recorded`: The date the row was entered
* `funder`: Who funded the well
* `gps_height`: Altitude of the well
* `installer`: Organization that installed the well
* `longitude`: GPS coordinate
* `latitude`: GPS coordinate
* `wpt_name`: Name of the waterpoint if there is one
* `num_private`: No description was provided for this feature
* `basin`: Geographic water basin
* `subvillage`: Geographic location
* `region`: Geographic location
* `region_code`: Geographic location (coded)
* `district_code`: Geographic location (coded)
* `lga`: Geographic location
* `ward`: Geographic location
* `population`: Population around the well
* `public_meeting`: True/False
* `recorded_by`: Group entering this row of data
* `scheme_management`: Who operates the waterpoint
* `scheme_name`: Who operates the waterpoint
* `permit`: If the waterpoint is permitted
* `construction_year`: Year the waterpoint was constructed
* `extraction_type`: The kind of extraction the waterpoint uses
* `extraction_type_group`: The kind of extraction the waterpoint uses
* `extraction_type_class`: The kind of extraction the waterpoint uses
* `management`: How the waterpoint is managed
* `management_group`: How the waterpoint is managed
* `payment`: What the water costs
* `payment_type`: What the water costs
* `water_quality`: The quality of the water
* `quality_group`: The quality of the water
* `quantity`: The quantity of water
* `quantity_group`: The quantity of water
* `source`: The source of the water
* `source_type`: The source of the water
* `source_class`: The source of the water
* `waterpoint_type`: The kind of waterpoint
* `waterpoint_type_group`: The kind of waterpoint

In [4]:
# Loading training_set_values dataset and saving it as df_labels
df_labels = pd.read_csv('data/training_set_labels.csv')

In [5]:
# Inspecting df_labels
df_labels

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional
...,...,...
59395,60739,functional
59396,27263,functional
59397,37057,functional
59398,31282,functional


In [6]:
# Checking the unique values of the target column
df_labels['status_group'].unique()

array(['functional', 'non functional', 'functional needs repair'],
      dtype=object)

The training set labels has the same number of rows, and contains the:
* `id` 
* `target column`: status group

The status group can be defined as: 

1. functional: the waterpoint is operational and there are no repairs needed
2. functional needs repair: the waterpoint is operational, but needs repairs
3. non functional: the waterpoint is not operational

## 4. Data Preparation

### 4. a. Joining values and labels datasets together 

The first step of preparing the data is to merge both df_values and df_labels, as the latter contains the target value.
<br>
Both datasets are merged on the 'id' column.

In [7]:
# Merging both dataframes on the column 'id'
raw_df = df_values.merge(df_labels, on='id')

In [8]:
# Inspecting the new dataframe
raw_df

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
59396,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe,functional
59397,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,functional
59398,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional


The new joined dataframe contains the same number of rows as the previous datasets: 59,400. It has 1 id column, 1 target column: status_group, and 39 feature columns.

### 4. b. Preprocessing data

Preprocessing is an important step in data science pipeline because it transforms raw data into a suitable format for training models. It also contributes to improve model accuracy and performance by handling issues like missing values, removing unnecessary columns, scaling, and encoding categorical variables.

#### 4. b. 1. Verifying and handling missing data 

In [9]:
# Checking for null values
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

The column `scheme_name` has a high number of null values and is contains the same information as `scheme_management`: who operates the waterpoint.  
As a consequence, it will be dropped entirely.

In [10]:
# Dropping the column scheme_name
raw_df.drop(['scheme_name'], axis=1, inplace=True)

In [11]:
# Inspecting the values of columns containing null information 
columns_with_null = raw_df.columns[raw_df.isnull().any()].tolist()

columns_with_null

for column in columns_with_null:
    print(column)
    print(raw_df[column].unique())
    print()

funder
['Roman' 'Grumeti' 'Lottery Club' ... 'Dina' 'Brown' 'Samlo']

installer
['Roman' 'GRUMETI' 'World vision' ... 'Dina' 'brown' 'SELEPTA']

subvillage
['Mnyusi B' 'Nyamara' 'Majengo' ... 'Itete B' 'Maore Kati' 'Kikatanyemba']

public_meeting
[True nan False]

scheme_management
['VWC' 'Other' nan 'Private operator' 'WUG' 'Water Board' 'WUA'
 'Water authority' 'Company' 'Parastatal' 'Trust' 'SWC' 'None']

permit
[False True nan]



Other columns' null values will be replaced by 'Unknown' as they contain a relatively few missing values, and handling them as 'Unknown' could be used to predict whether a well is functional or not.  

In [12]:
# Filling null values with 'Unknown'
for column in columns_with_null:
    raw_df[column].fillna('Unknown', inplace=True)

In [13]:
# Verifying the dataset no longer contains any null value
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 59400 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              59400 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59400 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

#### 4. b. 2. Removing unnecessary columns

The below  columns will be removed for the following reasons:

1. Irrelevant for predictions (i.e. date the row was entered, waterpoint name)
2. Contains similar information as another column (i.e. extraction_type, water_quality) 
3. Contains information which would require additional conversion (i.e. region_code, district_code)

* `id`: the identification number assigned to the water well 
* `date_recorded`: The date the row was entered
* `longitude`: GPS coordinate
* `latitude`: GPS coordinate
* `wpt_name`: Name of the waterpoint if there is one
* `num_private`:
* `subvillage`: Geographic location
* `region_code`: Geographic location (coded)
* `district_code`: Geographic location (coded)
* `lga`: Geographic location
* `ward`: Geographic location
* `recorded_by`: Group entering this row of data
* `scheme_management`: Who operates the waterpoint
* `extraction_type`: The kind of extraction the waterpoint uses
* `extraction_type_group`: The kind of extraction the waterpoint uses
* `management_group`: How the waterpoint is managed
* `payment`: What the water costs
* `water_quality`: The quality of the water
* `quantity_group`: The quantity of water
* `source`: The source of the water
* `source_class`: The source of the water
* `waterpoint_type`: The kind of waterpoint

In [14]:
# Storing the columns defined above into a list 
columns_to_drop = ['id', 'date_recorded','longitude','latitude','wpt_name','num_private','subvillage','region_code','district_code','lga','ward','recorded_by','scheme_management','extraction_type','extraction_type_group','management_group','payment','water_quality','quantity_group','source','source_class','waterpoint_type']

In [15]:
# Dropping the columns from the dataframe and creating a new one
df = raw_df.drop(columns_to_drop, axis=1)

In [16]:
# Inspecting the new df
df

Unnamed: 0,amount_tsh,funder,gps_height,installer,basin,region,population,public_meeting,permit,construction_year,extraction_type_class,management,payment_type,quality_group,quantity,source_type,waterpoint_type_group,status_group
0,6000.0,Roman,1390,Roman,Lake Nyasa,Iringa,109,True,False,1999,gravity,vwc,annually,good,enough,spring,communal standpipe,functional
1,0.0,Grumeti,1399,GRUMETI,Lake Victoria,Mara,280,Unknown,True,2010,gravity,wug,never pay,good,insufficient,rainwater harvesting,communal standpipe,functional
2,25.0,Lottery Club,686,World vision,Pangani,Manyara,250,True,True,2009,gravity,vwc,per bucket,good,enough,dam,communal standpipe,functional
3,0.0,Unicef,263,UNICEF,Ruvuma / Southern Coast,Mtwara,58,True,True,1986,submersible,vwc,never pay,good,dry,borehole,communal standpipe,non functional
4,0.0,Action In A,0,Artisan,Lake Victoria,Kagera,0,True,True,0,gravity,other,never pay,good,seasonal,rainwater harvesting,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,10.0,Germany Republi,1210,CES,Pangani,Kilimanjaro,125,True,True,1999,gravity,water board,per bucket,good,enough,spring,communal standpipe,functional
59396,4700.0,Cefa-njombe,1212,Cefa,Rufiji,Iringa,56,True,True,1996,gravity,vwc,annually,good,enough,river/lake,communal standpipe,functional
59397,0.0,Unknown,0,Unknown,Rufiji,Mbeya,0,True,False,0,handpump,vwc,monthly,fluoride,enough,borehole,hand pump,functional
59398,0.0,Malec,0,Musa,Rufiji,Dodoma,0,True,True,0,handpump,vwc,never pay,good,insufficient,shallow well,hand pump,functional


In [17]:
# Inspecting the new df's info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   amount_tsh             59400 non-null  float64
 1   funder                 59400 non-null  object 
 2   gps_height             59400 non-null  int64  
 3   installer              59400 non-null  object 
 4   basin                  59400 non-null  object 
 5   region                 59400 non-null  object 
 6   population             59400 non-null  int64  
 7   public_meeting         59400 non-null  object 
 8   permit                 59400 non-null  object 
 9   construction_year      59400 non-null  int64  
 10  extraction_type_class  59400 non-null  object 
 11  management             59400 non-null  object 
 12  payment_type           59400 non-null  object 
 13  quality_group          59400 non-null  object 
 14  quantity               59400 non-null  object 
 15  so

The new dataframe still has 59,400 rows, but now contains 17 feature columns and 1 target column. 
14 of the features, including the target variable is a categorical data, so they will be one-hot encoded in the next section.

#### 4. b. 3. Transforming the classification into a binary one

The target column contains 3 categories. 

Converting a ternary classification problem into a binary one can simplify modeling, handle imbalanced data, prioritize specific class distinctions, and enable the use of binary-focused algorithms, but it may lose some information and context. The choice depends on the problem, data, and project goals.

In [18]:
# Inspecting the values inside the column status_group
print(df['status_group'].unique())

['functional' 'non functional' 'functional needs repair']


In [19]:
# Verifying for data imbalance
print('Functional counts')
print(df['status_group'].value_counts())
print()
print()
print('Functional counts')
print(df['status_group'].value_counts(normalize=True))

Functional counts
functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64


Functional counts
functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64


The dataset is highly imbalanced, and the status 'functional needs repair' only represents 7% of the rows. 


A well that is functional but needs repair can be considered non-functional because it does not reliably provide safe and consistent access to water, which is the primary function of a well. <br> 
As a consequence, all 'functional needs repair' statuses will be replaced by 'non functional,'

Transforming the classification from a ternary to a binary one will then address the imbalanced dataset. 

In [20]:
# Replacing 'functional needs repair' by 'non functional'
df['status_group'] = df['status_group'].replace('functional needs repair', 
                                                'non functional')


In [21]:
# Verifying the replacement was correctly applied
print('Raw counts')
print(df['status_group'].value_counts())
print()
print()
print('Percentages')
print(df['status_group'].value_counts(normalize=True))

Raw counts
functional        32259
non functional    27141
Name: status_group, dtype: int64


Percentages
functional        0.543081
non functional    0.456919
Name: status_group, dtype: float64


If we had a model that *always* said  that the well was non functional, we would get an accuracy score of 0.456919, i.e. 45.7% accuracy.
<br> 
This is because bout 45.7% of all wells are currently non functional. 

#### 4. b. 4. Converting other binary columns

Some categorical features are binary: true or false, so they will be replaced byL
* 0 if false
* 1 if true
<br>Some of the data contains 'unknown' data. If unknown, it will be considered false. 

The target column is not technically true or false but is binary as well, so it will be converted as the following:
* functional: 1 
* non functional: 0 

In [22]:
# Storing binary columns into a new dataframe
binary_columns = ['public_meeting', 'permit', 'status_group']

In [23]:
# Converting public_meeting, permit and status_group to binary encoding
for column in binary_columns:
    print(column, df[column].unique())
    df[column] = df[column].map({
        False: 0,
        True: 1,
        'Unknown': 0,
        'non functional': 0,
#         'functional needs repair': 0,
        'functional': 1

    }) 
    print(column, df[column].unique())


public_meeting [True 'Unknown' False]
public_meeting [1 0]
permit [False True 'Unknown']
permit [0 1]
status_group ['functional' 'non functional']
status_group [1 0]


In [24]:
# Verifying the new data types
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   amount_tsh             59400 non-null  float64
 1   funder                 59400 non-null  object 
 2   gps_height             59400 non-null  int64  
 3   installer              59400 non-null  object 
 4   basin                  59400 non-null  object 
 5   region                 59400 non-null  object 
 6   population             59400 non-null  int64  
 7   public_meeting         59400 non-null  int64  
 8   permit                 59400 non-null  int64  
 9   construction_year      59400 non-null  int64  
 10  extraction_type_class  59400 non-null  object 
 11  management             59400 non-null  object 
 12  payment_type           59400 non-null  object 
 13  quality_group          59400 non-null  object 
 14  quantity               59400 non-null  object 
 15  so

There are now 11 categorical columms. 

#### 4. b. 5. Categorizing Values With Too Many Details

##### First step: Counting Values in Categorical Variables

Some categorical variables such as funder or installer cannot be one-hot encoded directly, as they contain too many distinct values. Further data transformation is required.

In [25]:
# Creating the dataframe categoricals to handle the categorical columns
categoricals = df.select_dtypes(include=[object])
categoricals

Unnamed: 0,funder,installer,basin,region,extraction_type_class,management,payment_type,quality_group,quantity,source_type,waterpoint_type_group
0,Roman,Roman,Lake Nyasa,Iringa,gravity,vwc,annually,good,enough,spring,communal standpipe
1,Grumeti,GRUMETI,Lake Victoria,Mara,gravity,wug,never pay,good,insufficient,rainwater harvesting,communal standpipe
2,Lottery Club,World vision,Pangani,Manyara,gravity,vwc,per bucket,good,enough,dam,communal standpipe
3,Unicef,UNICEF,Ruvuma / Southern Coast,Mtwara,submersible,vwc,never pay,good,dry,borehole,communal standpipe
4,Action In A,Artisan,Lake Victoria,Kagera,gravity,other,never pay,good,seasonal,rainwater harvesting,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...
59395,Germany Republi,CES,Pangani,Kilimanjaro,gravity,water board,per bucket,good,enough,spring,communal standpipe
59396,Cefa-njombe,Cefa,Rufiji,Iringa,gravity,vwc,annually,good,enough,river/lake,communal standpipe
59397,Unknown,Unknown,Rufiji,Mbeya,handpump,vwc,monthly,fluoride,enough,borehole,hand pump
59398,Malec,Musa,Rufiji,Dodoma,handpump,vwc,never pay,good,insufficient,shallow well,hand pump


In [26]:
# Storing categorical columns to a list
categorical_columns = categoricals.columns.tolist()
# Creating an empty dictionary to store value counts for each column
value_counts_dict = {}

# Iterating through each categorical column and calculating value counts
for column in categorical_columns:
    value_counts = categoricals[column].value_counts()
    value_counts_dict[column]=value_counts

In [27]:
# Reviewing columns with the highest number of categories within each of them 
value_counts_dict

{'funder': Government Of Tanzania    9084
 Unknown                   3639
 Danida                    3114
 Hesawa                    2202
 Rwssp                     1374
                           ... 
 Muwasa                       1
 Msigw                        1
 Rc Mofu                      1
 Overland High School         1
 Samlo                        1
 Name: funder, Length: 1897, dtype: int64,
 'installer': DWE           17402
 Unknown        3658
 Government     1825
 RWE            1206
 Commu          1060
               ...  
 EWE               1
 SCHOO             1
 Got               1
 Fabia             1
 SELEPTA           1
 Name: installer, Length: 2145, dtype: int64,
 'basin': Lake Victoria              10248
 Pangani                     8940
 Rufiji                      7976
 Internal                    7785
 Lake Tanganyika             6432
 Wami / Ruvu                 5987
 Lake Nyasa                  5085
 Ruvuma / Southern Coast     4493
 Lake Rukwa             

Funder and Installer are the two columns with the highest number of categories and seem to have very similar results. We will first focus on funders. 


<u>Funders</u> like any column, contains 59,400 rows. It is formed by 1,897 unique values. In order to organize them into similar categories, a research of each funder was done to identify if the organization was considered:
1. **Bilateral**: the government from another country funded the water well
2. **Government**: the government of Tanzania, or a programme funded by the government, or local, governmental agencies funded the water well  
3. **NPO_NGO**: the water well is funded by a non-profit organization or a non-governmental organization
4. **Private**: the fund comes from a private source: either individual or a company 
5. **Religious**: a religious organization funded the well
6. **Unknown**: the funder was not or could not be identified
7. **Minor funder**: funders which had funded less than 150 projects

The research was divided into two categories: 
- those which had funded at 150 projects were researched individually
- the others were categorized as minor funders

In a normal distribution, data outside the 75th percentile would be considered outliers. The goal of the above detailed classification was to get as close to the 75th percentile as possible. <br>
The column contains 59,400 rows, but currently has 3,658 funders identified as 'Unknown', leavingg 55,742 rows of funders to be categorized. Categorizing data up to the 75th percentile would classify over 41,806 rows. 


By setting the limit on funders who had paid for the wells to 150, this allowed to identify and categorize 43,177 funders, meeting our 75th percentile objective. 

In [28]:
categoricals['funder'].value_counts()

Government Of Tanzania    9084
Unknown                   3639
Danida                    3114
Hesawa                    2202
Rwssp                     1374
                          ... 
Muwasa                       1
Msigw                        1
Rc Mofu                      1
Overland High School         1
Samlo                        1
Name: funder, Length: 1897, dtype: int64

In [29]:
# Printing all rows for categorization but commented out for the rest of the code dfor better memory use
# pd.set_option('display.max_rows', None)

In [31]:
# Identifiying funders that funded up 150 water wells 
categoricals[['funder']].groupby('funder').filter(lambda x: len(x) <= 150).value_counts()

funder                       
Mkinga Distric Coun              150
Lvia                             147
Concern World Wide               145
Unhcr                            137
No                               134
                                ... 
Makori                             1
Makonder                           1
Makondakonde Water Population      1
Makona                             1
Zingibali Secondary                1
Length: 1837, dtype: int64

In [32]:
# Counting funders that funded over 150 water wells 
categoricals[['funder']].groupby('funder').filter(lambda x: len(x) > 150).value_counts()

funder                        
Government Of Tanzania            9084
Unknown                           3639
Danida                            3114
Hesawa                            2202
Rwssp                             1374
World Bank                        1349
Kkkt                              1287
World Vision                      1246
Unicef                            1057
Tasaf                              877
District Council                   843
Dhv                                829
Private Individual                 826
Dwsp                               811
0                                  777
Norad                              765
Germany Republi                    610
Tcrs                               602
Ministry Of Water                  590
Water                              583
Dwe                                484
Netherlands                        470
Hifab                              450
Adb                                448
Lga                              

In [33]:
# Ensuring enough data is categorized by counting how much would represent classiying funders that funded at least 150 wells
categoricals[['funder']].groupby('funder').filter(lambda x: len(x) > 150).value_counts().sum()

43177

A copy of the column `funder` will be created `funder_organization` to then replace each of the categories with the ones defined above.

In [34]:
# Creating column funders_organization with values from funders 
categoricals['funder_organization'] = categoricals['funder']
categoricals['funder_organization']

0                  Roman
1                Grumeti
2           Lottery Club
3                 Unicef
4            Action In A
              ...       
59395    Germany Republi
59396        Cefa-njombe
59397            Unknown
59398              Malec
59399         World Bank
Name: funder_organization, Length: 59400, dtype: object

In [35]:
# Verifying that the copy of the column was correctly done
assert (categoricals['funder_organization'] == categoricals['funder']).all(), "Columns are not equal."

# If the assertion passes, it will not raise an error.
print("Columns are equal.")

Columns are equal.


In [89]:
# Storing each of the identified funder into the corresponding list to then replace them
bilateral = ["Danida","Hesawa","Norad","Germany Republi","Netherlands","Rudep","Nethalan","World Bank","W.B"]
government = ["Government Of Tanzania","Rwssp","District Council","Dwsp","Water","Dwe","Lga","Private","Jaica","Rural Water Supply And Sanitat","Jica","Wsdp","Rc"]
NPO_NGO = ["World Vision","Unicef","Tasaf","Ministry Of Water","Amref","Oxfam","Wateraid","Mission","Shipo","Ded","Plan Int","Oxfarm","Oikos E.Afrika"]
private = ["Dhv","Hifab","Adb","Fini Water","Isf","Ces(gmbh)","Fw","Ces (gmbh)","Private Individual","Lawatefuka Water Supply","Magadini-makiwaru Water"]
religious = ["Kkkt","Tcrs","Rc Church","Adra","Dmdd","Kkkt_makwale","Wvt","Roman"]
unknown = ["Unknown","0","Finw","Dh","Kiliwater","Go"]
minor_funders = ["Mkinga Distric Coun ","Lvia ","Concern World Wide ","Unhcr ","No ","Swedish ","African ","Community ","Anglican Church ","He ","Is ","Ki ","Tardo ","Ir ","Wananchi ","Snv ","Roman Catholic ","Wua ","Unice ","Bsf ","Tassaf ","Co ","Dfid ","Lamp ","Concern ","Muwsa ","Villagers ","Village Council ","Ru ","Halmashauri Ya Wilaya Sikonge ","Hsw ","Germany ","Twe ","Idc ","Tanza ","Missi ","H ","Mdrdp ","Undp ","Aict ","Gtz ","Japan ","Cmsr ","Rc Ch ","Ndrdp ","Vwc ","Lwi ","Kuwait ","Fin Water ","Caritas ","Kkkt Church ","Cdtf ","Padep ","Kaemp ","Kibaha Town Council ","Marafip ","Conce ","Water Aid /sema ","Cefa ","Ncaa ","Losaa-kia Water Supply ","National Rural ","Commu ","Mkinga Distric Cou ","Md ","Sabemo ","Irish Ai ","Plan International ","Twesa ","St ","Kirde ","Gen ","Idara Ya Maji ","Tlc ","Grumeti ","Finida German Tanzania Govt ","Solidarm ","Kilindi District Co ","Cocen ","Wfp ","Ta ","Tanapa ","China Government ","Dwe/norad ","Sema ","Tabora Municipal Council ","Solidame ","European Union ","Dwssp ","Ridep ","Dads ","Acra ","Devon Aid Korogwe ","Miziriol ","Shawasa ","Red Cross ","Redep ","Tz Japan ","Abasia ","Giz ","Ms ","Tdft ","Donor ","Cafod ","Serikali ","Tredep ","Soda ","Un ","Ka ","Kuwasa ","Songea District Council ","Jbg ","Dasip ","Fathe ","W ","Kidp ","Urt ","Songea Municipal Counci ","Islamic Found ","Watu Wa Ujerumani ","Mbiuwasa ","African Development Bank ","Si ","Water User As ","Ilo ","Holland ","Oikos E.Africa/european Union ","Finn Water ","Aar ","Ics ","Kiuma ","Tuwasa ","Dhv\norp ","Peters ","The Desk And Chair Foundat ","Biore ","Kanisa Katoliki Lolovoni ","Po ","Cspd ","Swiss If ","Kalta ","Churc ","Ga ","Cg ","Peter Tesha ","Jika ","Happy Watoto Foundation ","Cocern ","World Vision/adra ","Finwater ","Kidep ","Il ","Save The Rain Usa ","Undp/ilo ","Ereto ","Sida ","Water Aid/sema ","Dw ","Village Government ","Lips ","Songas ","Kanisa La Menonite ","Halmashauri ","I Wash ","Partage ","Not Known ","Ifad ","Hw/rc ","Cdcg ","Mileniam Project ","Nethe ","Plan Internatio ","Quwkwin ","Wwf ","Tacare ","Rc Churc ","Total Land Care ","Village ","Dadis ","Adp ","Kilwater ","Hans ","Hewasa ","Rished ","Tanzakesho ","Bgm ","Kalitasi ","Missionaries ","Roman Cathoric-same ","Care International ","Sabodo ","Wd And Id ","Cmcr ","Ilct ","Cefa-njombe ","Village Community ","Rcchurch/cefa ","Luthe ","Mwaya Mn ","Bank ","Singida Yetu ","Killflora ","Swisland/ Mount Meru Flowers ","Cipro/government ","Vifafi ","Msf ","Norad/ Kidep ","Islamic ","Sekei Village Community ","Norad /government ","Maji Mugumu ","Rotary Club ","Magoma Adp ","Efg ","Kuamu ","Makonde Water Population ","Sowasa ","P ","Dhv Moro ","Dct ","Water Board ","Peace Cope ","Kadp ","One Un ","Gain ","The People Of Japan ","Cct ","Ham ","Eu/acra ","Mbunge ","Ubalozi Wa Marekani ","Tanzania ","Pidp ","Asb ","Dasp ","Mdc ","Bened ","Tasaf/dmdd ","Sumbawanga Munici ","Sdg ","Msabi ","Idydc ","Bahewasa ","Father Bonifasi ","Mkuyu ","Ukiligu ","Drdp Ngo ","Pmo ","Quick Wings ","Lwiji Italy ","Livin ","Rwsp ","Finland Government ","Bruder ","Rdc ","Msikiti ","Tado ","Tahea ","Isf/government ","Ruthe ","Mi ","Sda ","Us Embassy ","Fpct ","Elct ","Shule ","Finland ","A/co Germany ","African Muslim Agency ","Roman Cathoric Same ","Quick Wins ","African Relie ","Resolute Mining ","Dar Al Ber ","Roman Catholic Rulenge Diocese ","Imf ","Sao H ","Living Water International ","Rc Church/centr ","Isingiro Ho ","Isf/tacare ","Roman Church ","Olgilai Village Community ","Adp Mombo ","Dassip ","Private Owned ","Mamad ","Wfp/tnt/usaid ","Jeica ","Water Project Mbawala Chini ","The Isla ","Halmashauri Ya Manispa Tabora ","Aco/germany ","Undp/aict ","Maxavella ","Ba As ","Benguka ","Lgcdg ","Franc ","Mosque ","Village Council/ Haydom Luther ","Rural Water Supply And Sanita ","Msf/tacare ","Qwiqwi ","Tassaf I ","Rada ","Aic ","Bulyahunlu Gold Mine ","Prf ","Af ","Oxfam Gb ","Mh An ","Tridep ","Kinapa ","Rips ","Qwickwin ","Canada ","Lifetime ","Lutheran Church ","Lowasa ","Institution ","National Rural And Hfa ","Secondary ","Kwikwiz ","Ai ","Wfp/tnt ","Idea ","Germany Misionary ","Government/ Community ","Finidagermantanzania Govt ","Dak ","Unicef/central ","Millenium ","Dhv/gove ","Suwasa ","Auwasa ","De ","Kkkt-dioces Ya Pare ","Belgian Government ","Chamavita ","Kmcl ","Sijm ","Snv Ltd ","Nado ","Muslims ","Tasafu ","Tag ","Kanisa","Williamson Diamond Ltd","Government/ World Bank","Regional Water Engineer Arusha","Mzinga A","Water User Group","Bffs","Women For Partnership","Ikela Wa","Kadres Ngo","Serikali Ya Kijiji","Pci","Tassaf Ii","Mem","Hortanzia","Action Contre La Faim","Koica","Holla","Council","Kingupira S","W.D & I.","Cobashec","People Of Japan","Healt","Desk And Chair Foundation","Aimgold","Professor Ben Ohio University","France","Morovian Church","Lottery Club","Angrikana","Government /tassaf","Nk","Tltc","Do","Sengerema District Council","Mamlaka Ya Maji Ngara","Duwas","D","Camavita","Rundu Man","Mtuwasa","Kilomber","Milenia","British Colonial Government","Roman Cathoric -kilomeni","Br","Halmashaur","Ebaha","Olumuro","I.E.C","Member Of Parliament","Action Aid","Moroil","Summit For Water","Abd","Kyela Council","El","Local","Kanisa Katoliki","School","Lawate Fuka Water Suppl","Bs","Mws","Parastatal","Village Govt","Cipro/care/tcrs","Hapa","Menon","Baric","Baptist Church","Africare","Rwsssp","Mbozi District Council","Sangea District Council","Unicef/ Csp","Sauwasa","Tassaf/ Danida","Quick","In","Pataji","Gt","Obc","African Development Foundation","Adf","Robert Loyal","World Vision/ Kkkt","Wate Aid/sema","Moslem Foundation","U.S.A","Islam","Ministry Of Education","Kimkuma","W.D &","Mheza Distric Counc","Rwssp/wsdp","Vttp","Clause","Mboma","Sanje Wa","Concern /govern","Villa","Secondary Schoo","Semaki K","Mac","Caltus","Makonde","Jeshi La Wokovu","Ngos","Gaica","Dawasco","Filo","Total Landcare","Re","Wfp/usaid/tnt","Mzungu Paul","Jimbo Fund","Lions Club","Krp","Mp Mzeru","Tcrs.Tlc","Nsc","Tanap","Hesawz","Hesaw","Losakia Water Supply","Japan Food","Mdgwc","Safari Roya","Hospital","Ilkeri Village","Mgm","Kirdep","Rc/mission","Tabraki","Runduman","Regwa Company Of Egypt","Priva","Kigoma Municipal","Ola","St Ph","Jgb","Greec","Brdp","Trachoma","Dbspe","Ddp","Wrssp","Handeni Trunk Main(","Tumaini Fund","Domestic Rural Development Pro","Tulawaka Gold Mine","Bilila","Walokole","Ea","Wanan","Asdp","Balo","Caltas","Cbhi","Usaid/wfp","Cpro","Dmk","Church","Uhai Wa Mama Na Mtoto","W.C.S","Institutional","Nrwssp","Tag Church","Konoike","Rural Drinking Water Supply","Cpps","Quickwi","Dwe/bamboo Projec","Wizara","Concen","Minis","Netherland","Mavuno Ngo","Nyamongo Gold Mining","Totoland Care","Cdg","Geochaina","Rc Cathoric","Rc Mission","Tanesco","Christian Outrich","Kijiji","Mwanga Town Water Authority","Saleh Zaharani","Moravian","Loliondo Secondary","Hiap","Adap","Pad","Tgts","Free Pentecoste Church Of Tanz","Lutheran","Lench","Lions","Bridge North","Lcgd","Diocese Of Geita","Longido Sec School","Farm Africa","Sweden","T","Oldonyolengai","Musilim Agency","Cast","Cartas Tanzania","Kashwas","Magadini Makiwaru Water","Roman Ca","Tcrs /government","Oikos E .Africa/european Union","Wvc","Loliondo Parish","Simmors","Wama","Rvemp","Stantons","Enyueti","Cheni","W.D.&.I.","Mianz","Morovian","F","Rural Water Supply","Eu","Minjingu","Miomb","Sw","Missionary","Mmem","Mh Kapuya","Floresta","Engin","Lvemp","L","Vi","Lake Tanganyika","Serena","Dbfpe","Serikari","Lwi & Central Government","Lcdg","Simavi","Uyoge","Selous G","Shanta","Lgcbg","Ustawi","Diwani","Lgdcg","Lidep","Liuwassa","Usa Embassy","Usaid","Danida /government","Vickfis","Vodacom","Rotte","Udc/sema","Uhoranzi","Sswp","Ukida","Dwt","Cipro/care","Mchukwi Hos","Colonial Government","Kiwanda Cha Tangawizi","Soliderm","Dadp","Unesco","Kmt","Koica And Tanzania Government","Concern/governm","Schoo","Korea","Malec","Makonde Water Supply","Doddea","Rssp","Dhinu","Hydom Luthelani","Halmashauri Ya Wilaya","Bingo Foundation","Geita Goldmain","Japan Embassy","Japan Aid","Mzee Don","Oldadai Village Community","H/w","Belgij","Italy Government","Oikos","Tanroad","Nassor Fehed","Wfp/usaid","Awf","Ggm","Tanz Egypt Technical Cooper","Wssp","Nduku Village","Greinaker","Irevea Sister","Ifakara","Quick Win Project","Ngiresi Village Community","Rotary I","Quick Win Project /council","Tirdo","Anjuman E Seifee","Presadom","Quik","Pr","Water Sector Development","Mtuwasa And Community","Mtibwa S","Acord","Ro","Patuu","Tcrs /care","Zao Water Spring","Tpp","Bobby","Hindu","Fpct Church","Mwelia Estate","Gachuma Ginery","Pentecosta Church","Wspd","Magereza","Aic Church","Vicfish Ltd","Nwssp","Pentecostal Church","Abood","Scott","Tcrs/care","Makuru","Vifaf","Makoye Masanzu","Wsdp & Sdg","Peter","Vicfish","Vififi","Denish","Lotary Club","Senapa","Q-sem Ltd","Others","Adp/w","Africa","Uvimaki","Lwf","Africa 2000 Network/undp","Omary Issa","Africa Amini Alama","School Adm9nstrarion","African 2000 Network","Adp Bungu","African Barrick Gold","African Realief Committe Of Ku","Dawasa","Denat","Acord Ngo","Oda","Wua And Ded","Shear Muslim","Ox","Zaben","Calvary Connect","Mashaka","Norad/ Tassaf Ii","Redcross","Chacha","Mwita Muremi","Cgi","Cdtfdistrict Council","Redeso","Regwa Company Of Egpty","Bio Fuel Company","Ministry Of Healthy","Mwigicho","Rural Water Department","Missio","Ruped","Resolute","Bokera W","Wajerumani","Rocci Ross","Bread For The Wor","Mmg Gold Mine","Moradi","Muhameid Na","Care/cipro","Mtc","Roman Cathoric","British Tanza","Rombo Dalta","C","Mp","Mwl. Nyerere Sec. School","Redap","Norad/ Tassaf","Chani","Mama Mery Nagu","Sawaka","Village Fund","Cope","None","Rafik","Railway","Nirad","Village Office","Maro","Niger","Mataro","Arabs Community","Neemia Mission","Company","Compa","Ardhi Instute","Nddp","Nchagwa","Nazareth Church","Cipro","Cip","Mzung","Mzee Omari","S","Wb / District Council","Rwssp Shinyanga","Villaers","People From Japan","Father W","Domestic Water Supply Project","Kata","Ikeuchi Towels Japan","Fao","Greineker","Solar Villa","Kombe Foundation","Socie","Svn","Snv-swash","Gwitembe","Unicef/cspd","Government/tassaf","Irc","Swash","Kambi Migoko","Tasaf And Lga","Swidish","Italy","Sister Francis","The Islamic","Halmashauli","Gesawa","Dsp","Islamic Agency Tanzania","Halmashauri Wil","Tadepa","Government","Eno","Irish Government","Tlc/john Majala","Investor","Egypt Government","Insututional","Kibo Brewaries","Egypt","Kibo","Global Fund","Kkkt Dme","Kibaha Independent School","Kfw","Kerebuka","Kiwanda Cha Samaki","Kdrdp Ngo","Tom","Government Of Misri","Sua","Un/wfp","Dwe And Veo","Htm","Iran Gover","K","Dmo","Friends Of Kibara Foundation","Tag Patmo's","Lisa","Lga And Adb","G.D&i.D","Holand","Tempo","Swisland/mount Meru Flowers","Game Fronti","Singasinga","Kyariga","Juhibu","Dmmd","Jimmy","Judge Mchome","Hpa","Foreigne","Ten Degree Hotel","Lench Taramai","Tasef","Dioce","Laramatak","Totoland","Kwik","Fptc - Pent","Wilson","Tlc/thimotheo Masunga","Tlc/sorri","Resolute Mininggolden Pride","Revocatus Mahatane","Winkyens","Rhobi","Water Se","Totaland Care","Tanz/egypt Technical Co-op","Rhobi Wambura","Richard M.Kyore","Water Department","Rilayo Water Project","William Acleus","Tlc/seleman Mang'ombe","Ringo","Rashid Seng'ombe","Rashid Mahongwe","Rashid","Taees","Tlc/samora","Tlc/nyengesa Masanja","Ras","Raurensia","Tansi","Resolute Golden Pride Project","Taes","Tancan","Tancro","Redekop Digloria","Tanedaps Society","Tajiri Jumbe Lila","Rdws","Wbk","Taipo","Wcst","Redet","Wdp","Regina Group","Rc/dwe","Rc Njoro","Tove","Wdsp","Tag Church Ub","Tanga Cement","Rc Msufi","Tanload","Weepers","Rc Mofu","Rc Missionary","People From Egypt","Toronto-estate","Watu Wa Marekani","Rc Missi","Rc Mi","Town Council","Shirika La Kinamama Na Watot","Tanzania /egypt","Rarymond Ekura","Tasaf/tlc","Plan Tanzania","Yasi Naini","Tgrs","Tasf","Tgt","Poland Sec School","Yaole","Pori La Akiba Kigosi","Tgz","Wwf / Fores","Raramataki","Primo Zunda","Prince Medium School","Private Co","Private Individul","Private Institutions","Theo","Theonas Mnyama","Tasaf Ii","Tasaf And Mmem","Yasini","Yasini Selemani","Plan","Piusi","Tcrst","Zinduka","People Of Sweden","Tdrs","Perusi Bhoke","Tcrs Kibondo","Zao Water Spring X","Peter Mayiro","Team Rafiki","Peter Ngereka","Tbl","Zao","Zaburi And Neig","Petro Patrice","Teonas Wambura","Teresa Munyama","Piscop","Piscope","Pius Msekwa","Private Person","Tasaf 1","Prodap","Qwckwin","Qwick Win","Woyege","Worldvision","R","World Vision/rc Church","Tanzania Egypt Technical Co Op","Tanzania Compasion","World Bank/government","Tlc/emmanuel Kasoga","Rafael Michael","Women Fo Partnership","Tlc/jenus Malecha","Tanzaling","Rajab Seleman","Rajabu Athumani","Ramadhani M. Mvugalo","Wirara Ya Maji","Ramadhani Nyambizi","Ramsar","Qwekwin","Tanzania Journey","Thomasi Busigaye","Tlc/community","Timothy Shindika","Wug And Ded","Prof. Saluati","Tasae","Pwagu","Pwc","Tasad","Quick Win","Tasa","Tina/africare","Quick Win/halmashauri","Tareto","Tingatinga Sec School","Quick Wins Scheme","Quicklw","Tkc","Quickwins","Wsdo","Tarangire Park","Ripati","Wanginyi Water","Water Authority","Uniceg","Unhcr/government","Unice/ Cspd","Saudia","Soko La Magomeni","Unicef/african Muslim Agency","Scharnhorstgymnasium","Sobodo","Scholastica Pankrasi","Village Council/ Rose Kawala","Siza Mayengo","Water Aid/dwe","Siter Fransis","Sister Makulata","Sda Church","Village Contributio","Unicet","Village Communi","Sdp","Siss M. Minghetti","Sisal Estste Hale","Village Res","Village Water Commission","Unhcr/danida","San Pellegrino","Sagaswe","Said Hashim","Said Omari","Villagers Mpi","Un Habitat","Said Salum Ally","Saidi Halfani","Sakwidi","Salamu Kita","Salehe","Salim Ahmed Salim","Salum Tambalizeni","Songa Hospi","Samlo","Samsoni","Samwel","Samweli","Samweli Kitana","Samweli Mshosha","Segera Estate","Seif Ndago","Sisa","Simango Kihengu","Silvester Shilingi","Silinda Yetu","Unp/aict","Serikaru","Seronera","Shabani Dunia","Shamte Said","Sharifa Athuman","Sido","Upendo Primary School","Upper Ruvu","Ur","Shekhe","Shule Ya Sekondari Ipuli","Shule Ya Msingi Ufala","Usambala Sister","Shelisheli Commission","Shule Ya Msingi","Shinyanga Shallow Wells","Vc","Simav","Sekondari","Simba Lodge","Seleman Masoud","Sipdo","Seleman Rashid","Seleman Seif","Selestine Mganga","Selikali Ya Kijiji","Singsinga","Unicrf","Sema S","Semaki","Sindida Yetu","Sent Tho","Seram","Vgovernment","Simone","Serian","Simon Lusambi","Uniseg","Veo","Villages","Villege Council","Safari Camp","Tree Ways German","Rotary Club Australia","Wame Mbiki","Rotary Club Kitchener","Rotary Club Of Chico And Moshi","Rotary Club Of Usa And Moshi","Swifti","Rotaty Club","Rotery C","Treedap","Wamarekani","Sweeden","Wamakapuchini","Swalehe Rajab","Ruangwa Lga","Rudep /dwe","Rudep/norad","Rudri","Sunamco","Rumaki","Wamisionari Wa Kikatoriki","Wamissionari Wa Kikatoriki","Runda","Wanakijiji","Robert Kampala","Tadeo","Robert Mosi","Tacri","Rodri","Romam Catholic","Tquick Wings","Trach","Taboma","Water /sema","Tabea","Taasaf","Trc","Roman Cathoric Church","Warento","Wards","Tredsp","Swiss Tr","Rotary","Sun-ja Na","Sumriy","Sadaqatun Jar","Rwi","Vwt","Vwcvwc","Vwcvc","Stabex","Rwsso","St Magreth Church","Vw","Uaacc","St Gasper","S. Kumar","Ubalozi Wa Japani","S.P.C Pre-primary School","Vn","S.S Mohamed","Villlage Contributi","Villegers","St Elizabeth Majengo","Umoja","Sophia Wazir","Staford Higima","Stansilaus","Sumo","Stephano","Waitaliano","Twende Pamoja","Subvillage","Suasa","Wahidi","Waheke","Rural","Su-ki Jang","Stp-sustainable Tan","Wafidhi Wa Ziwa T","Twice","W.F.D.P","Steven Nyangarika","Twig","Rusumo Game Reserve","Stephano Paulo","Tz As","Ruvu Darajani","Rv","Tcrs/village Community","Mgaya","Pentekoste","Government/tcrs","Haam","H4ccp","Gurdians","Grazie Grouppo Padre Fiorentin","Grazie Franco Lucchini","Grail Mission Kiseki Bar","Gra Na Halmashauri","Government/school","Halimashau","Government And Community","Government /world Vision","Government /sda","Goldwill Foundation","Goldmain","Godii","Giovan Disinistra Per Salve","Haidomu Lutheran Church","Halimashauli","Gg","Hassan Gulam","Hery","Heri Mission","Henure Dema","Hearts Helping Hands.Inc.","Health Ministry","Hdv","Haydom Lutheran Hospital","Hasnein Murij","Hamref","Hasnein Muij Mbunge","Hasnan Murig (mbunge)","Hashi","Hasawa","Haruna Mpog","Hapa Singida","Handeni Trunk Maini","Gil Cafe'church'","Getekwe","Hesawwa","Enyuati","Fabia","Eung-am Methodist Church","Eung Am Methodist Church","Ester Ndege","Esawa","Erre Kappa","Ermua","Engineers Without Border","Fdc","Embasy Of Japan In Tanzania","Elca","Egypt Technical Co Operation","Education Funds","Eco Lodge","Eater","Eastmeru Medium School","Farm-africa","Fida","Getdsc00","Friends Of Ulambo And Mwanhala","Germany Missionary","Germany Cristians","German Missionary","Gerald Tuseko Gro","Gdp","Game Division","Full Gospel Church","Friend From Un","Fiwater","Friedkin Conservation Fund","Fresh Water Plc England","Fredked Conservation","Frankfurt","Fpct Mulala","Fosecu","Folac","Hesawa And Concern World Wide","Hesawza","Kauzeni","Ju","Jwtz","Justine Marwa","Jumanne Siabo","Jumanne","Jumaa","Juma","Ju-sarang Church' And Bugango","John Skwese","Kadip","John Gileth","John Fund","Jipa","Jeshi Lawokovu","Jeshi La Wokovu [cida]","Japan Government","Japan Food Aid","Kaaya","Kagera","Jamal Abdallah","Kando","Karadea Ngo","Kapelo","Kanisani","Kanisa La Tag","Kanisa La Neema","Kanisa La Mitume","Kanis","Kanamama","Kagera Mine","Kamata Project","Kamama","Kalitesi","Kalebejo Parish","Kajima","Kahema","Kagunguli Secondary","Japan Food Aid Counter Part","Jamal","Hesswa","Huches","Igolola Community","If","Idf","Icf","Icdp","Icap","Iado","Hotels And Loggs Tz Ltd","Ilo/undp","Hotels And Lodge Tanzania","Hongoli","Holili Water Supply","Hilfe Fur Brunder","Hhesawa","Hez","Hewawa","Ilaramataki","Ilwilo Community","Jafary Mbaga","Isf/gvt","Jacobin","Iucn","Italian","Issa Mohamedi Tumwanga","Isnashia And","Islamic Society","Islamic Community","Isf / Tasaff","In Memoria Di Albeto","Irevea Sister Water","Iom","International Aid Services","Internal Drainage Basin","Insititutiona","Inkinda","Incerto","Dwst","Dwsdp","Dwe/ubalozi Wa Marekani","Bgssws","Boma Saving","Boazi /o","Boazi","Bkhws","Birage","Bingo Foundation Germany","Bhws","Bgss","Bonite Bottles Ltd","Bfwd","Batist Church","Bathlomew Vicent","Bao","Banca Reale","Balyehe","Ballo","Bong-kug Ohh/choonlza Lee","Bra","Bakari Chimkube","Busoga Trust","Carmatech","Care/dwe","Care Int","Canada Aid","Camartec","Caltaz Kahama","Caltas Tanzania","Buptist","Brad","Bumabu","Buluga Subvillage Community","Bukwang Church Saints","Bukwang Church Saint","Bukumbi","Brown","Bread Of The Worl","Bakwata","Bahresa","Dwe/rudep","Action In A","Afroz Ismail","Afriican Reli","Africaone Ltd","African Reflections Foundation","Africa Project Ev Germany","Afric","Afdp","Act Mara","Agape Churc","Act","Abs","Abdul","Abddwe","Abdala","Abc-ihushi Development Cent","Abas Ka","Afya Department Lindi Rural","Agt Church","Babtist","Aqua Blues Angels","Babtest","B.A.P","Asgerali N Bharwan","Artisan","Area","Arabi","Arab Community","Apm[africa Precious Metals Lt","Ahmadia","Apm","Answeer Muslim Grou","Amrefe","Ambwene Mwaikek","Alia","Aixos","Aic Kij","Cathoric","Cc Motor Day 2010","Ccp","Ded_rwsp","District Medical","Diocese Of Mount Kilimanjaro","Dina","Dimon","Dhv\swis","Dgv","Deogratius Kasima","Ded/rwssp","Dmd","Ded Kilo","Ddca","Dbsp","Dasp Ltd","Dasiip","Dar Es Salaam Round Table","Daldo","District Rural Project","Dmdd/solider","Ccpk","Drdp","Dwe/anglican Church","Dwarf","Dv","Duka","Dsdp","Drwssp","Drv Na Idara","Dqnida","Dmk Anglican","Doner And Ded","Doner And Com","Dominiki Simwen","Domestic Rural Development Pr","Dom","Dokta Mwandulam","Doctor Mwambi","Daida","Dagida","Dae Yeol And Chae Lynn","Chama Cha Ushirika","Cida","Church Of Disciples","Chuo","Christan Outrich","Chongolo","Chmavita","Charlotte Well","Chai Wazir","Dadid","Chacha Issame","Ch","Cgc","Cg/rc","Cefa/rcchurch","Cdft","Ccps","Cocu","College","Community Bank","Compasion International","Dacp","Da Unoperaio Siciliano","D Ct","Cvs Miss","Csf","Crs","Cristan Outrich","Craelius","Cpps Mission","Cper","Cpar","Costantine Herman","Comunity Construction Fund","Comunedi Roma","Comune Di Roma","Kassim","Kayempu Ltd","Pentecostal Hagana Sweeden","Mwalimu Omari","Mwinjuma Mzee","Mwingereza","Mwanza","Mwanamisi Ally","Mwanaisha Mwidadi","Mwamvita Rajabu","Mwamama","Mwalimu Muhenza","Mwita Kichere","Mwalimu Maneromango Muhenzi","Mwakifuna","Mwakalinga","Mwakabalula","Muwasa","Muslimu Society(shia)","Muslimehefen International","Mwita","Mwita Lucas","Muslim Society","Mzee Shindika","Nassan","Nasan","Namungo Miners","Mzungu","Mzee Yassin Naya","Mzee Waziri Tajari","Mzee Smith","Mzee Sh","Mwita Machota","Mzee Salum Bakari Darus","Mzee Ngwatu","Mzee Mkungata","Mzee Mabena","Mzee Lesilali","Mwl.Mwita","Mwita Mahiti","Muslim World","Muniko","Natio","Mnyambe","Motiba Manyanya","Mosqure","Moshono Adp","Moses","Morrovian","Morad","Mnyamisi Jumaa","Mnyama","Mow","Mmanya Abdallah","Mkurugenzi","Mkuluku","Mkulima","Mitema","Misri Government","Misheni","Motiba Wambura","Moyowosi Basin","Municipal Council","Msudi","Mungaya","Muivaru","Muislam","Muhochi Kissaka","Muhindi","Mtewe","Mtambo","Mstiiti","Mp Mloka","Msikitini","Msikiti Masji","Msiki","Msigwa","Msigw","Ms-danish","Mrtc","Natherland","National Park","Kc","Opec","Oxfarm Gb","Owner Pingo C","Overnment","Overland High School","Othod","Otelo Bussiness Company","Orphanage","One Desk One Chair","Padi","Omar Rafael","Omar Ally","Old Nyika Company","Okutu Village Community","Oikos E.Africa/ European Union","Obadia","Oak'zion' And Bugango B' Commu","Padep(mifugo)","Padri","Nyitamboka","Paskali","Pentecostal","Pentecosta Seela","Pentecost","Pema","Pdi","Paulo Sange","Patrick","Parastatal An","Padri K","Panone","Pankrasi","Pangadeco","Pancrasi","Pag Church","Paffect Mwanaindi","Padri Matayo","O","Nyeisa","National Rural (wb)","Ngelepo Group","Nmdc India","Njula","Nipon & Panoco","Nimrodi Mkono[mb]","Ngumi","Ngo","Nginila","Netherla","Norad/government","Nerthlands","Ndorobo Tours","Ndolezi","Ndm","Ncs","Nazaleti","Nazalet Church","Noeli Mahobokela","Norad/japan","Nyanza Road","Nyabibuye Islamic Center","Nyangere","Nyamuhanga Maro","Nyamingu Subvillage","Nyamasagi","Nyakaho Mwita","Nyahale","Nyabweta","Nyabarongo Kegoro","Norad/rudep","Nssf","Noshadi","Noshad","Norway Aid","Norplan","Nordic","Norani","Misana George","Ministry Of Agricultura","Mikumi G","Lake Tanganyika Prodap","Leopad Abeid","Legeza Legeza","Lee Kang Pyung's Family","Ldcgd","Ldcdd","Lc","Latfu","Lake Tanganyika Basin","Lgcd","Laizer","Kyela-morogoro","Kwasenenge Group","Kwaruhombo He","Kwang-nam Middle-school","Kwamdulu Estate","Kwa Mzee Waziri","Lg","Lgcgd","Kwa Ditriki Cho","Lottery","Lusajo","Lungwe","Luke Samaras Ltd","Luka","Luchelegu Primary School","Luali Kaima","Louise Elucas Sala","Lotary International","Lgdbg","Loocip","Long Ga","Lizad","Liz","Lions Club Kilimanjaro","Lions C","Lion Clu","Kwa Makala","Kurrp Ki","Miab","Kigwa","Kindoroko Water Project","Kilol","Kilimo","Kilimarondo Parish","Kikundi Cha Akina Mama","Kikom","Kijij","Kigoma Municipal Council","Kipo Potry","Kidika","Kibara Foundation","Kenyans Company","Kegocha","Kdpa","Kdc","Kcu","Kinga","Kitiangare Village Community","Kurrp","Kokornel","Kuji Foundation","Ku","Kopwe Khalifa","Kondo Primary","Kondela","Kome Parish","Kolopin","Koico","Kiwanda Cha Ngozi","Kkkt Usa","Kkkt Ndrumangeni","Kkkt Mareu","Kkkt Leguruki","Kkkt Canal","Kizenga","Kizego Jumaa","M","M And P","Ma","Matogoro","Mbozi Hospital","Mboni Salehe","Mbiusa","Mbeje","Mazaro Kabula","Mayiro","Matyenye","Matimbwa Sec","Mbuzi Mawe","Matata Selemani","Maswi Drilling Co. Ltd","Masista","Masese","Maseka Community","Masai Land","Marumbo Community","Mbozi Secondary School","Mbwana Omari","Maajabu Pima","Mfuko Wa Jimbo La Magu","Mhuzu","Mhoranzi","Mhina","Mh.J S Sumari","Mh.Chiza","Mgaya Masese","Mganga","Mfuko Wa Jimbo","Mbwiro","Methodist Church","Meru Concrete","Member Of Perliament Ahmed Ali","Member O","Meko Balo","Medicine","Meco","Maro Kyariga","Marke","Marafin","Magu Food Security","Makanga","Maju Mugumu","Majengo Prima","Mahita","Mahemba","Magutu Maro","Magul","Magige","Manyovu Agriculture Institute","Magani","Mafwimbo","Maerere","Madra","Madaraweshi","Machibya Guma","Maashumu Mohamed","Makanya Sisal Estate","Makapuchini","Makli","Makombo","Manyota Primary School","Mango Tree","Mamvua Kakungu","Mambe","Mamaz","Mama Ku","Malola","Maliasili","Males","Makusa","Makundya","Makori","Makonder","Makondakonde Water Population","Makona","Zingibali Seconda"]

In [90]:
# Replacing each list of funders by their assigned category
categoricals['funder_organization'] = categoricals['funder_organization'].replace(bilateral, 'bilateral')
categoricals['funder_organization'] = categoricals['funder_organization'].replace(government, 'government')
categoricals['funder_organization'] = categoricals['funder_organization'].replace(NPO_NGO, 'NPO_NGO')
categoricals['funder_organization'] = categoricals['funder_organization'].replace(private, 'private')
categoricals['funder_organization'] = categoricals['funder_organization'].replace(religious, 'religious')
categoricals['funder_organization'] = categoricals['funder_organization'].replace(unknown, 'unknown')
categoricals['funder_organization'] = categoricals['funder_organization'].replace(minor_funders, 'minor_funders')


In [91]:
# List of strings to replace
categoricals['funder_organization'].unique()

array(['religious', 'Grumeti', 'minor_funders', 'NPO_NGO',
       'Mkinga Distric Coun', 'government', 'Isingiro Ho', 'bilateral',
       'private', 'Biore', 'Twe', 'African Development Bank', 'Undp',
       'unknown', 'Not Known', 'Kirde', 'Cefa', 'European Union', 'Muwsa',
       'Dwe/norad', 'Olgilai Village Community', 'Roman Catholic', 'Sema',
       'Swisland/ Mount Meru Flowers', 'Ifad', 'Swedish', 'Idc', 'He',
       'Isf/tacare', 'Aict', 'Kiuma', 'Ruthe', 'Concern World Wide',
       'Wfp', 'Lips', 'Sida', 'Tanza', 'Village Council', 'Fpct', 'Ir',
       'Anglican Church', 'Peters', 'Donor', 'Jbg', 'Dadis', 'Germany',
       'Kibaha Town Council', 'Dfid', 'Af', 'Wananchi', 'No', 'Dct',
       'Norad /government', 'Co', 'Ridep', 'Tassaf', 'Hans', 'Fin Water',
       'Plan International', 'African Muslim Agency', 'Cdtf', 'Shawasa',
       'Un', 'Commu', 'Community', 'Save The Rain Usa', 'Tlc', 'Rc Churc',
       'Lvia', 'Songea District Council', 'Rc Ch',
       'Makonde Water P

#### 4. b. 5. Encoding Categorical Variables

In [26]:
# One-hot encoding the categorical columns
one_hot_df = pd.get_dummies(df)
one_hot_df.columns

Index(['amount_tsh', 'gps_height', 'population', 'public_meeting', 'permit',
       'construction_year', 'status_group', 'funder_0', 'funder_A/co Germany',
       'funder_Aar',
       ...
       'source_type_rainwater harvesting', 'source_type_river/lake',
       'source_type_shallow well', 'source_type_spring',
       'waterpoint_type_group_cattle trough',
       'waterpoint_type_group_communal standpipe', 'waterpoint_type_group_dam',
       'waterpoint_type_group_hand pump',
       'waterpoint_type_group_improved spring', 'waterpoint_type_group_other'],
      dtype='object', length=4129)

## 5. Modeling

What modeling techniques should we apply?

Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model. 



Be sure to explore:

1. Model features and preprocessing approaches
2. Different kinds of models (logistic regression, k-nearest neighbors, decision trees, etc.)
3. Different model hyperparameters

At minimum you must build three models:

* A simple, interpretable baseline model (logistic regression or single decision tree)
* A more-complex model (e.g. random forest)
* A version of either the simple model or more-complex model with tuned hyperparameters

### 5. a. Logistic Regression

#### 5. a. 1. Performing a Train-Test Split

In [27]:
# Splitting df into X and y
X = df.drop('status_group', axis=1)
y = df['status_group']

The dataset is being divided into two separate subsets: a training set, and a testing (or validation) set. The validation set will allow to assess the performance of the model. 

Two parameters are assigned when dividing the dataset:
* random_state=42 
   - setting a random seed of 42 ensures that the data split is reproducible
* stratify=y 
   - stratified sampling ensures the class distribution is maintained in both sets to address potential class imbalance issues

In [28]:
# Performing train-test split with random_state=42 and stratify=y 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

Because stratify=y was applied, the percentages of non functional water wells in the train and test target should be similar. 

In [29]:
# Inspecting the percentages of non functional water wells in train and test targets: 
print("Train percent of non functional wells:", y_train.value_counts(normalize=True)[1])
print("Test percent of non functional wells:", y_test.value_counts(normalize=True)[1])

Train percent of non functional wells: 0.5430751964085297
Test percent of non functional wells: 0.5430976430976431


#### 5. a. 2. Building and Evaluating a Baseline Model

We will begin by creating a classifier using scikit-learn's LogisticRegression model, setting the random_state to 42. Next, we will employ cross_val_score with the scoring metric "neg_log_loss" to compute the average log loss through cross-validation on our training data, X_train and y_train.

It is important to note that, similarly to the Root Mean Square Error case, when using cross_val_score, we need to utilize "negative log loss" due to the internal implementation requirements. Consequently, the code negates the result to ensure proper computation.

In [30]:
# Importing the relevant class and function
# from sklearn.linear_model import LogisticRegression
# from sklearn.model_selection import cross_val_score


In [31]:
# Instantiating a LogisticRegression with random_state=42
baseline_model = LogisticRegression(random_state=42)

In [32]:
# Using cross_val_score with scoring="neg_log_loss" to evaluate the model 
# on X_train and y_train
baseline_neg_log_loss_cv = cross_val_score(baseline_model, X_train, y_train, scoring="neg_log_loss")

baseline_neg_log_loss_cv = -(baseline_neg_log_loss_cv.mean())
baseline_neg_log_loss_cv

ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1196, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 1106, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\utils\_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\pandas\core\generic.py", line 2070, in __array__
    return np.asarray(self._values, dtype=dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'Unicef'

--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1196, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 1106, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\sklearn\utils\_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\albane.colmenares\AppData\Local\anaconda3\Lib\site-packages\pandas\core\generic.py", line 2070, in __array__
    return np.asarray(self._values, dtype=dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'Germany Republi'


#### 5. a. 3. Writing a Custom Cross Validation Function

#### 5. a. 4. Building and Evaluating Additional Logistic Regression Models

#### 5. a. 5. Choosing and Evaluating a Final Model

## 6. Evaluation

Which model best meets the business objectives?

After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model.

Choosing the right **classification metrics**

## 7. Findings & Recommendations

**Predictive** approach

A predictive finding might include:

* How well your model is able to predict the target
* What features are most important to your model


A predictive recommendation might include:

* The contexts/situations where the predictions made by your model would and would not be useful for your stakeholder and business problem
* Suggestions for how the business might modify certain input variables to achieve certain target results

## 8. Limits & Next Steps

## \*\*Appendix \*\*