### Notebook 2_3: Preprocessing - Handling Categorical Features
**Author:<br>
Tashi T. Gurung**<br>
**hseb.tashi@gmail.com**

### About the project:
The **objective** of this project is to **predict the failure of water points** spread accross Tanzania before they occur.

50% of Tanzania's population do not have access to safe water. Among other sources, Tanzanians depend on water points mostly pumps (~60K) spread across Tanzania. Compared to other infrastructure projects, water point projects consist of a huge number of inspection points that are geographically spread out. Gathering data on the condition of these pumps has been a challenge. From working with local agencies, to implementing mobile based crowd sourcing projects, none have produced satisfactory results.

The lack of quality data creates a number of problem for a stakeholder like the Tanzanian Government, specifically the Ministry of Water. Consequences include not only higher maintainence costs, but also all the problems and nuanced issues faced by communities when their access to water is compromised or threatened.

While better data collection infrastructure should be built overtime, this project (with its model(s), various analysis, and insights) will be key for efficient resource allocation to maximize the number of people and communities with access to water.
In the long run, it will assist stake holders in and project planning, and even local, regional and national level policy formation. 

---

### Import libraries and datasets

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import warnings

# Filter out FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [5]:
# import json file with desired data type information 
json_file_path = '../data/processed/data_types.json'

# Read and load the JSON file into a dictionary
with open(json_file_path, 'r') as json_file:
    data_types_dict = json.load(json_file)

data_types_dict['longitude'] = 'float64'

In [6]:
df = pd.read_csv('../data/processed/preprocessed_data.csv', dtype=data_types_dict)
df.head(2)

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,basin,subvillage,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,Lake Nyasa,Mnyusi B,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,Lake Victoria,Nyamara,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


---

## Handling Categorical Features

In [None]:
# Change data types to object
columns_to_convert_to_object = ['district_code', 'region_code']
df[columns_to_convert_to_object] = df[columns_to_convert_to_object].astype('object')

In [None]:
categorical_columns = df.select_dtypes(include=['object']).columns.to_list()
categorical_columns.remove('status_group')

In [None]:
# Calculate cardinality (number of unique values) for each categorical column
cardinality = df[categorical_columns].nunique()

# Display the cardinality of each categorical column
print(cardinality)


funder                    1896
installer                 2145
wpt_name                 37399
basin                        9
subvillage               19287
region                      21
region_code                 27
district_code               19
lga                        125
ward                      2092
public_meeting               3
recorded_by                  1
scheme_management           11
scheme_name               2696
permit                       3
extraction_type             18
extraction_type_group       13
extraction_type_class        7
management                  12
management_group             5
payment                      7
payment_type                 7
water_quality                8
quality_group                6
quantity                     5
quantity_group               5
source                      10
source_type                  7
source_class                 3
waterpoint_type              7
waterpoint_type_group        6
dtype: int64


In [None]:
df['status_group'] = df['status_group'].map({
    'functional': 2,
    'non functional': 0,
    'functional needs repair': 1
})


In [None]:
# Import the required library
from category_encoders import TargetEncoder

# Initialize the target encoder
encoder = TargetEncoder()

# Encode the "wpt_name" column using the target encoder
df['wpt_name_encoded'] = encoder.fit_transform(df['wpt_name'], df['status_group'])

# Drop the original "wpt_name" column
df.drop('wpt_name', axis=1, inplace=True)

# Print the first few rows to check the result
print(df[['wpt_name_encoded', 'status_group']].head())


   wpt_name_encoded  status_group
0          1.496774             2
1          1.130120             2
2          1.288769             2
3          1.008064             0
4          1.064645             2


In [None]:
# lets follow the same for the remaining high cardinality features

# Encode the "subvillage" column using the target encoder
df['subvillage_encoded'] = encoder.fit_transform(df['subvillage'], df['status_group'])

# Encode the "ward" column using the target encoder
df['ward_encoded'] = encoder.fit_transform(df['ward'], df['status_group'])

# Encode the "scheme_name" column using the target encoder
df['scheme_name_encoded'] = encoder.fit_transform(df['scheme_name'], df['status_group'])

# Drop the original categorical columns
df.drop(['subvillage', 'ward', 'scheme_name'], axis=1, inplace=True)


In [None]:
# Calculate cardinality (number of unique values) for each categorical column

new_categorical_columns = df.select_dtypes(include=['object']).columns.to_list()
cardinality = df[new_categorical_columns].nunique()

# Display data type and cardinality side by side
for column in new_categorical_columns:
    print(f"{column}: {df[column].dtype}, Cardinality: {cardinality[column]}")


funder: object, Cardinality: 1896
installer: object, Cardinality: 2145
basin: object, Cardinality: 9
region: object, Cardinality: 21
region_code: object, Cardinality: 27
district_code: object, Cardinality: 19
lga: object, Cardinality: 125
public_meeting: object, Cardinality: 3
recorded_by: object, Cardinality: 1
scheme_management: object, Cardinality: 11
permit: object, Cardinality: 3
extraction_type: object, Cardinality: 18
extraction_type_group: object, Cardinality: 13
extraction_type_class: object, Cardinality: 7
management: object, Cardinality: 12
management_group: object, Cardinality: 5
payment: object, Cardinality: 7
payment_type: object, Cardinality: 7
water_quality: object, Cardinality: 8
quality_group: object, Cardinality: 6
quantity: object, Cardinality: 5
quantity_group: object, Cardinality: 5
source: object, Cardinality: 10
source_type: object, Cardinality: 7
source_class: object, Cardinality: 3
waterpoint_type: object, Cardinality: 7
waterpoint_type_group: object, Cardinal

**Observation:** 
- we still have two features: funder, and installer with relatively high cardinality

**Action:**
- Lets look into thesee

### funder

In [None]:
# Use np.where to replace values
df["funder"] = np.where(df["funder"] == "Government Of Tanzania", df["funder"], "other")

### installer

In [None]:
pumps_per_installer = df["installer"].value_counts()

In [None]:
def categorize_installer(installer):
    if pumps_per_installer[installer]  > 10_000:
        return "large"
    elif pumps_per_installer[installer]  > 1:
        return "mid"
    else:
        return "small"
     

In [None]:
# Apply the categorization function to the 'installer' column
df['installer_category'] = df['installer'].apply(categorize_installer)
df.drop(columns = ['installer'] , inplace = True)

---

In [None]:
columns_to_drop = [
    'region',  # region_code already provides that information
    'lga',  # need to look into it further
    'recorded_by',  # all rows have the same value
    'extraction_type_group',
    'extraction_type_class',  # for both of these, extraction_type already provides the info
    'payment_type',  # info provided by payment
    'quality_group',  # info provided by quality
    'source_type',  # info provided by source
    'source_class',  # info provided by source
    'waterpoint_type_group',  # info provided by waterpoint_type
]

df.drop(columns=columns_to_drop, inplace=True)


## Encoding Categorical Variables

In [None]:
# List of categorical columns to one-hot encode
categorical_columns_to_encode  = df.select_dtypes(include=['object']).columns.tolist()

# Perform one-hot encoding for the specified columns
df_encoded = pd.get_dummies(df, columns=categorical_columns_to_encode)

## Feature Engineering

Create a new feature called *age*

In [None]:
df_encoded['year_recorded'] = df_encoded['date_recorded'].dt.year

In [None]:
# Calculate the 'age' by subtracting 'construction_year' from 'year_recorded'
df_encoded['age'] = df_encoded['year_recorded'] - df_encoded['construction_year']
df_encoded.drop(columns = ['date_recorded', 'year_recorded'],inplace=True)

In [None]:
df_encoded.to_csv('../data/processed/preprocessed_data.csv')