Final Project Submission
Student Name: Benson Kamau
Student Pace: Full-Time
Instructor's: Nikita Njoroge

## Title: Predicting the functionality of Water wells in Tanzania

#### Overview

Tanzania, a country of nearly 57 million people, has a serious problem ensuring that its people have access to dependable and safe water sources. Although there are many water stations around the nation that have been created by the government and non-governmental organizations (NGOs), a significant number of these facilities are either non-operational or in critical need of repair. The objective of guaranteeing all Tanzanians long-term access to this essential resource is put in jeopardy by this circumstance. Ensuring the upkeep and efficiency of the current water infrastructure is of utmost importance in order to protect the welfare and advancement of people throughout this country in East Africa.

#### Challenges 

1. Environmental factors: Droughts and seasonal variations can have a big impact on well water supply. Extended dry spells can lower water levels, which can lead to wells being unreliable.
2. Contamination: A number of things, such as being next to a latrine, agricultural runoff, and naturally occurring contaminants in the groundwater, can pollute water wells. Communities that depend on these wells run the danger of health problems due to contaminated water.
3. Maintenance and repair: This can be attributed to factors including lack of funding for regular upkeep and lack of technical skills to maintain and repair the wells.
4. Public Awareness: The necessity of keeping water wells maintained and the potential health risks of drinking contaminated water are sometimes not well understood.

#### Proposed solution

1. Invest in boreholes and deeper wells that are more resilient to drought and seasonal fluctuations. Complementing these wells with other water sources would also help the community during dry season. 
2. Establish community-based water monitoring systems with the assistance of environmental and health organizations to guarantee routine supervision and timely reaction to pollution.
3. Encourage partnerships between the government, private sector, and NGOs to pool resources and share the responsibility of maintaining water wells.
4. Educate the community on the importance of maintaining the wells and risks associated with contaminated water.

#### Conclusion

The analysis identified key factors influencing well functionality, including geographic location, construction details, environmental conditions, and maintenance practices.

#### Problem Statement

Access to safe and consistent drinking water is a major challenge in Tanzania, especially in rural regions with minimal infrastructure. To solve this issue, the Tanzanian government has made investments in the building of water wells in collaboration with a number of non-governmental organizations. However, the sustainability and functionality of these wells remain uncertain, with many of them falling into disrepair or becoming non-functional over time.

#### Objectives

Main objective 
 1. To develop a machine learning classifier that predicts the condition of water wells in Tanzania.

Specific objectives
1. Prepare and analyse data for modeling
2. Develop predictive Model
3. Provide actionable insights and recommendations based on the findings of the model

### Importing relevant libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,precision_score, recall_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

Define a class for model building

In [None]:
# Define my class
class Modelbuilder:
    def __init__(self, model):
        self.model = model

    def preprocessing(self, X_train, X_test):
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        return X_train_scaled, X_test_scaled

    def fit(self, X_train, y_train):
        self.model.fit(X_train, y_train)    

    def predict(self, X_test):
        y_pred = self.model.predict(X_test)
        return y_pred
    
    def evaluate(self, y_test, y_pred):
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        confusion_matrix = confusion_matrix(y_test, y_pred)
        return accuracy, precision, recall, confusion_matrix

### 1. Data Loading and Understanding

The data is sourced from Taarifa and the Tanzanian Ministry of Water. Data utilized can be found here: https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/

For the purposes of our evaluation, we are utilizing the Training Set Labels and Training Set Values, which include data from 59,400 pumps. 
We will then use the Test set values to test our models.

The following is a list of column names and descriptions:

* amount_tsh - Total static head (amount water available to waterpoint)
* date_recorded - The date the row was entered
* funder - Who funded the well
* gps_height - Altitude of the well
* installer - Organization that installed the well
* longitude - GPS coordinate
* latitude - GPS coordinate
* wpt_name - Name of the waterpoint if there is one
* num_private -
* basin - Geographic water basin
* subvillage - Geographic location
* region - Geographic location
* region_code - Geographic location (coded)
* district_code - Geographic location (coded)
* lga - Geographic location
* ward - Geographic location
* population - Population around the well
* public_meeting - True/False
* recorded_by - Group entering this row of data
* scheme_management - Who operates the waterpoint
* scheme_name - Who operates the waterpoint
* permit - If the waterpoint is permitted
* construction_year - Year the waterpoint was constructed
* extraction_type - The kind of extraction the waterpoint uses
* extraction_type_group - The kind of extraction the waterpoint uses
* extraction_type_class - The kind of extraction the waterpoint uses
* management - How the waterpoint is managed
* management_group - How the waterpoint is managed
* payment - What the water costs
* payment_type - What the water costs
* water_quality - The quality of the water
* quality_group - The quality of the water
* quantity - The quantity of water
* quantity_group - The quantity of water
* source - The source of the water
* source_type - The source of the water
* source_class - The source of the water
* waterpoint_type - The kind of waterpoint
* waterpoint_type_group - The kind of waterpoint

In [17]:
#create a function that loads data and gets the info about the data.
def load_and_get_info(file_path):
    """
    Load data from a CSV file and get information about the DataFrame.

    Parameters:
    - file_path (str): Path to the CSV file.

    Returns:
    - df_info (str): Information about the DataFrame.
    """
    # Load data
    df = pd.read_csv(file_path)

    # Display the first few rows of the DataFrame
    df_head = df.head()

    # Get information about the DataFrame
    df_info = df.info()

    return df,df_info, df_head

# A function that checks the data types of DataFrame columns and return the count of columns for each data type category.
def check_data_types(df):
    """
    Check the data types of DataFrame columns and return the count of columns for each data type category.

    Parameters:
    - df (DataFrame): Input DataFrame.

    Returns:
    - data_type_counts (dict): Dictionary containing the count of columns for each data type category.
    """
    data_type_counts = df.dtypes.replace({'object': 'string'}).value_counts().to_dict()
    return data_type_counts

#### 1.1 Loading our first dataset - Training set values

In [12]:
file_path1 = 'data/training set values.csv'
df1,data_info, data_head = load_and_get_info(file_path1)
print(data_info)
print("\nFirst few rows of the DataFrame:")
data_head #data_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55763 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59398 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,14/03/2011,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,06/03/2013,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,25/02/2013,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,28/01/2013,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,13/07/2011,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


Our training set values contains 59400 rows and 40 columns. 

In [16]:
#check the data types of DataFrame columns in our training set values.
data_type_counts = check_data_types(df1)
print("Count of columns for each data type category:")
print(data_type_counts)

Count of columns for each data type category:
{'string': 30, dtype('int64'): 7, dtype('float64'): 3}


The dataset is divided into three data types categories:
1. String(object type) which has 30 columns e.g., funder, installer. 
2. Integer type which has 7 columns e.g., gpd_height, region_code, district_code
3. Float type which has 3 columns e.g., longitude and latitude

#### 1.2 Loading our second dataset - Training set labels

In [33]:
file_path2 = 'data/training set labels.csv'
df2,data_info, data_head = load_and_get_info(file_path2)
print(data_info)
print("\nFirst few rows of the DataFrame:")
data_head #data_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            59400 non-null  int64 
 1   status_group  59400 non-null  object
dtypes: int64(1), object(1)
memory usage: 928.3+ KB
None

First few rows of the DataFrame:


Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


The dataset has 2 columns(id and status_group) and 59,400 rows.

In [20]:
#check the data types of DataFrame columns in our training set values.
data_type_counts = check_data_types(df2)
print("Count of columns for each data type category:")
print(data_type_counts)


Count of columns for each data type category:
{dtype('int64'): 1, 'string': 1}


#### 1.3 Loading the third dataset - Test set values

In [64]:
file_path3 = 'data/Test set values.csv'
test_data,data_info, data_head = load_and_get_info(file_path3)
print(data_info)
print("\nFirst few rows of the DataFrame:")
data_head #data_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14850 entries, 0 to 14849
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     14850 non-null  int64  
 1   amount_tsh             14850 non-null  float64
 2   date_recorded          14850 non-null  object 
 3   funder                 13980 non-null  object 
 4   gps_height             14850 non-null  int64  
 5   installer              13973 non-null  object 
 6   longitude              14850 non-null  float64
 7   latitude               14850 non-null  float64
 8   wpt_name               14850 non-null  object 
 9   num_private            14850 non-null  int64  
 10  basin                  14850 non-null  object 
 11  subvillage             14751 non-null  object 
 12  region                 14850 non-null  object 
 13  region_code            14850 non-null  int64  
 14  district_code          14850 non-null  int64  
 15  lg

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,04/02/2013,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,04/02/2013,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,...,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,01/02/2013,,1567,,34.767863,-5.004344,Puma Secondary,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,22/01/2013,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,...,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,27/03/2013,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,...,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


The dataset has 14,850 rows and 40 columns.

In [25]:
#check the data types of DataFrame columns in our training set values.
data_type_counts = check_data_types(df3)
print("Count of columns for each data type category:")
print(data_type_counts)

Count of columns for each data type category:
{'string': 30, dtype('int64'): 7, dtype('float64'): 3}


The dataset is divided into three data types categories:
1. String(object type) which has 30 columns  
2. Integer type which has 7 columns 
3. Float type which has 3 columns 

#### 1.4 Merging the training set values and training set labels

In [39]:
# merging the training set values and training set labels on the 'id' column
train_data = pd.merge(df1, df2, on='id')
train_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55763 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59398 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

In [38]:
#Check the distribution of the target variable(status_group)
train_data['status_group'].value_counts()

status_group
functional                 32259
non functional             22824
functional needs repair     4317
Name: count, dtype: int64

### Data Cleaning

To start our data cleaning process, we will check the num_private column because no description has been provided.

In [40]:
train_data['num_private'].value_counts()

num_private
0       58643
6          81
1          73
5          46
8          46
        ...  
42          1
23          1
136         1
698         1
1402        1
Name: count, Length: 65, dtype: int64

Over 99% of the data = 0. I presume 0 means the data was not available and choose to drop the column entirely.

In [42]:
#drop the column 'num_private'
train_data.drop(columns=['num_private'],inplace=True)

I will assume that whoever recorded data did so truthfully and this should not effect our target. We drop the column.

In [44]:
#drop the column 'recorded_by'
train_data.drop(columns=['recorded_by'], inplace=True)

Next I explore columns with similar column names and descriptions to see if there is any overlap in ifnormation.

In [45]:
#Explore columns with similar column names and descriptions to see if there is any overlap in information.
def explore_similar_columns(df, col1, col2):
    """
    Explore columns with similar names in the DataFrame and return the value counts of the specified columns.

    Parameters:
    - df (DataFrame): The DataFrame containing the columns.
    - col1 (str): The name of the first column.
    - col2 (str): The name of the second column.

    Returns:
    - col1_value_counts (Series): Value counts of the first column.
    - col2_value_counts (Series): Value counts of the second column.
    """
    # Ensure the specified columns exist in the DataFrame
    if col1 not in df.columns or col2 not in df.columns:
        raise ValueError(f"One or both of the columns '{col1}' and '{col2}' do not exist in the DataFrame")
    
    # Get value counts for both columns
    col1_value_counts = df[col1].value_counts()
    col2_value_counts = df[col2].value_counts()

    return col1_value_counts, col2_value_counts

1. Payment and Payment_type

In [49]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'payment', 'payment_type')
print(col1_value_counts)

print(col2_value_counts)

payment
never pay                25348
pay per bucket            8985
pay monthly               8300
unknown                   8157
pay when scheme fails     3914
pay annually              3642
other                     1054
Name: count, dtype: int64
payment_type
never pay     25348
per bucket     8985
monthly        8300
unknown        8157
on failure     3914
annually       3642
other          1054
Name: count, dtype: int64


The two columns have similar information in the dataset. I will drop the payment type.

In [50]:
train_data.drop(columns=['payment_type'], inplace=True)

2. Quantity and quantity group

In [52]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'quantity', 'quantity_group')
print(col1_value_counts)

print(col2_value_counts)

quantity
enough          33186
insufficient    15129
dry              6246
seasonal         4050
unknown           789
Name: count, dtype: int64
quantity_group
enough          33186
insufficient    15129
dry              6246
seasonal         4050
unknown           789
Name: count, dtype: int64


The column 'quantity' and 'quantity_group' have similar names and descriptions, so I can drop one of them.

In [53]:
train_data.drop(columns=['quantity_group'], inplace=True)

3. Waterpoint_type and waterpoint_type group 

In [54]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'waterpoint_type', 'waterpoint_type_group')
print(col1_value_counts)

print(col2_value_counts)

waterpoint_type
communal standpipe             28522
hand pump                      17488
other                           6380
communal standpipe multiple     6103
improved spring                  784
cattle trough                    116
dam                                7
Name: count, dtype: int64
waterpoint_type_group
communal standpipe    34625
hand pump             17488
other                  6380
improved spring         784
cattle trough           116
dam                       7
Name: count, dtype: int64


The above columns have similar names and descriptions, but one of the keys - communal standpipe has different values. I will keep the columns.

4. Source ,source_type and source_type_group

In [55]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'source', 'source_type')
print(col1_value_counts)

print(col2_value_counts)

source
spring                  17021
shallow well            16824
machine dbh             11075
river                    9612
rainwater harvesting     2295
hand dtw                  874
lake                      765
dam                       656
other                     212
unknown                    66
Name: count, dtype: int64
source_type
spring                  17021
shallow well            16824
borehole                11949
river/lake              10377
rainwater harvesting     2295
dam                       656
other                     278
Name: count, dtype: int64


In [57]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'source', 'source_class')
print(col1_value_counts)

print(col2_value_counts)

source
spring                  17021
shallow well            16824
machine dbh             11075
river                    9612
rainwater harvesting     2295
hand dtw                  874
lake                      765
dam                       656
other                     212
unknown                    66
Name: count, dtype: int64
source_class
groundwater    45794
surface        13328
unknown          278
Name: count, dtype: int64


I drop the source_type and source_class columns since they contain similar information to the source column, which is more robust.

In [58]:
train_data.drop(columns=['source_type', 'source_class'], inplace=True)

5. Extraction_type, extraction_typ_group,extraction_type_name

In [59]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'extraction_type', 'extraction_type_group')
print(col1_value_counts)

print(col2_value_counts)

extraction_type
gravity                      26780
nira/tanira                   8154
other                         6430
submersible                   4764
swn 80                        3670
mono                          2865
india mark ii                 2400
afridev                       1770
ksb                           1415
other - rope pump              451
other - swn 81                 229
windmill                       117
india mark iii                  98
cemo                            90
other - play pump               85
walimi                          48
climax                          32
other - mkulima/shinyanga        2
Name: count, dtype: int64
extraction_type_group
gravity            26780
nira/tanira         8154
other               6430
submersible         6179
swn 80              3670
mono                2865
india mark ii       2400
afridev             1770
rope pump            451
other handpump       364
other motorpump      122
wind-powered         117
india 

In [61]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'extraction_type', 'extraction_type_class')
print(col1_value_counts)

print(col2_value_counts)

extraction_type
gravity                      26780
nira/tanira                   8154
other                         6430
submersible                   4764
swn 80                        3670
mono                          2865
india mark ii                 2400
afridev                       1770
ksb                           1415
other - rope pump              451
other - swn 81                 229
windmill                       117
india mark iii                  98
cemo                            90
other - play pump               85
walimi                          48
climax                          32
other - mkulima/shinyanga        2
Name: count, dtype: int64
extraction_type_class
gravity         26780
handpump        16456
other            6430
submersible      6179
motorpump        2987
rope pump         451
wind-powered      117
Name: count, dtype: int64


I will keep the extraction_type and extraction_type_name columns. I will go ahead and drop the extraction_type_group column.

In [63]:
train_data.drop(columns=['extraction_type_group'], inplace=True)

Next, we explore null values and decide how to clean nulls.

In [106]:
def check_nulls_and_duplicates(df):
    # Calculate the number of null values in each column
    null_counts = df.isnull().sum()

    # Calculate the total number of rows
    total_rows = len(df)

    # Calculate the percentage of null values in each column
    null_percentage = (null_counts / total_rows) * 100 
    
    # Display message about columns with null values
    columns_with_null = null_counts[null_counts > 0]
    if not columns_with_null.empty:
        print("Columns with null values and their count/percentage:")
        for column, count in columns_with_null.items():
            percentage = null_percentage[column]
            print(f"{column}: {count} ({percentage:.2f}%)")
    else:
        print("No null values")

    # Calculate the number of duplicate rows
    num_duplicates = df.duplicated().sum()

    # Display the number of duplicate rows
    print(f"Number of duplicate rows: {num_duplicates}")

    return num_duplicates

In [88]:
check_nulls_and_duplicates(train_data)

Columns with null values and their count/percentage:
funder: 3637 (6.12%)
installer: 3655 (6.15%)
wpt_name: 2 (0.00%)
subvillage: 371 (0.62%)
public_meeting: 3334 (5.61%)
scheme_management: 3878 (6.53%)
scheme_name: 28810 (48.50%)
permit: 3056 (5.14%)
Number of duplicate rows: 0


0

From the above output, i will explore the scheme_management and scheme_names further by checking the value counts of each.

In [92]:
col1_value_counts, col2_value_counts = explore_similar_columns(train_data, 'scheme_name', 'scheme_management')
print(col1_value_counts)

print(col2_value_counts)

scheme_name
K                       682
Borehole                546
Chalinze wate           405
M                       400
DANIDA                  379
                       ... 
Mradi wa maji Vijini      1
Villagers                 1
Magundi water supply      1
Saadani Chumv             1
Mtawanya                  1
Name: count, Length: 2695, dtype: int64
scheme_management
VWC                 36793
WUG                  5206
Water authority      3153
WUA                  2883
Water Board          2748
Parastatal           1680
Private operator     1063
Company              1061
Other                 766
SWC                    97
Trust                  72
Name: count, dtype: int64


I will drop scheme name since scheme management captures similar data more cleanly with fewer nulls. 

In [94]:
train_data.drop(columns=['scheme_name'], inplace=True)

I will go ahead and replace the null values in the coulmns with about 5% or 6% of null values with "unknown"

In [97]:
columns_to_fill = ['funder', 'installer', 'public_meeting', 'scheme_management']

for column in columns_to_fill:
    train_data[column] = train_data[column].fillna(value='Unknown')

I will also drop the null values in the wpt_name and subvillage columns.

In [98]:

train_data.dropna(subset=['wpt_name', 'subvillage'], inplace=True)

Finally, i will check the values in the permit column.

In [102]:
train_data['permit'].value_counts()

permit
True     38791
False    17180
Name: count, dtype: int64

I will replace the null values with True which is also the most common value

In [104]:
train_data['permit'] = train_data['permit'].fillna(value=True)

In [107]:
check_nulls_and_duplicates(train_data)

No null values
Number of duplicate rows: 0


0