<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Python Data Cleaning and Analysis - Population by Type of Disability

## 1.0 Importing our Libraries

In [1]:
# Importing the libraries we will need 

# Importing the pandas library
# 
import pandas as pd

# Importing the numpy library
#
import numpy as np

## 1.1 Reading the Dataset from our CSV file





The dataset we will use contains the percentage distribution of population by type of disability for Kenyan counties. 


*   Dataset link for download or access: [Link](https://drive.google.com/a/moringaschool.com/file/d/13twDwhbJqBr1Dvmlwzv6Mou-irul77cS/view?usp=sharing)





In [None]:
# Let's read the data from the CSV file and create the dataframe to be used
# 
df = pd.read_csv('Percentage_Distribution_of_Population_by_type_of_Disability_County_Estimates2005_6.csv')

## 1.2 Previewing our Dataset


In [None]:
# Let's preview the first 10 rows of our data
# 
df.head()

Unnamed: 0,County,Missing_Hand,Missing_Foot,Lame,Blind,Deaf,Dumb,Mental,Paralyzed,Other,Total_Count,Location_1,OBJECTID
0,Baringo,0.11,0.03,0.31,0.03,0.01,0.04,0.12,0.54,0.11,6512.1,"(0.512912, 35.952537)",0
1,Bomet,0.0,0.0,0.29,0.1,0.08,0.08,0.14,0.0,0.39,6538.0,"(-0.690131, 35.278005)",1
2,Bungoma,0.12,0.0,0.49,0.0,0.02,0.21,0.13,0.0,0.22,13170.6,"(0.737046, 34.672536)",2
3,Busia,0.0,0.0,0.14,0.12,0.05,0.36,0.05,0.15,0.31,6655.5,"(0.428414, 34.210571)",3
4,Elgeyo Marakwet,0.0,0.0,0.32,0.0,0.36,0.0,0.2,0.13,0.07,3599.9,"(0.806011, 35.564093)",4


## 1.3 Accessing Information about our Dataset

In [None]:
# Getting to know more about the dataset by accessing its information
# 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   County        48 non-null     object 
 1   Missing_Hand  48 non-null     float64
 2   Missing_Foot  48 non-null     float64
 3   Lame          48 non-null     float64
 4   Blind         48 non-null     float64
 5   Deaf          48 non-null     float64
 6   Dumb          48 non-null     float64
 7   Mental        48 non-null     float64
 8   Paralyzed     48 non-null     float64
 9   Other         48 non-null     float64
 10  Total_Count   48 non-null     float64
 11  Location_1    47 non-null     object 
 12  OBJECTID      48 non-null     int64  
dtypes: float64(10), int64(1), object(2)
memory usage: 5.0+ KB


## 1.4 Cleaning our Dataset

Let us perform data cleaning procedures below providing a documentation for our actions and reasons. We will perform as many data cleaning procedures as we think suitable for the various dimensions of data.

### 1.41) Validity: <font color="green">Challenges</font>

In [None]:
# Procedure 1: Irrelevant Data Observation
# Data Cleaning Action: Dropping Location_1 attribute
# Explanation: We won't need it during Analysis. No question to be answered requires that column.
#
df1 = df.drop(columns=['Location_1'])
df1.head()

Unnamed: 0,County,Missing_Hand,Missing_Foot,Lame,Blind,Deaf,Dumb,Mental,Paralyzed,Other,Total_Count,OBJECTID
0,Baringo,0.11,0.03,0.31,0.03,0.01,0.04,0.12,0.54,0.11,6512.1,0
1,Bomet,0.0,0.0,0.29,0.1,0.08,0.08,0.14,0.0,0.39,6538.0,1
2,Bungoma,0.12,0.0,0.49,0.0,0.02,0.21,0.13,0.0,0.22,13170.6,2
3,Busia,0.0,0.0,0.14,0.12,0.05,0.36,0.05,0.15,0.31,6655.5,3
4,Elgeyo Marakwet,0.0,0.0,0.32,0.0,0.36,0.0,0.2,0.13,0.07,3599.9,4


In [None]:
# Procedure 2: Irrelevant Data Observation
# Data Cleaning Action: Drop OBJECTID attribute
# Explanation: Its irrelevant to our table i.e. attribute starts from 0 and ends of 48 i.e. No meaning at all.
df2 = df1.drop(columns=['OBJECTID'])
df2.head()

Unnamed: 0,County,Missing_Hand,Missing_Foot,Lame,Blind,Deaf,Dumb,Mental,Paralyzed,Other,Total_Count
0,Baringo,0.11,0.03,0.31,0.03,0.01,0.04,0.12,0.54,0.11,6512.1
1,Bomet,0.0,0.0,0.29,0.1,0.08,0.08,0.14,0.0,0.39,6538.0
2,Bungoma,0.12,0.0,0.49,0.0,0.02,0.21,0.13,0.0,0.22,13170.6
3,Busia,0.0,0.0,0.14,0.12,0.05,0.36,0.05,0.15,0.31,6655.5
4,Elgeyo Marakwet,0.0,0.0,0.32,0.0,0.36,0.0,0.2,0.13,0.07,3599.9


### 1.42) Accuracy <font color="green">Challenges</font>

In [None]:
# Procedure 1: Accuracy
# Data Cleaning Action:  No action taken
# Explanation: There were no conflicting or formular generated columns

### 1.43) Completeness <font color="green">Challenges</font>

In [None]:
# Procedure 1: Checking Completeness
# Data Cleaning Action: Using isnull() and count(), get the number of non-missing entries per row
# Explanation: To ensure all data points are complete
#
df2.isnull().count()

County          48
Missing_Hand    48
Missing_Foot    48
Lame            48
Blind           48
Deaf            48
Dumb            48
Mental          48
Paralyzed       48
Other           48
Total_Count     48
dtype: int64

### 1.44) Consitency: <font color="green">Challenges</font>

In [None]:
# Procedure 1: Removing Unnecessary Row
# Data Cleaning Action: Drop "Kenya Average" Row
# Explanation: This will enable us to analyse the rows without this outlier
df3 = df2.drop(index=47)

### 1.45) Uniformity: <font color="green">Challenges<font/>

In [None]:
# Procedure 1: Checking data types
# Data Cleaning Action: Non
# Explanation: Verify if the data types will allow analysis
#
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47 entries, 0 to 46
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   County        47 non-null     object 
 1   Missing_Hand  47 non-null     float64
 2   Missing_Foot  47 non-null     float64
 3   Lame          47 non-null     float64
 4   Blind         47 non-null     float64
 5   Deaf          47 non-null     float64
 6   Dumb          47 non-null     float64
 7   Mental        47 non-null     float64
 8   Paralyzed     47 non-null     float64
 9   Other         47 non-null     float64
 10  Total_Count   47 non-null     float64
dtypes: float64(10), object(1)
memory usage: 4.4+ KB


### Exporting the Cleaned Dataset

In [None]:
# Let's export our dataframe into a csv file as shown 
# in the example given in the following line;
# daframe.to_csv('example.csv')
# In the above case daframe is the dataframe which would like to export.
# we use the to_csv function to create a csv file with the name example 
# and export it
#
df3.to_csv('population_type.csv')

## 1.5 Answering Questions

Let's answer the following questions with our dataset using python.

In [None]:
# Challenge 1
# Which county had the highest no. of registered deaf persons?
df3['def_pop'] = df3['Deaf']*df3['Total_Count']
df4 = df3[['County','def_pop']].sort_values(by='def_pop', ascending=0).head(1)
df4

Unnamed: 0,County,def_pop
44,Vihiga,3053.778


In [None]:
# Challenge 2
# Which county had the highest no. of registered persons with a missing hand?
#
df3['mhand_pop'] = df3['Missing_Hand']*df3['Total_Count']
df4 = df3[['County','mhand_pop']].sort_values(by='mhand_pop', ascending=0).head(1)
df4

Unnamed: 0,County,mhand_pop
16,Kisumu,1602.104


In [None]:
# Challenge 3
# Which county had the highest no. of registered persons with a missing foot?
# 
df3['mfoot_pop'] = df3['Missing_Foot']*df3['Total_Count']
df4 = df3[['County','mfoot_pop']].sort_values(by='mfoot_pop', ascending=0).head(1)
df4

Unnamed: 0,County,mfoot_pop
29,Nairobi,1941.706


In [None]:
# Challenge 4
# Which county had the highest no. of registered lame persons?
# 
df3['lame_pop'] = df3['Lame']*df3['Total_Count']
df4 = df3[['County','lame_pop']].sort_values(by='lame_pop', ascending=0).head(1)
df4

Unnamed: 0,County,lame_pop
30,Nakuru,11878.9


In [None]:
# Challenge 5
# Which county had the lowest no. of registered blind persons?
# 
df3['blind_pop'] = df3['Blind']*df3['Total_Count']
df4 = df3[['County','blind_pop']].sort_values(by='blind_pop', ascending=0)
df5 = df4[df4['blind_pop'] !=0].tail(1)
df5

Unnamed: 0,County,blind_pop
11,Kericho,152.79


In [None]:
# Challenge 6
# Which county had the highest third no. of registered deaf persons?
# 
df3['def_pop'] = df3['Deaf']*df3['Total_Count']
df4 = df3[['County','def_pop']].sort_values(by='def_pop', ascending=0).head(3)
df4.at[33,'County']

'Nyamira'

In [None]:
# Challenge 7
# In descending order, which top 5 counties had the highest no. of registered dumb persons?
# 
df3['dumb_pop'] = df3['Dumb']*df3['Total_Count']
df4 = df3[['County','dumb_pop']].sort_values(by='dumb_pop', ascending=0)
df5 = df4[df4['dumb_pop'] !=0].head(5)
df5

Unnamed: 0,County,dumb_pop
28,Murang'a,3489.961
2,Bungoma,2765.826
33,Nyamira,2600.015
3,Busia,2395.98
21,Machakos,2346.04


In [None]:
# Challenge 8
# In ascending order, which top 5 counties had the highest no. of registered persons with a mental disability? 
# 
df3['mental_pop'] = df3['Mental']*df3['Total_Count']
df4 = df3[['County','mental_pop']].sort_values(by='mental_pop', ascending=1).tail(5)
df4

Unnamed: 0,County,mental_pop
18,Kwale,4336.53
21,Machakos,4457.476
29,Nairobi,4480.86
12,Kiambu,7170.929
30,Nakuru,7343.32


In [None]:
# Challenge 9
# Which counties had no registerd blind persons nor deaf persons?
# 
df4 =  df3[df3['Deaf'] == 0]
df5 = df4[df3['Blind'] == 0]
df5.County

  """


5             Embu
6          Garissa
14       Kirinyaga
16          Kisumu
19        Laikipia
20            Lamu
21        Machakos
25            Meru
30          Nakuru
38    Taita Taveta
43     Uasin Gishu
Name: County, dtype: object

In [None]:
# Challenge 10
#Which disability was the most registered across all the counties?
print("The Missing_Hand was registered in",np.count_nonzero(df3['Missing_Hand'], axis=0),"Counties")
print("The Missing_Foot was registered in",np.count_nonzero(df3['Missing_Foot'], axis=0),"Counties")
print("The Lame was registered in",np.count_nonzero(df3['Lame'], axis=0),"Counties")
print("The Blind was registered in",np.count_nonzero(df3['Blind'], axis=0),"Counties")
print("The Dumb was registered in",np.count_nonzero(df3['Dumb'], axis=0),"Counties")
print("The Mental was registered in",np.count_nonzero(df3['Mental'], axis=0),"Counties")
print("The Paralyzed was registered in",np.count_nonzero(df3['Paralyzed'], axis=0),"Counties")
print("The Other was registered in",np.count_nonzero(df3['Other'], axis=0),"Counties")

The Missing_Hand was registered in 13 Counties
The Missing_Foot was registered in 17 Counties
The Lame was registered in 40 Counties
The Blind was registered in 29 Counties
The Dumb was registered in 33 Counties
The Mental was registered in 40 Counties
The Paralyzed was registered in 35 Counties
The Other was registered in 39 Counties


In [None]:
# Challenge 11
# Which disability was the least registered across all the counties?
#
print("The Missing_Hand was registered in",np.count_nonzero(df3['Missing_Hand'], axis=0),"Counties")

The Missing_Hand was registered in 13 Counties


In [None]:
# Challenge 12
# What was the average no. of registered persons with a disability?
df3.Total_Count.mean()

8565.491489361699

In [None]:
# Challenge 13
# Which three counties had least registered persons with disabilities?
#
df4 = df3[['County','Total_Count']].sort_values(by='Total_Count', ascending=0).tail(3)
df4

Unnamed: 0,County,Total_Count
42,Turkana,1733.1
20,Lamu,524.9
40,Tharaka Nithi,420.1


In [None]:
# Challenge 14
# What was the total no of registered persons with a disability across all counties?
#
df3.Total_Count.sum()

402578.1

In [None]:
# Challenge 15 
# Which top 3 counties has the highest no. of registered persons with a disability?
# 
df4 = df3[['County','Total_Count']].sort_values(by='Total_Count', ascending=0).head(3)
df4

Unnamed: 0,County,Total_Count
10,Kakamega,27468.9
7,Homa Bay,23696.8
44,Vihiga,23490.6
