# Philippine Poverty Statistics - EDA

## Data Preparation

### Reading data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Read and inspect data
df = pd.read_csv('povstat_processed.csv')
df

Unnamed: 0.1,Unnamed: 0,Variable,Year,province,value,adm_level,region,unit
0,0,Annual Per Capita Poverty Threshold (in Pesos),1991,1st District,,District,NCR,in Pesos
1,1,Annual Per Capita Poverty Threshold (in Pesos),2006,1st District,15699,District,NCR,in Pesos
2,2,Annual Per Capita Poverty Threshold (in Pesos),2009,1st District,19227,District,NCR,in Pesos
3,3,Annual Per Capita Poverty Threshold (in Pesos),2012,1st District,20344,District,NCR,in Pesos
4,4,Annual Per Capita Poverty Threshold (in Pesos),2015,1st District,25007,District,NCR,in Pesos
...,...,...,...,...,...,...,...,...
3600,3600,Magnitude of Subsistence Poor Population,1991,Zamboanga del Sur,,Province,Region IX,population
3601,3601,Magnitude of Subsistence Poor Population,2006,Zamboanga del Sur,268576,Province,Region IX,population
3602,3602,Magnitude of Subsistence Poor Population,2009,Zamboanga del Sur,261992,Province,Region IX,population
3603,3603,Magnitude of Subsistence Poor Population,2012,Zamboanga del Sur,209765,Province,Region IX,population


In [3]:
# TODO: Move this, only describes numerical columns
df.describe()

Unnamed: 0.1,Unnamed: 0,Year
count,3605.0,3605.0
mean,1802.0,2006.6
std,1040.818188,8.358192
min,0.0,1991.0
25%,901.0,2006.0
50%,1802.0,2009.0
75%,2703.0,2012.0
max,3604.0,2015.0


In [4]:
# Drop first column (index) as Pandas dataframe already adds index
df.drop(columns=df.columns[[0]], axis=1, inplace=True)

In [5]:
# Rename columns
df.rename(columns = {
    'Variable': 'Variable',
    'Year': 'Year',
    'province': 'Province',
    'value': 'Value',
    'adm_level': 'Admin_Lvl',
    'region': 'Region',
    'unit': 'Unit'
}, inplace=True)

In [16]:
df

Unnamed: 0,Variable,Year,Province,Value,Admin_Lvl,Region,Unit
0,Annual Per Capita Poverty Threshold (in Pesos),1991,1st District,,District,NCR,in Pesos
1,Annual Per Capita Poverty Threshold (in Pesos),2006,1st District,15699,District,NCR,in Pesos
2,Annual Per Capita Poverty Threshold (in Pesos),2009,1st District,19227,District,NCR,in Pesos
3,Annual Per Capita Poverty Threshold (in Pesos),2012,1st District,20344,District,NCR,in Pesos
4,Annual Per Capita Poverty Threshold (in Pesos),2015,1st District,25007,District,NCR,in Pesos
...,...,...,...,...,...,...,...
3600,Magnitude of Subsistence Poor Population,1991,Zamboanga del Sur,,Province,Region IX,population
3601,Magnitude of Subsistence Poor Population,2006,Zamboanga del Sur,268576,Province,Region IX,population
3602,Magnitude of Subsistence Poor Population,2009,Zamboanga del Sur,261992,Province,Region IX,population
3603,Magnitude of Subsistence Poor Population,2012,Zamboanga del Sur,209765,Province,Region IX,population


### Inspect unique values

In [6]:
# Print counts of unique values per column
print("Unique value counts per column:")
df.nunique()

Unique value counts per column:


Variable        7
Year            5
Province      103
Value        2168
Admin_Lvl       4
Region         18
Unit            4
dtype: int64

In [7]:
# List unique values per column
unique_vars = df['Variable'].unique().tolist()
unique_years = df['Year'].unique().tolist()
unique_admins = df['Admin_Lvl'].unique().tolist()
unique_regions = df['Region'].unique().tolist()
unique_units = df['Unit'].unique().tolist()

In [8]:
print("Unique values in Variable:")
print("\n".join(map(str, unique_vars)))

Unique values in Variable:
Annual Per Capita Poverty Threshold (in Pesos)
Poverty Incidence among Families (%)
Magnitude of Poor Families
Poverty Incidence among Population (%)
Magnitude of Poor Population
Subsistence Incidence among Population (%)
Magnitude of Subsistence Poor Population


In [9]:
print("Unique values in Year:")
print("\n".join(map(str, unique_years)))

Unique values in Year:
1991
2006
2009
2012
2015


In [10]:
print("Unique values in Admin_Lvl:")
print("\n".join(map(str, unique_admins)))

Unique values in Admin_Lvl:
District
Region
Province
nan
Country


In [11]:
print("Unique values in Region:")
print("\n".join(map(str, unique_regions)))

Unique values in Region:
NCR
ARMM
CAR
CARAGA
Region VI
Region V
nan
Region III
Region VI-A
Region VIII
Region VII
Region X
Region II
Region XI
Region I
Region IV-B
Region XII
Philippines
Region IX


In [12]:
print("Unique values in Unit:")
print("\n".join(map(str, unique_units)))

Unique values in Unit:
in Pesos
%
families
population


### Inspect missing values

To cast the ```Value``` column into a numerical data type, I first inspect which entries may cause trouble with type conversion.

In [14]:
#tmp = df['Value'].str.isdigit()
#tmp.loc('NaN')

df[pd.to_numeric(df['Value'], errors='coerce').isnull()]['Value'].unique()

array([nan, '  15,699 ', '  19,227 ', ..., '  261,992 ', '  209,765 ',
       '  143,740 '], dtype=object)

In [13]:
#df['Value'] = df['Value'].str.strip()
#df['Value'] = df['Value'].str.replace(",", "")

In [15]:
# In the .csv file, Value entries are formatted as strings or empty characters
# Strip whitespace around values
df.Value = df.Value.astype(float)

ValueError: could not convert string to float: '  15,699 '

In [None]:
# Convert object datatypes into more relevant ones
df = df.astype({
    'Variable': 'category',
    'Year': 'int64',
    'Province': 'category',
    'Value': 'float64',
    'Admin_Lvl': 'category',
    'Region': 'category',
    'Unit': 'category'
})

In [None]:
# Inspect general information about dataset
df.info()

In [None]:
# Show totals of missing values per column 
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

From above, the missing values appear to be linked to either the variable the value is referring to or certain regions/administration levels.