##  Main Data  Sources
##### Estimated Population
Home -> Population Estimates -> Annual Population Estimates -> PEA04 - Estimated Population (Persons in April)
##### Households with Internet access - 2011
Home -> SAPMAP 2011 -> Theme 15: PC and Internet Access ->
    SAP2011T15T3CTY - Households with Internet access
##### Households with Internet access - 2016
Home -> SAPMAP 2016 -> Theme 15: Motor Car Availability, PC Ownership and Internet Access ->
    SAP2016T15T3CTY - Households with Internet access
##### Households with Internet access - 2022
Home -> SAPMAP 2022 -> Theme 15: Motor Car Availability, PC Ownership and Internet Access ->
    SAP2022T15T2NUTS - Households with Internet access
##  Auxiliary Data  Sources
##### Average Number of Persons per Private Household 
Home -> Census 2022 -> Summary Results -> FY004B - Average Number of Persons per Private Household
##### IRISH REGIONS
https://www.cso.ie/en/media/csoie/releasespublications/documents/ep/censuspreliminaryresults/2022/backgroundnotes/NUTS3_Region.xlsx

In [19]:
##IMPORTING LIBRARIES
import pandas as pd
import statistics as stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Estimated Population file overview

In [3]:
#creating dataframe for Estimated Population file and validating it
df_pop = pd.read_csv("PEA04.csv")
df_pop.head()

Unnamed: 0,STATISTIC Label,Year,Age Group,Sex,Region,UNIT,VALUE
0,Estimated Population (Persons in April),2011,0 - 4 years,Both sexes,State,Thousand,356.0
1,Estimated Population (Persons in April),2011,0 - 4 years,Both sexes,Border,Thousand,30.7
2,Estimated Population (Persons in April),2011,0 - 4 years,Both sexes,West,Thousand,32.6
3,Estimated Population (Persons in April),2011,0 - 4 years,Both sexes,Mid-West,Thousand,35.0
4,Estimated Population (Persons in April),2011,0 - 4 years,Both sexes,South-East,Thousand,32.0


In [5]:
# checking dataframe short summary: data types, null values, nr of the rows
df_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6669 entries, 0 to 6668
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   STATISTIC Label  6669 non-null   object 
 1   Year             6669 non-null   int64  
 2   Age Group        6669 non-null   object 
 3   Sex              6669 non-null   object 
 4   Region           6669 non-null   object 
 5   UNIT             6669 non-null   object 
 6   VALUE            6669 non-null   float64
dtypes: float64(1), int64(1), object(5)
memory usage: 364.8+ KB


**Brief Overview**

* There is no null values
* Field 'Age Group' is irrelevant, so can be removed
* Fields 'Year', 'Sex' and 'Region' should be deeply analysed by unique values
* Field 'VALUE' should be renamed as 'Estimated population' in order to be used as a defined column in the final DataFrame
* Each value from 'VALUE' field should be multiplied by 1000 to have nominal values. Thus field 'UNIT' can be deprecated.



remove age groups
filter sex
aggregate data
correlate with FY004B?

In [54]:
df_pop['STATISTIC Label'].unique()

array(['Estimated Population (Persons in April)'], dtype=object)

In [6]:
df_pop.Year.unique()

array([2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021,
       2022, 2023], dtype=int64)

In [7]:
df_pop.Sex.unique()

array(['Both sexes', 'Male', 'Female'], dtype=object)

In [8]:
df_pop.Region.unique()

array(['State', 'Border', 'West', 'Mid-West', 'South-East', 'South-West',
       'Dublin', 'Mid-East', 'Midland'], dtype=object)

**Additional overview**
* 'STATISTIC Label' column has only 1 unique value
* We have fill data for 2011-2023 years about the estimated population
* For the 'Sex' field only "Both Sexes" values will be used as relevant. Splitting by 'Male'/'Female' doesn't have any bias as any person can have internet access regardless of their sex or gender identity.
* In the 'Region' field we have 8 big regions and "State" reflecting all country data. For better clarity "State" will be renamed in "Ireland", and according to official name "Midland" should be renamed in "Midlands".  

In [45]:
# Plot a histogram of population changings year by year for the whole coutry 

filtered_df = df_pop.query('Sex == "Both sexes"')

# Create a pivot table with Year as columns, Region as rows, and Value as values
pivot1 = filtered_df.pivot_table(values='VALUE', index='Region', columns='Year', aggfunc="sum", fill_value=0)
pivot1

Year,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Border,784.0,784.2,786.2,784.1,783.8,786.7,800.4,811.5,820.2,824.2,834.1,844.5,864.3
Dublin,2523.0,2521.3,2531.7,2572.8,2630.6,2671.9,2702.9,2746.2,2797.8,2844.9,2869.4,2936.6,3003.0
Mid-East,1314.9,1329.3,1332.9,1342.8,1358.6,1381.9,1411.9,1438.6,1463.8,1494.7,1510.5,1547.8,1554.2
Mid-West,933.6,941.3,941.6,943.6,946.7,945.2,960.4,977.8,986.8,995.8,999.2,1017.5,1039.1
Midland,567.6,574.1,578.6,579.6,582.3,586.8,598.4,609.5,618.0,619.8,628.2,640.6,649.0
South-East,820.6,823.2,827.6,826.9,831.2,842.2,853.0,863.0,880.3,890.0,902.3,921.4,937.3
South-West,1324.6,1334.7,1347.8,1358.9,1357.8,1369.9,1388.1,1405.9,1421.1,1446.5,1454.7,1486.0,1514.8
State,9149.7,9187.3,9229.3,9291.0,9375.6,9479.4,9621.8,9769.8,9917.0,10059.7,10149.4,10367.9,10563.2
West,881.6,878.7,883.4,882.3,884.5,895.1,906.5,917.5,929.4,944.4,951.0,973.2,1001.6


# PLOT a graph?

# Households with Internet access 
# Overview of 3 files in bulk

In [81]:
#creating dataframe for SAP2011 file and validating it

df_it11 = pd.read_csv("SAP2011.csv")
df_it11.head()

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE
0,Households with Internet access,2011,Broadband,Carlow County,Number,11158
1,Households with Internet access,2011,Broadband,Dublin City,Number,137669
2,Households with Internet access,2011,Broadband,South Dublin,Number,68306
3,Households with Internet access,2011,Broadband,Fingal,Number,73868
4,Households with Internet access,2011,Broadband,Dún Laoghaire-Rathdown,Number,59750


In [62]:
# checking dataframe short summary: data types, null values, nr of the rows

df_it11.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  170 non-null    object
 1   Census Year      170 non-null    int64 
 2   Internet         170 non-null    object
 3   County           170 non-null    object
 4   UNIT             170 non-null    object
 5   VALUE            170 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 8.1+ KB


In [42]:
#creating dataframe for SAP2011 file and validating it

df_it16 = pd.read_csv("SAP2016.csv")
df_it16.head()

Unnamed: 0,Statistic Label,Census Year,County,Internet,UNIT,VALUE
0,Households with Internet access,2016,Carlow,Broadband,Number,13539
1,Households with Internet access,2016,Carlow,Other,Number,1852
2,Households with Internet access,2016,Carlow,No,Number,4432
3,Households with Internet access,2016,Carlow,Not Stated,Number,642
4,Households with Internet access,2016,Carlow,Total,Number,20465


In [63]:
# checking dataframe short summary: data types, null values, nr of the rows

df_it16.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  155 non-null    object
 1   Census Year      155 non-null    int64 
 2   County           155 non-null    object
 3   Internet         155 non-null    object
 4   UNIT             155 non-null    object
 5   VALUE            155 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 7.4+ KB


In [43]:
#creating dataframe for SAP2011 file and validating it

df_it22 = pd.read_csv("SAP2022.csv")
df_it22.head()

Unnamed: 0,Statistic Label,Census Year,Internet,NUTS 3 Region,UNIT,VALUE
0,Households with Internet access,2022,Broadband,Ireland,Number,1457883
1,Households with Internet access,2022,Broadband,Border,Number,116928
2,Households with Internet access,2022,Broadband,West,Number,134086
3,Households with Internet access,2022,Broadband,Mid-West,Number,137622
4,Households with Internet access,2022,Broadband,South-East,Number,124415


In [64]:
# checking dataframe short summary: data types, null values, nr of the rows

df_it22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  45 non-null     object
 1   Census Year      45 non-null     int64 
 2   Internet         45 non-null     object
 3   NUTS 3 Region    45 non-null     object
 4   UNIT             45 non-null     object
 5   VALUE            45 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 2.2+ KB


**Brief overview**

* All 3 files have same structure: 2 integer fileds and 4 fileds with object type data.
* in the SAP2016 columns 'County' and 'Internet' are in different order. This would need to be changed.
* Column 'VALUE' will be renamed in 'Households with Internet access', in order to be used as a defined column in the final DataFrame
* SAP2022 uses NUTS3 regions, and other 2 files uses counties. For all 3 files column with counties and regions needs to be analysed with more attention and to find a strategy to be unified. 

In [56]:
#analysing unique value in the 'Statistic Label' field for all 3 files, 
#in order to make sure that we have stats only for "Households with Internet access"

#2011

df_it11['Statistic Label'].unique()

array(['Households with Internet access'], dtype=object)

In [74]:
df_it11['Internet'].unique()

array(['Broadband', 'Other', 'No', 'Not Stated', 'Total'], dtype=object)

In [57]:
#2016

df_it16['Statistic Label'].unique()

array(['Households with Internet access'], dtype=object)

In [75]:
df_it16['Internet'].unique()

array(['Broadband', 'Other', 'No', 'Not Stated', 'Total'], dtype=object)

In [59]:
#2022

df_it22['Statistic Label'].unique()

array(['Households with Internet access'], dtype=object)

In [76]:
df_it22['Internet'].unique()

array(['Broadband', 'No', 'Not stated', 'Other', 'Total'], dtype=object)

**Additional overview**
* Column 'Statistic Label' has only 1 value for all 3 files, so there is no need for changes
* Column 'Internet' contain 'Total' value, that will create a duplication in the numbers in case of any aggregation, so all rows with 'Total' value need to be removed.
* We have 'Broadband' and 'Other' values for those who have Internet Access, so as a furute potential aggregation for these values can be 'Yes' as an opposite to those households without access.
* 'Not Stated' value is an undefined field, that would need a separate deeper analysis.

In [65]:
#analysing regions and counties

#2011

df_it11.County.unique()

array(['Carlow County', 'Dublin City', 'South Dublin', 'Fingal',
       'Dún Laoghaire-Rathdown', 'Kildare County', 'Kilkenny County',
       'Laois County', 'Longford County', 'Louth County', 'Meath County',
       'Offaly County', 'Westmeath County', 'Wexford County',
       'Wicklow County', 'Clare County', 'Cork City', 'Cork County',
       'Kerry County', 'Limerick City', 'Limerick County',
       'North Tipperary', 'South Tipperary', 'Waterford City',
       'Waterford County', 'Galway City', 'Galway County',
       'Leitrim County', 'Mayo County', 'Roscommon County',
       'Sligo County', 'Cavan County', 'Donegal County',
       'Monaghan County'], dtype=object)

In [66]:
#2016

df_it16.County.unique()

array(['Carlow', 'Cavan', 'Clare', 'Cork City', 'Cork County', 'Donegal',
       'Dublin City', 'Dún Laoghaire-Rathdown', 'Fingal', 'Galway City',
       'Galway County', 'Kerry', 'Kildare', 'Kilkenny', 'Laois',
       'Leitrim', 'Limerick City and County', 'Longford', 'Louth', 'Mayo',
       'Meath', 'Monaghan', 'Offaly', 'Roscommon', 'Sligo',
       'South Dublin', 'Tipperary', 'Waterford City and County',
       'Westmeath', 'Wexford', 'Wicklow'], dtype=object)

In [68]:
#2022

df_it22['NUTS 3 Region'].unique()

array(['Ireland', 'Border', 'West', 'Mid-West', 'South-East',
       'South-West', 'Dublin', 'Mid-East', 'Midlands'], dtype=object)

**Additional overview**

* In the SAP2022 we have aggregated regions, as well as in file with Estimated Population, so there will be no changes for values in this column, except the name. Column 'NUTS 3 Region' will be renamed in "Region"
* In order to unify counties in the files SAP2011 and SAP2016, following steps will be applied:
  1. removing all excessive words as "City", "County", "City and County" and spaces (if needed)
  2. new column "Region" will be created and each row will have appropriate according value.
  3. Values from "County" column will be replaced with values from "Region" column.
  4. Column "Region" will be removed
  5. Column "County" will be renamed in "Region"
  
Thus we will have 3 files with identical structure and unified values.
Last step will be to aggregate all 3 files in 1 dataframe

# Irish Regions file overview

In [70]:
df_reg = pd.read_csv("Regions.csv")
df_reg.head()

Unnamed: 0,Name of region,Constituent counties,Type of area
0,Border,Cavan,Administrative county
1,,Donegal,Administrative county
2,,Leitrim,Administrative county
3,,Louth,Administrative county
4,,Monaghan,Administrative county


In [71]:
# checking dataframe short summary: data types, null values, nr of the rows

df_reg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Name of region        8 non-null      object
 1   Constituent counties  35 non-null     object
 2    Type of area         35 non-null     object
dtypes: object(3)
memory usage: 1.1+ KB


In [73]:
#print all 35 rows to see full view
print(df_reg)

   Name of region     Constituent counties           Type of area
0           Border                   Cavan  Administrative county
1              NaN                 Donegal  Administrative county
2              NaN                 Leitrim  Administrative county
3              NaN                   Louth  Administrative county
4              NaN                Monaghan  Administrative county
5              NaN                   Sligo  Administrative county
6              NaN                     NaN                    NaN
7           Dublin                  Dublin                   City
8              NaN  Dún Laoghaire-Rathdown  Administrative county
9              NaN                  Fingal  Administrative county
10             NaN            South Dublin  Administrative county
11             NaN                     NaN                    NaN
12        Mid-East                 Kildare  Administrative county
13             NaN                   Meath  Administrative county
14        

**Overview**
* In order to have acording region for every county, there will be a need to fill all NaNs with region name. I will use "fill with previous valid value" method.
* all excessive empty rows will be removed

# Auxiliary file overview

# Private households vs Number of people per household	

In [77]:
df_aux = pd.read_csv("FY004B.csv")
df_aux.head()

Unnamed: 0,Statistic Label,CensusYear,County and City,UNIT,VALUE
0,Private households,2011,State,Number,1654208.0
1,Private households,2011,Carlow,Number,19436.0
2,Private households,2011,Dublin City,Number,208008.0
3,Private households,2011,Dún Laoghaire-Rathdown,Number,75819.0
4,Private households,2011,Fingal,Number,93146.0


In [78]:
# checking dataframe short summary: data types, null values, nr of the rows

df_aux.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Statistic Label  279 non-null    object 
 1   CensusYear       279 non-null    int64  
 2   County and City  279 non-null    object 
 3   UNIT             279 non-null    object 
 4   VALUE            279 non-null    float64
dtypes: float64(1), int64(1), object(3)
memory usage: 11.0+ KB


In [21]:
df_aux.CensusYear.unique()

array([2011, 2016, 2022], dtype=int64)

In [79]:
df_aux['Statistic Label'].unique()

array(['Private households', 'Persons in private households',
       'Average number of persons in private households'], dtype=object)

In [80]:
df_aux['County and City'].unique()

array(['State', 'Carlow', 'Dublin City', 'Dún Laoghaire-Rathdown',
       'Fingal', 'South Dublin', 'Kildare', 'Kilkenny', 'Laois',
       'Longford', 'Louth', 'Meath', 'Offaly', 'Westmeath', 'Wexford',
       'Wicklow', 'Clare', 'Cork City and Cork County', 'Kerry',
       'Limerick City and County', 'Tipperary',
       'Waterford City and County', 'Galway City', 'Galway County',
       'Leitrim', 'Mayo', 'Roscommon', 'Sligo', 'Cavan', 'Donegal',
       'Monaghan'], dtype=object)

**Overview**
* We have same 3 values for years from SAP data frame
* 'Average number of persons in private households' value is excessive, so will be removed
* Similar approach for  'County and City' needs to be applied. I.e. counties will be replased with according regions and all the data will be aggregated.
* Field 'VALUE' will be replaced with 2 fields => 'Private households' and 'Persons in private households', in order to be used in the final dataframe as defined values.

created final file with columns
Year
Region
Type of internet connection
Nr of estimated people
Nr pf people with Internet connection
