### <b>In this notebook I will combine the data from 2011, 2016 and 2022 files in 1 dataset.</b> <br></br>
From the Central Statistic Office website I got 3 separate files (SAP2011.csv, SAP2016.csv, SAP2022.csv) that contain data about how many households have Internet access in each region and type of used Internet connection. <br></br>

Performing first brief exploration

In [69]:
##IMPORTING LIBRARIES
import pandas as pd
import statistics as stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [70]:
#Exploring data
df = pd.read_csv("SAP2011.csv")
df.head()

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE
0,Households with Internet access,2011,Broadband,Carlow County,Number,11158
1,Households with Internet access,2011,Broadband,Dublin City,Number,137669
2,Households with Internet access,2011,Broadband,South Dublin,Number,68306
3,Households with Internet access,2011,Broadband,Fingal,Number,73868
4,Households with Internet access,2011,Broadband,Dún Laoghaire-Rathdown,Number,59750


In [71]:
#Exploring data
df = pd.read_csv("SAP2016.csv")
df.head()

Unnamed: 0,Statistic Label,Census Year,County,Internet,UNIT,VALUE
0,Households with Internet access,2016,Carlow,Broadband,Number,13539
1,Households with Internet access,2016,Carlow,Other,Number,1852
2,Households with Internet access,2016,Carlow,No,Number,4432
3,Households with Internet access,2016,Carlow,Not Stated,Number,642
4,Households with Internet access,2016,Carlow,Total,Number,20465


In [72]:
#Exploring data
df = pd.read_csv("SAP2022.csv")
df.head()

Unnamed: 0,Statistic Label,Census Year,Internet,NUTS 3 Region,UNIT,VALUE
0,Households with Internet access,2022,Broadband,Ireland,Number,1457883
1,Households with Internet access,2022,Broadband,Border,Number,116928
2,Households with Internet access,2022,Broadband,West,Number,134086
3,Households with Internet access,2022,Broadband,Mid-West,Number,137622
4,Households with Internet access,2022,Broadband,South-East,Number,124415


After brief data exploration I notice that: <br></br>
1. geographical classification is different in files, i.e. <br> - For the 2011 and 2016 we have couties and for 2022 we have NUTS3 Regions Names. <br>
2. Name of the columns are different and in different order<br></br>

Other data, that are used for the CA are splitted by Regions, so this is a good reason to modify data from 2011 and 2016 accordingly.
For this I will use additional information from Central Statistic Office website that contain Regions classification framework -<br>
1. https://www.cso.ie/en/methods/informationnotefordatausersrevisiontotheirishnuts2andnuts3regions/ <br>
2. https://www.cso.ie/en/media/csoie/releasespublications/documents/er/newdwellingcompletions/q42020/NDC2020Q4TBL5_NUTS3.xlsx<br></br>
Based on this Regions classification I have created <b> NUTS3_Region.csv </b> This file will be used to help transforming the data from Counties to Regions <br></br> 
<b>Note:</b> While working with data from 2011 I noriced that Tiperrary is divided: North and South, and for 2016 there is only 1 county Tiperrary, that was included in the relevant Region. <br></br>

Columns and their order will be also unified and modified.

# Following EDA will be structured as:<br>
####     &nbsp;1. Handling data for file with Regions classification<br>
####     &nbsp;2. Handling data for file with Internet Types data from 2011 year<br>
####     &nbsp;3. Handling data for file with  Internet Types data from 2016 year<br>
####     &nbsp;4. Handling data for file with  Internet Types data from 2022 year<br>
####     &nbsp;5. Creating 1 DataFrame and exporting it in the csv file<br>

# 1. Handling data for file with Regions classification

In [73]:
#Creating DataFrame with Regions classification

df_reg = pd.read_csv("NUTS3_Region.csv")
df_reg

Unnamed: 0,Name of region,Constituent counties,Type of area
0,Border,Cavan,Administrative county
1,,Donegal,Administrative county
2,,Leitrim,Administrative county
3,,Louth,Administrative county
4,,Monaghan,Administrative county
5,,Sligo,Administrative county
6,,,
7,Dublin,Dublin,City
8,,Dún Laoghaire-Rathdown,Administrative county
9,,Fingal,Administrative county


#### Observing that there are NA values in the column with Regions names and empty row after each set of counties

In [74]:
#Using 'ffill' method in the column with Regions to fill NAs with the last valid value 
df_reg["Name of region "] = df_reg["Name of region "].fillna(method='ffill')
df_reg.sample(8)

  df_reg["Name of region "] = df_reg["Name of region "].fillna(method='ffill')


Unnamed: 0,Name of region,Constituent counties,Type of area
24,Mid-West,North Tipperary,Administrative county
2,Border,Leitrim,Administrative county
5,Border,Sligo,Administrative county
28,South-East,Kilkenny,Administrative county
31,South-East,Waterford,Administrative county
25,Mid-West,Tipperary,Administrative county
17,Midland,Longford,Administrative county
41,West,Roscommon,Administrative county


In [75]:
#checking nr of rows with NA values
df_reg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Name of region        42 non-null     object
 1   Constituent counties  35 non-null     object
 2    Type of area         35 non-null     object
dtypes: object(3)
memory usage: 1.1+ KB


In [76]:
#Removing rows with NAs and making sure that there no null values left
df_reg = df_reg.dropna()
df_reg.info()

<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 0 to 41
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Name of region        35 non-null     object
 1   Constituent counties  35 non-null     object
 2    Type of area         35 non-null     object
dtypes: object(3)
memory usage: 1.1+ KB


In [77]:
#Cheking list of counties
df_reg["Constituent counties"].unique()

array(['Cavan', 'Donegal', 'Leitrim', 'Louth', 'Monaghan', 'Sligo',
       'Dublin', 'Dún Laoghaire-Rathdown', 'Fingal', 'South Dublin',
       'Kildare', 'Meath', 'Wicklow', 'Laois', 'Longford', 'Offaly',
       'Westmeath', 'Clare', 'Limerick', 'North Tipperary ', 'Tipperary',
       'Carlow', 'Kilkenny', 'South Tipperary ', 'Waterford', 'Wexford',
       'Cork', 'Kerry', 'Galway', 'Mayo', 'Roscommon'], dtype=object)

In [78]:
# North Tiperrary and South Tiperrary contain additional space at the end, so a couple of more rows of code to remove them

In [79]:
df_reg["Constituent counties"] = df_reg["Constituent counties"].str.replace(f'Tipperary ', 'Tipperary')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reg["Constituent counties"] = df_reg["Constituent counties"].str.replace(f'Tipperary ', 'Tipperary')


In [80]:
#final check
df_reg["Constituent counties"].unique()

array(['Cavan', 'Donegal', 'Leitrim', 'Louth', 'Monaghan', 'Sligo',
       'Dublin', 'Dún Laoghaire-Rathdown', 'Fingal', 'South Dublin',
       'Kildare', 'Meath', 'Wicklow', 'Laois', 'Longford', 'Offaly',
       'Westmeath', 'Clare', 'Limerick', 'North Tipperary', 'Tipperary',
       'Carlow', 'Kilkenny', 'South Tipperary', 'Waterford', 'Wexford',
       'Cork', 'Kerry', 'Galway', 'Mayo', 'Roscommon'], dtype=object)

# 2. Handling data for file with data from 2011 year

In [81]:
#Creating dataset with internet type data per region for 2011 year 
df_it11 = pd.read_csv("SAP2011.csv")
df_it11.head(8)

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE
0,Households with Internet access,2011,Broadband,Carlow County,Number,11158
1,Households with Internet access,2011,Broadband,Dublin City,Number,137669
2,Households with Internet access,2011,Broadband,South Dublin,Number,68306
3,Households with Internet access,2011,Broadband,Fingal,Number,73868
4,Households with Internet access,2011,Broadband,Dún Laoghaire-Rathdown,Number,59750
5,Households with Internet access,2011,Broadband,Kildare County,Number,50093
6,Households with Internet access,2011,Broadband,Kilkenny County,Number,19816
7,Households with Internet access,2011,Broadband,Laois County,Number,16003


In [82]:
#Checking unique values
df_it11.County.unique()

array(['Carlow County', 'Dublin City', 'South Dublin', 'Fingal',
       'Dún Laoghaire-Rathdown', 'Kildare County', 'Kilkenny County',
       'Laois County', 'Longford County', 'Louth County', 'Meath County',
       'Offaly County', 'Westmeath County', 'Wexford County',
       'Wicklow County', 'Clare County', 'Cork City', 'Cork County',
       'Kerry County', 'Limerick City', 'Limerick County',
       'North Tipperary', 'South Tipperary', 'Waterford City',
       'Waterford County', 'Galway City', 'Galway County',
       'Leitrim County', 'Mayo County', 'Roscommon County',
       'Sligo County', 'Cavan County', 'Donegal County',
       'Monaghan County'], dtype=object)

##### Counties in this file contain words "County", "City", that are not included in the main clasiification file.
##### In order to be able to use regions instead of counties, as a first step I will remove those words, as well as remaining spaces.
##### Other files might need similar modification, so this is a good reason to create a simple function that remove unnecessary words and/or simbols.

In [83]:
#Method 'replace' will be used
def remove_word(df, word):
        df['County'] = df['County'].str.replace(f' {word}', '')
        return df

In [84]:
#removing unnecessary words and making sure we have no additional spaces
df_it11 = remove_word(df_it11.copy(), "County")
df_it11 = remove_word(df_it11.copy(), "City")
df_it11 = remove_word(df_it11.copy(), " ")
df_it11.County.unique()

array(['Carlow', 'Dublin', 'South Dublin', 'Fingal',
       'Dún Laoghaire-Rathdown', 'Kildare', 'Kilkenny', 'Laois',
       'Longford', 'Louth', 'Meath', 'Offaly', 'Westmeath', 'Wexford',
       'Wicklow', 'Clare', 'Cork', 'Kerry', 'Limerick', 'North Tipperary',
       'South Tipperary', 'Waterford', 'Galway', 'Leitrim', 'Mayo',
       'Roscommon', 'Sligo', 'Cavan', 'Donegal', 'Monaghan'], dtype=object)

In [85]:
#checking DataFrame for NAs
df_it11.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  170 non-null    object
 1   Census Year      170 non-null    int64 
 2   Internet         170 non-null    object
 3   County           170 non-null    object
 4   UNIT             170 non-null    object
 5   VALUE            170 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 8.1+ KB


###### Data frame has no NAs and ready for further modification

Next I will created a function that create new column in current data frame. In this column each row will have according  region for the available county. <br> </br>
Rational behind of creating separate function is that similar transformation will be needed for the file from 2016

In [86]:
#column Region  will be created in the dataFrame

#Function is working with 2 dataframes as arguments:
#for eah element from column "County" in the df_it will check if there is according name in 
#"Constituent counties" column from df_reg
#and if so value from column "Name of region" will be taken for new column

def create_region_column(df_it, df_reg):
    df_it['Region'] = None
    for i in range(len(df_it)):
        county = df_it.loc[i, 'County']
        if county in df_reg['Constituent counties'].to_list():
            region = df_reg.loc[df_reg['Constituent counties'] == county, 'Name of region '].iloc[0]
        else:
            region = county
        df_it.loc[i, 'Region'] = region
    return df_it

In [87]:
#applying fuction to current dataframe and checking if new column is created with correct values
df_it11 = create_region_column(df_it11, df_reg)
df_it11.head()

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE,Region
0,Households with Internet access,2011,Broadband,Carlow,Number,11158,South-East
1,Households with Internet access,2011,Broadband,Dublin,Number,137669,Dublin
2,Households with Internet access,2011,Broadband,South Dublin,Number,68306,Dublin
3,Households with Internet access,2011,Broadband,Fingal,Number,73868,Dublin
4,Households with Internet access,2011,Broadband,Dún Laoghaire-Rathdown,Number,59750,Dublin


In [88]:
#replacing the counties with regions and checking 
df_it11["County"] = df_it11["Region"]
df_it11.head()

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE,Region
0,Households with Internet access,2011,Broadband,South-East,Number,11158,South-East
1,Households with Internet access,2011,Broadband,Dublin,Number,137669,Dublin
2,Households with Internet access,2011,Broadband,Dublin,Number,68306,Dublin
3,Households with Internet access,2011,Broadband,Dublin,Number,73868,Dublin
4,Households with Internet access,2011,Broadband,Dublin,Number,59750,Dublin


In [89]:
#renaming column to the new name and checking
df_it11 = df_it11.rename(columns={'County': 'Name of Region'})
df_it11.head()

Unnamed: 0,Statistic Label,Census Year,Internet,Name of Region,UNIT,VALUE,Region
0,Households with Internet access,2011,Broadband,South-East,Number,11158,South-East
1,Households with Internet access,2011,Broadband,Dublin,Number,137669,Dublin
2,Households with Internet access,2011,Broadband,Dublin,Number,68306,Dublin
3,Households with Internet access,2011,Broadband,Dublin,Number,73868,Dublin
4,Households with Internet access,2011,Broadband,Dublin,Number,59750,Dublin


In [90]:
#removing last column to have initial view for the table
df_it11 = df_it11.drop(columns=["Region"])
df_it11.head()

Unnamed: 0,Statistic Label,Census Year,Internet,Name of Region,UNIT,VALUE
0,Households with Internet access,2011,Broadband,South-East,Number,11158
1,Households with Internet access,2011,Broadband,Dublin,Number,137669
2,Households with Internet access,2011,Broadband,Dublin,Number,68306
3,Households with Internet access,2011,Broadband,Dublin,Number,73868
4,Households with Internet access,2011,Broadband,Dublin,Number,59750


In [91]:
#last check of the regions[just to be completely sure]
df_it11["Name of Region"].unique()

array(['South-East', 'Dublin', 'Mid-East', 'Midland', 'Border',
       'Mid-West', 'South-West', 'West'], dtype=object)

# 3. Handling data for file with data from 2016 year

In [92]:
#Creating new DataFrame and exploring it
df_it16 = pd.read_csv("SAP2016.csv")
df_it16.head()

Unnamed: 0,Statistic Label,Census Year,County,Internet,UNIT,VALUE
0,Households with Internet access,2016,Carlow,Broadband,Number,13539
1,Households with Internet access,2016,Carlow,Other,Number,1852
2,Households with Internet access,2016,Carlow,No,Number,4432
3,Households with Internet access,2016,Carlow,Not Stated,Number,642
4,Households with Internet access,2016,Carlow,Total,Number,20465


In [93]:
#re-order columns in according to previous file from 2011
df_it16 = df_it16[['Statistic Label','Census Year','Internet','County','UNIT','VALUE']]
df_it16.head()

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE
0,Households with Internet access,2016,Broadband,Carlow,Number,13539
1,Households with Internet access,2016,Other,Carlow,Number,1852
2,Households with Internet access,2016,No,Carlow,Number,4432
3,Households with Internet access,2016,Not Stated,Carlow,Number,642
4,Households with Internet access,2016,Total,Carlow,Number,20465


In [94]:
#checking unique values
df_it16.County.unique()

array(['Carlow', 'Cavan', 'Clare', 'Cork City', 'Cork County', 'Donegal',
       'Dublin City', 'Dún Laoghaire-Rathdown', 'Fingal', 'Galway City',
       'Galway County', 'Kerry', 'Kildare', 'Kilkenny', 'Laois',
       'Leitrim', 'Limerick City and County', 'Longford', 'Louth', 'Mayo',
       'Meath', 'Monaghan', 'Offaly', 'Roscommon', 'Sligo',
       'South Dublin', 'Tipperary', 'Waterford City and County',
       'Westmeath', 'Wexford', 'Wicklow'], dtype=object)

In [95]:
#removing all irrelevant words
df_it16 = remove_word(df_it16.copy(), "City and County")
df_it16 = remove_word(df_it16.copy(), "County")
df_it16 = remove_word(df_it16.copy(), "City")
df_it16 = remove_word(df_it16.copy(), " ")
df_it16.County.unique()

array(['Carlow', 'Cavan', 'Clare', 'Cork', 'Donegal', 'Dublin',
       'Dún Laoghaire-Rathdown', 'Fingal', 'Galway', 'Kerry', 'Kildare',
       'Kilkenny', 'Laois', 'Leitrim', 'Limerick', 'Longford', 'Louth',
       'Mayo', 'Meath', 'Monaghan', 'Offaly', 'Roscommon', 'Sligo',
       'South Dublin', 'Tipperary', 'Waterford', 'Westmeath', 'Wexford',
       'Wicklow'], dtype=object)

In [96]:
#using function for Regions
df_it16 = create_region_column(df_it16, df_reg)
df_it16.head()

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE,Region
0,Households with Internet access,2016,Broadband,Carlow,Number,13539,South-East
1,Households with Internet access,2016,Other,Carlow,Number,1852,South-East
2,Households with Internet access,2016,No,Carlow,Number,4432,South-East
3,Households with Internet access,2016,Not Stated,Carlow,Number,642,South-East
4,Households with Internet access,2016,Total,Carlow,Number,20465,South-East


In [97]:
#replacing the counties with regions and checking 
df_it16["County"] = df_it16["Region"]
df_it16.head()

Unnamed: 0,Statistic Label,Census Year,Internet,County,UNIT,VALUE,Region
0,Households with Internet access,2016,Broadband,South-East,Number,13539,South-East
1,Households with Internet access,2016,Other,South-East,Number,1852,South-East
2,Households with Internet access,2016,No,South-East,Number,4432,South-East
3,Households with Internet access,2016,Not Stated,South-East,Number,642,South-East
4,Households with Internet access,2016,Total,South-East,Number,20465,South-East


In [98]:
#renaming column to the new name and checking
df_it16 = df_it16.rename(columns={'County': 'Name of Region'})
df_it16.head()

Unnamed: 0,Statistic Label,Census Year,Internet,Name of Region,UNIT,VALUE,Region
0,Households with Internet access,2016,Broadband,South-East,Number,13539,South-East
1,Households with Internet access,2016,Other,South-East,Number,1852,South-East
2,Households with Internet access,2016,No,South-East,Number,4432,South-East
3,Households with Internet access,2016,Not Stated,South-East,Number,642,South-East
4,Households with Internet access,2016,Total,South-East,Number,20465,South-East


In [99]:
#removing last column to have initial view for the table

df_it16 = df_it16.drop(columns=["Region"])
df_it16.head()

Unnamed: 0,Statistic Label,Census Year,Internet,Name of Region,UNIT,VALUE
0,Households with Internet access,2016,Broadband,South-East,Number,13539
1,Households with Internet access,2016,Other,South-East,Number,1852
2,Households with Internet access,2016,No,South-East,Number,4432
3,Households with Internet access,2016,Not Stated,South-East,Number,642
4,Households with Internet access,2016,Total,South-East,Number,20465


In [100]:
#Last check for obtained data
df_it16["Name of Region"].unique()

array(['South-East', 'Border', 'Mid-West', 'South-West', 'Dublin', 'West',
       'Mid-East', 'Midland'], dtype=object)

# 4. Handling data for file with data from 2022 year

In [101]:
#Creating new DataFrame and exploring it

df_it22 = pd.read_csv("SAP2022.csv")
df_it22.head()

Unnamed: 0,Statistic Label,Census Year,Internet,NUTS 3 Region,UNIT,VALUE
0,Households with Internet access,2022,Broadband,Ireland,Number,1457883
1,Households with Internet access,2022,Broadband,Border,Number,116928
2,Households with Internet access,2022,Broadband,West,Number,134086
3,Households with Internet access,2022,Broadband,Mid-West,Number,137622
4,Households with Internet access,2022,Broadband,South-East,Number,124415


This file already has regions names and correct order of columns, so just renaming column

In [103]:
#renaming column
df_it22 = df_it22.rename(columns={'NUTS 3 Region': 'Name of Region'})
df_it22.head()

Unnamed: 0,Statistic Label,Census Year,Internet,Name of Region,UNIT,VALUE
0,Households with Internet access,2022,Broadband,Ireland,Number,1457883
1,Households with Internet access,2022,Broadband,Border,Number,116928
2,Households with Internet access,2022,Broadband,West,Number,134086
3,Households with Internet access,2022,Broadband,Mid-West,Number,137622
4,Households with Internet access,2022,Broadband,South-East,Number,124415


In [104]:
#checking regions to be from the same range with other files 
df_it22["Name of Region"].unique()

array(['Ireland', 'Border', 'West', 'Mid-West', 'South-East',
       'South-West', 'Dublin', 'Mid-East', 'Midlands'], dtype=object)

# 5. Creating 1 DataFrame and exporting it in the csv file "SAP.csv"

In [138]:
#checking nr of rows in the 1st dataset
df_it11.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  170 non-null    object
 1   Census Year      170 non-null    int64 
 2   Internet         170 non-null    object
 3   Name of Region   170 non-null    object
 4   UNIT             170 non-null    object
 5   VALUE            170 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 8.1+ KB


In [139]:
#checking nr of rows in the 2nd dataset

df_it16.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  155 non-null    object
 1   Census Year      155 non-null    int64 
 2   Internet         155 non-null    object
 3   Name of Region   155 non-null    object
 4   UNIT             155 non-null    object
 5   VALUE            155 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 7.4+ KB


In [140]:
#checking nr of rows in the 3rd dataset

df_it22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  45 non-null     object
 1   Census Year      45 non-null     int64 
 2   Internet         45 non-null     object
 3   Name of Region   45 non-null     object
 4   UNIT             45 non-null     object
 5   VALUE            45 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 2.2+ KB


In [141]:
#brief calc for expected result
170+155+45

370

In [142]:
#applying concatination for all 3 DataFrames
df_it = pd.concat([df_it11, df_it16,df_it22], ignore_index=True)
df_it.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 370 entries, 0 to 369
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  370 non-null    object
 1   Census Year      370 non-null    int64 
 2   Internet         370 non-null    object
 3   Name of Region   370 non-null    object
 4   UNIT             370 non-null    object
 5   VALUE            370 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 17.5+ KB


In [143]:
#Cheking if there is no mistakes or misspellings in the regions names
df_it["Name of Region"].unique()

array(['South-East', 'Dublin', 'Mid-East', 'Midland', 'Border',
       'Mid-West', 'South-West', 'West', 'Ireland', 'Midlands'],
      dtype=object)

In [144]:
# Group the DataFrame by the all columns except Value
#Rational: to have 1 row/value for each group of categories
df_grouped = df_it.groupby(['Statistic Label', 'Census Year', 'Internet', 'Name of Region', 'UNIT'])

In [145]:
# Aggregate the data in each group by summarising values in each group
df_grouped = df_grouped.agg({'VALUE': 'sum'})

In [146]:
# Reset the indexes of the grouped DataFrame
df_grouped = df_grouped.reset_index()
df_grouped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  125 non-null    object
 1   Census Year      125 non-null    int64 
 2   Internet         125 non-null    object
 3   Name of Region   125 non-null    object
 4   UNIT             125 non-null    object
 5   VALUE            125 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 6.0+ KB


In [147]:
#selecting random rows for wider observation
df_grouped.sample(15)

Unnamed: 0,Statistic Label,Census Year,Internet,Name of Region,UNIT,VALUE
86,Households with Internet access,2022,Broadband,South-East,Number,124415
17,Households with Internet access,2011,Not Stated,Dublin,Number,12359
84,Households with Internet access,2022,Broadband,Mid-West,Number,137622
73,Households with Internet access,2016,Total,Dublin,Number,479159
42,Households with Internet access,2016,Broadband,Mid-East,Number,140854
92,Households with Internet access,2022,No,Mid-East,Number,18400
59,Households with Internet access,2016,Not Stated,Mid-West,Number,4919
113,Households with Internet access,2022,Other,South-East,Number,11507
30,Households with Internet access,2011,Other,South-West,Number,19639
14,Households with Internet access,2011,No,South-West,Number,66925


In [319]:
#Export DataFrame to a csv File
df_grouped.to_csv("SAP.csv", index = False)