# Preprocessing/EDA for Small area income estimates for middle layer super output areas, England & Wales, 2011/12

Dataset link: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/smallareaincomeestimatesformiddlelayersuperoutputareasenglandandwales

1.1 Loading the dataset from an .xlsx file to a Pandas DataFrame, then displaying the name of the sheets that have relevant information (i.e. variables/potential features)

In [5]:
import pandas as pd

# Loading the Excel file for MSOA Income Estimates from ONS
file_path = "../../data/raw/msoa_income.xlsx"

# Retrieving and displaying the sheet names
sheets = pd.ExcelFile(file_path).sheet_names
print("The sheet names/tables for MSOA Income Estimate data:", sheets)

The sheet names/tables for MSOA Income Estimate data: ['Contents', 'Metadata', 'Terms and Conditions', 'Total weekly income', 'Net weekly income', 'Net income before housing costs', 'Net income after housing costs', 'Related Publications']


1.2 Defining the Sheets 4 to 7 as DataFrames. The other sheets are irrelevant for the cleaning and analysis processes.

In [19]:
# Defining the relevant sheets now the names are knowm
relevant_sheets = ["Total weekly income", "Net weekly income", "Net weekly income", 
"Net income before housing costs", "Net income after housing costs"]

# Loading the relevant sheets and skipping the first three rows
# Then make the new first row as the header (column names)
adjust_data = {}
for sheet in relevant_sheets:
    msoa_income_data = pd.read_excel(file_path, sheet_name = sheet, header = 4)

    # Removing the last row as it is irrelevant
    msoa_income_data = msoa_income_data[:-1]
    
    # Storing the adjusted DataFrame
    adjust_data[sheet] = msoa_income_data
    
# Displaying the number of missing values each sheet
    print(f"Column Information {sheet}:")
    print(msoa_income_data.isnull().sum())
    print("-" * 40)

Column Information Total weekly income:
MSOA code                     0
MSOA name                     0
Local authority code          0
Local authority name          0
Region code                   0
Region name                   0
Total weekly income (£)       0
Upper confidence limit (£)    0
Lower confidence limit (£)    0
Confidence interval (£)       0
dtype: int64
----------------------------------------
Column Information Net weekly income:
MSOA code                     0
MSOA name                     0
Local authority code          0
Local authority name          0
Region code                   0
Region name                   0
Net weekly income (£)         0
Upper confidence limit (£)    0
Lower confidence limit (£)    0
Confidence interval (£)       0
dtype: int64
----------------------------------------
Column Information Net weekly income:
MSOA code                     0
MSOA name                     0
Local authority code          0
Local authority name          0
Region c

There are no missing values 