# The Preprocessing/EDA of LSOA atlas data by the Greater London Authority
Contains census data for LSOA boundaries in 2011. Some of the data is indicative to show trends and not the official ONS statistics. This is because of the merging of certain areas. 

Dataset source: https://data.london.gov.uk/dataset/lsoa-atlas

1.1 Loading the dataset from an .xlsx file to a Pandas DataFrame, then displaying the name of the sheets that have relevant information (i.e. variables/potential features)

In [37]:
import pandas as pd

# Loading the Excel file for LSOA atlas data, which I will name "lsoa_census_data" from now on.
file_path = "../../data/raw/lsoa_census.xlsx"

# Retrieving and displaying the sheet names
sheets = pd.ExcelFile(file_path).sheet_names
print("The sheet names/tables for MSOA Income Estimate data:", sheets)

The sheet names/tables for MSOA Income Estimate data: ['iadatasheet1', 'iadatasheet2', 'iadatasheet3', 'iadatasheet4', 'iadatasheet5', 'iadatasheet6', 'Metadata']


1.2 Loading Sheets 1 to 6 as DataFrames. The other sheets are irrelevant for the cleaning and analysis processes.

In [38]:
# Loading sheets as DataFrames
relevant_sheets = sheets[0:5]
lsoa_census_data = {sheet: pd.read_excel(file_path, sheet_name = sheet) 
                    for sheet in relevant_sheets}

# Exploring each DataFrame by looping through each sheet
for sheet, census in lsoa_census_data.items():
    print(f"Sheet: {sheet}")
    print(census.head())
    print(census.info())

Sheet: iadatasheet1
  Unnamed: 0           Unnamed: 1 Mid-year Population Estimates  Unnamed: 3  \
0        NaN                  NaN                      All Ages         NaN   
1      Codes                Names                          2001      2002.0   
2  E01000001  City of London 001A                          1615      1571.0   
3  E01000002  City of London 001B                          1493      1452.0   
4  E01000003  City of London 001C                          1573      1547.0   

   Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  \
0         NaN         NaN         NaN         NaN         NaN         NaN   
1      2003.0      2004.0      2005.0      2006.0      2007.0      2008.0   
2      1578.0      1559.0      1461.0      1474.0      1538.0      1504.0   
3      1401.0      1398.0      1402.0      1430.0      1467.0      1417.0   
4      1506.0      1487.0      1536.0      1524.0      1602.0      1499.0   

   ...                       Unnamed: 200 

As you can see from the output above, there are numerous rows of headers that are merged and unmerged. This dataset is complex compared to the lsoa_crime and msoa_income datasets. I need to carefully look at the first 3 rows to understand their order and structures. I need to use human reasoning and manually go through each sheet to decide how I am going to combine these headers into one layer (i.e. three rows of headers into one). 

1.3 Manual Approach to the Handle Multi-level Headers:

In [39]:
# Iterating over each sheet, taking the first three rows to display them.
for sheet, census in lsoa_census_data.items():
    header = census.iloc[:3] 
    print(f"Sheet: {sheet}\n{header}") 

Sheet: iadatasheet1
  Unnamed: 0           Unnamed: 1 Mid-year Population Estimates  Unnamed: 3  \
0        NaN                  NaN                      All Ages         NaN   
1      Codes                Names                          2001      2002.0   
2  E01000001  City of London 001A                          1615      1571.0   

   Unnamed: 4  Unnamed: 5  Unnamed: 6  Unnamed: 7  Unnamed: 8  Unnamed: 9  \
0         NaN         NaN         NaN         NaN         NaN         NaN   
1      2003.0      2004.0      2005.0      2006.0      2007.0      2008.0   
2      1578.0      1559.0      1461.0      1474.0      1538.0      1504.0   

   ...                       Unnamed: 200  Unnamed: 201  Unnamed: 202  \
0  ...  % Dwellings in Council Tax Band H           NaN           NaN   
1  ...                               2005       2006.00       2007.00   
2  ...                               0.57          0.57          0.56   

   Unnamed: 203 Unnamed: 204  Unnamed: 205  Unnamed: 206  Hou

I want to simplify the sheets one by one, as much as I can. So I am going to start with the first sheet until the sixth sheet, listing what I am going to remove:
1. First sheet:
   - Column 3 to 86 - I do not need all the ages for multiple years. I will use the 2011 Census Population which has simplified that category to columns)
   - Column 117 to 125 - 2011 Religion data is not of interest to me. But I will keep Ethnic groups/Country of Birth for now.
   - Column 137 to 151 - Vacant Dwelling/Stock Total is not needed as I am more interested in the Dwelling type (2011).
   - Column 152 to 207 - I do not need Dwelling by Council Tax Data
   - Column 208 to 210 - This is about Sales in House Prices which is highly aggregated and not directly correlated with the levels of income in a given ares.
2. Second sheet:
   - Column 3 to 44 - Crime numbers but keeping the crime rates because they are normalised.
   - Column 103 - I do not think apprenticeships will give clear insights about income levels, and  other there are other interesting categories regarding qualifications and education levels.
   - Column 108 to 124 and 126 to 131 - Workplace Employment is not as insightful as Employment rates. And the Claimant Count for JobSeekers Allowance will be removed except the one for May-11 and May-12.
3. I am removing the whole third sheet, because this is about Pensions etc.
4. Fourth sheet:
   - Column 3 to 72 - Disability Living Allowance, which is not of interest.
   - Column 143 to 168 - 2008/2010 data that is not within the range of interest (i.e. 2011/12)
   - Column 171 to 186 - 2005/2008, same reason as above
   - Column 228 - 251 - Only interested in the total number of children whose parents have claimed benefits for.
5. Fifth sheet:
   - Column 28 to 32 - Free School Meals can be removed except the total amount.
   - Column 34 to 51 - Pupil Absences will be removed except the number for Unauthorised Absence in All Schools and Persistent Absentees in All Schools.
   - Column 64 to 103 - Early Stages and KS1 will be removed for simplification purposes and older children will provide more insights in terms of income levels and educational development. Therefore KS2 and onwards will be used as features.
   - Column 104 to 136 - KS2 will be removed except for Average Point Score of Pupils
   - Column 146 to 181 - GCSE but Average Capped GCSE is more useful to the research, as it disregards additional subjects and focuses on the top 9 performing subjects/grades of students.
   - Column 192 to 203 - A/AS Level removed except for Level 3 QCDA Point Score Per Student.
   - Remove all columns from Column 3 onwards that are not about "2011", "2012" or "2011/12"
6. Remove the whole of the sixth sheet because it is about 2010 data.

A lot of these columns are removed because of multicollinearity purposes and as a preliminary step of cleaning/preprocessing.

In [41]:
# Defining a function that can clean each sheet 
def clean_sheet(lsoa_data, sheet_name):
    if sheet_name == 'iadatasheet1':
        # Dropping the columns by range
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[2:86], errors='ignore')  # Columns 3 to 86
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[116:125], errors='ignore')  # Columns 117 to 125
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[136:210], errors='ignore')  # Columns 137 to 210

    elif sheet_name == 'iadatasheet2':
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[2:44], errors='ignore')  # Columns 3 to 44
        lsoa_data = lsoa_data.drop(columns=[102], errors='ignore')  # Column 103
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[107:124], errors='ignore')  # Columns 108 to 124
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[125:131], errors='ignore')  # Columns 126 to 131

    elif sheet_name == 'iadatasheet3':
        # Removing the entire sheet
        return None

    elif sheet_name == 'iadatasheet4':
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[2:72], errors='ignore')  # Columns 3 to 72
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[142:168], errors='ignore')  # Columns 143 to 168
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[170:186], errors='ignore')  # Columns 171 to 186
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[227:251], errors='ignore')  # Columns 228 to 251, keep total for children

    elif sheet_name == 'iadatasheet5':
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[27:32], errors='ignore')  # Columns 28 to 32
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[33:51], errors='ignore')  # Columns 34 to 51
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[63:103], errors='ignore')  # Columns 64 to 103
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[103:136], errors='ignore')  # Columns 104 to 136
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[145:181], errors='ignore')  # Columns 146 to 181
        lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[191:203], errors='ignore')  # Columns 192 to 203

    elif sheet_name == 'iadatasheet6':
        # Removing the entire sheet
        return None

    # Removing the first column
    lsoa_data = lsoa_data.drop(columns=lsoa_data.columns[0], errors='ignore')
    
    return lsoa_data

# Iterating over sheets and cleaning them
lsoa_cleaned_sheets = {}
for sheet_name, lsoa_data in lsoa_census_data.items():
    lsoa_cleaned_data = clean_sheet(lsoa_data, sheet_name)
    if lsoa_cleaned_data is not None:
        lsoa_cleaned_sheets[sheet_name] = lsoa_cleaned_data

# Saving the cleaned sheets to a new Excel file
output_path = '../../data/processed/lsoa_data_cleaned.xlsx'
with pd.ExcelWriter(output_path) as writer:
    for sheet_name, lsoa_data in lsoa_cleaned_sheets.items():
        lsoa_data.to_excel(writer, sheet_name=sheet_name, index=False)

print(f"The cleaned sheets have been saved to {output_path} successfully")

The cleaned sheets have been saved to ../../data/processed/lsoa_data_cleaned.xlsx successfully


I realise that I need to manually rearrange the columns and headers, for the sake of time. The process will take too long through Python code. 

In [42]:
# Reading the cleaned Excel file and display the first three rows of each sheet
output_path = '../../data/processed/lsoa_data_cleaned.xlsx'

# Loading the cleaned sheets into a dictionary
cleaned_lsoa_data = pd.read_excel(output_path, sheet_name=None)

# Displaying the first three rows for each sheet
for sheet_name, lsoa_data in cleaned_lsoa_data.items():
    print(f"The first three rows of sheet: {sheet_name}")
    print(lsoa_data.head(3))
    print("\n" + "-" * 50 + "\n")

The first three rows of sheet: iadatasheet1
            Unnamed: 1 2011 Census Population Unnamed: 87 Unnamed: 88  \
0                  NaN          Age Structure         NaN         NaN   
1                Names               All Ages        0-15       16-29   
2  City of London 001A                   1465         115         216   

  Unnamed: 89 Unnamed: 90 Unnamed: 91  Unnamed: 92 Population Density (2011)  \
0         NaN         NaN         NaN          NaN       Persons per hectare   
1       30-44       45-64         65+  Working-age                       NaN   
2         379         487         268         1082                112.865948   

  Households (2011)  ...                       Unnamed: 191  \
0    All households  ...  % Dwellings in Council Tax Band F   
1               NaN  ...                               2010   
2               876  ...                              13.81   

                        Unnamed: 192                       Unnamed: 193  \
0  % Dwellings

This is looking better, but I will fill in the rest of the column headers in MS Excel and make sure leave the headers in a single row.

In [60]:
# Reading the cleaned Excel file and display the first three rows of each sheet
output_path = '../../data/processed/lsoa_data_cleaned.xlsx'

# Loading the cleaned sheets into a dictionary
cleaned_lsoa_data = pd.read_excel(output_path, sheet_name=None)

# Displaying the first three rows for each sheet
for sheet_name, lsoa_data in cleaned_lsoa_data.items():
    print(f"The first three rows of sheet: {sheet_name}")
    print(lsoa_data.head(3))
    print("\n" + "-" * 50 + "\n")

The first three rows of sheet: iadatasheet1
   lsoa_code                names  all_ages_count_2011  \
0  E01000001  City of London 001A               1465.0   
1  E01000002  City of London 001B               1436.0   
2  E01000003  City of London 001C               1346.0   

   ages_65_plus_count_2011  working_age_count_2011  persons_per_hectare_2011  \
0                    268.0                  1082.0                112.865948   
1                    269.0                  1024.0                 62.872154   
2                    254.0                   988.0                227.749577   

   couples_with_children_percent_2011  couples_without_children_percent_2011  \
0                            7.648402                              24.771689   
1                           10.361446                              26.024096   
2                            6.242350                              16.156671   

   lone_parent_household_percent_2011  race_white_count_2011  ...  \
0           

This is what I refined:
- The first three rows are now consolidated into a single layer of relevant columns names
- Irrelevant years were removed, except 2011 and 2012 data.
- I changed the column names into lowercase and replaced the spaces between words with underscores.
- I extracted the metrics that are relevant to the research questions and project goals (i.e. crime, housing, education and employment rates for income estimates)

1.3 Missing Values, Data Types, Unit Consistency, and Data Integrity
- Firstly, I will merge the sheets in the refined dataset.
- Then inspect for data types, duplicates, non-numerical values, irrelevant columns etc.

In [61]:
# Updating relevant sheets because some of them have been deleted
relevant_sheets = ['iadatasheet1', 'iadatasheet2', 'iadatasheet4', 'iadatasheet5']

# Merging the DataFrame sheets in lsoa_census_data dictionary
merged_data = cleaned_lsoa_data[relevant_sheets[0]]

for sheet in relevant_sheets[1:]:
    merged_data = merged_data.merge(
        cleaned_lsoa_data[sheet],
        how= 'left',
        on= 'lsoa_code',
        suffixes=('', f'_{sheet}')  # Handling the duplicate column names
    )


# Resolving the duplicate column names 
# Keeping the first occurrence of a column with duplicates
merged_data = merged_data.loc[:, ~merged_data.columns.duplicated()]

# Validate the merge
print(merged_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4768 entries, 0 to 4767
Data columns (total 68 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   lsoa_code                                        4765 non-null   object 
 1   names                                            4768 non-null   object 
 2   all_ages_count_2011                              4768 non-null   float64
 3   ages_65_plus_count_2011                          4768 non-null   float64
 4   working_age_count_2011                           4768 non-null   float64
 5   persons_per_hectare_2011                         4768 non-null   float64
 6   couples_with_children_percent_2011               4768 non-null   float64
 7   couples_without_children_percent_2011            4768 non-null   float64
 8   lone_parent_household_percent_2011               4768 non-null   float64
 9   race_white_count_2011         

Row 43 to 60 have the same names but they are quarterly rates that measure how ma, but I renamed them because I planned to aggregate them to represent 2011/12 for simplify the dataset.

In [62]:
# Calculating the annual averages
merged_data['rate_of_people_income_support_2011_avg'] = merged_data[['rate_of_people_income_support_2011', 
                                                           'rate_of_people_income_support_2011.1', 
                                                           'rate_of_people_income_support_2011.2', 
                                                           'rate_of_people_income_support_2011.3']].mean(axis=1)

merged_data['rate_of_people_income_support_2012_avg'] = merged_data[['rate_of_people_income_support_2012', 
                                                           'rate_of_people_income_support_2012.1', 
                                                           'rate_of_people_income_support_2012.2', 
                                                           'rate_of_people_income_support_2012.3']].mean(axis=1)

# Aggregating into a single variable
merged_data['rate_of_people_income_support_2011_12'] = merged_data[['rate_of_people_income_support_2011_avg', 
                                                          'rate_of_people_income_support_2012_avg']].mean(axis=1)

# Dropping the original quarterly columns and placeholder columns
columns_to_drop = ['names_iadatasheet2', 'names_iadatasheet4', 'names_iadatasheet5', 
                   'rate_of_people_income_support_2011.1', 'rate_of_people_income_support_2011.2', 
                   'rate_of_people_income_support_2011.3', 'rate_of_people_income_support_2011.1', 
                   'rate_of_people_income_support_2011.2', 'rate_of_people_income_support_2011.3']
merged_data.drop(columns=columns_to_drop, inplace=True)

1.4 KNN Imputation was used instead of Mean or Median imputation, because this methods considers the correlation with other variables and does not negatively affect the variance. Mean and Median relies on central tendency and doesnt adapt well to complex data such as the dataset which I am using. Also, this method is more realistic and is closer to reaching a true representation of data, which usually has some noise or outliers.

In [64]:
# Before I begin KNN Imputing, I need to change some column names to lowercase and correct some mistakes
merged_data.rename(columns={'avg_lvl_3_QCDA_per_student_2011_12': 'avg_lvl_3_qcda_per_student_2011_12'}, inplace=True)
merged_data.rename(columns={'lone_parent_household_percent_2011_iadatasheet2': 'lone_parent_household_percent_2011'}, inplace=True)

In [67]:
from sklearn.impute import KNNImputer
import pandas as pd

# 1. Converting the non-numeric values in numeric columns to NaN
numeric_columns = ['avg_point_ks2_percent_2011', 'avg_point_ks2_percent_2012', 'avg_capped_gcse_points_per_pupil_2011_12', 'avg_lvl_3_qcda_per_student_2011_12']
for col in numeric_columns:
    merged_data[col] = pd.to_numeric(merged_data[col], errors='coerce')

# 2. Handling missing values with KNN imputation
# Select the numeric columns (exclude columns like the target variable if needed)
numeric_columns_for_imputation = merged_data.select_dtypes(include=['float64']).columns

# Initialising KNNImputer (you can adjust n_neighbors based on your data)
knn_imputer = KNNImputer(n_neighbors = 5)

# Applying KNN imputation to the numeric columns
merged_data[numeric_columns_for_imputation] = knn_imputer.fit_transform(merged_data[numeric_columns_for_imputation])

# 3. Converting object columns like 'child_out_of_work_benefit_percent_2011' to numeric (if applicable)
merged_data['child_out_of_work_benefit_percent_2011'] = pd.to_numeric(merged_data['child_out_of_work_benefit_percent_2011'], errors='coerce')

# Checking to see if there are any columns left as objects
object_columns = merged_data.select_dtypes(include=['object']).columns
print(f"Object columns remaining: {object_columns}")

# Checking to see if there are any missing values after imputation
missing_values_after_imputation = merged_data.isnull().sum()
print(f"Missing values after KNN imputation: \n{missing_values_after_imputation}")


Object columns remaining: Index(['lsoa_code', 'names', 'child_tax_credit_lone_parent_percent_2011',
       'child_out_of_work_benefit_percent_2012',
       'persistent_absentees__percent_2011_12'],
      dtype='object')
Missing values after KNN imputation: 
lsoa_code                                   3
names                                       0
all_ages_count_2011                         0
ages_65_plus_count_2011                     0
working_age_count_2011                      0
                                           ..
avg_lvl_3_qcda_per_student_2011_12          0
avg_capped_gcse_points_per_pupil_2011_12    0
rate_of_people_income_support_2011_avg      0
rate_of_people_income_support_2012_avg      0
rate_of_people_income_support_2011_12       0
Length: 65, dtype: int64


There are some missing values in the lsoa_code column. I want to check the bottom of the dataset, as I think there are totals for the other columns except the first column.

In [69]:
# Displaying the last 5 five rows
print(merged_data.tail())

      lsoa_code                    names  all_ages_count_2011  \
4763  E01004764         Westminster 013C          2410.000000   
4764  E01004765         Westminster 013D          2023.000000   
4765        NaN  LSOA average for London          1715.270409   
4766        NaN  LSOA average for London          1715.270409   
4767        NaN  LSOA average for London          1715.270409   

      ages_65_plus_count_2011  working_age_count_2011  \
4763               286.000000             1912.000000   
4764               190.000000             1647.000000   
4765               189.851731             1184.471668   
4766               189.851731             1184.471668   
4767               189.851731             1184.471668   

      persons_per_hectare_2011  couples_with_children_percent_2011  \
4763                 38.808374                            5.133333   
4764                 75.456919                            6.074343   
4765                 94.853380                          

Just as I thought but not exactly, the last three rows are LSOA averages for London. I will delete these lines.

In [71]:
# Removing the last three rows
merged_data = merged_data.iloc[:-3]

# Checking to see if it changed or not
print(merged_data.tail())

      lsoa_code             names  all_ages_count_2011  \
4757  E01004758  Westminster 010D               1270.0   
4758  E01004759  Westminster 010E               1947.5   
4759  E01004760  Westminster 014E               1630.0   
4760  E01004761  Westminster 018D               1945.0   
4761  E01004762  Westminster 011E               2070.0   

      ages_65_plus_count_2011  working_age_count_2011  \
4757                     95.5                   819.5   
4758                    201.0                  1236.0   
4759                    160.0                  1184.0   
4760                    254.0                  1532.0   
4761                    323.0                  1531.0   

      persons_per_hectare_2011  couples_with_children_percent_2011  \
4757                151.100535                           20.020222   
4758                175.925926                           18.316195   
4759                227.019499                            9.208820   
4760                 37.3966

In [79]:
# Checking for missing values in each column
missing_values_per_column = merged_data.isnull().sum()

# Sending it to a csv file
missing_values_per_column.to_csv("../../data/raw/missing_values.csv")

This is what I saw was missing: 
child_out_of_work_benefit_percent_2011	37 missing (float column)
persistent_absentees__percent_2011_12	1072 missing (object column of percentages)

Applying KNN Imputation but I have to convert the object column first.

In [83]:
# Creating a copy of merged_data 
merged_data = merged_data.copy()

# 1. Handling the missing values in 'child_out_of_work_benefit_percent_2011' (float column)
# Convert non-numeric values in this column to NaN
merged_data.loc[:, 'child_out_of_work_benefit_percent_2011'] = pd.to_numeric(merged_data['child_out_of_work_benefit_percent_2011'], errors='coerce')

# Applying KNN imputation to the column
knn_imputer = KNNImputer(n_neighbors = 5)
merged_data.loc[:, ['child_out_of_work_benefit_percent_2011']] = knn_imputer.fit_transform(merged_data[['child_out_of_work_benefit_percent_2011']])

# 2. Handling missing values in 'persistent_absentees__percent_2011_12' (object column)
# First, removing any '%' characters and convert to numeric
merged_data.loc[:, 'persistent_absentees__percent_2011_12'] = merged_data['persistent_absentees__percent_2011_12'].replace('%', '', regex=True)
merged_data.loc[:, 'persistent_absentees__percent_2011_12'] = pd.to_numeric(merged_data['persistent_absentees__percent_2011_12'], errors='coerce')

# Applying KNN imputation to the column
merged_data.loc[:, ['persistent_absentees__percent_2011_12']] = knn_imputer.fit_transform(merged_data[['persistent_absentees__percent_2011_12']])

# 3. Checking for missing values after imputation
# Checking missing values only for the specified columns
columns_of_interest = ['child_out_of_work_benefit_percent_2011', 'persistent_absentees__percent_2011_12']
missing_values_for_selected_columns = merged_data[columns_of_interest].isnull().sum()

print(f"Missing values for selected columns: \n{missing_values_for_selected_columns}")


Missing values for selected columns: 
child_out_of_work_benefit_percent_2011    0
persistent_absentees__percent_2011_12     0
dtype: int64


Before Feature Engineering, I am going to change two columns that are objects, to a float column.

In [85]:
# Converting 'child_tax_credit_lone_parent_percent_2011' and 'child_out_of_work_benefit_percent_2012' to float
merged_data['child_tax_credit_lone_parent_percent_2011'] = pd.to_numeric(merged_data['child_tax_credit_lone_parent_percent_2011'], errors='coerce')
merged_data['child_out_of_work_benefit_percent_2012'] = pd.to_numeric(merged_data['child_out_of_work_benefit_percent_2012'], errors='coerce')

In [86]:
print(merged_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4762 entries, 0 to 4761
Data columns (total 65 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   lsoa_code                                     4762 non-null   object 
 1   names                                         4762 non-null   object 
 2   all_ages_count_2011                           4762 non-null   float64
 3   ages_65_plus_count_2011                       4762 non-null   float64
 4   working_age_count_2011                        4762 non-null   float64
 5   persons_per_hectare_2011                      4762 non-null   float64
 6   couples_with_children_percent_2011            4762 non-null   float64
 7   couples_without_children_percent_2011         4762 non-null   float64
 8   lone_parent_household_percent_2011            4762 non-null   float64
 9   race_white_count_2011                         4762 non-null   f

So all the columns are float columns except the first two columns, which is fine. I will save the lsoa_census data (i.e. merged_data) in the processed data folder.

In [87]:
# Saving the lsoa_census (merged_data) to the processed folder
lsoa_census_path = '../../data/processed/lsoa_census.csv'
merged_data.to_csv(lsoa_census_path, index=False)