## Merging two datasets into one dataset before feature engineering 

1. Merge the lsoa_census data with the msoa_income_net_income_after_housing_costs.csv'

The primary key to linking both datasets is the names (lsoa) and the msoa_names. The prefixes of those columns match but some of the MSOAs and LSOAs do not match due to many boundary rearrangements at the end of 2011. Therefore, I made a separate csv.file extract the rows that do not match, and I will inspect them individually with the website https://findthatpostcode.uk/. This website gives me the active and inactive MSOAs and LSOAs. Also, some of the MSOAs will have LSOAs that have the same income values because certain MSOAs were split into two. Some MSOAs will have LSOAs who have the average values of two MSOAs that consolidated at 2011. 

In [6]:
# Loading the data
msoa_income_path = '../../data/processed/msoa_income_net_income_after_housing_costs.csv'
msoa_income_data = pd.read_csv(msoa_income_path)

# Extract the prefix from 'names' in the LSOA data
merged_data['msoa_prefix'] = merged_data['names'].str.extract(r'(^.* \d+)')

# Merge based on the prefix
merged_data_with_income = pd.merge(
    merged_data, 
    msoa_income_data, 
    left_on='msoa_prefix', 
    right_on='msoa_name', 
    how='left'
)

# Extracting unmatched rows from the MSOA income data
unmatched_msoa = msoa_income_data[~msoa_income_data['msoa_name'].isin(merged_data['msoa_prefix'].unique())]

# Saving the results
merged_data_path = '../../data/processed/lsoa_census_merged_with_income.csv'
unmatched_msoa_path = '../../data/processed/unmatched_msoa.csv'

# Saving the merged data
merged_data_with_income.to_csv(merged_data_path, index=False)

# Saving the unmatched MSOA data
unmatched_msoa.to_csv(unmatched_msoa_path, index=False)

# Displaying the outputs to check
print("Merged Data Preview:")
print(merged_data_with_income.head())

print("\nUnmatched MSOA Data Preview:")
print(unmatched_msoa.head())

Merged Data Preview:
   lsoa_code                names  all_ages_count_2011  \
0  E01000001  City of London 001A               1465.0   
1  E01000002  City of London 001B               1436.0   
2  E01000003  City of London 001C               1346.0   
3  E01000004  City of London 001D               2143.0   
4  E01000005  City of London 001E                985.0   

   ages_65_plus_count_2011  working_age_count_2011  persons_per_hectare_2011  \
0                    268.0                  1082.0                112.865948   
1                    269.0                  1024.0                 62.872154   
2                    254.0                   988.0                227.749577   
3                    117.0                  1932.0                  9.354402   
4                    127.0                   694.0                 51.951477   

   couples_with_children_percent_2011  couples_without_children_percent_2011  \
0                            7.648402                              24

After going through each unmatched row, the missing values of the merged dataset has been restored in their correct places.

In [8]:
# Reloading the updated dataset
updated_data_path = '../../data/processed/lsoa_census_merged_with_income.csv'
merged_data_with_income = pd.read_csv(updated_data_path)

print("Updated dataset loaded successfully.")

# Checking for missing values
missing_values = merged_data_with_income.isnull().sum()

# Displaying columns with missing values
print("Columns with missing values:")
print(missing_values[missing_values > 0])

Updated dataset loaded successfully.
Columns with missing values:
child_tax_credit_lone_parent_percent_2011    1
child_out_of_work_benefit_percent_2012       1
dtype: int64


Using KNN Imputation for 

In [9]:
from sklearn.impute import KNNImputer

# Creating a KNN imputer instance
imputer = KNNImputer(n_neighbors=5)  # You can adjust the number of neighbors if needed

# Applying KNN Imputation (excluding non-numeric columns)
numeric_columns = merged_data_with_income.select_dtypes(include=['float64', 'int64']).columns
merged_data_with_income[numeric_columns] = imputer.fit_transform(merged_data_with_income[numeric_columns])

# Displaying completion message
print("KNN imputation completed.")


# Generating the descriptive statistics for numeric columns
summary_stats = merged_data_with_income.describe()

print("Quick Statistical Summary:")
print(summary_stats)

KNN imputation completed.
Quick Statistical Summary:
       all_ages_count_2011  ages_65_plus_count_2011  working_age_count_2011  \
count          4762.000000              4762.000000             4762.000000   
mean           1714.973436               189.850168             1184.058484   
std             325.491626                75.091826              285.467505   
min             623.250000                24.000000              387.750000   
25%            1536.000000               135.000000             1012.000000   
50%            1661.000000               178.000000             1131.500000   
75%            1826.750000               234.000000             1287.750000   
max            6289.000000               599.000000             5023.000000   

       persons_per_hectare_2011  couples_with_children_percent_2011  \
count               4762.000000                         4762.000000   
mean                  94.873933                           18.558813   
std                   

Looking at the statistical summary, I can see that the confidence intervals are consistent with the net income after housing costs. The columns have a minimum value of 0, which could be outliers. I will do more checks in the feature engineering part of my notebook (feature_poc.ipynb).