# Preparation for 2011 Cencus Data
Load several datasets from the 2011 UK Census.
- Age structure: The number of people in each age group in each area.
- Qualification: The number of people with each level of qualification in each area.
- Health: The number of people in each health category in each area.
- Occupation: The number of people in each occupation category in each area.

Then merge the datasets and use different algorithms to project the data into 2D space.

## 1. Restructure the datasets
Restructure the datasets to make them easier to work with. Rename the columns and create new columns to group the data in a more useful way.

In [2]:
import numpy as np
import pandas as pd

In [49]:
# load age data for 2011
age_structure_df=pd.read_csv('E:\\Bristol\\VA\\DATA\\2011\\Age_structure.csv')

# Adjusting 0-14 years population
age_structure_df['0-14'] = (age_structure_df['Age: Age 0 to 4; measures: Value'] +
                            age_structure_df['Age: Age 5 to 7; measures: Value'] +
                            age_structure_df['Age: Age 8 to 9; measures: Value'] +
                            age_structure_df['Age: Age 10 to 14; measures: Value'])

# Adjusting 15-24 years population
age_structure_df['15-24'] = (age_structure_df['Age: Age 15; measures: Value'] +
                             age_structure_df['Age: Age 16 to 17; measures: Value'] +
                             age_structure_df['Age: Age 18 to 19; measures: Value'] +
                             age_structure_df['Age: Age 20 to 24; measures: Value'])

# Adjusting 25-44 years population
age_structure_df['25-44'] = (age_structure_df['Age: Age 25 to 29; measures: Value'] +
                             age_structure_df['Age: Age 30 to 44; measures: Value'])

# Adjusting 45-64 years population
age_structure_df['45-64'] = (age_structure_df['Age: Age 45 to 59; measures: Value'] +
                             age_structure_df['Age: Age 60 to 64; measures: Value'])

# 65+
age_structure_df['65+'] = (age_structure_df['Age: Age 65 to 74; measures: Value'] +
                           age_structure_df['Age: Age 75 to 84; measures: Value'] +
                           age_structure_df['Age: Age 85 to 89; measures: Value'] +
                           age_structure_df['Age: Age 90 and over; measures: Value'])

# Select relevant columns to create a restructured DataFrame
age_structure_df.rename(columns={'Age: Mean Age; measures: Value': 'Mean Age', 'Age: Median Age; measures: Value': 'Median Age'}, inplace=True)
age_restructured = age_structure_df[['geography','geography code', '0-14', '15-24', '25-44', '45-64', '65+', 'Mean Age', 'Median Age']]

# rename columns
age_restructured.columns = ['geography', 'geography code', 'Age: 0-14', 'Age: 15-24', 'Age: 25-44', 'Age: 45-64', 'Age: 65+', 'Mean Age', 'Median Age']

# save the restructured DataFrame
# age_restructured.to_csv('./datasets/Age_structure.csv', index=False)

In [28]:
# load education data for 2011
qualification_df=pd.read_csv('E:\\Bristol\\VA\\DATA\\2011\\Qualification.csv')
# rename columns
qualification_df.columns = ['date', 'geography', 'geography code','Rural', 'All categories: Highest level of qualification', 'Qualification: No','Qualification: Level 1', 'Qualification: Level 2', 'Qualification: Apprenticeship', 'Qualification: Level 3', 'Qualification: Level 4 and above', 'Qualification: Other']

restructured_qualification = qualification_df[['geography', 'geography code', 'Qualification: No','Qualification: Level 1', 'Qualification: Level 2', 'Qualification: Apprenticeship', 'Qualification: Level 3', 'Qualification: Level 4 and above', 'Qualification: Other']]

# Display or save the restructured DataFrame
# restructured_qualification.to_csv('./datasets/Qualification.csv', index=False)

In [35]:
# load health data for 2011
health_df=pd.read_csv('E:\\Bristol\\VA\\DATA\\2011\\Health.csv')

health_df=health_df[['geography', 'geography code', 'General Health: Very good health; measures: Value', 'General Health: Good health; measures: Value', 'General Health: Fair health; measures: Value', 'General Health: Bad health; measures: Value', 'General Health: Very bad health; measures: Value']]

health_df.columns = ['geography', 'geography code', 'Health: Very good', 'Health: Good', 'Health: Fair', 'Health: Bad', 'Health: Very bad']

# health_df.to_csv('./datasets/Health.csv', index=False)

In [46]:
# load occupation data for 2011
occupation_df=pd.read_csv('E:\\Bristol\\VA\\DATA\\2011\\Occupation_Region.csv')
occupation_df=occupation_df.drop(columns=['date', 'Rural Urban'])

# rename columns
new_column_names = [
    'geography', 
    'geography code', 
    'All_Occupation: All categories',
    'All_Occupation: Managers, directors and senior officials', 
    'All_Occupation: Professional occupations', 
    'All_Occupation: Associate professional and technical occupations', 
    'All_Occupation: Administrative and secretarial occupations', 
    'All_Occupation: Skilled trades occupations', 
    'All_Occupation: Caring, leisure and other service occupations', 
    'All_Occupation: Sales and customer service occupations', 
    'All_Occupation: Process, plant and machine operatives', 
    'All_Occupation: Elementary occupations',
    'Male_Occupation: All categories',
    'Male_Occupation: Managers, directors and senior officials',
    'Male_Occupation: Professional occupations',
    'Male_Occupation: Associate professional and technical occupations',
    'Male_Occupation: Administrative and secretarial occupations',
    'Male_Occupation: Skilled trades occupations',
    'Male_Occupation: Caring, leisure and other service occupations',
    'Male_Occupation: Sales and customer service occupations',
    'Male_Occupation: Process, plant and machine operatives',
    'Male_Occupation: Elementary occupations',
    'Female_Occupation: All categories',
    'Female_Occupation: Managers, directors and senior officials',
    'Female_Occupation: Professional occupations',
    'Female_Occupation: Associate professional and technical occupations',
    'Female_Occupation: Administrative and secretarial occupations',
    'Female_Occupation: Skilled trades occupations',
    'Female_Occupation: Caring, leisure and other service occupations',
    'Female_Occupation: Sales and customer service occupations',
    'Female_Occupation: Process, plant and machine operatives',
    'Female_Occupation: Elementary occupations'
]
occupation_df.columns = new_column_names

# occupation_df.to_csv('./datasets/Occupation.csv', index=False)

## 2. Merge the datasets
Merge the restructured datasets to create a single dataset for projection.

In [51]:
# Merge the datasets
age=pd.read_csv('./datasets2011/Age_structure.csv')
qualification=pd.read_csv('./dataset2011/Qualification.csv')
health=pd.read_csv('./datasets2011/Health.csv')
occupation=pd.read_csv('./datasets2011/Occupation.csv')

# Merge the datasets
merged = pd.merge(age, qualification, on=['geography', 'geography code'])
merged = pd.merge(merged, health, on=['geography', 'geography code'])
merged = pd.merge(merged, occupation, on=['geography', 'geography code'])

# merged.to_csv('./datasets/Merged.csv', index=False)

In [5]:
merged=pd.read_csv('./datasets2011/Merged.csv')
merged.isnull().sum() # There is no missing value

geography                                                              0
geography code                                                         0
Age: 0-14                                                              0
Age: 15-24                                                             0
Age: 25-44                                                             0
Age: 45-64                                                             0
Age: 65+                                                               0
Mean Age                                                               0
Median Age                                                             0
Qualification: No                                                      0
Qualification: Level 1                                                 0
Qualification: Level 2                                                 0
Qualification: Apprenticeship                                          0
Qualification: Level 3                             

## 3. Projection
Use the merged dataset to make projections. Using different methods such as PCA, T-SNE, and UMAP to project the data into 2D space.