# GG4257 - Urban Analytics: A Toolkit for Sustainable Urban Development
## Lab Assignment No 2: Handling and GeoVisualisation of Urban Data

# General Information
- **Student ID No:** 200013533
- **Degree Programme:** Geography MA
- **Deadline Date:** 4 April 2024

# GitHub Repository
- **GitHub Link:** https://github.com/IskrenDinev/UA_Assignment_2

# Declaration

In submitting this assignment, I hereby confirm that I have read the University's statement on Good Academic Practice. The following work is my own. Significant academic debts and borrowings have been properly acknowledged and referenced.

## How To Read This WorkBook
Structure: This workbook is divided based on the Lab Workbooks completed in the labs, and within each, the challenges are clearly labelled (The workbook can easily be naviaged using the table of contents). For each challenge, the question/challenge goals are copy and pasted exactly from the work books and labelled as '*TASK*'. If the question/challenge does not require additional code from the workbook to complete, I will simply write my code in under it with code description located under the code. However, if the question/challenge does require additional code from the workbook which is not mine, below the question/challenge you will see '*CODE NEEDED FOR CHALLENGE*', this code has been directly copy and pasted from the workbook. My own response/code is located past this and labelled as '*ANSWER TO CHALLENGE*'. Important to remember, my description of the code is located under the corresponding code- there should be a description under every code.

Descriptions: If you see (described before) in the descriptions for the code, then I have gone into more detail before on the same type of method or code. For instance, I created a GeoDataFrame multiple times, therefore to prevent repetition, I have only described it with detail once or twice before.

## Data for This WorkBook
A file for the data needed for this workbook is found on a Google Drive (will need to be downloaded): https://drive.google.com/drive/folders/13ROZyEd9n7Vrf3ut-X2omaKbuZ_HvWZO?usp=sharing 

Make sure when the data file is downloaded, it is renamed to "data" within the repo

## Lab Workbook 6

### Challenge 1

It's time for you to apply everything you learned by analyzing a case study of GitHub's collaborator network data.

- **Data**: `github_users.p` (avaliable in Moodle)

> This dataset is a GitHub user collaboration network. As you already know, GitHub is a social coding site where users can collaborate on code repositories. In this network, nodes are users, and edges indicate that two users are collaborators on at least one GitHub repository.

1. Read the GitHub network dataset.
2. Describe using the basic functions of the graph's size. Explore nodes and edges. Provide how many nodes and edges are present in the network
3. Calculate the **degree centrality** of the GitHub collaboration network G. Using the .values() method of the network (e.g. G), extract the degree centrality values and convert them into a list. Then, plot a histogram to visualize the distribution of node degrees in the network.
4. Make a subset of the initial network (e.g. Gh_sub), where you include at least five nodes and their corresponding edges. Experiment with multiple nodes so you have a graph with enough edges to work on.
5. Plot the subset graph created.
6. Now calculate another relevant measure of the network -- **betweenness centrality**. Plot the betweenness centrality distribution of the subset you created. Tip: Same steps from the previous step, but use `nx.betweenness_centrality()`
7. Plot the Matrix, Arc and Circos from the subset.

In [None]:
#Libraries needed for this part
import pickle

#Reading the GitHub network dataset
with open('data/network_analysis/github_users.p', 'rb') as f:
    G = pickle.load(f)
G

The code above imports pickle and then reads the GitHub network dataset. This is named G

In [None]:
#Finding the size of G
print(len(G))

#Type of node
print(type(G.nodes()))
#Attribute associated with the first element of the node list
print(list(G.nodes(data=True))[0])

#Type of edge
print(type(G.edges()))
#Attribute associated with the first element of the edge list
print(list(G.edges(data=True))[0])

The code above aims to explore basic information about the network by printing the first element from the node and edge lists. The size of G is also found, which is 56519. The attribute associated with edge is bipartite

In [None]:
num_nodes = G.number_of_nodes()
num_edges = G.number_of_edges()
print(f'Number of nodes: {num_nodes}')
print(f'Number of edges: {num_edges}')

The code above calculates the number of edges and nodes in the dataset. As you can see, the number of edges are 72900 and the number of nodes are 56519

In [None]:
#Importing libraries for this part of the code
import networkx as nx
import matplotlib.pyplot as plt

#Getting the degree of centrality
degree_centrality = nx.degree_centrality(G)

#Creating a degree centrality list with the .values() method
degree_centrality_values = list(degree_centrality.values())

# Plot a histogram of the centrality
plt.figure(figsize=(6, 4))
plt.hist(degree_centrality_values, bins=40, color='#FF4500', alpha=0.7)
plt.title('Centrality')
plt.xlabel('Centrality')
plt.ylabel('Frequency')
plt.show()

The code above aims to calculate the degree centrality of the GitHub collaboration network G. Firstly, .degree_centrality() method is used to find the degree centrality which is then stored in degree_centrality. Next, using the .values() method, the degree centrality results are converted in a list. The results are then plotted, using plt., in a histogram, .hist. From the histogram, it is clear that the degree centrality for most is very low, indicating that the nodes are not well connected within the network.

In [None]:
#Making a subset called Gh_sub where I include at least 5 nodes and corresponding edges. 
#Listing to check what nodes I can select
list(G.edges)

#Selecting the 5 nodes
edges_from_G=G.edges(['u10','u11','u14','u17','u31'])

#Creating an empty graph so to put the selected nodes
Gh_sub = nx.DiGraph()

#Adding the selected nodes into the empty graph
Gh_sub.add_edges_from(edges_from_G)

The code above selects 5 nodes with their corresponding edges. This was made into a subset called Gh_sub.

In [None]:
#Displaying the newly created graph with the nodes and edges
plt.figure(figsize=(8, 8))
nx.draw(Gh_sub, with_labels=True)
plt.show()

The code above displays newly created subset of the nodes and edges. A figure was created using .figure() method and then drawn using .draw() method. The graph shows nodes which have the most edges. For instance, u10 has fewer nodes than u11.

In [None]:
betweenness_centrality_Gh = nx.betweenness_centrality(Gh_sub)
betweenness_centrality_Gh

The code above uses the .betweenness_centrality() method to find the betweenness centrality of the nodes

In [None]:
#Creating a degree centrality list with the .values() method
betweenness_centrality_Gh_values = list(betweenness_centrality_Gh.values())

# Plot a histogram of the centrality
plt.figure(figsize=(6, 4))
plt.hist(betweenness_centrality_Gh_values, bins=40, color='#FF4500', alpha=0.7)
plt.title('Centrality')
plt.xlabel('Centrality')
plt.ylabel('Frequency')
plt.show()

The code above aims to graph the betweenness centrality values. Firstly, the betweenness centrality values are converted into a list. A histogram is then created using the .hist() method. A title and labels are also added. As you can see, centrality is virtually 0 for all nodes.

In [None]:
#Importing the library for this section of the code
import nxviz as nv

#Plotting the the subset with Matrix
nv.MatrixPlot(Gh_sub)
plt.show()

The code above creates a matrix plot 

In [None]:
#Plotting the the subset with Arc
nv.ArcPlot(Gh_sub)
plt.show()

The code above created an arc plot

In [None]:
#Plotting the the subset with Circos
nv.CircosPlot(Gh_sub)
plt.show()

The code above creates a circos plot

### Challenge 2

This challenge is about OSMnx. You will explore and analyze a city's street network using the OSMnx Python library.

1. Use OSMnx to download the street network of a city of your choice. You can specify the city name, BBox or a Dict.
2. Calculate basic statistics for the street network, such as the number of nodes, edges, average node degree, etc.
3. Use OSMnx to plot the street network. Customize the plot to make it visually appealing, including node size, edge color. See the potential options here: https://osmnx.readthedocs.io/en/stable/user-reference.html#module-osmnx.plot
4. Utilize the routing capabilities of OSMnx to find the shortest path between two points in the street network. Plot the route on top of the street network.
5. Calculate the centrality measures (e.g., degree centrality and betweenness_centrality) for nodes in the street network.
6. Create the figure-groud from the selected city
7. Create interactive maps to plot nodes, edges, nodes+edges and one of the centrality measures.
8. Export the street network to a GeoPackage (.gpkg) file. Ensure that the exported file contains both node and edge attributes. Demonstrate that the new GeoPackage can be used and read in Python using any of the libraries we have seen in the class to create a simple and interactive map.
9. Finally, use OSMnx to extract other urban elements (e.g., buildings, parks) and plot them.

In [None]:
#Importing libraries needed for this section
import networkx as nx
import osmnx as ox

#Downloading street network of Yambol, Bulgaria using a Dict method 
Yambol = ox.graph_from_place("Yambol", network_type="drive", truncate_by_edge=True)

The code above uses OSMnx to download the street network of Yambol using Dict method. 

In [None]:
#Calculating basic statistics
num_nodes = Yambol.number_of_nodes()
num_edges = Yambol.number_of_edges()
avg_node_degree = sum(dict(Yambol.degree()).values()) / num_nodes

print(f'Number of nodes: {num_nodes}')
print(f'Number of edges: {num_edges}')
print(f'Average node degree: {avg_node_degree}')

The code above calculates basic statistics. Firstly, .number_of_nodes() is used to calculate the number of nodes, .number_of_edges is used to calculate the number of edges. These values are printed which shows that the number of nodes is 1334, the number of edges is 3605. Secondly, the average node degree is then calculated by dividing the number of nodes by the sum of the degree values. This comes back was 5.4- this shows that on average each node is connected to approximately 5.4 other nodes through edges.

In [None]:
#Picking an origin and destination points which 
orig = ox.distance.nearest_nodes(Yambol, X=20.473502, Y=22.501300)
dest = ox.distance.nearest_nodes(Yambol, X=42.484282, Y=40.508822)

The code above selects an origin and destination point. These points are out of Yambol, the idea behind is this to see which way to go if the people travelling needed to go through Yambol.

In [None]:
# find the shortest path between nodes, minimizing travel time, then plot it
route = ox.shortest_path(Yambol, orig, dest, weight="travel_time")
fig, ax = ox.plot_graph_route(Yambol, route, node_size=0)

The code above creates the map you see with the red route that runs through Yambol. This is done through using the .shortest_path() method with a weight of travel_time. This finds the quickest route through Yambol. This is then plotted using .plot_graph_rout() method.

In [None]:
#Degree centrality
#Importing libraries for this part of the code
import networkx as nx
import matplotlib.pyplot as plt

#Getting the degree of centrality
degree_centrality = nx.degree_centrality(Yambol)

#Creating a degree centrality list with the .values() method
degree_centrality_values = list(degree_centrality.values())

# Plot a histogram of the centrality
plt.figure(figsize=(6, 4))
plt.hist(degree_centrality_values, bins=40, color='#FF4500', alpha=0.7)
plt.title('Degree Centrality in Yambol')
plt.xlabel('Degree Centrality')
plt.ylabel('Frequency')
plt.show()

The code above generates a histogram of the degree centrality, very similar to Challenge 1. First degree_centrality() method is used to calculate the degree centrality and then it is made into a list based on the values. Secondly, plt.hist() is then used to plot it with a title and label.

In [None]:
#Betweenness centrality
betweenness_centrality_Yambol = nx.betweenness_centrality(Yambol)
#Creating a degree centrality list with the .values() method
betweenness_centrality_Yambol_values = list(betweenness_centrality_Yambol.values())

# Plot a histogram of the centrality
plt.figure(figsize=(6, 4))
plt.hist(betweenness_centrality_Yambol_values, bins=40, color='#FF4500', alpha=0.7)
plt.title('Betweenness Centrality in Yambol')
plt.xlabel('Betweenness Centrality')
plt.ylabel('Frequency')
plt.show()

The code above generates betweenness centrality histogram. First the betweenness centrality is calculated and then made into a list based on the values. Secondly, a histogram is created using plt.hist() with the title and labels. 

In [None]:
#Importing the python warnings module which emits warnings during program execution
import warnings
warnings.simplefilter('ignore', DeprecationWarning)

#Converting MultiDiGraph to an undirected MultiGraph
M = ox.utils_graph.get_undirected(Yambol)

#Converting MultiDiGraph to a DiGraph without parallel edges
D = ox.utils_graph.get_digraph(Yambol)

#Converting graph to node and edge GeoDataFrames
gdf_nodes, gdf_edges = ox.graph_to_gdfs(Yambol)

The code above uses osmnx library to convert between different types of graphs and creates geodataframe for nodes and edges for further analysis and visualisation. M converts a directed graph into an undirected one while preserving its edges. D removes parallel lines. 

In [None]:
#Picking a node from the different coordinates 
gdf_nodes

The code above merely just checks the nodes

In [None]:
#Importing library for this section
from IPython.display import Image

#Figure Ground of Yambol 
# configure the inline image display
img_folder = "images_yam"
extension = "png"
size = 480
dpi = 80

The code above first defines where the images related to Yambol will be stored. The images will be in a png format. With a size of 480 width in pixels. Finally dpi determines the quality and resolution of the displayed images.

In [None]:
place = "yambol"
point = (42.480064, 26.522020)
fp = f"./{img_folder}/{place}.{extension}"
fig, ax = ox.plot_figure_ground(
    point=point,
    network_type="all",
    default_width=3.3,
    filepath=fp,
    dpi=dpi,
    save=True,
    show=False,
    close=True,
)
Image(fp, height=size, width=size)

The code above produces a map based on the coordinates 42.480064, 26.522020 as the centre. fp then defines how the image will be saved. A figure is then created using the .plot_figure_ground() method. 

In [None]:
#Creating an interactive map of the edges
ox.graph_to_gdfs(Yambol, nodes=False).explore()

![image.png](attachment:d78e57dd-de0b-4270-a40c-b95f95dd48eb.png)

The code above creates an interactive map based on the edges

In [None]:
#Creating an interactive map of the nodes
nodes = ox.graph_to_gdfs(Yambol, edges=False)
nodes.explore(tiles="cartodbpositron", marker_kwds={"radius": 2})

![image.png](attachment:af72b205-5e1b-487b-91bd-0a36cf1df37d.png)

The code above creates an interactive map based on the nodes

In [None]:
#Creating an interactive maps to plot nodes+edges
nodes, edges = ox.graph_to_gdfs(Yambol)
m = edges.explore(color="skyblue", tiles="cartodbdarkmatter")
nodes.explore(m=m, color="pink", marker_kwds={"radius": 2})

![image.png](attachment:718cb55b-1fc5-4b2e-b174-796e8cafc5bb.png)

The code above creates an interactive map. The explore map chooses a colour for the edges and nodes with a node radius of 2. 

In [None]:
#Creating an interactive maps to plot nodes
edges.explore(tiles="cartodbdarkmatter", column="length", cmap="plasma")

![image.png](attachment:f6e16f73-2bb9-4633-8419-458f4c2030f4.png)

The code above creates an interactive map based on the length of the edges

In [None]:
#Creating an interactive maps to plot centrality measures
nx.set_node_attributes(Yambol, nx.betweenness_centrality(Yambol, weight="length"), name="bc")
nodes = ox.graph_to_gdfs(Yambol, edges=False)
nodes.explore(tiles="cartodbdarkmatter", column="bc", marker_kwds={"radius": 8})

![image.png](attachment:401aa94b-9f14-429d-a0ac-e7182f72e3d5.png)

The code above creates an interactive map based on the betweenness centrality values

In [None]:
#Exporting the graph in 2 different ways
ox.save_graph_geopackage(Yambol, filepath="./data/Yambol.gpkg")
ox.save_graphml(Yambol, filepath="./data/Yambol.graphml")

The code above exports the graph and data so that it can be visualised. This is what the image below shows. I opened this data in QGIS to show that it was saved properly.

![image.png](attachment:dea9eb89-2f9e-4ab6-b041-48d5ca9bbe97.png)

In [None]:
#Importing libraries for this section
import networkx as nx
import osmnx as ox

# get all building footprints in some neighborhood
place = "Yambol"
tags = {"building": True}
gdf = ox.features_from_place(place, tags)
gdf.shape

The code above retrieves building features form OpenStreetMap for Yambol. First the place is stated, then which feature I want to map (buildings), then using the .features_from_place() method retrieves the data and then it is finally made into a shape.

In [None]:
#Ignoring warnings
import warnings
warnings.simplefilter('ignore', DeprecationWarning)

#Creating a figure for the buildings in Yambol
fig, ax = ox.plot_footprints(gdf, figsize=(12, 10))

The code above creates a figure of all the buildings in Yambol. From the figure, it was clear that there are buildings missing, when compared to Google Maps, in the North East of the city. This area has a high Roma population who live in informal settlements. 

## Lab Workbook 7

### Challenge 1

In this challenge, you will replicate the process of creating a geodemographic classification using the k-means clustering algorithm. Please select any city in the UK except London, Liverpool, or Glasgow. The main goal is to generate a meaningful and informative classification that captures the diversity of areas in your dataset using the census data ( For England, you can try to use the 2021 or 2011 census, and for Scotland, you need to use the 2011 census data) 

1. Define the main goal for the geodemographic classification (marketing, retail and service planning). 
2. Look for census data from the selected city for which you would like to generate the geodemographic classification.
3. The census data at the Output Area OA level. Select multiple topics of at least four topics (socio-demographics, economics, health, and so on). Describe your topic selection accordingly based on the goal of your geodemographic classification. For example, if your geodemographics are related to marketing, Economic variables might be the appropriate selection. 
4. Identify the variables that will be crucial for effectively segmenting neighbourhoods. Evaluate how this choice may impact the classification results, including a DEA analysis.
5. Prepare, adjust or clean the dataset addressing any missing values or outliers that could distort the clustering results.
6. Include standardisation between areas and variables. Make an appropriate analysis and adjust the variable selection accordingly for any multicollinearity.
7. Utilize the k-means clustering algorithm to create a classification based on the selected variables.
8. Define the optimum number of clusters (i.e., using the Elblow method). Experiment with different values of k.
9. Evaluate your cluster groups (e.g., using PCA) and interpret your cluster centres. Describe your results and repeat the process to adjust the variable selection and cluster groups to provide more meaningful results for your geodemographic goal. Interpret the characteristics of each cluster. What demographic patterns or similarities are prevalent within each group?
10. Map the final cluster groups
11. Finish the analysis by naming the final clusters and plotting a final map that includes the census values and the provided names.
12. Finally, acknowledge the subjective nature of classification and make analytical decisions to produce an optimum classification for your specific purpose. Reflect on the challenges and insights gained during the classification process. Ensure you document your analytical decisions and the rationale behind any important decision. Once your geodemographics are constructed, describe the potential use cases for the geodemographic classification you have built based on your initial goal.

The main goal of the geodemographic classification is to explore for employment perspectives and ability to access the labour market. The census data, listed below, which was included in this study aims to take into account multiple social, health, education and economic factors that can predict ones ability to access the labour market. For instance, having a car or van to get to work can make a huge difference to be able to get a job. Additionally, if a person suffers from a condition, they might not be able to access certain jobs which reduces their chances of employment. I decided to form this geodemographic classification because the number of persons who are inactive rose from 8,503k to 8,912k between 2019-2022 and is expected to rise to 9,229k by 2026 (ONS, 2023). This could signal that people are giving up on accessing the labour market. Therefore, finding areas which have reduced chances or a lower perspective of employment is important to reverse this trend , which can cause economic and productivity stagnation. 

ONS (2023) Population changes and economic inactivity trends, UK: 2019 to 2026. Avaliable at: https://www.ons.gov.uk/employmentandlabourmarket/peoplenotinwork/economicinactivity/articles/populationchangesandeconomicinactivitytrendsuk2019to2026/2023-03-03#:~:text=For%20those%20aged%2018%20to,changes%2C%20between%202019%20and%202022. (8 March 2024).

Data Included:

KS102SC: Age Group (Social)

KS201SC: Ethnic Group (Social) 

KS404Sc: Car or van avaliability (Social) 

KS601SC: Economic Status (Economic)

KS608SC: Occupation (Economic)

QS118SC: Familties with Dependents (Social)

QS302SC: General Health (Health)

QS304SC: Long-term Sickness (Health)

QS501SC: Highest Qualifications (Education)

The topics are: social, economic, health and education

All the data was found using the Scottish Census website: https://www.scotlandscensus.gov.uk/census-results/at-a-glance/

The boundary shape file for Dundee was found using the UK Data Service: https://borders.ukdataservice.ac.uk/easy_download.html

The census variables were chosen based on whether it can help determine employment perspectives and ability to access the labour market. For instance, being young could be a limitation because of lack of experience. 

In [None]:
#Importing libraries for this section
import pandas as pd
import geopandas as gpd
import os

#Creating a directory for the csvs in my file
csv_directory = "data/Dundee_data"

# List for all csvs in the file
csv_files = [file for file in os.listdir(csv_directory) if file.endswith(".csv")]

# Creating an empty DataFrame to store the merged data
merged_data = pd.DataFrame()

# Loop through each CSV file
for csv_file in csv_files:
    csv_path = os.path.join(csv_directory, csv_file)
    df_csv = pd.read_csv(csv_path, low_memory=False)
    merged_data = pd.concat([merged_data, df_csv], axis=1)

# Saving the merged dataset
merged_data.to_csv("data/Dundee_data/merged_census_data.csv", index=False)

The code above aims to merge all the census data from the csv files within the Dundee_data file. First a variable is created with holds the path for the directory called csv_directory. Then, csv_files variable filters out for all files that end with .csv. Merged_data then will store all the csvs. A loop is then used to fill the merged_data dataframe. Finally, the merged dataframe is saved to a new csv file called merged_census_data. 

In [None]:
#Creating a path for the Dundee shape file
shp_path = "data/Dundee_data/scotland_oa_2011.shp"
#Reading the shape file path
gdf = gpd.read_file(shp_path)

#Creating a path for the merged census data
csv_path = "data/Dundee_data/merged_census_data.csv"
#Reading the census data
csv_data = pd.read_csv(csv_path, low_memory=False)

# Merging the GeoDataFrame with the DataFrame based on the oa_code
merged_data = gdf.merge(csv_data, left_on='code', right_on='oa_code', how='left')

The code above required some initial editing of the census data within Excel. The first rows needed to be deleted so that the data can be loaded properly. The OA polygon code column was also named oa_code so that it can be merged with the shape file. Everything else was left as it is, apart from the QS118SC: Families with Dependents which emerged a few different variables. I did it in Excel as the columns were named oddly and I had to do a few calculations. 

After the initial editing, the code loads both the shapefile and census data. It is then merged using the .merge() method and stated which columns should be used to merge the data together to the correct polygon (code for the shapefile and oa_code for the census).

In [None]:
#Picking variables that will be needed:
selected_columns = ['objectid', 'code', 'hhcount', 'popcount', 'masterpc', 'easting', 'northing', 'label', 'name', 'geometry', 'oa_code', 
                    'All people','18 to 19', '20 to 24', '25 to 29', '65 to 74', '75 to 84', '85 to 89', '90 and over', 
                    'White', 'Asian, Asian Scottish or Asian British','African', 
                    'All people aged 16 to 74', 'Economically inactive: Retired',
                    'Economically inactive: Looking after home or family', 'Economically inactive: Long-term sick or disabled',
                    'Economically inactive: Other', 'Economically inactive: Student', 'Unemployed people aged 16 to 74: Aged 16 to 24', 'Unemployed people aged 16 to 74: Aged 50 to 74', 'Unemployed people aged 16 to 74: Long-term unemployed',
                    'All people aged 16 to 74 in employment', '1. Managers, directors and senior officials', '2. Professional occupations',
                    '6. Caring, leisure and other service occupations', '7. Sales and customer service occupations',
                    '8. Process, plant and machine operatives', '9. Elementary occupations', 'All families in households',
                    'No dependent children in family', 'Household with dependent child', 'All people.2',
                    'Very good health', 'Good health', 'Bad health', 'Very bad health', 'All people.3', 'No condition',
                    'One or more conditions', 'All people aged 16 and over', 'All people aged 16 and over: No qualifications',
                    'All people aged 16 and over: Level 4 and above', 'All households', 'Number of cars or vans in household: No cars or vans']

selected_data = merged_data[selected_columns].copy()

The above code selects some of the variables within the merged_data dataframe. This is because once merged there was a few hundred variables which I didn't need. A new dataframe is created by using the .copy() method by copying over only the stated variables.

In [None]:
#Replacing all - in the selected data to a 0 so that it can be converted to an integer
selected_data.replace('-', 0, inplace=True)


The code above replaces all - with 0 so that the variables could be converted into int

In [None]:
#Merging different variables to create new variables- but first they need to be converted to int to be able to add properly
#Merging 18 to 19, 20 to 24 and 25 to 29 variables
selected_data['18 to 19'] = selected_data['18 to 19'].astype(int)
selected_data['20 to 24'] = selected_data['20 to 24'].astype(int)
selected_data['25 to 29'] = selected_data['25 to 29'].astype(int)
selected_data['18 to 29'] = selected_data['20 to 24'] + selected_data['25 to 29'] + selected_data['18 to 19']

#Merging both 65 to 74 and 75 to 84 variables
selected_data['65 to 74'] = selected_data['65 to 74'].astype(int)
selected_data['75 to 84'] = selected_data['75 to 84'].astype(int)
selected_data['85 to 89'] = selected_data['85 to 89'].astype(int)
selected_data['90 and above'] = selected_data['90 and over'].astype(int)
selected_data['65 and above'] = selected_data['65 to 74'] + selected_data['75 to 84'] + selected_data['85 to 89'] + selected_data['90 and above']

#Merging both good health variables
selected_data['Very good health'] = selected_data['Very good health'].astype(int)
selected_data['Good health'] = selected_data['Good health'].astype(int)
selected_data['Good Health'] = selected_data['Very good health'] + selected_data['Good health']

#Merging both bad health variables
selected_data['Bad health'] = selected_data['Bad health'].astype(int)
selected_data['Very bad health'] = selected_data['Very bad health'].astype(int)
selected_data['Bad Health'] = selected_data['Bad health'] + selected_data['Very bad health']

#Merging all economically inactive people
selected_data['Economically inactive: Retired'] = selected_data['Economically inactive: Retired'].astype(int)
selected_data['Economically inactive: Looking after home or family'] = selected_data['Economically inactive: Looking after home or family'].astype(int)
selected_data['Economically inactive: Long-term sick or disabled'] = selected_data['Economically inactive: Long-term sick or disabled'].astype(int)
selected_data['Economically inactive: Other'] = selected_data['Economically inactive: Other'].astype(int)
selected_data['Economically inactive: Student'] = selected_data['Economically inactive: Student'].astype(int)
selected_data['Economically inactive'] = (selected_data['Economically inactive: Retired'] +
                                        selected_data['Economically inactive: Looking after home or family'] +
                                        selected_data['Economically inactive: Long-term sick or disabled'] +
                                        selected_data['Economically inactive: Other'] +
                                        selected_data['Economically inactive: Student'])

#Merging higher income earners
selected_data['1. Managers, directors and senior officials'] = selected_data['1. Managers, directors and senior officials'].astype(int)
selected_data['2. Professional occupations'] = selected_data['2. Professional occupations'].astype(int)
selected_data['Higher Income'] = selected_data['1. Managers, directors and senior officials'] + selected_data['2. Professional occupations']

#Merging lower income earners
selected_data['6. Caring, leisure and other service occupations'] = selected_data['6. Caring, leisure and other service occupations'].astype(int)
selected_data['7. Sales and customer service occupations'] = selected_data['7. Sales and customer service occupations'].astype(int)
selected_data['8. Process, plant and machine operatives'] = selected_data['8. Process, plant and machine operatives'].astype(int)
selected_data['9. Elementary occupations'] = selected_data['9. Elementary occupations'].astype(int)
selected_data['Lower Income'] = (selected_data['6. Caring, leisure and other service occupations'] +
                                         selected_data['7. Sales and customer service occupations'] +
                                         selected_data['8. Process, plant and machine operatives'] +
                                         selected_data['9. Elementary occupations'])

The long code above aims to add different variables together so that I can create new variables. Importantly, the variables were first created into int so that they can be properly added. 

In [None]:
#Selecting columns again to remove ones that I will not need
selected_columns2 = ['objectid', 'code', 'hhcount', 'popcount', 'masterpc', 'easting', 'northing',
                'label', 'name', 'geometry', 'oa_code', 
                'All people', 'White', 'Asian, Asian Scottish or Asian British', 'African', 
                'All people aged 16 to 74', 'Economically inactive', 'Unemployed people aged 16 to 74: Long-term unemployed',
                'All families in households', 'No dependent children in family', 'Household with dependent child', 
                'No condition', 'One or more conditions', 
                'All people aged 16 and over', 'All people aged 16 and over: No qualifications', 'All people aged 16 and over: Level 4 and above',
                'Good Health',  'Bad Health', 
                'All people aged 16 to 74 in employment', 'Higher Income', 'Lower Income',
                '18 to 29','65 and above', 
                'All households', 'Number of cars or vans in household: No cars or vans']

selected_data2 = selected_data[selected_columns2].copy()

The code above selects some of the variables and copies them again, so that variables I don't need can be removed for simplicity.

In [None]:
#Important libraries for this section
import matplotlib.pyplot as plt
import seaborn as sns

#Creating a new list with only variables 
attributes_to_plot = ['18 to 29', '65 and above', 
                      'Good Health', 'Bad Health', 'No condition', 'One or more conditions',
                      'Higher Income', 'Lower Income', 'Economically inactive', 'Unemployed people aged 16 to 74: Long-term unemployed', 
                      'No dependent children in family', 'Household with dependent child', 
                      'White', 'Asian, Asian Scottish or Asian British','African', 
                      'All people aged 16 and over: No qualifications', 'All people aged 16 and over: Level 4 and above',
                     'Number of cars or vans in household: No cars or vans']

#Creating a figure
plt.figure(figsize=(15, 40))

#Plotting histograms
for i, attribute in enumerate(attributes_to_plot, 1):
    plt.subplot(9, 2, i)
    sns.histplot(selected_data2[attribute].astype(float), kde=True)
    plt.title(attribute)

plt.tight_layout()
plt.show()

The code above first creates a list that contains all the names of the variables I want to plot as histograms. A new figure is then created with the size (15, 40) so that all the variables can fit properly. The histograms are then plotted using a loop. The loop adds each histogram in a grid 2 by 9. 

In [None]:
#Defining a function 
def calculate_percentages(dataframe, total_columns, value_columns):

    result_df = pd.DataFrame()

    for total_col, value_col in zip(total_columns, value_columns):
        percentage_col_name = f"{value_col}_percentage"

        if total_col not in dataframe.columns:
            raise ValueError(f"Total column '{total_col}' not found in the DataFrame.")
            
        dataframe[value_col] = pd.to_numeric(dataframe[value_col], errors='coerce')
        dataframe[total_col] = pd.to_numeric(dataframe[total_col], errors='coerce')
        
        result_df[percentage_col_name] = (dataframe[value_col] / dataframe[total_col]) * 100

    return result_df

# List of the corresponding totals.
total_cols = ['All people',
              'All people',
              'All people',
              'All people',
              'All people',
              'All people',
              'All people aged 16 to 74 in employment',
              'All people aged 16 to 74 in employment',
              'All people aged 16 to 74',
              'All people aged 16 to 74',
              'All families in households',
              'All families in households',
              'All people',
              'All people',
              'All people',
              'All people aged 16 and over', 
              'All people aged 16 and over',
             'All households']

# List of the corresponding values. 
value_cols = ['18 to 29', 
              '65 and above', 
              'Good Health', 
              'Bad Health', 
              'No condition', 
              'One or more conditions',
              'Higher Income', 
              'Lower Income', 
              'Economically inactive',
              'Unemployed people aged 16 to 74: Long-term unemployed',
              'No dependent children in family',
              'Household with dependent child', 
              'White', 
              'Asian, Asian Scottish or Asian British',
              'African', 
              'All people aged 16 and over: No qualifications',
              'All people aged 16 and over: Level 4 and above',
             'Number of cars or vans in household: No cars or vans']

# Calculating percentages based on the corresponding values
result_dataframe = calculate_percentages(selected_data2, total_cols, value_cols)

The code above starts off by creating a function called calculate_percentages with three arguments: dataframe contains the data, total_columns represents the list of column names representing the total counts so that the percentage can be calculated, then the value_col represents the list of column names of the columns that I need to convert into percentages. The function creates a new column with the percentages based on value_col is called columnname_percentage.

Total_cols and value_col were then listed.

Finally, the calculate_percentages function calculates the percentages based on the 2 lists. 

In [None]:
# Concatenate the resulting tables.
concatenated_df = pd.concat([selected_data2, result_dataframe], axis=1, ignore_index=False)
concatenated_df.head()

The code above uses the .concat() method to concatenate (meaning link) selected_data2 and result_dataframe based on the columns (axis=1). 

In [None]:
keep_cols = ['objectid', 'code', 'hhcount', 'popcount', 'masterpc', 'easting', 'northing',
            'label', 'name', 'geometry', 'oa_code',
             '18 to 29_percentage', 
              '65 and above_percentage', 
              'Good Health_percentage', 
              'Bad Health_percentage', 
              'No condition_percentage', 
              'One or more conditions_percentage',
              'Higher Income_percentage', 
              'Lower Income_percentage', 
              'Economically inactive_percentage',
              'Unemployed people aged 16 to 74: Long-term unemployed_percentage',
              'No dependent children in family_percentage',
              'Household with dependent child_percentage', 
              'White_percentage', 
              'Asian, Asian Scottish or Asian British_percentage',
              'African_percentage', 
              'All people aged 16 and over: No qualifications_percentage',
              'All people aged 16 and over: Level 4 and above_percentage',
             'Number of cars or vans in household: No cars or vans_percentage']


Dundee_census_data = concatenated_df[keep_cols]

The code above removes any unncessary columns

In [None]:
# For more easy manipulation we define short column names
short_column_names = {
              '18 to 29_percentage': '18to29' , 
              '65 and above_percentage': '65+', 
              'Good Health_percentage': 'good_health', 
              'Bad Health_percentage': 'bad_health', 
              'No condition_percentage': 'No_condition', 
              'One or more conditions_percentage': 'condition',
              'Higher Income_percentage': 'higher_income', 
              'Lower Income_percentage': 'lower_income', 
              'Economically inactive_percentage': 'economically_inactive', 
              'Unemployed people aged 16 to 74: Long-term unemployed_percentage': 'long-term unemployed', 
              'No dependent children in family_percentage': 'no_dependent',
              'Household with dependent child_percentage': 'dependent', 
              'White_percentage': 'white', 
              'Asian, Asian Scottish or Asian British_percentage': 'asian',
              'African_percentage': 'african', 
              'All people aged 16 and over: No qualifications_percentage': 'no_qualifications',
              'All people aged 16 and over: Level 4 and above_percentage': 'level4_qualifications',
              'Number of cars or vans in household: No cars or vans_percentage': 'no_car'
}

Dundee_census_data = Dundee_census_data.rename(columns=short_column_names)

The list above changes the name of the columns using the .rename() method after creating a list with what to rename the columns to.

In [None]:
#Calculating the z_score
numeric_columns = Dundee_census_data.select_dtypes(include='float64')
z_score_df = (numeric_columns - numeric_columns.mean()) / numeric_columns.std(ddof=0)
z_score_df.head()

The code above converts dundee_census_data into float64 data type. The following code then creates z_score_df based on the calculation for finding the z score.

In [None]:
#Calculating and displaying matrix
corr = z_score_df.corr()
corr.style.background_gradient(cmap='coolwarm')

The code above uses the .corr() method to calculate the correlation matrix for the dataframe z_score_df. The .style.background_gradient() method is then used to show the matrix values. 

In [None]:
#Deciding the threshold
threshold = 0.79 

#Identifying highly correlated variables
highly_correlated = (corr.abs() > threshold) & (corr.abs() < 1.0)

#Creating the heatmap 
plt.figure(figsize=(10, 8))
sns.heatmap(highly_correlated, cmap='coolwarm', cbar=False, annot=True)
plt.title('Highly Correlated Variables')
plt.show()

The code above aims to find variables which are highly correlated based on the threshold 0.79. I chose this as the rule of thumb is 80% correlation. As you can see from the matrix values, one of the variables has -0.796105, and so I decided to make the threshold 0.79 as it was too close to 0.8 to keep in my opinion. 

The code creates a boolean dataframe called highly_correlated where if the matrix correlation value is above 0.79 and below 1 it will be grouped and coloured in the same colour. A heat map is then created based on highly_correlated.


In [None]:
#Dropping three variables which are highly correlated
z_score_df.drop(['good_health', 'higher_income', 'white'], axis=1, inplace=True)
z_score_df.info()

Based on the threshold, I have to chose to remove the good_health, higher_income and white variables so that there are no highly correlating variables

In [None]:
#Calculating and displaying the matrix 
corr_2 = z_score_df.corr()
corr_2.style.background_gradient(cmap='coolwarm')

The code above uses the .corr() method to calculate the correlation matrix for the dataframe z_score_df. The .style.background_gradient() method is then used to show the matrix values.

In [None]:
#Deciding the threshold
threshold = 0.79

#Identifying highly correlated variables
highly_correlated_2 = (corr_2.abs() > threshold) & (corr_2.abs() < 1.0)

#Creating a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(highly_correlated_2, cmap='coolwarm', cbar=False, annot=True)
plt.title('New Highly Correlated Variables')
plt.show()

The code above is the same as before, but with the removed highly correlated variables to check that no other variables are highly correlated. 

In [None]:
#Checking if there 
contains_nan = z_score_df.isna().any().any()

#Creating a reply to check if there are na values
if contains_nan:
    print("NA present")
else:
    print("No Na present")

The code above checks for any na values and then a response is printed with an if statement depending on whether there are any nan values

In [None]:
#Filling the na value with the mean value
z_score_df.fillna(z_score_df.mean(), inplace=True)

The code above converts all the na values using the .fillna() method with the mean values from z_score_df using the .mean() method 

In [None]:
#Libraries needed for this section
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist, pdist

# KMeans with 10 clusters
kmeans = KMeans(n_clusters=10)
kmeans.fit(z_score_df)
labels = kmeans.predict(z_score_df)
cluster_centres = kmeans.cluster_centers_
z_score_df['Cluster'] = kmeans.labels_

The code above performs a KMeans clustering of the data. Firstly, the number of clusters is defined as 10. Then kmean.fit() is then used to fit the KMeans model to z_score_df. Labels variable is then created which assigns each data point in z_score_df to of the 10 clusters. Then cluster_centres retreaves the cluster centres found by the Kmeans. Finally, a Cluster column is created with this data. 

In [None]:
plt.hist(labels)

The code above creates a histogram of the labels variable created before

In [None]:
#Creating empty list
Sum_of_squared_distances = []

#Stating that the range of k is between 1 and 14
K_range = range(1,15)

#Defining k and the k range
for k in K_range:
 km = KMeans(n_clusters=k)
 km = km.fit(z_score_df)
 Sum_of_squared_distances.append(km.inertia_)

#Plotting the elbow method from the defined sume of squared distances
plt.plot(K_range, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

The code above performs the elbow method to determine the optimal number of clusters for Kmeans clustering algorithm. First an empty list is created to store the sum of squared distances for each k. The range of k is stated of 1 to 14. Next a function is used to calculate the sum of squared distances for each k. This is then displayed in a plot with relevant labels and title. 

The elbow above indicates that the k should be either 6 or 8.
After attempting 8, I found that three of the clusters heavily overlapped and therefore I went for 6 cluster instead.

In [None]:
#Importing the needed library
import numpy as np

#Defining the elbow
def elbow(dataframe, n):
    kMeansVar = [KMeans(n_clusters=k).fit(dataframe.values) for k in range(1, n)]
    centroids = [X.cluster_centers_ for X in kMeansVar]
    k_euclid = [cdist(dataframe.values, cent) for cent in centroids]
    dist = [np.min(ke, axis=1) for ke in k_euclid]
    wcss = [sum(d**2) for d in dist]
    tss = sum(pdist(dataframe.values)**2)/dataframe.values.shape[0]
    bss = tss - wcss
    plt.plot(bss)
    plt.show()

#Displaying elbow
elbow(z_score_df,15)

The code above defines a function called elbow. The kMeansVar line creates a  list of k-means clustering models ranging between 1 and n. Centroids extracts the centroids of each cluster. K_euclid calculates the Euclidean distance between each data point and each centroid. Dist calculates the minimum distance of each data points to its nearest cluster centroid for each k-means model. Wcss calculates the within cluster sum of squares for each k-mean model. Tss calculates the total sum of squares. Finally bss calclates the between cluster sum. This is then displayed in the graph you see above with a maximum of 15 clusters. As you can see form the graph, it is clear that 6 clusters is fine for this investigation. 

In [None]:
#KMeans Clustering
kmeans = KMeans(n_clusters=6)
kmeans.fit(z_score_df)
labels = kmeans.predict(z_score_df)
cluster_centres = kmeans.cluster_centers_

#Creating new column called cluster
z_score_df['Cluster'] = kmeans.labels_

The code above creates a new attribute in z_score_df called Cluster based on the decided cluster number (6). The cluster assignment is decided by the fit of the data.

In [None]:
#Exploring the data
z_score_df.head()

The code above merely checkes to see if the cluster assignment worked

In [None]:
#Importing libraries for this part of the code
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#Kmeans Clustering
kmeans = KMeans(n_clusters=6)
clusters = kmeans.fit_predict(z_score_df)

#Assigning cluster number
z_score_df['Cluster'] = clusters

# Standardize the data for PCA
scaler = StandardScaler()
stand_data_scaled = scaler.fit_transform(z_score_df)

# PCA
pca = PCA(n_components=2).fit(stand_data_scaled)
pca_result = pca.transform(stand_data_scaled)

#Percentage of variance explained by each of the selected components.
variance_ratio = pca.explained_variance_ratio_

#Creating a figure to display this
plt.figure(figsize=(10, 6))
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1], hue=clusters, palette='viridis', s=50, alpha=0.7)
plt.title('Cluster Plot against 1st 2 Principal Components')
plt.xlabel(f'Principal Component 1 variation: {variance_ratio[0]*100:.2f}%')
plt.ylabel(f'Principal Component 2 variation: {variance_ratio[1]*100:.2f}%')
plt.legend(title='Clusters')
plt.show()

The code above creates a scatter plot of the first 2 principle components. The data is standardised and the principle component scores are computed. The percentage of variance is then calculated for the selected principle components. This is then graphed in a scatter plot. The plotting includes the variance ratios. These are 35.42% for PC 1, and 21.15% for PC 2. This shows that together the who principle components account for 56.57% of the total variance, meaning that 43.43% is not explained.

In [None]:
# KMeans clustering
kmeans = KMeans(n_clusters=6)
clusters = kmeans.fit_predict(z_score_df)

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_

# Get the cluster centers
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=z_score_df.columns)

#Exploring the data in the table
cluster_centers.head(5)

The code above creates a dataframe for the clusters so that the clusters can be displayed into  charts to visualise the characteristics of each cluster group for interpretation and naming.

In [None]:
#Choosing to display the first cluster
first_row_centers = cluster_centers.iloc[0, :]

# len of features
num_features = len(first_row_centers)

#Polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

#Plotting the cluster variabel values  
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')

#Adding an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

The code above aims to visualise cluster 1 into a polar plot, with each variable represented by a point on the plot. This is to show the relation of the variable to the average position which is the red line. This is done through first selecting the first row of the cluster_centres using .iloc[] method. Num_features uses len to calculate the number of features by finding the length of the first row centres array. Theta generates array angles. The cluster variable values are then plotted alongside a red line which represents the average at 0.0.  

As observed from this plot, this cluster is generally higher educated, the households in this cluster are generally more likely not to have a dependent child, more likely to have a higher income.

In [None]:
#Choosing to display the second cluster
second_row_centers = cluster_centers.iloc[1, :]

# len of features
num_features = len(second_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, second_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(second_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

The code above is almost identical to the code before apart from .iloc[1,:] was chosen instead of .iloc[0,:]. The rest has been described before.

For this cluster, generally, more people will be older, have bad health and have a medical condition of some kind. Additionally, they are more likely not to have dependent children, be economically inactive, and have a level 4+ qualification. This cluster seems to be populated by older middle class people. Illustrates areas which are less deprived possibly.

In [None]:
#Choosing to display the third cluster
third_row_centers = cluster_centers.iloc[2, :]

# len of features
num_features = len(third_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, third_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(third_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

As before, the code is identical but the third cluster is selected.

When compared to the average, this cluster has all variables under 0.75 standard deviations away meaning that most are close to the average. However, generally, this cluster will have higher lower income people, people who are long-term unemployed and have children. 

In [None]:
#Choosing to display the fourth cluster
fourth_row_centers = cluster_centers.iloc[3, :]

# len of features
num_features = len(fourth_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, fourth_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(fourth_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

As before, the code is identitical but the fourth cluster is selected.

For this cluster, there is generally a higher number of people with an Asian or African background. They are also generally highly educated, with no conditions and overwhelmingly between the ages of 18-29. This could be a student cluster related to Dundee's university. 

In [None]:
#Choosing to display the fifth cluster
fifth_row_centers = cluster_centers.iloc[4, :]

# len of features
num_features = len(fifth_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, fifth_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(fifth_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

As before, the code is identical but the fifth cluster is selected.

For this cluster, there is generally higher number of dependent children, people are in good health with no conditions, are highly economically active and generally younger but over the age of 29. This cluster seems to be working age adults who are in the labour market who are highly educated, in highly paid jobs and economically active.

In [None]:
#Choosing to display the sixth cluster
sixth_row_centers = cluster_centers.iloc[5, :]

# len of features
num_features = len(sixth_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, sixth_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(sixth_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

As before, the code is identical but the sixth cluster is selected.

For this cluster, there is generally higher number of people who are economically inactive, have lower incomes, are lower educated, are in worse health and have conditions. This is cluster illustrates areas of higher deprivation. 

In [None]:
cluster_df = z_score_df[['Cluster']].copy()

The code above creates a new dataframe by copying only the cluster variable in z_score_df

In [None]:
final_df = pd.concat([Dundee_census_data, cluster_df], axis=1, ignore_index=False)
final_df

The code above concatenates the two dataframes: Dundee_census_data and cluster_df to create final_df. This was done so that it can be explored later on.

In [None]:
#Creation of a function to rename the 6 clusters
def rename_column(x): 
    x = str(x)
    x = x.replace("0", "Higher Middle Aged Professionals Cluster")
    x = x.replace("1", "Higher Deprivation Cluster")
    x = x.replace("2", "Average Cluster")
    x = x.replace("3", "Lower Deprivation Cluster")
    x = x.replace("4", "Older Adult (sixty-five and above) Cluster")
    x = x.replace("5", "Young Adult (eighteen to twenty-nine) and Student Cluster")
    return x

#Applying this function to the cluster column
final_df['Cluster'] = final_df['Cluster'].apply(rename_column)

The above code creates a cluster so that the 6 clusters can be renamed based on their characteristic. This function is then applies to the cluster column. The names given correspond to the type of population which lives there and based on the visual observation of the final_df cluster which will be created later.

In [None]:
final_df.explore(column='Cluster', cmap='Set1', tiles='CartoDB positron')

![image.png](attachment:715122ef-80f3-4545-9770-bb176bd6c485.png)

The code above creates an interactive map to display the clusters in Dundee. As you can see, some of the clusters are clearer than others. For instance, the older adult cluster seems to be a bit more randomly distributed across Dundee. However, the young adult and student cluster is predominantly found around the centre of the city near the bridge. This is where most students that study at Dundee university live. 

#### Final Remarks on this challenge:

##### The classification and decision making process of this challenge was subject to some subjectivity that can limit these kinds of investigations. 

Firstly, the selection of variables to represent certain groups within the geodemographic classification was based on my understanding of what can limit an individual to access the labour market. Thus, social, educational, health and economic themes were chosen to build this classification. Some crucial variables were age, whether a household has a dependent child, education level, and general health (to just name a few). They were chosen to cover a large range of circumstances for individuals. For instance, younger adults may not have the experience to access certain jobs, if a household has a dependent child an individual might need to work part-time instead, if the education level is low obtaining a higher paid job will be more difficult and finally if a person is in bad health they may not be able to work all together. However, this classification is subject to my own personal understanding because of my positionality and experiences. Ultimately, the variables that I have chosen are not exhaustive and many other could have been chosen instead, such as whether a person is a first generation migrant or not. Equally, I decided to exclude many of the variables or group some. For instance, I only focused on the lowest and highest educational attainment when I could have chosen something in between. Plus, I grouped everyone between the ages of 18-29, when in reality, chances are that, young adults between 18-22 could be in university whilst 23-29 could be working, meaning that they are in a different life stage but are grouped as the same. These groupings and variable selections can make a huge difference to the results as Dundee does have a university, and as the interactive map above shows, students are generally clustered in one area.

Secondly, this challenge was based on my subjective decision to have the threshold of 0.79 to find any multicollinearity between the variables. The rule of thumb is stated as 80%, meaning 0.8, and therefore I am within this rule. However, if I had gone with 0.8, higher income would have been included in the geodemographic classification. Higher income correlated with lower income with -0.796105, so I made the threshold 0.79 as it was too close to 0.8 for my liking. However, if I made it 0.7, many more of the variables would has been removed. Ultimately, this decision could have changed what polygons were included in my 6 clusters, thus if another geographer was using this method with my variables, the results could have been different. 

Thirdly, during this challenge, I chose to focus on 6 clusters based on my elbow method graphs. I did attempt it with 8, but the results turned out quite odd, so I stuck with 6. A limitation is that I could have attempted it with 7 or even 5 clusters to check whether 6 is the best, though I didn't see a need for that. This decision was based on how I understood my elbow method graph and how attempting 8 and 6 clusters looked to me, therefore if someone else was replicating my methodology, they could understood it differently and choose a different number of clusters. Ultimately, this will change the interactive map that is seen above and could group areas which are shown as different clusters together, depicting a completely different story. 

Finally, the naming of the clusters was very subjective, and each geographer will have a different name for each of the clusters based on how they understood the cluster results and the final interactive map. I did not name the clusters based on location as I am not very familiar with Dundee, therefore I named the clusters based on their results focusing on predominantly on the economic, age and education themes. 


##### Challenges and insights gained during the classification process

One of the challenges during the classification process was trying to understand Dundee itself. You need to know an area and context before analysing the cluster results (or even choosing the variables). This is a challenge as each city/region will have different factors influencing why people cannot access jobs. In the case for Dundee, it is simultaneously a university town and an old industrial town. Therefore, the difference in education is major, especially between the ages, which influences what jobs the local population (who did not go to university) can access compared to the students at the university who have migrated to Dundee. This context was important in which variables I chose to include. 

An insight gained during the classification process was the major inequality observed across the city. There are clear borders between where students live and where the local population lives, which makes this city unequal in opportunity. This has given me an insight into some major socio-economic indicators, such as health, education, and age, that can be used to address disparities and support deprived areas. I will discuss how this could be done in my next section. 


##### Potential cases for the geodemographic classification that I have built based on my initial goal

The initial goal of this geodemographic classification was class areas which would have populations that are more likely to struggle to access the labour market or have lower perspectives. Thus, I chose to focus on a plethora of variables ranging from health to children, to current job (or lack thereof). Based on my final interactive map, it is clear that there are some major pockets of deprivation and there is high levels of inequality across the city. For instance, areas where people earn less, are less educated and have poorer health, could be (and were) identified (blue cluster). This can ultimately be used by the Scottish government and local council to distribute opportunities better across the city so to make it easier for people in such areas rise within society through a better job. Equally, such areas could do with higher investment which can drive these jobs. Whether that be better transport infrastructure to get to a job, or the construction of new offices, for example, could help these areas improve so that there isn't such economic inequality in the city. One of the variables which stood out was education, which is a major factor behind being able to access a high paying job. Using the classification, the council could even set up retraining programmes in those areas so that people can learn new skills and then have the opportunity to join an industry with those skills. This could potentially even out the income generated across the city even if there is high inequality in education level across Dundee.
