This script aims to match the names in ONS's 'built up areas' lookup table with the town names found in various funds available to the subnational expenditure team. Once all required town names are matched, the final product from this script will be a lookup table of all the town names used in the funds, along with the ONS code and the official name given by ONS for that code. 

This is achieved by first merging all the files, containing funding data, into one dataframe. This is then fuzzy matched against the ONS lookup table. A new dataframe is then created from the rows that didn't match, each row is then cleaned and ran through the fuzzy match algorithm again. This is merged with with the matched dataframe and saved as a csv file. The final step was to manually check any errors, as well as values that didn't get matched. This was done bu checking the town name in from the fund file and cross referencing this with the lookup table. 


In [None]:
# imports
import pandas as pd
from fuzzywuzzy import fuzz

Create dataframe of town names that will be fuzzy matched.

In [None]:
# import dataframe containing town names that need to be matched
df_0 = pd.read_csv('D:/Users/daniel.godden/Data/output/towns_fund.csv')
df_0['town']=df_0['geography_name']
df_0 = df_0['town']
df_0.head(5)

In [None]:
# import dataframe containing town names that need to be matched
df_1 = pd.read_csv('D:/Users/daniel.godden/Data/output/future_high_street_fund.csv')
df_1 = df_1['town']
df_1.head(5)

In [None]:
# import dataframe containing town names that need to be matched
df_2 = pd.read_csv('D:/Users/daniel.godden/Data/data/fuzzy/towns_from_internal_data.csv')
df_2['town'] = df_2['town_or_high_street']
df_2 = df_2['town']
df_2.head(5)

In [None]:
# import dataframe containing town names that need to be matched
df_3 = pd.read_csv('D:/Users/daniel.godden/Data/data/fuzzy/Towns Fund Expenditure Data - March 2023 - fhsf - clean.csv')
df_3['town'] = df_3['geography_name']
df_3 = df_3['town']
df_3.head(5)

In [None]:
# import dataframe containing town names that need to be matched
df_4 = pd.read_csv('D:/Users/daniel.godden/Data/data/fuzzy/Towns Fund Expenditure Data - March 2023 - towns - clean.csv')
df_4['town'] = df_4['geography_name']
df_4 = df_4['town']
df_4.head(5)

In [None]:
# merge into single dataframe
df0 = pd.merge(df_0,df_1.drop_duplicates(), how='outer')
df0 = pd.merge(df0,df_2.drop_duplicates(), how='outer')
df0 = pd.merge(df0,df_3.drop_duplicates(), how='outer')
df0 = pd.merge(df0,df_4.drop_duplicates(), how='outer')
df0.head(5)

Create dataframe of lookup names for fuzzy matching.

In [None]:
# import lookup table
df1 = pd.read_csv('D:/Users/daniel.godden/Data/data/fuzzy/Built-up_Areas_(December_2022)_Names_and_Codes_in_England_and_Wales.csv')
df1['lookup_town_code'] = df1['BUA22CD']
df1['lookup_town'] = df1['BUA22NM']
df1 = df1[['lookup_town_code','lookup_town']]
df1['lookup_town_index'] = df1.index
df1.head(5)

Now we have a dataframe of town names and a dataframe of lookup town names, we can use fuzzy matching to compare them. 

In [None]:
# Clean the data

column1 = 'town'
column2 = 'lookup_town'

df0[column1] = df0[column1].str.lower().str.strip()
df1[column2] = df1[column2].str.lower().str.strip()

# Define a function to calculate the similarity score
def fuzzy_match(row1, row2):
    return fuzz.token_sort_ratio(row1[column1], row2[column2])

# Calculate the similarity score for each pair of rows
similarity_scores = []
for i, row1 in df0.iterrows():
    for j, row2 in df1.iterrows():
        similarity_scores.append({
            'town_index': i,
            'lookup_town_index': j,
            'score': fuzzy_match(row1, row2)
        })

# Convert the similarity scores to a dataframe
similarity_df = pd.DataFrame(similarity_scores)

# Merge the rows with a similarity score above a threshold
threshold = 90
matches = similarity_df[similarity_df['score'] >= threshold]
matches.head(5)


Create final dataframe for analysis.

In [None]:
DF0 = pd.merge(df0,matches, on='town_index', how='inner')
DF0 = pd.merge(df1,DF0, on='lookup_town_index', how='inner')
DF0 = DF0.drop(['town_index','lookup_town_index'], axis=1)
DF0.head(5)


Create a dataframe of the rows from 'df0' that dont match.

In [None]:
DF1 = pd.merge(df0,DF0.drop_duplicates(), on='town', how='left', indicator=True)
non_match = DF1.loc[DF1['_merge']!= 'both']
non_match = non_match.drop(['_merge'], axis=1)
non_match = non_match['town']
non_match.head(5)

In [None]:
non_match['non_matched_town'] = non_match
non_match['town'] = non_match['non_matched_town'].str.replace('city centre', '').str.replace('town centre','')


Fuzzy match the non_match dataframe with df1.

In [None]:
# Clean the data

column1 = 'town'
column2 = 'lookup_town'

non_match[column1] = non_match[column1].str.lower().str.strip()
df1[column2] = df1[column2].str.lower().str.strip()

# Define a function to calculate the similarity score
def fuzzy_match(row1, row2):
    return fuzz.token_sort_ratio(row1[column1], row2[column2])

# Calculate the similarity score for each pair of rows
similarity_scores = []
for i, row1 in non_match.iterrows():
    for j, row2 in df1.iterrows():
        similarity_scores.append({
            'town_index': i,
            'lookup_town_index': j,
            'score': fuzzy_match(row1, row2)
        })

# Convert the similarity scores to a dataframe
similarity_df = pd.DataFrame(similarity_scores)

# Merge the rows with a similarity score above a threshold
threshold = 90
matched = similarity_df[similarity_df['score'] >= threshold]
matched.head(5)

In [None]:
DF0 = pd.merge(DF0,matched, on='town_index', how='inner')
DF0 = DF0.drop(['town_index','lookup_town_index'], axis=1)
DF0.head(5)

In [None]:
# specify the folder and file name to save the csv file
folder_path = 'D:/Users/daniel.godden/Data/output'
file_name = 'town_names_match.csv'

DF0.to_csv(folder_path + '/' + file_name, index=False)