# Preparing the dataset

The goal of this notebook is to prepare a dataset of place names in Durham in order to train a model to generate new names. To do this will require cleaning the dataset using the `pandas` library. Along the way we introduce fundamental techniques in data cleaning.

## Loading the raw data

We extract the place names in Durham from the dataset `IPN_GB_2023.csv` by the Office for National Statistics that contains all the place names in the UK.

In [306]:
import pandas as pd

# Load the CSV file as a DataFrame
df = pd.read_csv('IPN_GB_2023.csv', encoding='latin-1')

# Display the first few rows of the DataFrame
print(df.head())

   placeid   place22cd    placesort          place22nm  splitind descnm  \
0      382  IPN0000419       achill           A' Chill         0    LOC   
1    17379  IPN0018256  crianlarich  A' Chrìon Làraich         0    LOC   
2    26626  IPN0027987    glasphein  A' Glas Pheighinn         0    LOC   
3    46448  IPN0048777         moss      A' Mhointeach         0    LOC   
4     6291  IPN0006615    blackpark   A' Phairce Dhubh         0    LOC   

         ctyhistnm    cty61nm   cty91nm               ctyltnm  ...    eer22cd  \
0      Argyllshire  Inverness  Highland             Inverness  ...  S15000001   
1       Perthshire      Perth   Central  Stirling and Falkirk  ...  S15000001   
2  Inverness-shire  Inverness  Highland     Ross and Cromarty  ...  S15000001   
3  Inverness-shire  Inverness  Highland             Inverness  ...  S15000001   
4  Inverness-shire  Inverness  Highland     Ross and Cromarty  ...  S15000001   

    eer22nm    pfa22cd   pfa22nm      gridgb1m  gridgb1e  grid

From the dataframe `df` we only want the place names in County Durham. This is easy to do in `pandas`.

In [307]:
# extract all names that are in County Durham
durham_df = df[df['ctyhistnm'] == 'Durham']

# create a dataframe of all place names in Durham
durham_df = pd.Series(durham_df['place22nm'])
durham_df.name = None # removes the column name from the series object

## A first look at `durham_df`

A first step is to have a quick look at our data series `durham_df` using `.head()`.

In [308]:
durham_df.head()

765           Addison
782     Adelaide Bank
1002          Aislaby
1008          Aislaby
1030           Albany
dtype: object

We already see that there are duplicate entries in `durham_df`. To remove duplicate entries we use `.drop_duplicates()`.

In [309]:
print('length of durham_df before removing duplicates: ', len(durham_df))
durham_df = durham_df.drop_duplicates()
print('length of durham_df after removing duplicates: ', len(durham_df))

length of durham_df before removing duplicates:  2047
length of durham_df after removing duplicates:  1446


A large percentage of `durham_df` was duplicates and so we have significantly simplified our data set.

## Removing unnecessary elements in names

To ensure our input to the model is high quality we need to remove superfluous parts of place names. For example, scanning through the data we see that certain place names are appended with the string `', unparished area'` which appears to be an artefact in how the data was originally stored. We see these entries below.

In [310]:
result = durham_df[durham_df.str.contains(', unparished area')]
print(result)

24176       County Durham, unparished area
27112          Darlington, unparished area
37326           Gateshead, unparished area
43175          Hartlepool, unparished area
86693      South Tyneside, unparished area
89886    Stockton-on-Tees, unparished area
91432          Sunderland, unparished area
dtype: object


To remove the unnecessary part of the string we can use `.str.replace` to replace the unwanted string with an empty string.

In [311]:
durham_df = durham_df.str.replace(', unparished area', '').str.strip()

In [312]:
result = durham_df[durham_df.str.contains(', unparished area')]
print(result)

Series([], dtype: object)


## Using regular expressions to clean entries

Similarly, when checking what place names contain `Durham` we see see some are appended with `(County Durham)` which is unnecessary.

In [313]:
result = durham_df[durham_df.str.contains('Durham')]
print(result)

11902        Brandon (County Durham)
20119        Chilton (County Durham)
20949                 City of Durham
24176                  County Durham
24182                  County Durham
29968                   Durham South
29969                    Durham, Old
29970                         Durham
30397      Easington (County Durham)
66335         Murton (County Durham)
71127                     Old Durham
83936       Sherburn (County Durham)
88986        Stanley (County Durham)
104113     Whitehill (County Durham)
105046    Willington (County Durham)
dtype: object


We use `str.replace` to replace the unnecessary inlcusion of `(County Durham)` in some entries. Note that `(Hartlepool)` is included in some entries and we remove these similarly.

In [314]:
durham_df = durham_df.str.replace(r' \(County Durham\)', '').str.strip()
durham_df = durham_df.str.replace(r' \(Hartlepool\)', '').str.strip()

  durham_df = durham_df.str.replace(r' \(County Durham\)', '').str.strip()
  durham_df = durham_df.str.replace(r' \(Hartlepool\)', '').str.strip()


In [315]:
result = durham_df[durham_df.str.contains('Durham')]
print(result)

20949    City of Durham
24176     County Durham
24182     County Durham
29968      Durham South
29969       Durham, Old
29970            Durham
71127        Old Durham
dtype: object


We have removed all instances of place names that are appended with `(County Durham)`. Observe that there are two entries for `County Durham` of which neither we want to keep. Moreover, our local knowledge of Durham tells us that `Durham South` and `City of Durham` are not distinct from `Durham`.

In [316]:
durham_df = durham_df[durham_df != 'City of Durham']
durham_df = durham_df[durham_df != 'County Durham']
durham_df = durham_df[durham_df != 'Durham South']

In [318]:
result = durham_df[durham_df.str.contains('Durham')]
print(result)

29969    Durham, Old
29970         Durham
71127     Old Durham
dtype: object


## Rearranging entries in `durham_df`

`Durham, Old` and `Old Durham` are exactly the same place! Additionally, we want to take the convention that we do not have commas in our dataset and just swap the string elements before and after a comma. We can view the entries in `durham_df` that we need to perform this on.

In [319]:
result = durham_df[durham_df.str.contains(',')]
print(result)

3755                                    Auckland, St. Helen
3756                                         Auckland, West
3980                                       Aycliffe, Newton
3981                                       Aycliffe, School
5280                                         Barmston, High
                                ...                        
103556                                           Wham, High
105056                                      Willington, Low
105472                                         Wingate, Old
105563                                         Winning, Esh
106176    Wolsingham Park Moor lands cmn to Stanhope, To...
Length: 164, dtype: object


The following code find elements of `durham_df` that contain a comma, removes the comma and swaps the elements of the string either side of it around. 

In [320]:
ix = durham_df.str.contains(',')

durham_df.loc[ix] = durham_df.loc[ix].str.split(', ').apply(lambda x: f'{x[1]} {x[0]}')

In [321]:
print(durham_df[ix])

3755                                     St. Helen Auckland
3756                                          West Auckland
3980                                        Newton Aycliffe
3981                                        School Aycliffe
5280                                          High Barmston
                                ...                        
103556                                            High Wham
105056                                       Low Willington
105472                                          Old Wingate
105563                                          Esh Winning
106176    Tow Law and Wolsingham Wolsingham Park Moor la...
Length: 164, dtype: object


The last entry looks rather suspicious! It turns out that there is indeed a parish called `Wolsingham Park Moor lands cmn to Stanhope, Tow Law and Wolsingham` that corresponds to an area around Wolsingham, Tow Law and Stanhope. However, Wolsingham, Tow Law and Stanhope are all in `durham_df` and we can safely remove this strange entry. Inspecting `durham_df` further, there are two similar entries that we also remove.

In [322]:
durham_df = durham_df[durham_df != 'Tow Law and Wolsingham Wolsingham Park Moor lands cmn to Stanhope']
durham_df = durham_df[durham_df != 'Lands Common to Brancepeth and Brandon and Byshottles']
durham_df = durham_df[durham_df != 'Lynesack and Softley and South Bedburn Lands common to Hamsterley']

## Deeper into `durham_df`

Inspecting `durham_df` we find `Northern Parishes` and `Western Parishes` that correspond to areas rather than places. Consequently we remove these from `durham_df`.

In [323]:
durham_df = durham_df[durham_df != 'Northern Parishes']
durham_df = durham_df[durham_df != 'Western Parishes']

Parsing through `durham_df` we see many entries that are either misspelt, contain superfluous details or do not exist according to searches online. We remove these entries and briefly justify them with comments.

In [324]:
durham_df = durham_df[durham_df != 'Auk Side'] # already has Aukside
durham_df = durham_df[durham_df != 'Carlton (Stockton-on-Tees)']
durham_df = durham_df[durham_df != 'Chilton Buildings'] # old name for Chilton
durham_df = durham_df[durham_df != 'Claravale'] # Should be Clara Vale which is in durham_df
durham_df = durham_df[durham_df != 'Copthill'] # Copt Hill
durham_df = durham_df[durham_df != 'Cornforth Lane'] # Cornforth, can't find anything other than a street
durham_df = durham_df[durham_df != 'Dunston Hill and Whickham East'] # can't find Whickham East, Dunston Hill is already there
durham_df = durham_df[durham_df != 'Eden and Lyvennet Vale'] # can't find Lyvennet Vale but Eden is there
durham_df = durham_df[durham_df != 'Fighting Cocks'] # a pub?
durham_df = durham_df[durham_df != 'Hardwick and Salters Lane'] # can't find Salters Lane
durham_df = durham_df[durham_df != 'Houghton le Side'] # Houghton-le-Side is already in durham_df
durham_df = durham_df[durham_df != 'Mill Dam Valley'] # seems not to be a place
durham_df = durham_df[durham_df != 'Middleton in Teesdale'] # Middleton-in-Teesdale already in durham_df
durham_df = durham_df[durham_df != 'Northumberland'] # that's a county!
durham_df = durham_df[durham_df != 'Nova Scotia'] # ??
durham_df = durham_df[durham_df != 'Number One'] # ?????
durham_df = durham_df[durham_df != 'Pelaw Main'] # same as Pelaw
durham_df = durham_df[durham_df != 'Heworth Lane'] # can't find on map
durham_df = durham_df[durham_df != 'Heworth Shore'] # can't find
durham_df = durham_df[durham_df != 'Upper Heworth'] # same as High Heworth
durham_df = durham_df[durham_df != 'Crookhill and Stella Ryton'] # can't find
durham_df = durham_df[durham_df != 'Sherburn Road'] # can't find
durham_df = durham_df[durham_df != 'West Sherburn'] # can't find
durham_df = durham_df[durham_df != 'Stillington (Stockton-on-Tees)'] # don't want two
durham_df = durham_df[durham_df != 'Whitburn (South Tyneside)']
durham_df = durham_df[durham_df != 'Whinny Hill'] # Whinney hill
durham_df = durham_df[durham_df != 'Witton le Wear'] # already has Witton-le-Wear

Another issue is the same place has multiple entries with directions like north attached. For the examples where this does not actually correspond to a distinct place, we will remove them.

In [325]:
durham_df = durham_df[durham_df != 'Barnard Castle East']
durham_df = durham_df[durham_df != 'Barnard Castle West']
durham_df = durham_df[durham_df != 'Chester-le-Street East']
durham_df = durham_df[durham_df != 'Chester-le-Street North']
durham_df = durham_df[durham_df != 'Chester-le-Street South']
durham_df = durham_df[durham_df != 'Chester-le-Street West Central']
durham_df = durham_df[durham_df != 'Consett North']
durham_df = durham_df[durham_df != 'Consett South']
durham_df = durham_df[durham_df != 'North Grindon']
durham_df = durham_df[durham_df != 'South Grindon']
durham_df = durham_df[durham_df != 'West Hartlepool']
durham_df = durham_df[durham_df != 'The Hartlepools']
durham_df = durham_df[durham_df != 'Norton North']
durham_df = durham_df[durham_df != 'Norton South']
durham_df = durham_df[durham_df != 'Norton West']
durham_df = durham_df[durham_df != 'Upper Heworth'] # same as High Heworth
durham_df = durham_df[durham_df != 'Washington North']
durham_df = durham_df[durham_df != 'Washington South']
durham_df = durham_df[durham_df != 'Washington Station']
durham_df = durham_df[durham_df != 'Washington West']

We manually change entries that are inaccurate or misleading. I know `Newton Hall` well and `Framwellgate` refers to `Framwellgate Moor` that is already included in `durham_df`.

In [326]:
durham_df[durham_df == 'Framwellgate and Newton Hall'] = 'Newton Hall' 
durham_df[durham_df == 'Old (or Painshaw) Penshaw'] = 'Old Penshaw'

Throughout `durham_df` there is an inconsistent usage of `St` and `St.` in the names of places as an abbreviation of Saint. We take the convention to use `St` and can implement this with regular expressions.

In [327]:
durham_df = durham_df.str.replace(r'St\.', 'St')

  durham_df = durham_df.str.replace(r'St\.', 'St')


Another inconsistency is the use of both `&` and `and`. We choose to use the string `and`. 

In [328]:
durham_df = durham_df.str.replace(r'&', 'and')

In [329]:
result = durham_df[durham_df.str.contains(' and ')]
print(result)

1663               Alston and Fellside
3976      Aycliffe North and Middridge
4875            Bank Top and Lascelles
6294                  Beacon and Bents
7344             Benwell and Scotswood
                      ...             
103760    Whickham South and Sunniside
103866            Whitburn and Marsden
105053          Willington and Hunwick
105433       Windy Nook and Whitehills
105546          Winlaton and High Spen
Length: 61, dtype: object


Many entries that contain `and` will have both parts also contained in `durham_df`. For these we will remove the entry with `and`.

In [330]:
# Find entries in the form 'X and Y' that are both present in the Series
entries_to_remove = []
for entry in durham_df:
    parts = entry.split(' and ')
    if len(parts) == 2 and parts[0] in durham_df.values and parts[1] in durham_df.values:
        entries_to_remove.append(entry)

# Remove the identified entries from the Series
durham_df = durham_df[~durham_df.isin(entries_to_remove)]

After cleaning some entries may be duplicated again, so to be safe we remove any duplicates one last time.

In [331]:
print('length of durham_df before removing duplicates again: ', len(durham_df))
durham_df = durham_df.drop_duplicates()
print('length of durham_df after removing duplicates again: ', len(durham_df))

length of durham_df before removing duplicates again:  1349
length of durham_df after removing duplicates again:  1182


For the final version we relabel the indices from 1 to `len(durham_df)` for clarity.

In [332]:
durham_df.index = range(1, len(durham_df) + 1)

One final look before saving!

In [336]:
durham_df = durham_df.str.lower()

In [337]:
print(durham_df.to_string(max_rows=None))

1                            addison
2                      adelaide bank
3                            aislaby
4                             albany
5                        albert hill
6                aldin grange bridge
7                 allendale cottages
8                         allensford
9                       allenshields
10                allerdene colliery
11                         allerdene
12               alston and fellside
13                       alum waters
14                    annfield plain
15                 archdeacon newton
16                         ashbrooke
17                     auckland park
18                 st helen auckland
19                     west auckland
20                           aukside
21                       auton stile
22                       axwell park
23                     aycliffe east
24      aycliffe north and middridge
25                  aycliffe village
26                     aycliffe west
27                   newton aycliffe
2

In [338]:
durham_df.to_csv('place_names_durham.txt', sep='\t', index=False, header=False)

In [18]:
with open('place_names_UK.txt', 'r', encoding='utf-8') as f:
    places = f.read().splitlines()