## Setup notebook

In [0]:
# Core
import requests
import re
import warnings
import logging

# Data handling
import pandas as pd

### For illustrative purposes, we'll look at a sample dataset of 10,000 UK jobs scraped from Q4 2024 below
Notice we can already see how the location data is 'dirty' in terms of:
- **Inconsistent format**: How specific a job's location is varies e.g. 'London' vs 'London, GB, W1A 1AA'
- **Multiple locations advertised**: Some jobs hire for multiple locations. We'll need to handle these by exploding them into seperate rows, to each be geolocated, then merge back together e.g. 'LUTON AIRPORT, UK | SOUTHAMPTON INTL AIRPORT, UK' 
- **Unnecesary words dirtying up location:** Words like 'hybrid' are often in location fields within job descriptions. We'll need to do some regex to handle these e.g. 'Hybrid - Crawley | Hybrid - Reading'

In [0]:
df = pd.read_csv('/git1.csv')
df.head(5)

Unnamed: 0,id,location
0,13629709,London
1,12872996,"London, GB, W1A 1AA"
2,12045332,"LUTON AIRPORT, UK | SOUTHAMPTON INTL AIRPORT, UK | NORTHOLT, UK | ..."
3,9468938,"116/118 Market Street St Andrews, United Kingdom"
4,11085160,Hybrid - Crawley | Hybrid - Reading | Hybrid - Cheadle | Hybrid - ...



## Dataset manipulation to clean location data, starting with removing common words that 'dirty' up the location data

In [0]:
# Removing words in the 'location' column that will confuse the LocationIQ API if not removed now
df["location"] = df["location"].str.replace(
    "remote|location|locations|multiple location|multiple locations|offsite|work from home|flexible|virtual|full time|part time|full-time|part-time|hybrid|-",
    "",
    regex=True,
    flags=re.IGNORECASE
)

# Removing brackets only if they contain 'store' wording. Most bracket text contains useful location info for the API's accuracy, so we're keeping those
# Regular expression pattern to match text within parentheses
pattern = r"\(.*?\)"  
df["location"] = df["location"].apply(
    lambda x: re.sub(pattern, "", x) if "store" in x.lower() else x
)

## Handle situations where **multiples locations are advertised**

In [0]:
# Splitting up locations by '|', ';', '•' as they are actually multiple differnet locations, then process each location as it's own row. Without this,  API cannot tell it contains multiple different locations
df["location"] = df["location"].str.split(r"[\|;•]")
df = df.explode("location").reset_index(drop=True)


# final bit of cleaning
# Drop nan values and empty string values
df["location"].replace("", np.nan, inplace=True)
df.dropna(subset=["location"], inplace=True)
df.drop(df.index[df["location"] == ""], inplace=True)

# Removing leading & trailing whitespace
df["location"] = df["location"].str.strip()

## Now our 'location' field is much cleaner!

In [0]:
df.head(10)

Unnamed: 0,id,location
0,13629709,London
1,12872996,"London, GB, W1A 1AA"
2,12045332,"LUTON AIRPORT, UK"
3,12045332,"SOUTHAMPTON INTL AIRPORT, UK"
4,12045332,"NORTHOLT, UK"
5,12045332,United Kingdom
6,9468938,"116/118 Market Street St Andrews, United Kingdom"
7,11085160,Crawley
8,11085160,Reading
9,11085160,Cheadle


## Final step is to drop duplicates of the 'location' column
Since we're interfacing with the LocationIQ API—where each API call incurs cost and latency — **we need to minimize redundant requests by ensuring we only query unique locations**

In [0]:
# Save the original dataset before removing duplicate locations, so we can later link geolocated addresses back to each ID
original_dataset_path = "/df_original.csv"
df.to_csv(original_dataset_path, index=False)

# Drop duplicate locations to reduce redundant API calls, and remove the 'id' column as it's no longer needed.
df_no_duplicates = df.drop_duplicates(subset=["location"]).drop("id", axis=1)

# Save and proceed with geolocation
df_no_duplicates.to_csv('/no_duplicates.csv', index=False)