# Analysing food hygiene data

This notebook contains the code for analysing food hygiene data to establish the scale and nature of uninspected establishments.

We need to:

* Compile: fetch data from the FSA
* Clean: convert these from XML files to dataframes
* Question: find out how many haven't been inspected in 2 years or more (or other timeframes)
* Question: find out how many haven't yet been inspected
* Context: work this out as a percentage
* Context: establish the makeup of those establishments (e.g. how many are rated below 3? How many just 3?)
* Combine: repeat this for all authorities
* Combine: fetch Google Places API data
* Context: what's the average rating of those places not inspected?

## Import the libraries

In [None]:
#import the libraries we'll need
import requests
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup  # Change from ET to BeautifulSoup
import pandas as pd

In [None]:
#import a library for downloading files
from google.colab import files

In [None]:
#for using the isnan() function to check for missing values
import math

## Fetch the XML file

Each XML file is at a different URL on https://ratings.food.gov.uk/open-data - we try one.

In [None]:
#Brentwood's FSA data URL
url = "https://ratings.food.gov.uk/api/open-data-files/FHRS111en-GB.xml"

# Fetch the URL
response = requests.get(url)
# Store the content
xml_data = response.content

## Parse into a 'soup' and then into a dataframe

Now that we have a requests object, we need to convert that to a BeautifulSoup object to be able to parse it as a structured set of info.

As we parse it we store the info in a `pandas` dataframe.

In [None]:
# Parse the XML data
soup = BeautifulSoup(xml_data, 'xml')  # Use BeautifulSoup

# Create empty lists to store data
establishments = []
business_names = []
address_line_1s = []
address_line_2s = []
address_line_3s = []
address_line_4s = []
post_codes = []
rating_values = []
rating_dates = []
business_types = []
las = []
nrps = []
lats = []
lngs = []

# Find all establishment details
establishments_data = soup.find_all('EstablishmentDetail')

# Extract data for each establishment
for establishment in establishments_data:
  business_names.append(establishment.find('BusinessName').text.strip() or "")
  las.append(establishment.find('LocalAuthorityName').text.strip() or "")
  nrps.append(establishment.find('NewRatingPending').text.strip() or "")
  #if it is there
  if establishment.find('AddressLine1') != None:
    address_line_1s.append(establishment.find('AddressLine1').text.strip() or "")
  else:
    address_line_1s.append('')
  if establishment.find('AddressLine2') != None:
    address_line_2s.append(establishment.find('AddressLine2').text.strip() or "")
  else:
    address_line_2s.append('')
  if establishment.find('AddressLine3') != None:
    address_line_3s.append(establishment.find('AddressLine3').text.strip() or "")
  else:
    address_line_3s.append('')
  if establishment.find('AddressLine4') != None:
    address_line_4s.append(establishment.find('AddressLine4').text.strip() or "")
  else:
    address_line_4s.append('')
  if establishment.find('PostCode') != None:
    post_codes.append(establishment.find('PostCode').text.strip() or "")
  else:
    post_codes.append('')
  rating_values.append(establishment.find('RatingValue').text.strip() or "")
  rating_dates.append(establishment.find('RatingDate').text.strip() or "")
  business_types.append(establishment.find('BusinessType').text.strip() or "")

  # Find Geocode data (might not exist)
  #if establishment.find('Geocode') != None:
  #print(establishment.find('Geocode'))
  geocode = establishment.find('Geocode')
  #print(geocode)
  if geocode.find('Latitude') != None:
    lats.append(geocode.find('Latitude').text.strip())
    lngs.append(geocode.find('Longitude').text.strip())
  else:
    lats.append("")
    lngs.append("")

# Create a dictionary from lists
data = {
    "BusinessName": business_names,
    "Authority": las,
    "AddressLine1": address_line_1s,
    "AddressLine2": address_line_2s,
    "AddressLine3": address_line_3s,
    "AddressLine4": address_line_4s,
    "PostCode": post_codes,
    "RatingValue": rating_values,
    "NewRatingPending": nrps,
    "RatingDate": rating_dates,
    "BusinessType": business_types,
    "Lat": lats,
    "Lng": lngs
}

# Create pandas dataframe
df = pd.DataFrame(data)

# Print
df


Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng
0,:ROSEBANK NURSING HOMES LTD T/A Ardtully Retir...,Brentwood,Ardtully Retirement Home,Station Lane,Ingatestone,Essex,CM4 0BL,5,False,2023-11-20,Hospitals/Childcare/Caring Premises,51.666644,0.380843
1,124 (Essex) Transport Squadron Rlc Army Reserv...,Brentwood,Territorial Army Centre,Clive Road,Warley,Brentwood,CM13 3UJ,5,False,2023-07-31,Pub/bar/nightclub,51.599399,0.296689
2,55 Above Ltd,Brentwood,,,,,,5,False,2020-01-21,Retailers - other,,
3,A & S,Brentwood,13 Eastham Crescent,Brentwood,Essex,,CM13 2BN,5,False,2023-02-14,Retailers - other,51.612116,0.327679
4,A B Roots,Brentwood,Spring Farm,Blackmore Road,Ingatestone,Essex,CM4 0NP,5,False,2019-07-15,Retailers - other,51.677724,0.355835
...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,Yiamas \& NYX,Brentwood,Restaurant,Yiamas,Ongar Road,Pilgrims Hatch,CM15 9SS,5,False,2023-11-17,Other catering premises,51.6544384,0.2683294
661,Yorkies,Brentwood,186 Warley Hill,Warley,Essex,,CM14 5HF,5,False,2023-02-13,Restaurant/Cafe/Canteen,51.6057156,0.2965792
662,ZEBRANO,Brentwood,161 Kings Road,Brentwood,Essex,,CM14 4EG,5,False,2023-12-06,Other catering premises,51.6150512695313,0.299198001623154
663,Zizzi,Brentwood,72-74 High Street,Brentwood,Essex,,CM14 4AN,5,False,2022-06-17,Other catering premises,51.619841,0.301203


### Identify the business types

We are not going to look at all businesses, so we need a list of types that we might choose from.

In [None]:
#show the unique values, and the count of each
df['BusinessType'].value_counts()

Restaurant/Cafe/Canteen                  122
Other catering premises                  121
Retailers - other                        116
Pub/bar/nightclub                         89
Takeaway/sandwich shop                    52
School/college/university                 49
Hospitals/Childcare/Caring Premises       44
Mobile caterer                            36
Retailers - supermarkets/hypermarkets     18
Manufacturers/packers                      8
Hotel/bed & breakfast/guest house          5
Importers/Exporters                        2
Distributors/Transporters                  2
Farmers/growers                            1
Name: BusinessType, dtype: int64

### Filtering to select categories

The categories we are going to go with as fitting into our criteria of 'places a person might eat out' are:

* Restaurant/Cafe/Canteen
* Pub/bar/nightclub
* Takeaway/sandwich shop
* Mobile caterer


We are going to exclude 'Other catering premises' as inspection suggests this is almost entirely sports clubs and home-based cake/baking operations.

In [None]:
filtereddf = df[
    (df['BusinessType'] == 'Restaurant/Cafe/Canteen') |
    (df['BusinessType'] == 'Pub/bar/nightclub') |
    (df['BusinessType'] == 'Takeaway/sandwich shop') |
    (df['BusinessType'] == 'Mobile caterer')
 ]

filtereddf['BusinessType'].value_counts()

Restaurant/Cafe/Canteen    122
Pub/bar/nightclub           89
Takeaway/sandwich shop      52
Mobile caterer              36
Name: BusinessType, dtype: int64

### FUNCTION: parsefsaxml

We are going to need to do this repeatedly, so let's store in a function.

In [None]:
#define the function - it takes one parameter we call 'url'
def parsefsaxml(url):
  # Fetch the URL
  response = requests.get(url)
  # Store the content
  xml_data = response.content
  # Parse the XML data
  soup = BeautifulSoup(xml_data, 'xml')
  # Create empty lists to store data
  establishments = []
  business_names = []
  address_line_1s = []
  address_line_2s = []
  address_line_3s = []
  address_line_4s = []
  post_codes = []
  rating_values = []
  rating_dates = []
  business_types = []
  las = []
  nrps = []
  lats = []
  lngs = []

  # Find all establishment details
  establishments_data = soup.find_all('EstablishmentDetail')

  # Extract data for each establishment
  for establishment in establishments_data:
    business_names.append(establishment.find('BusinessName').text.strip() or "")
    las.append(establishment.find('LocalAuthorityName').text.strip() or "")
    nrps.append(establishment.find('NewRatingPending').text.strip() or "")
    #if it is there
    if establishment.find('AddressLine1') != None:
      address_line_1s.append(establishment.find('AddressLine1').text.strip() or "")
    else:
      address_line_1s.append('')
    if establishment.find('AddressLine2') != None:
      address_line_2s.append(establishment.find('AddressLine2').text.strip() or "")
    else:
      address_line_2s.append('')
    if establishment.find('AddressLine3') != None:
      address_line_3s.append(establishment.find('AddressLine3').text.strip() or "")
    else:
      address_line_3s.append('')
    if establishment.find('AddressLine4') != None:
      address_line_4s.append(establishment.find('AddressLine4').text.strip() or "")
    else:
      address_line_4s.append('')
    if establishment.find('PostCode') != None:
      post_codes.append(establishment.find('PostCode').text.strip() or "")
    else:
      post_codes.append('')
    #this trips up on https://ratings.food.gov.uk/api/open-data-files/FHRS527en-GB.xml
    if establishment.find('RatingValue') != None:
      rating_values.append(establishment.find('RatingValue').text.strip() or "")
    else:
      rating_values.append('')
    rating_dates.append(establishment.find('RatingDate').text.strip() or "")
    business_types.append(establishment.find('BusinessType').text.strip() or "")

    # Find Geocode data (might not exist)
    #if establishment.find('Geocode') != None:
    #print(establishment.find('Geocode'))
    geocode = establishment.find('Geocode')
    #print(geocode)
    if geocode.find('Latitude') != None:
      lats.append(geocode.find('Latitude').text.strip())
      lngs.append(geocode.find('Longitude').text.strip())
    else:
      lats.append("")
      lngs.append("")

  # Create a dictionary from lists
  data = {
      "BusinessName": business_names,
      "Authority": las,
      "AddressLine1": address_line_1s,
      "AddressLine2": address_line_2s,
      "AddressLine3": address_line_3s,
      "AddressLine4": address_line_4s,
      "PostCode": post_codes,
      "RatingValue": rating_values,
      "NewRatingPending": nrps,
      "RatingDate": rating_dates,
      "BusinessType": business_types,
      "Lat": lats,
      "Lng": lngs
  }
  # Create pandas dataframe
  df = pd.DataFrame(data)
  #return to whatever called the function
  return(df)


### Extract the year separately

The `RatingDate` column is currently a text string. As we want to filter on year, we can extract that into a dedicated column.

In [None]:
#use .to_datetime() from pandas to convert the column to datetime
#add the method .dt.year to extract the year from the resulting list of datetime objects
df['ratingYear'] = pd.to_datetime(df['RatingDate']).dt.year

## Filter to those inspected before 2019

In [None]:
#
before2019 = df[df['ratingYear'] < 2019]
before2019

Unnamed: 0,BusinessName,AddressLine1,AddressLine2,PostCode,RatingValue,RatingDate,BusinessType,Lat,Lng,ratingYear
9,A.S.K. Wines,88 Church Lane,Doddinghurst,CM15 0NG,Exempt,2018-02-12,Retailers - other,51.667835,0.298024,2018.0
15,Adele Bywater Cakes,,,,5,2016-01-25,Other catering premises,,,2016.0
35,Bar Bar.Co,159 Kings Road,Brentwood,CM14 4EG,5,2017-11-13,Retailers - other,51.6150512695313,0.299198001623154,2017.0
44,Bentley District Village Club,Bentley Village Hall,Ongar Road,CM15 9RZ,5,2014-10-21,Pub/bar/nightclub,51.640834,0.275657,2014.0
59,Boots UK Ltd,51 High Street,Brentwood,CM14 4RH,5,2017-07-03,Retailers - other,51.620859,0.302945,2017.0
...,...,...,...,...,...,...,...,...,...,...
626,Travis Perkins,41 Coxtie Green Road,Pilgrims Hatch,CM14 5PN,5,2017-12-08,Retailers - other,51.639362,0.27502,2017.0
629,Vaporetto,,,,5,2015-10-07,Mobile caterer,,,2015.0
636,W H Smith Ltd,1 - 2 Baytree Centre,Brentwood,CM14 4BX,Exempt,2015-07-06,Retailers - other,51.619507,0.302212,2015.0
643,Well Ltd,201 Rayleigh Road,Hutton,CM13 1LZ,Exempt,2017-12-27,Retailers - other,51.6329536437988,0.351101011037827,2017.0


### Show that as a percentage

We can see what that is as a percentage by dividing the length (number of rows) of the filtered dataset by the length of the unfiltered dataset.

In [None]:
len(before2019)/len(df)

0.13922155688622753

## What are these establishments like?

We use `.value_counts()` to generate a pivot table of how many rows there are in each category (rating value).

In [None]:
before2019['RatingValue'].value_counts()

5         71
Exempt    17
4          5
Name: RatingValue, dtype: int64

In [None]:
#divide all by the rows to get as %
before2019['RatingValue'].value_counts()/len(before2019)

5         0.763441
Exempt    0.182796
4         0.053763
Name: RatingValue, dtype: float64

## Repeat for 'Awaiting inspection'

Some establishments don't have any date because they are 'Awaiting inspection'. Let's look at them:

In [None]:
len(df[df['RatingValue'] == 'AwaitingInspection' ])

17

In [None]:
#divide the part by the whole
len(df[df['RatingValue'] == 'AwaitingInspection' ])/len(df)

0.025449101796407185

In [None]:
#add the two together
ai_perc = len(df[df['RatingValue'] == 'AwaitingInspection' ])/len(df)
pre19perc = len(before2019)/len(df)
ai_perc+pre19perc

0.16467065868263472

## Fetch the codes for each authority

We have collected the codes for each authority covered by the FSA, and stored them in a Google Doc, which is imported below.

Because the ID codes are numeric, they will be imported as numbers unless we specify otherwise, so we add the `dtype=str` parameter below to ensure all data is imported as strings.

In [None]:
#store the URL we've published the Google Sheet at (as a CSV)
fsacodesurl = "https://docs.google.com/spreadsheets/d/e/2PACX-1vT56kwmL6BGdve2HLvPqazY9qIOC9R9OC6-yzcmwnaKgca3MrImKe2-tPF7ltlE29OkPn9ioiSBuDSi/pub?gid=1903756103&single=true&output=csv"
#import, all fields as strings
fsacodedf = pd.read_csv(fsacodesurl, dtype=str)
#show
fsacodedf

Unnamed: 0,ID,LA only
0,297,Babergh
1,109,Basildon
2,701,Bedford
3,110,Braintree
4,227,Breckland
...,...,...
358,567,Rhondda Cynon Taf
359,568,Swansea
360,569,Torfaen
361,570,Vale of Glamorgan


### Generate a list of URLs

These codes mean we can now generate URLs for the API endpoint for each authority.

The URLs look like this:

`https://ratings.food.gov.uk/api/open-data-files/FHRS561en-GB.xml`

The only bit that changes is the three-digit code after `FHRS`.

In [None]:
#create a list to store the urls
apiurl_list = []
#loop through the codes
for i in fsacodedf['ID']:
  #form the URL with that in the middle
  apiurl = "https://ratings.food.gov.uk/api/open-data-files/FHRS"+i+"en-GB.xml"
  #append to the list
  apiurl_list.append(apiurl)

#show the first 5 results
apiurl_list[:5]

['https://ratings.food.gov.uk/api/open-data-files/FHRS297en-GB.xml',
 'https://ratings.food.gov.uk/api/open-data-files/FHRS109en-GB.xml',
 'https://ratings.food.gov.uk/api/open-data-files/FHRS701en-GB.xml',
 'https://ratings.food.gov.uk/api/open-data-files/FHRS110en-GB.xml',
 'https://ratings.food.gov.uk/api/open-data-files/FHRS227en-GB.xml']

### Loop through in groups of 100 - FILTER by business type

Now we test our function on multiple XML files from the API.

We start by testing 5 by adding an index slice in:

`for i in apiurl_list[:5]:`

Then we change it to `[:100]`, then to `[100:200]` and finally `[200:]`, each time storing the results in a different dataframe so we can recombine it later.

This helps us deal with problems - at one point an empty rating cell trips up the process so this limits the impact to just one third of the total.

#### The first 100

In [None]:
#create an empty list to store the results
df_list = []

#loop through api URLs
for i in apiurl_list[:100]:
  print(i)
  idf = parsefsaxml(i)
  #filter out exempt inspections
  idf = idf[idf['RatingValue'] != 'Exempt']
  #filter to the categories
  idf = idf[
    (idf['BusinessType'] == 'Manufacturers/packers') |
    (idf['BusinessType'] == 'Importers/Exporters') |
    (idf['BusinessType'] == 'Distributors/Transporters')
  ]
  # Append the new DataFrame to the list
  df_list.append(idf)

# Concatenate all DataFrames in the list into a single DataFrame
alldf = pd.concat(df_list)

alldf

#store this 100 in one data frame
alldf0_99 = alldf

https://ratings.food.gov.uk/api/open-data-files/FHRS297en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS109en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS701en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS110en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS227en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS111en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS228en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS155en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS027en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS112en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS702en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS113en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS114en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS156en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS028en-GB.xml
https://ratings.food.gov.

In [None]:
#export the selection
alldf0_99.to_csv('alldf0_99.csv')
#download the file
files.download('alldf0_99.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### The second 100

In [None]:
#create an empty list to store the results
df_list = []

#loop through api URLs
for i in apiurl_list[100:200]:
  print(i)
  idf = parsefsaxml(i)
  #filter out exempt inspections
  idf = idf[idf['RatingValue'] != 'Exempt']
  #filter to the eating out categories
  idf = idf[
    (idf['BusinessType'] == 'Manufacturers/packers') |
    (idf['BusinessType'] == 'Importers/Exporters') |
    (idf['BusinessType'] == 'Distributors/Transporters')
  ]
  # Append the new DataFrame to the list
  df_list.append(idf)

# Concatenate all DataFrames in the list into a single DataFrame
alldf = pd.concat(df_list)

alldf

#store this 100 in one data frame
alldf100_199 = alldf

https://ratings.food.gov.uk/api/open-data-files/FHRS521en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS522en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS523en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS524en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS525en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS526en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS527en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS528en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS529en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS530en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS531en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS532en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS533en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS874en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS706en-GB.xml
https://ratings.food.gov.

In [None]:
#export the selection
alldf100_199.to_csv('alldf100_199.csv')
#download the file
files.download('alldf100_199.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### The third batch

In [None]:
#create an empty list to store the results
df_list = []

#loop through api URLs
for i in apiurl_list[200:]:
  print(i)
  idf = parsefsaxml(i)
  #filter out exempt inspections
  idf = idf[idf['RatingValue'] != 'Exempt']
  #filter to the eating out categories
  idf = idf[
    (idf['BusinessType'] == 'Manufacturers/packers') |
    (idf['BusinessType'] == 'Importers/Exporters') |
    (idf['BusinessType'] == 'Distributors/Transporters')
  ]
  # Append the new DataFrame to the list
  df_list.append(idf)

# Concatenate all DataFrames in the list into a single DataFrame
alldf = pd.concat(df_list)

alldf

#store this batch in one data frame
alldf200_ = alldf

https://ratings.food.gov.uk/api/open-data-files/FHRS106en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS310en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS140en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS187en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS885en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS270en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS877en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS311en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS312en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS189en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS313en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS142en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS190en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS191en-GB.xml
https://ratings.food.gov.uk/api/open-data-files/FHRS192en-GB.xml
https://ratings.food.gov.

In [None]:
#export the selection
alldf200_.to_csv('alldf200_.csv')
#download the file
files.download('alldf200_.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Combine all 3 dataframes and export

Now we can combine the three dataframes we've created for the three slices of the list.

In [None]:
#combine the 3 data frames in the list
alldf363 = pd.concat([alldf0_99,alldf100_199,alldf200_])


In [None]:
#export as a CSV
alldf363.to_csv('alldf363_wholesalers.csv')
#download the file
files.download('alldf363_wholesalers.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### Get an overview of business types

Let's check how many there are in each business type.

In [None]:
#get a count of each type of business
alldf363['BusinessType'].value_counts()

BusinessType
Manufacturers/packers        10159
Distributors/Transporters     3129
Importers/Exporters            517
Name: count, dtype: int64

### Add the years

Each time we also extract the year of inspection into a new column.

In [None]:
#use .to_datetime() from pandas to convert the column to datetime
#add the method .dt.year to extract the year from the resulting list of datetime objects
alldf363['ratingYear'] = pd.to_datetime(alldf363['RatingDate']).dt.year
alldf363['ratingYear'].value_counts()

ratingYear
2023.0    4061
2022.0    2502
2024.0    1533
2021.0    1298
2019.0     719
2020.0     590
2018.0     515
2017.0     252
2016.0     160
2015.0     109
2014.0      73
2013.0      47
2012.0      44
2011.0      37
2010.0      25
2009.0      15
2008.0       7
2007.0       5
1999.0       3
2004.0       2
2005.0       1
2003.0       1
Name: count, dtype: int64

### Add T/F columns for pre-2022

We are interested in how many haven't been inspected in at least two years. A rough approximation of that can be given by how many have a year of inspection before 2022 (there will also be some inspected in the first few months of 2022 who haven't been inspected now for over two years, but we just want a rough idea for now).



In [None]:
alldf363['pre2022'] = alldf363['ratingYear'] < 2022
alldf363['pre2022'].value_counts()

pre2022
False    9902
True     3903
Name: count, dtype: int64

## Identify six months ago

In [None]:
from datetime import date
date.today().strftime("%d/%m/%Y")

'25/04/2024'

In [None]:
pd.to_datetime(alldf363['RatingDate'])[:5] < '25/10/2024'

36      True
278     True
296     True
392    False
534    False
Name: RatingDate, dtype: bool

In [None]:
alldf363['before25oct2024'] = pd.to_datetime(alldf363['RatingDate']) < '25/10/2024'

## Export with years

In [None]:
#export as a CSV
alldf363.to_csv('alldf363.csv')
#download the file
files.download('alldf363.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Check awaiting inspection

We have a number of records where no date is given - these also return `False` for the year being before 2022.

In [None]:
len(alldf363[alldf363['RatingValue'] == 'AwaitingInspection' ])

1556

## Generate a pivot table showing numbers for each authority.

We can get an idea of those awaiting inspection by using the `pivot_table()` function from pandas.

Note that there are two ways this is stored: as 'Awaiting Inspection' and 'AwaitingInspection' (no space).

Note also that Scottish authorities use a different rating system which has three levels: pass, pass and eat safe, and improvement required.

In [None]:
alldf363.pivot_table(index="Authority",
                        values="BusinessName",
                        columns="RatingValue",
                        margins=True,
                        aggfunc="count").fillna(0)

RatingValue,0,1,2,3,4,5,Awaiting Inspection,AwaitingInspection,Improvement Required,Pass,Pass and Eat Safe,All
Authority,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Aberdeen City,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,3.0,25.0,0.0,31
Aberdeenshire,0.0,0.0,0.0,0.0,0.0,0.0,59.0,0.0,14.0,158.0,0.0,231
Adur,0.0,0.0,0.0,0.0,2.0,7.0,0.0,0.0,0.0,0.0,0.0,9
Amber Valley,0.0,0.0,0.0,0.0,4.0,19.0,0.0,2.0,0.0,0.0,0.0,25
Anglesey,0.0,1.0,1.0,2.0,8.0,59.0,0.0,0.0,0.0,0.0,0.0,71
...,...,...,...,...,...,...,...,...,...,...,...,...
Wychavon,0.0,0.0,0.0,1.0,1.0,14.0,0.0,1.0,0.0,0.0,0.0,17
Wyre,0.0,1.0,0.0,0.0,7.0,16.0,0.0,1.0,0.0,0.0,0.0,25
Wyre Forest,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,3
York,0.0,0.0,0.0,3.0,5.0,14.0,0.0,1.0,0.0,0.0,0.0,23


In [None]:
#store in a dataframe
pivot_rating_la = alldf363.pivot_table(index="Authority",
                        values="BusinessName",
                        columns="RatingValue",
                        margins=True,
                        aggfunc="count").fillna(0)
#export as a CSV
pivot_rating_la.to_csv('pivot_rating_la.csv')

### Generate a pivot showing number of inspections pre-2022 for each authority

We can repeat this for the pre2022 column, to show the numbers in each authority which are `True` (pre-2022 inspections) and `False` (not pre-2022).

Note that those Awaiting Inspection will be counted as False here, so we need to combine the previous pivot table with this to get a more accurate figure.

In [None]:
pivot_pre22_la = alldf363.pivot_table(index="Authority",
                        values="BusinessName",
                        columns="pre2022",
                        margins=True,
                        aggfunc="count").fillna(0).astype(int)

pivot_pre22_la


pre2022,False,True,All
Authority,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aberdeen City,390,687,1077
Aberdeenshire,367,297,664
Adur,195,5,200
Amber Valley,390,181,571
Anglesey,293,57,350
...,...,...,...
Wychavon,438,37,475
Wyre,406,62,468
Wyre Forest,321,38,359
York,789,228,1017


In [None]:
#export as a CSV
pivot_pre22_la.to_csv('pivot_pre22_la.csv')

In [None]:
#Calculate the pre22 numbers as %
pivot_pre22_la[True]/pivot_pre22_la['All']

Authority
Aberdeen City    0.637883
Aberdeenshire    0.447289
Adur             0.025000
Amber Valley     0.316988
Anglesey         0.162857
                   ...   
Wychavon         0.077895
Wyre             0.132479
Wyre Forest      0.105850
York             0.224189
All              0.195954
Length: 364, dtype: float64

In [None]:
pivot_pre22_la['percPre22'] = pivot_pre22_la[True]/pivot_pre22_la['All']
pivot_pre22_la

pre2022,False,True,All,percPre22
Authority,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aberdeen City,390,687,1077,0.637883
Aberdeenshire,367,297,664,0.447289
Adur,195,5,200,0.025000
Amber Valley,390,181,571,0.316988
Anglesey,293,57,350,0.162857
...,...,...,...,...
Wychavon,438,37,475,0.077895
Wyre,406,62,468,0.132479
Wyre Forest,321,38,359,0.105850
York,789,228,1017,0.224189


## Data checking: duplicates

Let's see if there are any duplicates.

In [None]:
#create a list of True/False values indicating whether a row is a duplicate
dupes = alldf363.duplicated()
#Get a count of T/F
dupes.value_counts()

False    13799
True         6
Name: count, dtype: int64

Is there any pattern to the authorities involved?

In [None]:
#Use that list to filter to duplicate entries, and get a count of the authorities
alldf363['Authority'][dupes].value_counts()

Authority
Mid Suffolk        1
Lewisham           1
Epsom and Ewell    1
Somerset           1
South Hams         1
West Devon         1
Name: count, dtype: int64

In [None]:
#Use that list to filter to duplicate entries, and get a count of the authorities
alldf363['BusinessName'][dupes].value_counts()

BusinessName
Broughton Hall Dairy    1
Phlox                   1
Park Farm Honey         1
The Preserving Pan      1
Namoh                   1
Fusion Cuisine          1
Name: count, dtype: int64

Let's just look at them.

In [None]:
#Use that list to filter to duplicate entries, and get a count of the authorities
alldf363[dupes]

Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng,ratingYear,pre2022,before25oct2024
99,Broughton Hall Dairy,Mid Suffolk,Broughton Hall,Stowmarket Road,Stonham Aspal,STOWMARKET,IP14 6AD,AwaitingInspection,False,,Manufacturers/packers,52.1918855,1.1190541,,False,False
1637,Phlox,Lewisham,,,,,,5,False,2022-03-02,Manufacturers/packers,,,2022.0,False,True
360,Park Farm Honey,Epsom and Ewell,,,,,,AwaitingInspection,False,,Manufacturers/packers,,,,False,False
5197,The Preserving Pan,Somerset,,,,,,5,False,2023-12-01,Manufacturers/packers,,,2023.0,False,True
707,Namoh,South Hams,,,,,,5,False,2020-02-12,Manufacturers/packers,,,2020.0,True,True
236,Fusion Cuisine,West Devon,,,,,,AwaitingInspection,False,,Manufacturers/packers,,,,False,False


## Export deduplicated

The numbers involved here are around 0.04% but we will remove them anyway.

In [None]:
alldf363deduplicated = alldf363.drop_duplicates()
#export as a CSV
alldf363deduplicated.to_csv('alldf363deduplicated.csv')
#download the file
files.download('alldf363deduplicated.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Import and deduplicate again by BusinessName/latlong

Although we have deduplicated where rows are the same, there might also be rows which are duplicates in the sense of them being the same business, but a different rating/inspection date.

Pilar Thomas conducted some spot-checking on exported data on a different set of categories: first, a preliminary conditional formatting in Excel to have a general view, and then using OpenRefine.

The process followed was:

1. Reorder spreadsheet so Lat column is sorted a – z, and select Reorder Rows Permanently.
2. Apply Duplicates facet to Lat and select True, so exact latitude figures will be ordered in consecutive rows.
3. Blank down cells in column BusinessName – this will identify two consecutive columns with the same BusinessName and then delete the name in the second row so it's easier to filter later. The reasoning behind this is that we'll probably then be able to detect rows with the same latitude and same business name, which could easily be duplicates.
4. Apply Facet by blank to column BusinessName.
Star these rows (732 in total) so they are easier to check later on, and then delete Facet by blank.
5. Sort A-Z by BusinessName and drag Blanks (starred rows) so they are in top of the column.
Apply Text filter to column AddressLine1.

Then random spot-checking:
6. Starting with the first row (number 1222.), search its AddressLine1 (Bickels Yard Cafe ( Fusion )) in the Text filter box.
7. Two exact rows appear with the same latitude, so we flag the starred one (with the blank BusinessName), as Bickels Yard Cafe is clearly duplicated.
8. Reset AddressLine1 Text filter and repeat this process with other random rows.

We now try to codify this process in Python.


In [None]:
#check how many rows
len(alldf363deduplicated)

13799

In [None]:
#remove duplicates - specifying which columns we want to deduplicate on
deduped_BNLL = alldf363deduplicated.drop_duplicates(subset=['BusinessName','Lat','Lng'])
#how many does that leave
len(deduped_BNLL)

13748

### Count how many duplicates - and check

We now look at what rows have been identified as duplicates.

In [None]:
#create a list of True/False values indicating whether a row is a duplicate
dupes = alldf363deduplicated.duplicated(subset=['BusinessName','Lat','Lng'],
                            keep = False) #keep all duplicates
#Get a count of T/F
dupes.value_counts()

False    13706
True        93
Name: count, dtype: int64

An inspection of the results flags a potential cause of false positives in the deduplication: rows where Lat and Lng are empty strings. In these cases, only the BusinessName field is left to deduplicate on, so Bake A Wish in two different places is treated as a duplicate.

In [None]:
#Use that list to filter to duplicate entries, and sort it by BusinessName so we can see them together
alldf363deduplicated[dupes].sort_values(by = ['BusinessName'])

Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng,ratingYear,pre2022,before25oct2024
78,Alpha Food Service Ltd,Cardiff,"Unit 11, Wholesale Fruit Centre Bessemer Road",Leckwith,Cardiff,,CF11 8BB,AwaitingInspection,False,,Distributors/Transporters,51.4658512,-3.1983039,,False,False
79,Alpha Food Service Ltd,Cardiff,"Unit 16, Wholesale Fruit Centre Bessemer Road",Leckwith,Cardiff,,CF11 8BB,3,False,2023-11-09,Distributors/Transporters,51.4658512,-3.1983039,2023.0,False,True
135,Bake A Wish,Sunderland,,,,,,4,False,2020-02-18,Manufacturers/packers,,,2020.0,True,True
40,Bake A Wish,Rushcliffe,,,,,,5,False,2023-09-18,Manufacturers/packers,,,2023.0,False,True
53,Baked,Ceredigion,,,,,,5,False,2023-11-22,Manufacturers/packers,,,2023.0,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5159,W J Lean,Highland,,Mains of Kilravock,Croy,,IV2 7PJ,Awaiting Inspection,False,,Manufacturers/packers,,,,False,False
1282,Wiltshire Farm Foods,Flintshire,Wiltshire Farm Foods,Wiltshire Farm Foods Pendle Court,Evans Way Shotton,Deeside Flintshire,CH5 1QJ,5,False,2023-06-22,Distributors/Transporters,,,2023.0,False,True
836,Wiltshire Farm Foods,Rushcliffe,Wiltshire Farm Foods Loughborough,Unit 9,Wolds Farm Business Park,Kinoulton Lane,NG12 3EQ,5,False,2023-12-13,Distributors/Transporters,,,2023.0,False,True
4740,Wiltshire Farm Foods,Dorset,,,,,,5,False,2021-11-12,Distributors/Transporters,,,2021.0,True,True


### Check for missing values in Lat

There are a lot of missing values which could give us false positives where a row is marked as a duplicate because it has the same name and the same (missing) latitude.

We can do a count to see how many there are of each Lat.

In [None]:
#Use that list to filter to duplicate entries
#count how many of each value
alldf363deduplicated['Lat'][dupes].value_counts()

Lat
              81
51.5294841     2
51.1492536     2
51.3954042     2
54.9022469     2
51.4658512     2
51.5514689     2
Name: count, dtype: int64

### Deduplicating by name and postcode

Let's try doing it by postcode instead.

In [None]:
#create a list of True/False values indicating whether a row is a duplicate
dupes = alldf363deduplicated.duplicated(subset=['BusinessName','PostCode'],
                            keep = False) #keep both duplicates
#Get a count of T/F
dupes.value_counts()

False    13726
True        73
Name: count, dtype: int64

In [None]:
#Use that list to filter to duplicate entries, and sort it by BusinessName so we can see them together
alldf363deduplicated[dupes].sort_values(by = ['BusinessName'])

Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng,ratingYear,pre2022,before25oct2024
79,Alpha Food Service Ltd,Cardiff,"Unit 16, Wholesale Fruit Centre Bessemer Road",Leckwith,Cardiff,,CF11 8BB,3,False,2023-11-09,Distributors/Transporters,51.4658512,-3.1983039,2023.0,False,True
78,Alpha Food Service Ltd,Cardiff,"Unit 11, Wholesale Fruit Centre Bessemer Road",Leckwith,Cardiff,,CF11 8BB,AwaitingInspection,False,,Distributors/Transporters,51.4658512,-3.1983039,,False,False
215,Ashra Foods,Leicester City,Unit 49,Vulcan House,Vulcan Road,Leicester,LE5 3EF,2,False,2023-08-14,Manufacturers/packers,52.6389674,-1.1145069,2023.0,False,True
214,Ashra Foods,Leicester City,Unit 48,Vulcan House,Vulcan Road,Leicester,LE5 3EF,2,False,2023-08-14,Manufacturers/packers,52.639239,-1.114848,2023.0,False,True
135,Bake A Wish,Sunderland,,,,,,4,False,2020-02-18,Manufacturers/packers,,,2020.0,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
864,The Home Kitchen,Cotswold,,,,,,5,False,2021-04-08,Distributors/Transporters,,,2021.0,True,True
2793,The Robin Collective,Tower Hamlets,"Studio 6, Cornelius Drebbel House",5 Empson Street,London,,E3 3LT,5,False,2023-06-30,Manufacturers/packers,,,2023.0,False,True
2794,The Robin Collective,Tower Hamlets,"Studio 8, Cornelius Drebbel House",5 Empson Street,London,,E3 3LT,5,False,2023-06-30,Manufacturers/packers,51.521525,-0.013943,2023.0,False,True
1868,Y Sied Laeth,Gwynedd,"Bryn Derwen, Bryn Hynog",,Llannor,Gwynedd,LL53 5UG,5,False,2023-05-12,Manufacturers/packers,,,2023.0,False,True


### Check for missing PostCode field

Again, could this be caused by them having no postcode at all?

Let's count the values.

In [None]:
#Use that list to filter to duplicate entries
#count how many of each value
alldf363deduplicated['PostCode'][dupes].value_counts()

PostCode
            45
NR5 8BF      2
LE5 3EF      2
NW10 6HJ     2
KT3 3NW      2
SE22 9NA     2
E2 9FP       2
E3 3LT       2
TN23 1EF     2
RG19 4ZA     2
DG9 7HJ      2
CF11 8BB     2
LL53 5UG     2
NP26 3DE     2
CF72 9FQ     2
Name: count, dtype: int64

Here we see 45 empty strings, with the rest in pairs. Using `isinstance` doesn't work here as it did in the other notebook, because they're still strings.

In [None]:
#loop through each value in PostCode and use in isinstance() function to return a list of True/False
#create a data frame from that list
#apply value_counts() to get a total of True and False
pd.DataFrame([isinstance(i, str) for i in alldf363deduplicated['PostCode'][dupes]]).value_counts()

True    73
Name: count, dtype: int64

## Deduplicating only where there is not an empty string

We need to do the following:

* Sort by inspection date and BusinessName so that when we remove duplicates we remove the older inspection record
* Create a T/F column identifying duplicates based on BusinessName/Lat/Lng
* Create a T/F column identifying empty string entries in the Lat column
* Filter out duplicates based on BusinessName/Lat/Lng where the Lat NaN (empty string) column is False
* Create a T/F column identifying duplicates based on BusinessName/PostCode
* Create a T/F column identifying empty string entries in the PostCode column
* Filter out duplicates based on BusinessName/PostCode where the PostCode NaN (empty string) column is False
* Inspect the duplicate-but-empty string results to identify any other obvious duplicates


### Sorting by inspection date: `sort_values()`

In [None]:
#sort by RatingDate
exportdf = alldf363deduplicated.sort_values(by = ['RatingDate'])
exportdf.head(3)

Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng,ratingYear,pre2022,before25oct2024
381,Cornish Garlic Co.,Plymouth City,,2 Julian Street,Plymouth,,PL4 0PR,AwaitingInspection,False,,Manufacturers/packers,50.3681212,-4.1214395,,False,False
179,Ceva logistics,Rugby,"ceva House, Excelsior road",,,Ashby de la zouch,LE65 1NU,AwaitingInspection,False,,Distributors/Transporters,52.7488518,-1.450756,,False,False
1155,Future Forward/Camp Knak,Hackney,,,,,,AwaitingInspection,False,,Distributors/Transporters,,,,False,False


In [None]:
#show the last rows
exportdf.tail(3)

Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng,ratingYear,pre2022,before25oct2024
1280,Louise's Farm Kitchen,Aberdeenshire,,Milton Of Auchenhove,Lumphanan,Aberdeenshire,AB31 4QR,Pass,False,2024-04-23,Manufacturers/packers,,,2024.0,False,True
2007,Suzi Bakes,Aberdeenshire,,Home Bakery Business,,,,Awaiting Inspection,False,2024-04-23,Manufacturers/packers,,,2024.0,False,True
587,Handmade Scotch Egg Company,Herefordshire,The Egg Shed,The Hop Pocket,,Bishops Frome,WR6 5BT,5,False,2024-04-23,Manufacturers/packers,52.122103,-2.494203,2024.0,False,True




Note that 'Awaiting inspection' is treated as older than an actual inspection date.

### Create True/False columns for duplicates

This time we set the `keep =` parameter to `'last'` rather than `False`. This ensures that we will only mark rows as a duplicate if they are 'older' entries (as the data is sorted by RatingDate so the last entry will be the latest).

In [None]:
#add a T/F column identifying duplicates based on BusinessName/lat/long
exportdf['duplicateBNLL'] = exportdf.duplicated(subset=['BusinessName','Lat','Lng'],
                            keep = 'last') #Mark duplicates as True except for the last occurrence.

#show the first few rows
exportdf.head(3)

Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng,ratingYear,pre2022,before25oct2024,duplicateBNLL
381,Cornish Garlic Co.,Plymouth City,,2 Julian Street,Plymouth,,PL4 0PR,AwaitingInspection,False,,Manufacturers/packers,50.3681212,-4.1214395,,False,False,False
179,Ceva logistics,Rugby,"ceva House, Excelsior road",,,Ashby de la zouch,LE65 1NU,AwaitingInspection,False,,Distributors/Transporters,52.7488518,-1.450756,,False,False,False
1155,Future Forward/Camp Knak,Hackney,,,,,,AwaitingInspection,False,,Distributors/Transporters,,,,False,False,False


## Creating a T/F column for empty strings

There are no NaN values in this dataset, unlike the other notebook.

Instead we need to identify empty cells.

In [None]:
#create an empty list to keep track
latempty = []

#loop through the lats
for i in exportdf['Lat']:
  #print(type(i))
  #if it's an empty string
  if i == '':
    #add True to the list
    latempty.append(True)
  #otherwise
  else:
    #add False
    latempty.append(False)

#show how many of each
pd.Series(latempty).value_counts()

False    8981
True     4818
Name: count, dtype: int64

We can also write this code like this:

In [None]:
#loop through each item and test if it's '', store in a list
latempty = [i == '' for i in exportdf['Lat']]
#convert to a pandas Series object to use value_counts()
pd.Series(latempty).value_counts()

False    8981
True     4818
Name: count, dtype: int64

In [None]:
#add a T/F column identifying duplicates based on BusinessName/PostCode
exportdf['duplicateBNP'] = exportdf.duplicated(subset=['BusinessName','PostCode'],
                            keep = 'last') #only mark older entries as duplicates
#add a T/F column identifying empty PostCode
exportdf['postCodeEMPTY'] = [i == '' for i in exportdf['PostCode']]
#add a T/F column identifying empty lat
exportdf['latEMPTY'] = [i == '' for i in exportdf['Lat']]

#show the first few rows
exportdf.head(3)

Unnamed: 0,BusinessName,Authority,AddressLine1,AddressLine2,AddressLine3,AddressLine4,PostCode,RatingValue,NewRatingPending,RatingDate,BusinessType,Lat,Lng,ratingYear,pre2022,before25oct2024,duplicateBNLL,duplicateBNP,postCodeEMPTY,latEMPTY
381,Cornish Garlic Co.,Plymouth City,,2 Julian Street,Plymouth,,PL4 0PR,AwaitingInspection,False,,Manufacturers/packers,50.3681212,-4.1214395,,False,False,False,False,False,False
179,Ceva logistics,Rugby,"ceva House, Excelsior road",,,Ashby de la zouch,LE65 1NU,AwaitingInspection,False,,Distributors/Transporters,52.7488518,-1.450756,,False,False,False,False,False,False
1155,Future Forward/Camp Knak,Hackney,,,,,,AwaitingInspection,False,,Distributors/Transporters,,,,False,False,False,False,True,True


In [None]:
exportdf.to_csv('exportdf.csv')

### How many should be filtered out?

Of those with duplicated name-and-postcode, how many of the 13799 have empty postcodes?

In [None]:
#filter to those with duplicate name and postcodes
#then count the values in the empty column
exportdf[exportdf['duplicateBNP']]['postCodeEMPTY'].value_counts()

postCodeEMPTY
True     23
False    14
Name: count, dtype: int64

And for name-and-latlong?

In [None]:
#filter to those with duplicate lat-long
#then count the values in the empty column
exportdf[exportdf['duplicateBNLL']]['latEMPTY'].value_counts()

latEMPTY
True     45
False     6
Name: count, dtype: int64

We can now download the data and check for false positives (duplicate but empty) as well as double-counting (duplicate postcode and duplicate latlong for the same row).

That manual exploration of the data finds that the six name-latlong duplicates are also listed among the 14 name-postcode duplicates. So in total, we only have 14 duplicates to remove - and we only need to use the postcode method because that covers all of them.

In [None]:
exportdf_deduped = exportdf[
    (exportdf['duplicateBNP'] == False) | #all not marked as duplicate name-postcode
    (exportdf['duplicateBNP'] == True) & (exportdf['postCodeEMPTY'] == True) #those duplicate but no postcode
  ]

len(exportdf_deduped)

13785

## Export df with duplicate businesses removed

In [None]:
exportdf_deduped.to_csv("wholesalers_deduped.csv")
files.download('wholesalers_deduped.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>