## Detecting Names that Appear in Dataset Using SpaCy

### Package installation
Section below is used to download necessary SpaCy packages for use.

In [None]:
#pip install -U spacy
#!python -m spacy download en

### Example 1
Section below is an example code base of how SpaCy can be utilised on the dataframe to detect potential names along with their indexes. We assume that from here onwards, the SpaCy package has been updated for use. In this example, we will be using "Attendances at Accident & Emergency Departments, Specialist Outpatient Clinics, Polyclinics and Public Sector Dental Clinics" data set that is readily available in [data.gov.sg](https://data.gov.sg/dataset/attendances-at-accident-emergency-departments-specialist-outpatient-clinics-and-polyclinics)

In [1]:
import numpy as np
import pandas as pd
import spacy
# Import the english language model
nlp = spacy.load("en_core_web_sm")
df = pd.read_csv("outpatient-attendances-a-e-socs-polyclinics-dental.csv")

As you can see below, the dataset gives a simple dataframe with three columns with 'type' column giving some general text covering different categories. As the purpose of using SpaCy is to detect names, I will be slipping in some names across a few rows to be tested for detection later. They are namely;

- Chan Sek Keong
- S Dhanabalan
- Joanna Dong
- Abdul Halim bin Haron
- Russell Lee
- Li Li

In [14]:
df.at[0,'type']='Chan Sek Keong'
df.at[23,'type']='S Dhanabalan'
df.at[49,'type']='Joanna Dong'
df.at[11,'type']='Abdul Halim bin Haron'
df.at[52,'type']='Russell Lee'
df.at[7,'type']='Li Li'
df.head()

Unnamed: 0,year,type,count
0,2006,Chan Sek Keong,676763
1,2006,Specialist Outpatient Clinics,3624976
2,2006,Polyclinics,3769989
3,2006,Public Sector Dental Clinics,838466
4,2007,A&E,752122


From here onwards, we will proceed to utilise SpaCy to detect all possible names that appear in side the data set. Its important to note that the column needs to be in string format before SpaCy can be used to read the text. If not, it would only result in errors.

In [24]:
df.type = df.type.astype(str)
def nameDetection(txt):
    doc = nlp(txt)
    persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
    return persons

namesExist = []
for index, row in df.iterrows():
    namesExist.extend(nameDetection(row['type']))

#Remove Duplicates in list
namesExist = list(dict.fromkeys(namesExist))
print(f'Names Detected are {namesExist}')

Names Detected are ['Chan Sek Keong', 'Li Li', 'Abdul Halim bin Haron', 'Joanna Dong', 'Russell Lee']


As we can see SpaCy is able to detect almost all the names with the exception of a few (S Dhanabalan). Thus, while not foolproof, it is a good base to help with the compliance process.

### Example 2
Since the dataset used was small, lets use another example where the dataset is much bigger, an expand further to cover more test case names for the SpaCy model to detect through the column. We will use [Amazon's Food Review Data](https://www.kaggle.com/snap/amazon-fine-food-reviews) which has 500k records for this case. We will limit it to about 200k records. To align with the text column available, we will be slipping in names in the reviews with additional sentences on top of them as well.

- Chan Sek Keong
- S Dhanabalan
- Joanna Dong
- Abdul Halim 
- Haron
- Russell Lee
- Philip Jeyaretnam
- Kwa Geok Choo
- Devan Nair
- Salmah

In [42]:
df = pd.read_csv("Reviews.csv")
df = df.sample(frac=0.36, replace=True, random_state=1).reset_index()
df.at[0,'Text']='Chan Sek Keong and I went to check out the coffee tea shop and we felt the ambience was amazing!'
df.at[12123,'Text']='I decided to visit S Dhanabalan to celebrate his birthday, loved the cake they had. '
df.at[4799,'Text']='I hired a singer Joanna Dong to sing Count on Me Singapore on National Day at the cafe. Had a wonderful time!'
df.at[1100,'Text']='Abdul Halim and Haron was outside the cafe when we saw the beautiful nasi lemak stall'
df.at[200100,'Text']='Goodness gracious, I did not expect to meet my favourite idol Russell Lee at the coffee tea house'
df.at[34722,'Text']='Philip Jeyaretnam'
df.at[100987,'Text']='Kwa Geok Choo'
df.at[323,'Text']='Whats the difference between a sentence and a word with Devan Nair in the first sentence'
df.at[7827,'Text']='Salmah and I had an amazing breakfast in East Coast'

In [43]:
df.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,128037,128038,B001717U9U,A34TQDJ94475AO,"Jay Endo ""Jay Endo""",1,1,2,1337817600,Not worth the money,Chan Sek Keong and I went to check out the cof...
1,491755,491756,B001SAX0EE,AYFENUF5IEAZQ,ellamar,1,1,5,1296864000,"Try it, you'll love it.",If you have never tried these Telma cubes in m...
2,470924,470925,B001ELL60W,A3DSWKQ2CR9H1G,yvonne,1,1,5,1330819200,best waffel mix ever,My family loves this product. We have done ta...
3,491263,491264,B008114GDW,A1SLAL64ORLT41,Bonnie Muffin,0,1,5,1314576000,Cinnamon nom nom noms,These taste exactly like cinnamon toast crunch...
4,371403,371404,B000EVNYQM,A2E2GGQGZOG4TD,"Wisdom ""seeker""",1,1,5,1296604800,great cereal!,I discovered this cereal at a local market and...


In [44]:
%%time
df.Text = df.Text.astype(str)
def nameDetection(txt):
    doc = nlp(txt)
    persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
    return persons

namesExist = []
for index, row in df.iterrows():
    namesExist.extend(nameDetection(row['Text']))

#Remove Duplicates in list
namesExist = list(dict.fromkeys(namesExist))
print(f'Names Detected are {namesExist}')

Wall time: 1h 27min 9s


As we can see, sweeping through 200k records, other potential names were flagged out in the reviews. There were also instances where words that were not names were not flagged out particularly becuase of the convention of the words with capital letters in them (eg. Blueberry Muffins). Thus, this requires some eyeballing on the user part to detect potential names.

In [45]:
names = ['Chan Sek Keong', 'S Dhanabalan', 'Joanna Dong', 'Abdul Halim, Haron', 
         'Russell Lee', 'Philip Jeyaretnam', 'Kwa Geok Choo', 'Devan Nair', 'Salmah']

for name in names:
    if name in namesExist:
        print(f'{name} was detected in SpaCy')

Chan Sek Keong was detected in SpaCy
S Dhanabalan was detected in SpaCy
Joanna Dong was detected in SpaCy
Russell Lee was detected in SpaCy
Philip Jeyaretnam was detected in SpaCy
Kwa Geok Choo was detected in SpaCy
Devan Nair was detected in SpaCy


As observed, full names tend to be detected more easily than partial names. In the case of 'Abdul Halim', 'Haron' and 'Salmah', their names failed to be detected after going through the column itself. Nevertheless accuracy is still fairly good if we intend to use this as a method to help provide compliance checks on the datasets. If we want to improve the accuracy of name detection further, we can consider training the SpaCy model further using existing names that we want it to check it for as part of the model. Refer to [this](https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718) for useful reference on how to train the model further.