<a href="https://colab.research.google.com/github/JGillette71/BERT-Inferred_Minor_Status/blob/main/Storyteller_Data_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Global Giving Storytelling Project Data Exploration 

NER Research Project 

Jason Gillette 

Source: https://data.world/marcmaxmeister/globalgiving-storytelling-project

Tasks 
  - [x] Read data file 
  - [x] Summary Stats 
  - [x] Remove entries with insufficient length 
  - [x] Clean leading or trailing whitespace 
  - [x] General NER analysis via BERT 
    - [x] export copy 
  - [ ] Remove entries w/o sufficient entities 
  - [ ] Automated spelling corrections???
  - [ ] Publish cleaned data for custom entity tagging 

In [None]:
import pandas as pd

In [None]:
# read data 
data = pd.read_csv('/content/drive/MyDrive/Grad School Projects /globalgiving-storytelling-project-QueryResult.csv')
data.head()

Unnamed: 0,created_date,story,storyteller_age_lt_16,storyteller_gender_f,storyteller_gender_m,title,organization_name,revised_organization_name,story_location_country,story_location_city,story_location_neighborhood,latitude,longitude,location
0,2011-04-07T07:08:59,Many people in different countries have been k...,0,0,1,Anominous Felony,JUNK,,Kenya,Nairobi,Kibera 42,-1.318426,36.796496,POINT(36.79649600000000 -1.31842600000000)
1,2011-04-29T06:03:22,- In Kenya many people died because of violenc...,0,1,0,Violence,people died,,KENYA,NAIROBI,Kibera D.C /kianda,-1.13333,34.55,POINT(34.55000000000000 -1.13333000000000)
2,2011-04-29T11:25:59,An organization known as the 'VMS' meaning vol...,0,0,1,UNPLEASING PERSONALITY,VOLUNTEER MINISTERS,Volunteer Ministers,KENYA,NAIROBI,SCOUT CAMP ROWALAND,-1.13333,34.55,POINT(34.55000000000000 -1.13333000000000)
3,2011-05-16T04:19:16,Safaricom is one of the leading telecommunicat...,0,0,0,Interaction through sports,SAFARICOM MOBILE NETWORK,Safaricom Company,KENYA,NAIROBI,KIBERA MAKINA,-1.318426,36.796496,POINT(36.79649600000000 -1.31842600000000)
4,2011-07-16T18:40:36,politics and development are things that go ha...,0,1,0,political absurd,individual,Individual,Kenya,kakamega,intisia,0.26675,34.89439,POINT(34.89439000000000 0.26675000000000)


In [None]:
# check schema and size 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57818 entries, 0 to 57817
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   created_date                 57818 non-null  object 
 1   story                        57818 non-null  object 
 2   storyteller_age_lt_16        57818 non-null  int64  
 3   storyteller_gender_f         57818 non-null  int64  
 4   storyteller_gender_m         57818 non-null  int64  
 5   title                        57660 non-null  object 
 6   organization_name            57438 non-null  object 
 7   revised_organization_name    33072 non-null  object 
 8   story_location_country       56985 non-null  object 
 9   story_location_city          56788 non-null  object 
 10  story_location_neighborhood  55680 non-null  object 
 11  latitude                     38521 non-null  float64
 12  longitude                    38521 non-null  float64
 13  location        

'created_date' --> date story was recorded

**'story' --> target text**

**'storyteller_age_lt_16' --> Is the storyteller under 16 y.o.**

'storyteller_gender_f' --> storyteller_gender 

'storyteller_gender_m' --> storyteller_gender

'title' --> topic? 

'organization_name' --> 

'revised_organization_name' -->

'story_location_country'  --> country

'story_location_city'  --> city 

'story_location_neighborhood'  --> neighborhood

'latitude' --> loc data

'longitude' --> loc data

'location' --> loc data

In [None]:
# counts if the storyteller is under 16 years old 
print('values counts:')
print(data['storyteller_age_lt_16' ].value_counts())
print('percentage:')
print(data['storyteller_age_lt_16' ].value_counts(normalize=True))

values counts:
0    52087
1     5731
Name: storyteller_age_lt_16, dtype: int64
percentage:
0    0.900879
1    0.099121
Name: storyteller_age_lt_16, dtype: float64


Roughly 10% of entries are stories told by minors, indicative of minor entity type in associated stories, but not necessarily the case for each. 

In [None]:
# create column w/ story length 
data['avg_story_length'] = data['story'].apply(lambda x: len(x.split()))
print(data['avg_story_length'].head())
print(data['avg_story_length'].describe())

0     68
1     90
2    135
3    133
4     64
Name: avg_story_length, dtype: int64
count    57818.000000
mean        69.378308
std         36.909739
min          1.000000
25%         42.000000
50%         61.000000
75%         88.000000
max        592.000000
Name: avg_story_length, dtype: float64


In [None]:
# how many stories have over 512 words
# BERT cannot accept over 512 tokens; sentences get truncated 
data[data['avg_story_length'] > 500]

Unnamed: 0,created_date,story,storyteller_age_lt_16,storyteller_gender_f,storyteller_gender_m,title,organization_name,revised_organization_name,story_location_country,story_location_city,story_location_neighborhood,latitude,longitude,location,avg_story_length
29095,2012-03-18T22:21:23,I’m writing this letter to share with everyone...,0,0,1,Hope for our children. How our community was t...,Sumando Manos Foundation,,Argentina,Misiones,"El Soberbio, El Boton. La Flor. School # 269",,,,592


In [None]:
# how many stories have low word counts / unlikely to contain contextual training value 
data[data['avg_story_length'] < 10]

Unnamed: 0,created_date,story,storyteller_age_lt_16,storyteller_gender_f,storyteller_gender_m,title,organization_name,revised_organization_name,story_location_country,story_location_city,story_location_neighborhood,latitude,longitude,location,avg_story_length
33728,2012-07-09T07:21:35,You understand about food security we are wish...,0,0,1,Food,Care International,,Kenya,Nairobi,Ngando,,,,9
33872,2012-07-09T09:07:56,How about the support from outside?thanks,0,0,0,Thanks,care international,,Kenya,Nairobi,Ngando,,,,6
33890,2012-07-09T11:04:55,Who best assists the poor. \r\n\r\nCare Inter...,0,0,0,Priority,Care International,,Kenya,Nairobi,Nairobi Slum,,,,7
34443,2012-07-10T07:21:22,Proudly we need your support.Thanks. Kibera Y...,0,0,1,Is it you,World Vision,,Kenya,Nairobi,Kibera,,,,9
44705,2011-06-13T20:44:38,"mrembo program educated me about adolescence,p...",0,1,0,Orphans in kamukunji,mrembo program,Mrembo Program,KENYE,NAIROBI,muthurua,-1.13333,34.55,POINT(34.55000000000000 -1.13333000000000),8
46359,2012-12-19T08:16:34,..,0,0,1,IN PURSUE FOR PEACE,AMANI KENYA,,KENYA,KISUMU,KISUMU TOWN,,,,1
46362,2012-12-19T08:27:19,...,0,0,1,IMPROVING HYGIENE,KISUMU TOWN WELFARE,,KENYA,KISUMU,NYALENDA,,,,1
46379,2012-12-20T05:12:46,.,0,0,1,.,.,,KENYA,NAIROBI,MASIMBA STAGE,,,,1
46380,2012-12-20T05:14:25,After the post election violence..,0,0,0,.,.,,,,,,,,5
46386,2012-12-20T06:51:43,.,0,0,0,.,.,,,,,,,,1


Begin cleaning up the data

In [None]:
# arbitrary 10% buffer for max len and min word count of 10 words
data = data[(data['avg_story_length'] < 460) & (data['avg_story_length'] > 10)]
data.reset_index(inplace=True)
print(len(data))
print(data['avg_story_length'].describe())

57755
count    57755.000000
mean        69.433157
std         36.765103
min         11.000000
25%         42.000000
50%         61.000000
75%         88.000000
max        459.000000
Name: avg_story_length, dtype: float64


Dropped about 58 entries 

In [None]:
# clean leading non alphabetic chars or whitespace and trailing whitespace from stories using regex 
import re 
data['story'] = data['story'].replace(r"^[^a-zA-Z]+| +$", r"", regex=True)
data['story'] 

0        Many people in different countries have been k...
1        In Kenya many people died because of violence,...
2        An organization known as the 'VMS' meaning vol...
3        Safaricom is one of the leading telecommunicat...
4        politics and development are things that go ha...
                               ...                        
57750    Fighting among children in a responsive charac...
57751    corruption in our communities is like a diseas...
57752    Its adevelopmental small metre finnechel busin...
57753    Samaritan blinds an individual growth healthwi...
57754    Considered as the earliest kind of business in...
Name: story, Length: 57755, dtype: object

In [None]:
# fix missing spaces after periods and commas
data['story'] = data['story'].replace(r'(?<=[.,])(?=[^\s])', r' ', regex=True)
# See "whole of Africa.it" for confirmation of correction
data.at[3, 'story'] 

'Safaricom is one of the leading telecommunication network company in the whole of Africa. it has over a billion subscriber and half of them are enrolled members to its savings branch.\r\nSafaricom was established over five years ago and has managed to be the best and mobile phone subscriber and has managed to be the best and most profitable business firm in east and central Africa\r\nThey saw the need of helping the community rather than being a service provider alone and taking away peoples money. So the came to the community and organised and funded sporting tournaments for the youth and it was a very successful idea\r\nWinners were awarded and also the loosers were never left out. It also proved to be an important tool to the company the organisationand started using its services.'

In [None]:
!pip install transformers



In [None]:
# example token classification to identify general entities 
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy='simple')
example = data.at[3, 'story'] 

ner_results = nlp(example)
print(len(example.split()))
print(example)
print(ner_results)

136
Safaricom is one of the leading telecommunication network company in the whole of Africa. it has over a billion subscriber and half of them are enrolled members to its savings branch.
Safaricom was established over five years ago and has managed to be the best and mobile phone subscriber and has managed to be the best and most profitable business firm in east and central Africa
They saw the need of helping the community rather than being a service provider alone and taking away peoples money. So the came to the community and organised and funded sporting tournaments for the youth and it was a very successful idea
Winners were awarded and also the loosers were never left out. It also proved to be an important tool to the company the organisationand started using its services.
[{'entity_group': 'ORG', 'score': 0.9977357, 'word': 'Safaricom', 'start': 0, 'end': 9}, {'entity_group': 'LOC', 'score': 0.9996039, 'word': 'Africa', 'start': 82, 'end': 88}, {'entity_group': 'ORG', 'score'

In [None]:
# split with regex instead of .split() to seperate punctuation like the tokenizer 
# not needed here, may need for later token alignment 
split = re.findall(r"[\w']+|[.,!?;]", example)
print(len(split))
print(split)

141
['Safaricom', 'is', 'one', 'of', 'the', 'leading', 'telecommunication', 'network', 'company', 'in', 'the', 'whole', 'of', 'Africa', '.', 'it', 'has', 'over', 'a', 'billion', 'subscriber', 'and', 'half', 'of', 'them', 'are', 'enrolled', 'members', 'to', 'its', 'savings', 'branch', '.', 'Safaricom', 'was', 'established', 'over', 'five', 'years', 'ago', 'and', 'has', 'managed', 'to', 'be', 'the', 'best', 'and', 'mobile', 'phone', 'subscriber', 'and', 'has', 'managed', 'to', 'be', 'the', 'best', 'and', 'most', 'profitable', 'business', 'firm', 'in', 'east', 'and', 'central', 'Africa', 'They', 'saw', 'the', 'need', 'of', 'helping', 'the', 'community', 'rather', 'than', 'being', 'a', 'service', 'provider', 'alone', 'and', 'taking', 'away', 'peoples', 'money', '.', 'So', 'the', 'came', 'to', 'the', 'community', 'and', 'organised', 'and', 'funded', 'sporting', 'tournaments', 'for', 'the', 'youth', 'and', 'it', 'was', 'a', 'very', 'successful', 'idea', 'Winners', 'were', 'awarded', 'and', '

In [None]:
!pip install swifter



In [None]:
import swifter
# Swifter automatically decides which is faster: Dask parallel processing or Pandas apply.

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy='simple')

# id general entities in all stories 
data['entities'] = data['story'].swifter.apply(lambda x: nlp(x))

data['entities'] 

Pandas Apply:   0%|          | 0/57755 [00:00<?, ?it/s]

0        [{'entity_group': 'MISC', 'score': 0.99714357,...
1        [{'entity_group': 'LOC', 'score': 0.9998136, '...
2        [{'entity_group': 'ORG', 'score': 0.76395184, ...
3        [{'entity_group': 'ORG', 'score': 0.9977357, '...
4                                                       []
                               ...                        
57750    [{'entity_group': 'ORG', 'score': 0.9783759, '...
57751    [{'entity_group': 'ORG', 'score': 0.77378774, ...
57752    [{'entity_group': 'MISC', 'score': 0.9950283, ...
57753    [{'entity_group': 'ORG', 'score': 0.5690524, '...
57754    [{'entity_group': 'MISC', 'score': 0.995856, '...
Name: entities, Length: 57755, dtype: object

In [None]:
data.at[0, 'entities'] 

[{'end': 245,
  'entity_group': 'MISC',
  'score': 0.99714357,
  'start': 242,
  'word': 'HIV'},
 {'end': 250,
  'entity_group': 'MISC',
  'score': 0.7982759,
  'start': 246,
  'word': 'AIDS'}]

In [None]:
def extract_entity_groups(i):
  return([dictionary['entity_group'] for dictionary in i if 'entity_group' in dictionary])

data['entities'] = data['entities'].swifter.apply(lambda x: extract_entity_groups(x))

data['entities'] 

Pandas Apply:   0%|          | 0/57755 [00:00<?, ?it/s]

0                   [MISC, MISC]
1              [LOC, MISC, MISC]
2                          [ORG]
3           [ORG, LOC, ORG, LOC]
4                             []
                  ...           
57750                 [ORG, ORG]
57751                      [ORG]
57752               [MISC, MISC]
57753                      [ORG]
57754    [MISC, MISC, ORG, MISC]
Name: entities, Length: 57755, dtype: object

### NER Analysis Checkpoint

In [None]:
# export data so I don't have to run nlp pipeline again 
data.to_csv('/content/drive/MyDrive/Grad School Projects /storytelling-ner-added.csv')

In [None]:
# read saved data 
data = pd.read_csv('/content/drive/MyDrive/Grad School Projects /storytelling-ner-added.csv')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57755 entries, 0 to 57754
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   57755 non-null  int64  
 1   index                        57755 non-null  int64  
 2   created_date                 57755 non-null  object 
 3   story                        57755 non-null  object 
 4   storyteller_age_lt_16        57755 non-null  int64  
 5   storyteller_gender_f         57755 non-null  int64  
 6   storyteller_gender_m         57755 non-null  int64  
 7   title                        57597 non-null  object 
 8   organization_name            57376 non-null  object 
 9   revised_organization_name    33069 non-null  object 
 10  story_location_country       56930 non-null  object 
 11  story_location_city          56733 non-null  object 
 12  story_location_neighborhood  55626 non-null  object 
 13  latitude        

In [None]:
# convert entities column back to list type 
# csv import converts to string 
from ast import literal_eval

data['entities'] = data['entities'].apply(lambda x: literal_eval(x))

type(data.at[0,'entities'])

list

In [None]:
# average entities per story
print(data['entities'].apply(len).mean())

# max entities in stories  
print(data['entities'].apply(len).max())

# counts  
print(data['entities'].apply(len).value_counts())

2.2076876460912476
34
0     14353
1     13364
2     10360
3      7165
4      4618
5      2961
6      1830
7      1101
8       733
9       426
10      290
11      192
12       99
13       86
14       53
15       37
16       25
17       16
20       11
18       10
19        7
21        4
26        4
22        3
28        2
25        2
23        1
27        1
34        1
Name: entities, dtype: int64


In [None]:
# set arbitrary range for entity count per story 
clean_data = data[(data['entities'].map(len) > 5) & (data['entities'].map(len) < 21)]

# Need to reset index 
clean_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4916 entries, 31 to 57746
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   4916 non-null   int64  
 1   index                        4916 non-null   int64  
 2   created_date                 4916 non-null   object 
 3   story                        4916 non-null   object 
 4   storyteller_age_lt_16        4916 non-null   int64  
 5   storyteller_gender_f         4916 non-null   int64  
 6   storyteller_gender_m         4916 non-null   int64  
 7   title                        4898 non-null   object 
 8   organization_name            4879 non-null   object 
 9   revised_organization_name    2969 non-null   object 
 10  story_location_country       4822 non-null   object 
 11  story_location_city          4796 non-null   object 
 12  story_location_neighborhood  4665 non-null   object 
 13  latitude        

In [None]:
# count instances of each entity 
clean_data['entities'].explode().value_counts()

PER     11016
LOC     10179
ORG      9569
MISC     7441
Name: entities, dtype: int64

In [None]:
# count total entities  
clean_data['entities'].explode().count().sum()

38205

In [None]:
final_cols = ['story',
              'storyteller_age_lt_16',
              'storyteller_gender_f',
              'storyteller_gender_m', 
              'story_location_country',
              'story_location_city',
              'entities']

clean_data = clean_data[final_cols]
clean_data.reset_index(drop=True, inplace=True)   
clean_data.head()    

Unnamed: 0,story,storyteller_age_lt_16,storyteller_gender_f,storyteller_gender_m,story_location_country,story_location_city,entities
0,The law should be amended to give the truth Ju...,0,1,0,KENYA,KAKAMEGA,"[ORG, MISC, ORG, PER, PER, LOC, MISC]"
1,Baraka Youth Group in Emuhaya District empowe...,0,0,1,kenya,Emuhaya,"[ORG, LOC, PER, PER, ORG, MISC, MISC, LOC, PER..."
2,M-Learning is the short form of the word Mobil...,0,0,1,KENYA,NAIROBI,"[ORG, ORG, MISC, ORG, ORG, ORG]"
3,It was one of those glorous evenings students ...,1,1,0,KENYA,KAKAMEGA,"[LOC, ORG, LOC, ORG, ORG, LOC]"
4,Food donations by the Kenya Red Cross society ...,0,0,1,KENYA,NAIROBI,"[ORG, LOC, LOC, ORG, MISC, LOC, LOC]"


In [None]:
# export data so I don't have to run nlp pipeline again 
clean_data.to_csv('/content/drive/MyDrive/Grad School Projects /clean-storytelling-data.csv')

In [None]:
clean_data['story'] = clean_data['story'].replace(r'\t|\n|\r| | +', r' ', regex=True)

clean_data.head(10)

Unnamed: 0,story,storyteller_age_lt_16,storyteller_gender_f,storyteller_gender_m,story_location_country,story_location_city,entities
0,The law should be amended to give the truth Ju...,0,1,0,KENYA,KAKAMEGA,"[ORG, MISC, ORG, PER, PER, LOC, MISC]"
1,Baraka Youth Group in Emuhaya District empowe...,0,0,1,kenya,Emuhaya,"[ORG, LOC, PER, PER, ORG, MISC, MISC, LOC, PER..."
2,M-Learning is the short form of the word Mobil...,0,0,1,KENYA,NAIROBI,"[ORG, ORG, MISC, ORG, ORG, ORG]"
3,It was one of those glorous evenings students ...,1,1,0,KENYA,KAKAMEGA,"[LOC, ORG, LOC, ORG, ORG, LOC]"
4,Food donations by the Kenya Red Cross society ...,0,0,1,KENYA,NAIROBI,"[ORG, LOC, LOC, ORG, MISC, LOC, LOC]"
5,Western HIV/AIDS Network - WEHAK- mainly a cap...,0,0,1,KENYA,KAKAMEGA,"[ORG, MISC, ORG, MISC, ORG, ORG, ORG, ORG, ORG..."
6,Western HIV/AIDs Networking concentrates to HI...,0,1,0,Kenya,Kakamega,"[MISC, MISC, MISC, ORG, MISC, MISC]"
7,Western HIV/AIDS Network - WEHAK while working...,0,1,0,KENYA,KAKAMEGA,"[ORG, MISC, ORG, PER, MISC, MISC, MISC, MISC]"
8,Western HIV/AIDS Network- WEHAK Works with com...,0,1,0,KENYA,KAKAMEGA,"[ORG, MISC, ORG, MISC, ORG, MISC]"
9,Our organisation- Western HIV/AIDS Network has...,0,1,0,KENYA,KAKAMEGA,"[ORG, MISC, MISC, MISC, MISC, MISC, MISC]"


In [None]:

# write curated stories to text file for use in annotation tool 
for chunk in chunksize:
clean_data['story'].to_csv('/content/drive/MyDrive/Grad School Projects /currated_story_data_chunk.txt', sep='\n', chunksize=1000, index=False) # this isn't correctly writing one line in file per entry 

In [None]:
chunk_size = 100
counter = 0
for chunk in pd.read_csv("/content/drive/MyDrive/Grad School Projects /currated_story_data_onesep.txt", chunksize=100, sep='\n'):
  counter = counter + 1
  chunk.to_csv(f'/content/drive/MyDrive/Grad School Projects /ner_data/ner_data_chunk{counter}.txt',index=False)

In [None]:
filename = "/content/drive/MyDrive/Grad School Projects /currated_story_data_onesep.txt"

#w tells python we are opening the file to write into it
outfile = open(filename, 'w')

for story in clean_data['story']:
  outfile.write(story + "\n")

outfile.close() #Close the file when we’re done!

In [None]:

filename = '/content/drive/MyDrive/Grad School Projects /currated_story_data_chunk'+'A'+'.txt'
outfile = open(filename, 'w')
stop = i*500
for story in clean_data['story'][1:100]:
  outfile.write(story + "\n")
outfile.close() 

In [None]:
for i in range(1,11):
  print(i*500)

500
1000
1500
2000
2500
3000
3500
4000
4500
5000


In [None]:
clean_data[(clean_data['story'].str.contains(' '))&(clean_data['story'].str.contains('age'))]

Unnamed: 0,story,storyteller_age_lt_16,storyteller_gender_f,storyteller_gender_m,story_location_country,story_location_city,entities
1,Baraka Youth Group in Emuhaya District empowe...,0,0,1,kenya,Emuhaya,"[ORG, LOC, PER, PER, ORG, MISC, MISC, LOC, PER..."
2,M-Learning is the short form of the word Mobil...,0,0,1,KENYA,NAIROBI,"[ORG, ORG, MISC, ORG, ORG, ORG]"
8,Western HIV/AIDS Network- WEHAK Works with com...,0,1,0,KENYA,KAKAMEGA,"[ORG, MISC, ORG, MISC, ORG, MISC]"
10,Water is the most precious commodity in the wo...,0,0,1,Kenya,Vihiga,"[MISC, LOC, LOC, LOC, ORG, LOC]"
25,Twitter delivers a miracle for a family in nee...,0,0,1,KENYA,NAIROBI,"[ORG, PER, PER, PER, LOC, LOC, PER, ORG]"
...,...,...,...,...,...,...,...
4888,The formation of RIDOKAM self Help group in Ur...,0,0,1,KENYA,MIGORI,"[ORG, ORG, LOC, LOC, LOC, ORG]"
4897,"In Hamisi District, Western Kenya, there is a ...",0,0,0,KENYA,CHEPTULU,"[LOC, LOC, LOC, LOC, LOC, MISC]"
4901,Unemployment is a key contributor to Kenya's p...,0,0,1,KENYA,CHEPTULU,"[LOC, LOC, LOC, LOC, LOC, PER, PER, PER, LOC]"
4906,Hillary Bilali is my name. I am a youth 17 yrs...,0,0,1,Kenya,Kakamega,"[PER, LOC, MISC, MISC, MISC, MISC, MISC, MISC]"


In [None]:
per_count = 0
for i in range(len(clean_data)):
  if 'PER' in clean_data.at[i, 'entities']:
    per_count +=1
print(per_count)


2754
