# **Import Libraries and Show an Example**

Submitted by: **Rukshar Alam**

[SpaCy’s named entity recognition](https://spacy.io/api/data-formats#section-named-entities) has been trained on the OntoNotes 5 corpus and it supports the following entity types:
<img src="https://miro.medium.com/max/3000/1*qQggIPMugLcy-ndJ8X_aAA.png">


Update Spacy to use a better pretrained model [en_core_web_trf](https://spacy.io/models/en#en_core_web_trf) instead of using en_core_web_sm

In [12]:
!pip install -U spacy  
!python -m spacy validate


Requirement already up-to-date: spacy in /usr/local/lib/python3.7/dist-packages (3.0.6)
2021-07-07 05:36:29.906828: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.7/dist-packages/spacy[0m

NAME              SPACY            VERSION                            
en_core_web_trf   >=3.0.0,<3.1.0   [38;5;2m3.0.0[0m   [38;5;2m✔[0m



In [13]:
!python -m spacy download en_core_web_trf

2021-07-07 05:36:34.991531: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [14]:
!pip install spacy[transformers]



In [15]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
import en_core_web_trf
from pprint import pprint
import os
import pandas as pd
import spacy_transformers

#nlp = en_core_web_sm.load()
nlp = spacy.load("en_core_web_trf")

One of the nice things about Spacy is that we only need to apply nlp once, the entire background pipeline will return the objects.

In [16]:
##Example NER
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


# **Mounting Drive, Loading Data, and Merging Text Data**

In [17]:
from google.colab import drive #mounting google drive where I keep my files
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [18]:
data_directory =  '/content/gdrive/My Drive/Omdena/Data/'
files = os.listdir(data_directory)
pprint(files)
files_directory = [data_directory+file for file in files]
#pprint(files_directory)
news_df = [] #load all csv files in this array
for file_directory in files_directory[:4]: #leaving out daily sun data because different encoding has to be used for it
  
  df = pd.read_csv(file_directory)
  print(file_directory)
  news_df.append(df)

df = pd.read_csv(files_directory[4],  encoding='cp1252') #loading daily sun data with different encoding
news_df.append(df)

#print the column names of each df
for df in news_df:
  print(df.columns)

['prothom_alo_modified.csv',
 'prothom_alo_1jan2019-31dec2020_v1.csv',
 'RoadAccidentsTheDailyObserver.csv',
 'Dhaka Tribune Complete Data.csv',
 'The_Daily_Sun.csv']
/content/gdrive/My Drive/Omdena/Data/prothom_alo_modified.csv
/content/gdrive/My Drive/Omdena/Data/prothom_alo_1jan2019-31dec2020_v1.csv
/content/gdrive/My Drive/Omdena/Data/RoadAccidentsTheDailyObserver.csv
/content/gdrive/My Drive/Omdena/Data/Dhaka Tribune Complete Data.csv
Index(['date_of_incident', 'time_of_incident', 'incident_type', 'location',
       'death_count', 'injury_count', 'type_of_vehicle1', 'type_of_vehicle2',
       'description_text', 'published-time', 'full_text'],
      dtype='object')
Index(['date_of_incident', 'time_of_incident', 'incident_type', 'location',
       'death_count', 'injury_count', 'type_of_vehicle1', 'type_of_vehicle2',
       'driver_age', 'description_text', 'published-time', 'link',
       'full_text'],
      dtype='object')
Index(['links', 'titles', 'Year', 'News'], dtype='object'

Based on the columns, we consider only the description and the title columns of each dataframe. 

In [19]:
df_1 = news_df[0][['description_text', 'full_text']]
df_2 = news_df[1][['description_text', 'full_text']]
df_3 = news_df[2][['titles', 'News']]
df_4 = news_df[3][['News title','Header']]
df_5 = news_df[4][['headline', 'summary']]
news_df_only_text = [df_1, df_2, df_3,df_4, df_5]
for df in news_df_only_text:
  df.columns = ['title', 'description'] #change the column names to make them same across all dfs
#concatenate all the dfs
single_news_df = pd.concat(news_df_only_text, axis = 0, ignore_index=True)
print(single_news_df.shape) #(2029, 2)

(2029, 2)


# **Named Entity Recognition of News Data**

In [24]:
#df.apply(lambda x: func(x['col1'],x['col2']),axis=1)
def named_entity_recognition(news_description): #return the NER of a news article
  ner_news = nlp(str(news_description))
  ner_list  = [(X.text, X.label_) for X in ner_news.ents]
  return ner_list
single_news_df['NER'] = single_news_df.apply(lambda x: named_entity_recognition(x['description']), axis = 1)

In [26]:
##save this news dataframe with named entities in a csv file
single_news_df.to_csv('news_title_description_ner.csv', index=False)

In [25]:
single_news_df

Unnamed: 0,title,description,NER
0,Four people killed in Jashore road accident,Four people were killed and another injured on...,"[(Four, CARDINAL), (Sunday, DATE), (Dhopakhola..."
1,Three killed in Mymensingh road accident,"Three people, including a motorcycle rider, we...","[(Three, CARDINAL), (Mymensingh, GPE), (Rasulp..."
2,2 killed in Fatullah road accident,Two passengers of a battery-run auto-rickshaw ...,"[(Two, CARDINAL), (Dhaka, GPE), (Masdair Amena..."
3,7 killed in three road accidents in Chattogram...,"At least seven people, including women and chi...","[(At least seven, CARDINAL), (three, CARDINAL)..."
4,ASI killed in Chattogram road accident,"An assistant sub-inspector (ASI) of police, Ka...","[(Kazi Md Salahuddin, PERSON), (early Friday, ..."
...,...,...,...
2024,Pedestrian killed in Ctg road accident,CHATTOGRAM: A pedestrian was killed in a road ...,"[(CHATTOGRAM, GPE), (Sitakunda Upazila, GPE), ..."
2025,Man killed in Ctg road accident,CHITTAGONG: A person was killed in a road acci...,"[(CHITTAGONG, GPE), (Sitakunda, GPE), (Wednesd..."
2026,Child killed in Manikganj road accident,MANIKGANJ: An infant schoolgirl was killed in ...,"[(MANIKGANJ, GPE), (SP, ORG), (Manikganj Sadar..."
2027,Awami League leader killed in Savar road accident,A local Awami League leader was killed after a...,"[(Awami League, ORG), (Baliapur, LOC), (Hemaye..."


In [22]:
doc = nlp(single_news_df.iloc[5,1])
pprint([(X.text, X.label_) for X in doc.ents])

[('two-year-old', 'DATE'),
 ('Dhaka', 'GPE'),
 ('Sylhet', 'GPE'),
 ('Sunday', 'DATE'),
 ('UNB', 'ORG'),
 ('Five', 'CARDINAL'),
 ('Beauty Roy', 'PERSON'),
 ('Gopesh Roy', 'PERSON'),
 ('Nabiganj upazila', 'ORG'),
 ('Habiganj', 'GPE'),
 ('Rupak Roy', 'PERSON'),
 ('Satmail', 'GPE'),
 ('around 1:30pm when the Dhaka', 'TIME'),
 ('Sylhet', 'GPE'),
 ('Beauty Roy', 'PERSON'),
 ('Monirul Islam', 'PERSON'),
 ('South Surma', 'GPE')]
