## Recognizing Top Organizations and Locations with SpaCy and NLTK

**Author - Lasya Nayani Bhatta**

In [46]:
import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

In [47]:
pip install nltk

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



#### Read news data

In [48]:
news_path = 'https://storage.googleapis.com/msca-bdp-data-open/news/nlp_a_5_news.json'
news_df = pd.read_json(news_path, orient='records', lines=True)

print(f'Sample contains {news_df.shape[0]:,.0f} news articles')
news_df.head(2)

Sample contains 10,012 news articles


Unnamed: 0,url,date,language,title,text
0,http://kokomoperspective.com/obituaries/jon-w-horton/article_b6ba8e1e-cb9c-11eb-9868-fb11b88b9778.html,2021-06-13,en,Jon W. Horton | Obituaries | kokomoperspective.com,Jon W. Horton | Obituaries | kokomoperspective.comYou have permission to edit this article. EditCloseSign Up Log In Dashboard LogoutMy Account Dashboard Profile Saved items LogoutCOVID-19Click here for the latest local news on COVID-19HomeAbout UsContact UsNewsLocalOpinionPoliticsNationalStateAgricultureLifestylesEngagements/Anniversaries/WeddingsAutosEntertainmentHealthHomesOutdoorsSportsNFLNCAAVitalsObituariesAutomotivee-EditionCouponsGalleries74°...
1,https://auto.economictimes.indiatimes.com/news/auto-components/birla-precision-to-ramp-up-capacity-to-tap-emerging-opportunities-in-india/81254902,2021-02-28,en,"Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto","Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto We have updated our terms and conditions and privacy policy Click ""Continue"" to accept and continue with ET AutoAccept the updated privacy & cookie policyDear user, ET Auto privacy and cookie policy has been updated to align with the new data regulations in European Union. Please review and accept these changes below to continue using the website.You can see our privacy policy & our cookie ..."


**Cleaning News Data**

**a) Cleaning Title**

In [49]:
import re 
# Check if mentions exist
mentions_exist_title = any('@' in title for title in news_df['title'])

# Check if hashtags exist
hashtags_exist_title = any('#' in title for title in news_df['title'])

# Check if numbers exist
numbers_exist_title = any(re.search(r'\d', title) for title in news_df['title'])

# Check if links exist
links_exist_title = any(re.search(r'https?://\S+', title) for title in news_df['title'])

print("Mentions exist in titles:", mentions_exist_title)
print("Hashtags exist in titles:", hashtags_exist_title)
print("Numbers exist in titles:", numbers_exist_title)
print("Links exist in titles:", links_exist_title)


Mentions exist in titles: True
Hashtags exist in titles: True
Numbers exist in titles: True
Links exist in titles: False


In [50]:
import re

def cleanTitle(title):
    # Remove mentions if they exist
    title = re.sub('@[A-Za-z0-9_]+', '', title)
    
    # Remove hashtags if they exist
    title = re.sub('#', '', title)
    
    # Remove hyperlinks if they exist
    title = re.sub(r'https?://\S+', '', title)
    
    # Remove pipe characters if they exist
    title = title.replace('|', '')
    
    # Remove any remaining non-alphanumeric characters and extra whitespaces
    title = re.sub(r'[^a-zA-Z0-9\s]', '', title)
    title = re.sub(r'\s+', ' ', title).strip()
    
    return title.strip()

# Apply the updated cleaning function to the 'title' column in news_df
news_df['cleanedTitle'] = news_df['title'].apply(cleanTitle)


**Cleaning Text**

In [51]:
import re 
# Check if mentions exist
mentions_exist = any('@' in text for text in news_df['text'])

# Check if hashtags exist
hashtags_exist = any('#' in text for text in news_df['text'])

# Check if numbers exist
numbers_exist = any(re.search(r'\d', text) for text in news_df['text'])

# Check if links exist
links_exist = any(re.search(r'https?://\S+', text) for text in news_df['text'])

print("Mentions exist:", mentions_exist)
print("Hashtags exist:", hashtags_exist)
print("Numbers exist:", numbers_exist)
print("Links exist:", links_exist)


Mentions exist: True
Hashtags exist: True
Numbers exist: True
Links exist: True


In [52]:
import re

def cleanNews(text):
    # Remove mentions (@username)
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    
    # Remove hashtags (#)
    text = re.sub(r'#', '', text)
    
    # Remove hyperlinks
    text = re.sub(r'https?://\S+', '', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove non-alphanumeric characters and extra whitespaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    #text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply the updated cleaning function to the 'text' column in news_df
news_df['cleanedText'] = news_df['text'].apply(cleanNews)


In [53]:
news_df.head(3)

Unnamed: 0,url,date,language,title,text,cleanedTitle,cleanedText
0,http://kokomoperspective.com/obituaries/jon-w-horton/article_b6ba8e1e-cb9c-11eb-9868-fb11b88b9778.html,2021-06-13,en,Jon W. Horton | Obituaries | kokomoperspective.com,Jon W. Horton | Obituaries | kokomoperspective.comYou have permission to edit this article. EditCloseSign Up Log In Dashboard LogoutMy Account Dashboard Profile Saved items LogoutCOVID-19Click here for the latest local news on COVID-19HomeAbout UsContact UsNewsLocalOpinionPoliticsNationalStateAgricultureLifestylesEngagements/Anniversaries/WeddingsAutosEntertainmentHealthHomesOutdoorsSportsNFLNCAAVitalsObituariesAutomotivee-EditionCouponsGalleries74°...,Jon W Horton Obituaries kokomoperspectivecom,Jon W Horton Obituaries kokomoperspectivecomYou have permission to edit this article EditCloseSign Up Log In Dashboard LogoutMy Account Dashboard Profile Saved items LogoutCOVIDClick here for the latest local news on COVIDHomeAbout UsContact UsNewsLocalOpinionPoliticsNationalStateAgricultureLifestylesEngagementsAnniversariesWeddingsAutosEntertainmentHealthHomesOutdoorsSportsNFLNCAAVitalsObituariesAutomotiveeEditionCouponsGalleriesFair ...
1,https://auto.economictimes.indiatimes.com/news/auto-components/birla-precision-to-ramp-up-capacity-to-tap-emerging-opportunities-in-india/81254902,2021-02-28,en,"Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto","Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto We have updated our terms and conditions and privacy policy Click ""Continue"" to accept and continue with ET AutoAccept the updated privacy & cookie policyDear user, ET Auto privacy and cookie policy has been updated to align with the new data regulations in European Union. Please review and accept these changes below to continue using the website.You can see our privacy policy & our cookie ...",Birla Precision to ramp up capacity to tap emerging opportunities in India Auto News ET Auto,Birla Precision to ramp up capacity to tap emerging opportunities in India Auto News ET Auto We have updated our terms and conditions and privacy policy Click Continue to accept and continue with ET AutoAccept the updated privacy cookie policyDear user ET Auto privacy and cookie policy has been updated to align with the new data regulations in European Union Please review and accept these changes below to continue using the websiteYou can see our privacy policy our cookie policy We...
2,https://ca.sports.yahoo.com/news/global-hydrogen-fueling-station-markets-104800330.html?src=rss,2021-12-07,en,Global Hydrogen Fueling Station Markets to 2035: Current State and Future Prognosis of Passenger Hydrogen Fuel Cell Vehicles (FCVs),Global Hydrogen Fueling Station Markets to 2035: Current State and Future Prognosis of Passenger Hydrogen Fuel Cell Vehicles (FCVs) HOME MAIL NEWS SPORTS FINANCE CELEBRITY STYLE MOVIES WEATHER MOBILE Yahoo Sports Sign in Mail Sign in to view your mail Sports Home Sports Home Fantasy Fantasy Fantasy FootballFantasy Football Fantasy HockeyFantasy Hockey Fantasy BasketballFantasy Basketball Fantasy Auto RacingFantasy Auto Racing Fantasy Go...,Global Hydrogen Fueling Station Markets to 2035 Current State and Future Prognosis of Passenger Hydrogen Fuel Cell Vehicles FCVs,Global Hydrogen Fueling Station Markets to Current State and Future Prognosis of Passenger Hydrogen Fuel Cell Vehicles FCVs HOME MAIL NEWS SPORTS FINANCE CELEBRITY STYLE MOVIES WEATHER MOBILE Yahoo Sports Sign in Mail Sign in to view your mail Sports Home Sports Home Fantasy Fantasy Fantasy FootballFantasy Football Fantasy HockeyFantasy Hockey Fantasy BasketballFantasy Basketball Fantasy Auto RacingFantasy Auto Racing Fantasy GolfFanta...


**News Title - Without Segmentation**

**Using Spacy**

In [54]:
import pandas as pd
from tqdm.notebook import tqdm
import spacy
from nltk import sent_tokenize, word_tokenize
import nltk
tqdm.pandas()

In [55]:
nlp = spacy.load("en_core_web_md")
nlp.select_pipes(enable=['ner'])

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

In [56]:
def extract_entities_spacy(text):
    doc = nlp(text)
    organizations = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    locations = [ent.text for ent in doc.ents if ent.label_ in ["LOC", "GPE"]]
    return organizations, locations

In [57]:
news_df['organizations_spacy_spacy'], news_df['locations_spacy_spacy'] = zip(*news_df['cleanedTitle'].progress_apply(extract_entities_spacy))



  0%|          | 0/10012 [00:00<?, ?it/s]

In [58]:
all_organizations_spacy = [org for sublist in news_df['organizations_spacy_spacy'] for org in sublist]
all_locations_spacy = [loc for sublist in news_df['locations_spacy_spacy'] for loc in sublist]


In [59]:
top_20_org = pd.Series(all_organizations_spacy).value_counts().head(20)


In [60]:
print("Top 20 organizations:")
top_20_org

Top 20 organizations:


Daily Mail                       1068
Ford                              360
Toyota                            203
Chevrolet                         195
Hyundai                           180
Honda                             155
Daily Mail Online                 151
Star News                         136
Nissan                            108
Automotive News                    98
EV                                 92
BMW                                90
Car Dealer Magazine                78
Dodge                              76
Jeep                               68
Otago Daily Times Online News      67
CoventryLive                       61
Mazda                              58
Volkswagen                         55
Lexus                              54
dtype: int64

In [61]:
top_20_loc = pd.Series(all_locations_spacy).value_counts().head(20)

In [62]:
print("Top 20 Location:")
top_20_loc

Top 20 Location:


Manitoba       140
UK             140
US             139
Winnipeg       137
Toronto        120
London         101
Cambridge       91
India           79
North York      72
Calgary         63
Taiwan          56
Mississauga     55
Scarborough     42
Oakville        42
Innisfil        41
Ottawa          39
China           37
Australia       36
Russia          33
Hamilton        32
dtype: int64

**Using NLTK**

In [63]:
from tqdm.notebook import tqdm
import nltk
tqdm.pandas()

In [64]:
def extract_entities_nltk(text):
    entities = []
    labels = []
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary=False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk))  # Add space between multi-token entities
            labels.append(chunk.label())
    entities_labels = list(set(zip(entities, labels)))  # unique entities
    organizations = [entity for entity, label in entities_labels if label == "ORGANIZATION"]
    locations = [entity for entity, label in entities_labels if label == "GPE"]
    return organizations, locations

In [65]:
news_df['organizations_nltk_new'], news_df['locations_nltk_new'] = zip(*news_df['cleanedTitle'].progress_apply(extract_entities_nltk))


  0%|          | 0/10012 [00:00<?, ?it/s]

In [66]:
all_organizations_nltk = [org for orgs in news_df['organizations_nltk_new'] for org in orgs]
all_locations_nltk = [loc for locs in news_df['locations_nltk_new'] for loc in locs]

In [67]:
top_20_org_nltk = pd.Series(all_organizations_nltk).value_counts().head(20)

In [68]:
print("Top 20 organizations:")
top_20_org_nltk

Top 20 organizations:


Star News            174
Daily Mail Online    110
Shropshire Star       92
MercedesBenz          88
Automotive News       85
PHL17com              59
Business Live         51
GMC                   49
SUVs                  45
NewsBreak             43
BMW                   42
UK                    42
Hindu                 41
COVID19               39
CarWale               38
News Driven           36
SUV                   35
Auto News             34
Fast Lane Car         34
Express Star          33
dtype: int64

In [69]:
top_20_org_nltk = pd.Series(all_locations_nltk).value_counts().head(20)

In [70]:
print("Top 20 Location:")
top_20_org_nltk

Top 20 Location:


Sale           1961
Prince          147
Winnipeg        137
New             129
Toronto         118
London          109
India           102
China            92
Cambridge        91
North York       72
British          65
Calgary          63
Taiwan           56
Mississauga      55
Land             50
Kitchener        45
Oakville         42
Innisfil         41
Scarborough      41
Ottawa           37
dtype: int64

**Tweets Title-Sentence segmentation**

**Using Spacy**

In [112]:
def extract_entities_spacy(text):
    organizations = []
    locations_gpe = []
    
    # Process each sentence individually
    doc = nlp(text)
    for sent in doc.sents:
        sent_ents = nlp(sent.text).ents
        organizations.extend([ent.text for ent in sent_ents if ent.label_ == "ORG"])
        locations_gpe.extend([ent.text for ent in sent_ents if ent.label_ in ["LOC", "GPE"]])
    
    return organizations, locations_gpe


In [114]:
news_df['organizations_spacy_segment_spacy'], news_df['locations_spacy_segment_spacy'] = zip(*news_df['cleanedTitle'].progress_apply(extract_entities_spacy))

  0%|          | 0/10012 [00:00<?, ?it/s]

In [120]:
all_organizations_segment = [org for orgs in news_df['organizations_spacy_segment_spacy'] for org in orgs]
all_locations_segment = [loc for locs in news_df['locations_spacy_segment_spacy'] for loc in locs]

In [122]:
top_20_org_segment = pd.Series(all_organizations_segment).value_counts().head(20)

In [123]:
top_20_org_segment

Daily Mail                       1078
Ford                              357
Toyota                            204
Chevrolet                         196
Hyundai                           179
Honda                             156
Daily Mail Online                 151
Star News                         135
Automotive News                    98
EV                                 93
BMW                                90
Nissan                             87
Car Dealer Magazine                78
Jeep                               77
Otago Daily Times Online News      67
Dodge                              64
CoventryLive                       61
Mazda                              56
Lexus                              54
Volkswagen                         47
dtype: int64

In [125]:
top_20_loc_segment = pd.Series(all_locations_segment).value_counts().head(20)

In [126]:
top_20_loc_segment

UK             140
US             140
Manitoba       140
Winnipeg       137
Toronto        120
London         101
Cambridge       91
India           79
North York      72
Calgary         63
Taiwan          56
Mississauga     55
Oakville        42
Scarborough     42
Innisfil        41
Ottawa          39
China           37
Australia       36
Russia          33
Hamilton        32
dtype: int64

**Using NLTK**

In [325]:
from tqdm.notebook import tqdm
import nltk

# Download NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

def extract_orgs_locs_nltk_segment(text):
    organizations = []
    locations = []
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False):
            if hasattr(chunk, 'label'):
                entity = ' '.join(c[0] for c in chunk)  # Combine multi-token entities
                if chunk.label() == 'ORGANIZATION':
                    organizations.append(entity)
                elif chunk.label() == 'GPE':
                    locations.append(entity)
    return organizations, locations

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [326]:

news_df['organizations_nltk_segment'], news_df['locations_nltk_segment'] = zip(*news_df['cleanedTitle'].progress_apply(extract_orgs_locs_nltk_segment))

  0%|          | 0/10012 [00:00<?, ?it/s]

In [327]:
all_organizations_nltk_segment = [org for orgs in news_df['organizations_nltk_segment'] for org in orgs]
all_locations_nltk_segment = [loc for locs in news_df['locations_nltk_segment'] for loc in locs]

In [328]:
top_20_org_segment_nltk = pd.Series(all_organizations_nltk_segment).value_counts().head(20)

In [329]:
print("Top 20 Organizations in Title:")
top_20_org_segment_nltk

Top 20 Organizations in Title:


Star News            174
Daily Mail Online    110
Shropshire Star       92
MercedesBenz          88
Automotive News       85
PHL17com              59
Business Live         51
GMC                   49
SUVs                  45
BMW                   43
NewsBreak             43
UK                    42
Hindu                 41
COVID19               39
CarWale               38
News Driven           36
SUV                   35
Fast Lane Car         34
Auto News             34
Express Star          33
dtype: int64

In [330]:
top_20_loc_segment_nltk = pd.Series(all_locations_nltk_segment).value_counts().head(20)

In [331]:
print("Top 20 Location in Title:")
top_20_loc_segment_nltk

Top 20 Location in Title:


Sale           1961
Prince          147
Winnipeg        137
New             131
Toronto         118
London          109
India           103
China            92
Cambridge        91
North York       72
British          65
Calgary          63
Taiwan           56
Mississauga      55
Land             50
Kitchener        45
Oakville         42
Innisfil         41
Scarborough      41
Ottawa           37
dtype: int64

**Comparison of Entity Counts Between SpaCy and NLTK**

Organizations: SpaCy vs NLTK: Difference in counts of organizations identified by SpaCy compared to NLTK. SpaCy Segment vs SpaCy: Difference in counts of organizations identified by SpaCy with sentence segmentation compared to without segmentation. Locations: SpaCy vs NLTK: Difference in counts of locations identified by SpaCy compared to NLTK. SpaCy Segment vs SpaCy: Difference in counts of locations identified by SpaCy with sentence segmentation compared to without segmentation.

**Why?**

Comparing entity counts across different extraction methods or configurations helps identify which method performs better for a given dataset. This insight guides decision-making to optimize entity extraction strategies, improving overall data processing and downstream applications.

*Note : Used this because my top 20 count is almost the same, Used this to show comparision*  

In [129]:
import pandas as pd
from collections import Counter

# Function to compute the difference in counts
def compute_difference_counts(column1, column2):
    # Flatten the columns and count occurrences
    counts1 = Counter([word for sublist in column1 for word in sublist])
    counts2 = Counter([word for sublist in column2 for word in sublist])
    
    # Get the top 30 words from each column
    top_30_column1 = counts1.most_common(30)
    top_30_column2 = counts2.most_common(30)
    
    # Calculate the difference in counts
    difference_counts = {}
    for word, count in top_30_column1:
        difference_counts[word] = count - counts2.get(word, 0)
    for word, count in top_30_column2:
        if word not in difference_counts:
            difference_counts[word] = counts1.get(word, 0) - count
    
    return difference_counts

# Compute difference in counts for organizations and locations between different columns
difference_counts_org_spacy_vs_nltk = compute_difference_counts(news_df['organizations_spacy_spacy'], news_df['organizations_nltk_new'])
difference_counts_loc_spacy_vs_nltk = compute_difference_counts(news_df['locations_spacy_spacy'], news_df['locations_nltk_new'])
difference_counts_org_spacy_segment_vs_spacy = compute_difference_counts(news_df['organizations_spacy_segment_spacy'], news_df['organizations_spacy_spacy'])
difference_counts_loc_spacy_segment_vs_spacy = compute_difference_counts(news_df['locations_spacy_segment_spacy'], news_df['locations_spacy_spacy'])

# Create DataFrames to store comparison results
org_spacy_vs_nltk_df = pd.DataFrame(difference_counts_org_spacy_vs_nltk.items(), columns=['Organization', 'Difference_Count'])
loc_spacy_vs_nltk_df = pd.DataFrame(difference_counts_loc_spacy_vs_nltk.items(), columns=['Location', 'Difference_Count'])
org_spacy_segment_vs_spacy_df = pd.DataFrame(difference_counts_org_spacy_segment_vs_spacy.items(), columns=['Organization', 'Difference_Count'])
loc_spacy_segment_vs_spacy_df = pd.DataFrame(difference_counts_loc_spacy_segment_vs_spacy.items(), columns=['Location', 'Difference_Count'])

# Display the comparison results
print("SpaCy vs NLTK - Organizations:")
print(org_spacy_vs_nltk_df)
print("\nSpaCy vs NLTK - Locations:")
print(loc_spacy_vs_nltk_df)
print("\nSpaCy Segment vs SpaCy - Organizations:")
print(org_spacy_segment_vs_spacy_df)
print("\nSpaCy Segment vs SpaCy - Locations:")
print(loc_spacy_segment_vs_spacy_df)


SpaCy vs NLTK - Organizations:
                              Organization  Difference_Count
0                               Daily Mail              1068
1                                     Ford               353
2                                   Toyota               200
3                                Chevrolet               195
4                                  Hyundai               178
5                                    Honda               154
6                        Daily Mail Online                41
7                                Star News               -38
8                                   Nissan               108
9                          Automotive News                13
10                                      EV                77
11                                     BMW                48
12                     Car Dealer Magazine                78
13                                   Dodge                76
14                                    Jeep            

**Therefore, choosing SpaCy with text Segmentation**

**Obseravtions:( News Title)**

**Most Repetative Organization : Daily Mail**

**Most Repetative Location: UK**

**News Text - Without Sentence Segmentation**

**Using Spacy**

In [131]:
def extract_entities_spacy(text):
    doc = nlp(text)
    organizations = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    locations = [ent.text for ent in doc.ents if ent.label_ in ["LOC", "GPE"]]
    return organizations, locations

In [356]:
news_df['organizations_spacy'], news_df['locations_spacy'] = zip(*news_df['cleanedText'].progress_apply(extract_entities_spacy))


  0%|          | 0/10012 [00:00<?, ?it/s]

In [357]:
all_organizations_spacy = [org for sublist in news_df['organizations_spacy'] for org in sublist]
all_locations_spacy = [loc for sublist in news_df['locations_spacy'] for loc in sublist]

In [358]:
top_20_org_text = pd.Series(all_organizations_spacy).value_counts().head(20)

In [359]:
print("Top 20 organization:")
top_20_org_text

Top 20 organization:


Ford              6420
Prince Philips    5968
Toyota            5749
Netflix           5434
Hyundai           4229
Honda             4022
EV                3983
Facebook          3583
Amazon            3509
Land Rover        3325
Instagram         3219
Royal             3085
Chevrolet         2908
BMW               2878
Nissan            2540
BBC               2296
Tesla             2188
TikTok            2065
Palace            1991
House             1971
dtype: int64

In [360]:
top_20_loc_text = pd.Series(all_locations_spacy).value_counts().head(20)

In [361]:
print("Top 20 Location:")
top_20_loc_text

Top 20 Location:


LA                20156
US                13124
NYC               12063
Los Angeles       10812
UK                10239
London             9229
New York City      8005
Hollywood          5672
Miami              5345
West Hollywood     5185
Paris              5045
Beverly Hills      4415
New York           4272
California         4182
India              3868
Australia          3785
Malibu             3526
Sydney             3116
Texas              3014
Las Vegas          2913
dtype: int64

**Using NLTK**

In [362]:
def extract_entities_nltk(text):
    entities = []
    labels = []
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary=False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk))  # Add space between multi-token entities
            labels.append(chunk.label())

    entities_labels = list(set(zip(entities, labels)))  # unique entities
    organizations = [entity for entity, label in entities_labels if label == "ORGANIZATION"]
    locations = [entity for entity, label in entities_labels if label == "GPE"]
    return organizations, locations

In [363]:
news_df['organizations_nltk_new'], news_df['locations_nltk_new'] = zip(*news_df['cleanedText'].progress_apply(extract_entities_nltk))

  0%|          | 0/10012 [00:00<?, ?it/s]

In [364]:
all_organizations_nltk = [org for sublist in news_df['organizations_nltk_new'] for org in sublist]
all_locations_nltk = [loc for sublist in news_df['locations_nltk_new'] for loc in sublist]

In [365]:
top_20_org_nltk = pd.Series(all_organizations_nltk).value_counts().head(20)

In [366]:
print("Top 20 organization:")
top_20_org_nltk

Top 20 organization:


COVID                          3621
RomeoAston                     2224
UK                             2063
MakeAcuraAlfa                  2030
Daily                          2024
US                             2024
insuranceCar                   1984
vehicleGet                     1963
DealerSite Account             1962
Autopath Technologies Inc      1962
TopMore                        1961
Carpagesca Terms Conditions    1961
MailMail                       1948
siteReader                     1948
PrintsOur                      1948
pageDaily                      1948
MoneyMetroJobsiteMail          1940
usHow                          1919
Associated Newspapers          1913
UsTermsDo                      1905
dtype: int64

In [367]:
top_20_loc_nltk = pd.Series(all_locations_nltk).value_counts().head(20)

In [368]:
print("Top 20 Location:")
top_20_loc_nltk

Top 20 Location:


London           2443
British          2305
New              2173
Sale             2048
Los Angeles      1890
New York         1775
New York City    1648
West             1634
Miami            1619
French           1614
California       1613
American         1608
Malibu           1545
Paris            1533
Australian       1487
Australia        1448
Mexico           1447
Facebook         1432
Covid            1407
China            1307
dtype: int64

**News Text - With Sentense Segmentation**

**Using Spacy**

In [132]:
def extract_entities_spacy(text):
    organizations = []
    locations_gpe = []
    
    # Process each sentence individually
    for sent in sent_tokenize(text):
        doc = nlp(sent)
        organizations.extend([ent.text for ent in doc.ents if ent.label_ == "ORG"])
        locations_gpe.extend([ent.text for ent in doc.ents if ent.label_ in ["LOC", "GPE"]])
    
    return organizations, locations_gpe

In [134]:
def extract_entities_nltk(text):
    entities = []
    labels = []
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary=False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk))  # Add space between multi-token entities
            labels.append(chunk.label())
    entities_labels = list(set(zip(entities, labels)))  # unique entities
    organizations = [entity for entity, label in entities_labels if label == "ORGANIZATION"]
    locations_gpe = [entity for entity, label in entities_labels if label in ["GPE", "LOC"]]
    return organizations, locations_gpe


In [135]:
news_df['organizations_spacy_segment'], news_df['locations_spacy_segment'] = zip(*news_df['cleanedText'].progress_apply(extract_entities_spacy))

  0%|          | 0/10012 [00:00<?, ?it/s]

In [136]:
all_organizations_spacy_segment = [org for sublist in news_df['organizations_spacy_segment'] for org in sublist]
all_locations_spacy_segment = [loc for sublist in news_df['locations_spacy_segment'] for loc in sublist]


In [137]:
top_20_org_seg = pd.Series(all_organizations_spacy_segment).value_counts().head(20)

In [138]:
print("Top 20 Organization:")
top_20_org_seg

Top 20 Organization:


COVID             8414
MailOnline        8281
KM                4924
Toyota            4461
Britney Spears    4334
Hyundai           3873
Instagram         3811
Honda             3466
Ford              3289
Amazon            3042
Prince Philips    3041
Palace            2825
Trump             2689
EV                2538
Netflix           2443
BMW               2302
BBC               2164
Royal             2147
House             2117
PrintsOur         1948
dtype: int64

In [139]:
top_20_loc_seg = pd.Series(all_locations_spacy_segment).value_counts().head(20)

In [140]:
print("Top 20 Location:")
top_20_loc_seg

Top 20 Location:


LA                18149
US                12392
NYC               10645
UK                10311
Los Angeles       10033
London             9047
New York City      7232
Hollywood          5570
Meghan             5125
Miami              4805
West Hollywood     4631
New York           4405
Beverly Hills      4273
California         4013
India              3974
Australia          3686
Malibu             3655
Paris              3301
Sydney             3093
Mexico             2985
dtype: int64

**Using NLTK**

In [141]:
def extract_orgs_locs_nltk_segment(text):
    organizations = []
    locations = []
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False):
            if hasattr(chunk, 'label'):
                entity = ' '.join(c[0] for c in chunk)  # Combine multi-token entities
                if chunk.label() == 'ORGANIZATION':
                    organizations.append(entity)
                elif chunk.label() == 'GPE':
                    locations.append(entity)
    return organizations, locations

In [142]:
news_df['organizations_nltk_segment'], news_df['locations_nltk_segment'] = zip(*news_df['cleanedText'].progress_apply(extract_orgs_locs_nltk_segment))

  0%|          | 0/10012 [00:00<?, ?it/s]

In [144]:
all_organizations = [org for orgs in news_df['organizations_nltk_segment'] for org in orgs]
all_locations = [loc for locs in news_df['locations_nltk_segment'] for loc in locs]

In [145]:
top_20_org = pd.Series(all_organizations).value_counts().head(20)

In [146]:
print("Top 20 tokens")
top_20_org

Top 20 tokens


COVID             10570
MailOnline         8255
NYC                8215
VERY               6856
LA                 4973
US                 4422
Duke               4291
UK                 3992
Duchess            3573
THE                2593
Queen              2364
PDA                2337
RomeoAston         2309
SUV                2242
Prince Philips     2214
Princess Diana     2165
MakeAcuraAlfa      2087
insuranceCar       2064
Daily              2050
BBC                2024
dtype: int64

In [147]:
top_20_loc = pd.Series(all_locations).value_counts().head(20)

In [149]:
print("Top 20 tokens")
top_20_loc

Top 20 tokens


Los Angeles      9561
London           8177
New York City    7507
British          5848
New York         4967
West             4943
Miami            4290
India            3889
Malibu           3887
California       3426
New              3394
American         3360
Paris            3281
Australia        3045
China            3034
Australian       2987
Sydney           2792
Mexico           2769
Sale             2528
French           2505
dtype: int64

**Observation**

**Most Repetitive Organization : Ford**

**Most Repetative Place : LA ( Los Angeles)**

Through my analysis of news text data, I've found that utilizing spaCy without sentence segmentation tends to yield superior results compared to approaches involving segmentation. By processing the entire article as a cohesive unit, rather than breaking it into sentences, spaCy can better capture the intricate relationships and contextual nuances present within news articles. This approach enables a more comprehensive understanding of the content, particularly in deciphering complex topics or identifying overarching themes across multiple sentences.

#### Read Tweets data

In [181]:
tweets_path = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/nlp_a_5_tweets.json'
tweets_df = pd.read_json(tweets_path, orient='records', lines=True)
print(f'Sample contains {tweets_df.shape[0]:,.0f} tweets')
tweets_df.head(2)

Sample contains 10,105 tweets


Unnamed: 0,id,lang,date,name,retweeted,text
0,1534565117614084096,en,2022-06-08,Low Orbit Tourist 🌍📷,,"Body &amp; Assembly - Halewood - United Kingdom\n🌍53.3504,-2.8352296,402m\n\nHalewood Body &amp; Assembly is a Jaguar Land Rover factory in Halewood, England, and forms the major part of the Halewood complex which is shared with Ford who manufacture transmissions at the site. [Wikipedia] https://t.co/LPmCnZIaVt"
1,1534565743429394439,en,2022-06-08,CompleteCar.ie,RT,"Land Rover Ireland has announced that the new Range Rover Sport starts at €114,150, now on @completecar:\n\nhttps://t.co/TjGUkL3FYr https://t.co/QdVaEiJkjO"


In [182]:
pip install emoji beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [183]:
import warnings
warnings.simplefilter("ignore")

In [184]:
import re
from bs4 import BeautifulSoup
import contractions

def cleanTweets(text, remove_numbers=True, replace_numbers=False):
    # Remove mentions, non-alphabetic characters, and emojis
    text = re.sub(r'@[A-Za-z0-9_]+|[^a-zA-Z\s]', '', text)
    
    # Remove hashtag '#' symbol
    text = text.replace('#', '')
    
    # Remove hyperlinks
    text = re.sub(r'(https?://\S+)|(www\.\S+)', '', text)
    
    # Remove newline characters
    text = text.replace('\n', ' ')
    
    return text

tweets_df['cleanedTweets'] = tweets_df['text'].apply(cleanTweets)

In [185]:
tweets_df.head(3)

Unnamed: 0,id,lang,date,name,retweeted,text,cleanedTweets
0,1534565117614084096,en,2022-06-08,Low Orbit Tourist 🌍📷,,"Body &amp; Assembly - Halewood - United Kingdom\n🌍53.3504,-2.8352296,402m\n\nHalewood Body &amp; Assembly is a Jaguar Land Rover factory in Halewood, England, and forms the major part of the Halewood complex which is shared with Ford who manufacture transmissions at the site. [Wikipedia] https://t.co/LPmCnZIaVt",Body amp Assembly Halewood United Kingdom m Halewood Body amp Assembly is a Jaguar Land Rover factory in Halewood England and forms the major part of the Halewood complex which is shared with Ford who manufacture transmissions at the site Wikipedia httpstcoLPmCnZIaVt
1,1534565743429394439,en,2022-06-08,CompleteCar.ie,RT,"Land Rover Ireland has announced that the new Range Rover Sport starts at €114,150, now on @completecar:\n\nhttps://t.co/TjGUkL3FYr https://t.co/QdVaEiJkjO",Land Rover Ireland has announced that the new Range Rover Sport starts at now on httpstcoTjGUkLFYr httpstcoQdVaEiJkjO
2,1529341557580652545,en,2022-05-25,Exmoor Trim,,New Land Rover Range Rover Hits Top Speed With Ease On Autobahn\n\nhttps://t.co/19QOgAIu3v,New Land Rover Range Rover Hits Top Speed With Ease On Autobahn httpstcoQOgAIuv


**Tweets - Without Sentence Segmentation**

**Using spacy**

In [232]:
import pandas as pd
from tqdm.notebook import tqdm
import spacy
from nltk import sent_tokenize, word_tokenize
import nltk
tqdm.pandas()

In [233]:
nlp = spacy.load("en_core_web_md")
nlp.select_pipes(enable=['ner'])

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

In [234]:
def extract_entities_spacy(text):
    doc = nlp(text)
    organizations = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    locations_gpe = [ent.text for ent in doc.ents if ent.label_ in ["LOC", "GPE"]]
    return organizations, locations_gpe

In [235]:
tweets_df['organizations_spacy'], tweets_df['locations_spacy'] = zip(*tweets_df['cleanedTweets'].progress_apply(extract_entities_spacy))

  0%|          | 0/10105 [00:00<?, ?it/s]

In [236]:
top_organizations_spacy_tweets = tweets_df['organizations_spacy'].explode().value_counts().head(20)
print("Top 20 Organizations in Tweets (SpaCy):")
top_organizations_spacy_tweets.head(20)

Top 20 Organizations in Tweets (SpaCy):


Land Rover                             1855
Jaguar Land Rover                       822
eBay                                    412
Land Rover Defender                     359
Audi Jaguar Land                        290
General Motors                          284
Jaguar                                  121
Ford                                    112
Volvo                                   106
Land Rover Range                         98
SHAMELESS                                93
Land Rovers                              80
TEKNOOFFICIAL                            77
Sikosana                                 76
BaT Auctions                             72
Rover                                    72
BMW                                      66
Land Rover Range Rover                   66
the SHAMELESS Health Services Board      64
Defender                                 64
Name: organizations_spacy, dtype: int64

In [237]:
top_locations_spacy_tweets = tweets_df['locations_spacy'].explode().value_counts().head(20)
print("\nTop 20 Locations in Tweets (SpaCy):")
top_locations_spacy_tweets


Top 20 Locations in Tweets (SpaCy):


Russia               471
UK                   353
httpstcocsoUdvcHc    190
NigelAndArron        190
Zimbabwe              87
Cambridge             85
India                 64
Hagley                62
Jamaica               60
LandRover             51
London                48
Sussex                43
weekChase             40
US                    38
China                 35
New Zealand           25
France                25
Ukraine               24
Indias                24
Hollywood             24
Name: locations_spacy, dtype: int64

**Using NLTK**

In [238]:
def extract_entities_nltk(text):
    entities = []
    labels = []
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary=False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk))  # Add space between multi-token entities
            labels.append(chunk.label())

    entities_labels = list(set(zip(entities, labels)))  # unique entities
    organizations = [entity for entity, label in entities_labels if label == "ORGANIZATION"]
    locations = [entity for entity, label in entities_labels if label == "GPE"]
    return organizations, locations


In [239]:
tweets_df['organizations_nltk_new'], tweets_df['locations_nltk_new'] = zip(*tweets_df['cleanedTweets'].progress_apply(extract_entities_nltk))

  0%|          | 0/10105 [00:00<?, ?it/s]

In [240]:
all_organizations = [org for orgs in tweets_df['organizations_nltk_new'] for org in orgs]
all_locations = [loc for locs in tweets_df['locations_nltk_new'] for loc in locs]

In [241]:
top_20_org = pd.Series(all_organizations).value_counts().head(20)
top_20_loc = pd.Series(all_locations).value_counts().head(20)

In [242]:
print("Top 20 Organizations:")
top_20_org

Top 20 Organizations:


Land Rover                         1371
Land                                817
eBay                                478
Land Rover Discovery                371
General Motors                      283
httpstcocsoUdvcHc                   190
NigelAndArron                       190
LAND                                171
Duke                                163
Duchess                             146
UK                                  145
ROVER                               135
Jaguar Land Rover                   124
Jaguar Land                         123
SUV                                 118
Rover                                77
SHAMELESS                            75
InvictusGames                        71
SHAMELESS Health Services Board      64
UKmfg                                64
dtype: int64

In [243]:
print("\nTop 20 Locations:")
top_20_loc


Top 20 Locations:


Land           1774
Russia          436
British         184
LAND            110
Sussex           96
Russian          83
New              78
Zimbabwe         75
Car              69
Cambridge        68
Paracetamol      64
Hagley           62
Meghan           55
India            51
Jamaica          50
Prince           50
Indian           49
Kenyan           46
Meet             39
London           38
dtype: int64

**Tweets - With Sentence Segmentation**

**Using Spacy**

In [244]:
def extract_entities_spacy(text):
    organizations = []
    locations_gpe = []
    
    # Process each sentence individually
    for sent in sent_tokenize(text):
        doc = nlp(sent)
        organizations.extend([ent.text for ent in doc.ents if ent.label_ == "ORG"])
        locations_gpe.extend([ent.text for ent in doc.ents if ent.label_ in ["LOC", "GPE"]])
    
    return organizations, locations_gpe

In [245]:
tweets_df['organizations_spacy_segment_spacy'], tweets_df['locations_spacy_segment_spacy'] = zip(*tweets_df['cleanedTweets'].progress_apply(extract_entities_spacy))

  0%|          | 0/10105 [00:00<?, ?it/s]

In [246]:
all_organizations_spacy_segment = [org for sublist in tweets_df['organizations_spacy_segment_spacy'] for org in sublist]
all_locations_spacy_segment = [loc for sublist in tweets_df['locations_spacy_segment_spacy'] for loc in sublist]


In [247]:
top_20_org_seg = pd.Series(all_organizations_spacy_segment).value_counts().head(20)

In [248]:
print("Top 20 org:")
top_20_org_seg

Top 20 org:


Land Rover                             1853
Jaguar Land Rover                       822
eBay                                    412
Land Rover Defender                     359
Audi Jaguar Land                        290
General Motors                          284
Jaguar                                  121
Ford                                    112
Volvo                                   106
Land Rover Range                         98
SHAMELESS                                93
Land Rovers                              80
TEKNOOFFICIAL                            77
Sikosana                                 76
BaT Auctions                             72
Rover                                    72
Land Rover Range Rover                   66
BMW                                      66
Defender                                 64
the SHAMELESS Health Services Board      64
dtype: int64

In [249]:
top_20_loc_seg = pd.Series(all_locations_spacy_segment).value_counts().head(20)

In [250]:
print("Top 20 org:")
top_20_loc_seg

Top 20 org:


Russia               471
UK                   353
NigelAndArron        190
httpstcocsoUdvcHc    190
Zimbabwe              87
Cambridge             85
India                 64
Hagley                62
Jamaica               60
LandRover             51
London                48
Sussex                43
weekChase             40
US                    38
China                 35
France                25
New Zealand           25
Hollywood             24
Indias                24
Ukraine               24
dtype: int64

**Using NLTK**

In [251]:
from tqdm.notebook import tqdm
import nltk

# Download NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

def extract_orgs_locs_nltk_segment(text):
    organizations = []
    locations = []
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False):
            if hasattr(chunk, 'label'):
                entity = ' '.join(c[0] for c in chunk)  # Combine multi-token entities
                if chunk.label() == 'ORGANIZATION':
                    organizations.append(entity)
                elif chunk.label() == 'GPE':
                    locations.append(entity)
    return organizations, locations



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\MANOJ\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [252]:
tweets_df['organizations_nltk_segment'], tweets_df['locations_nltk_segment'] = zip(*tweets_df['cleanedTweets'].progress_apply(extract_orgs_locs_nltk_segment))


  0%|          | 0/10105 [00:00<?, ?it/s]

In [253]:
all_organizations = [org for orgs in tweets_df['organizations_nltk_segment'] for org in orgs]
all_locations = [loc for locs in tweets_df['locations_nltk_segment'] for loc in locs]

In [254]:
top_20_org = pd.Series(all_organizations).value_counts().head(20)

In [255]:
#top_20_org = org_series.value_counts().head(20)
print("\nTop 20 Organizations:")
top_20_org


Top 20 Organizations:


Land Rover                         1385
Land                                822
eBay                                478
Land Rover Discovery                371
General Motors                      283
NigelAndArron                       190
httpstcocsoUdvcHc                   190
LAND                                178
Duke                                177
SUV                                 170
Duchess                             146
UK                                  146
ROVER                               135
Jaguar Land Rover                   124
Jaguar Land                         123
SHAMELESS                            93
Rover                                77
InvictusGames                        71
SHAMELESS Health Services Board      64
httpstcohvbiNbZFs                    64
dtype: int64

In [256]:
top_20_loc = pd.Series(all_locations).value_counts().head(20)

In [257]:
#top_20_loc = loc_series.value_counts().head(20)
print("\nTop 20 Locations:")
top_20_loc


Top 20 Locations:


Land           1785
Russia          464
British         188
LAND            110
Sussex          109
Zimbabwe         86
Russian          84
New              78
Car              69
Cambridge        68
Paracetamol      64
India            63
Hagley           62
Meghan           55
Indian           53
Jamaica          51
Prince           50
Kenyan           46
Meet             39
London           38
dtype: int64

**Observation**

**Most Repetitive Organization : Land Rover**

**Most Repetative Location : Russia**

In my analysis of tweet data, I've found that using spaCy with sentence segmentation works better compared to other methods. This is because spaCy accurately identifies entities within each sentence, helping to capture context and relationships more precisely. By leveraging spaCy's pre-trained models and optimized performance, this approach ensures higher accuracy in recognizing organizations and locations mentioned in the tweets. As a result, in my analysis, spaCy with sentence segmentation stands out as the preferred option for extracting detailed insights from tweet data.




