<a href="https://colab.research.google.com/github/2002hk/NLP/blob/main/Named_entity_and_sentiment_analysis_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np


In [2]:
!pip install spacy newspaper3k plotly dash transformers


Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting dash
  Downloading dash-2.18.1-py3-none-any.whl.metadata (10 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.1.2-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading tinysegmente

####Pipeline Overview
##Here’s how the pipeline will flow:

- Scrape multiple news articles.
- Perform NER using spaCy.
- Perform sentiment analysis using a pre-trained model from Hugging Face.
- Aggregate entities and analyze the frequency.
- Visualize results using Dash.


In [3]:
from newspaper import Article
def scrape_multiple_articles(urls):
  articles_data=[]
  for url in urls:
    article=Article(url)
    article.download()
    article.parse()
    articles_data.append({'title':article.title,'text':article.text})
  return pd.DataFrame(articles_data)

urls=[
    'https://www.bbc.com/news/world-us-canada-55353178',
    'https://edition.cnn.com/videos/world/2020/12/25/lookback-2020-world-news-review-ward-pkg-intl-hnk-vpx.cnn',
    'https://www.bbc.com/news/articles/c93pdlg4dlno',
    'https://www.bbc.com/news/world-54337098',
    'https://www.cnn.com/2024/10/08/politics/bob-woodward-book-war-joe-biden-putin-netanyahu-trump/index.html',
    'https://www.cnn.com/politics'
]

new_df=scrape_multiple_articles(urls)



In [4]:
new_df.head()

Unnamed: 0,title,text
0,The year 2020: A time when everything changed,The year 2020: A time when everything changed\...
1,The major events that changed the world in 2020,1. How relevant is this ad to you?\n\nVideo pl...
2,Trump 'resorted to crimes' to overturn 2020 re...,Trump 'resorted to crimes' to overturn 2020 el...
3,Covid-19: Milestones of the global pandemic,Covid-19: Milestones of the global pandemic\n\...
4,‘That son of a bitch’: New Woodward book revea...,Editor’s Note: The story below contains explic...


In [5]:
#perfroming named entity recognition
import spacy
nlp=spacy.load('en_core_web_sm')
def perform_ner(text):
  doc=nlp(text)
  entities=[(ent.text,ent.label_) for ent in doc.ents]
  return entities

new_df['entities']=new_df['text'].apply(perform_ner)



In [6]:
new_df[['title', 'entities']]

Unnamed: 0,title,entities
0,The year 2020: A time when everything changed,"[(The year 2020, DATE), (BBC, ORG), (2020, DAT..."
1,The major events that changed the world in 2020,"[(1, CARDINAL), (Audio, PERSON)]"
2,Trump 'resorted to crimes' to overturn 2020 re...,"[(Trump, ORG), (2020, DATE), (Donald Trump, PE..."
3,Covid-19: Milestones of the global pandemic,"[(Reuters, ORG), (last year, DATE), (China, GP..."
4,‘That son of a bitch’: New Woodward book revea...,"[(CNN, ORG), (Bob Woodward, PERSON), (Joe Bide..."
5,CNN Politics,"[(1, CARDINAL), (Audio, PERSON)]"


In [7]:
#performing sentment analysis
from transformers import pipeline

# Load pre-trained sentiment analysis model
sentiment_analyzer = pipeline('sentiment-analysis')

def analyze_sentiment(text):
    result = sentiment_analyzer(text[:512])[0]  # Limited to 512 characters
    return result['label'], result['score']

# Add sentiment to the DataFrame
new_df['sentiment'] = new_df['text'].apply(analyze_sentiment)
#print(news_df[['title', 'sentiment']])


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [8]:
new_df[['title', 'sentiment']]

Unnamed: 0,title,sentiment
0,The year 2020: A time when everything changed,"(NEGATIVE, 0.9962599277496338)"
1,The major events that changed the world in 2020,"(NEGATIVE, 0.9981604218482971)"
2,Trump 'resorted to crimes' to overturn 2020 re...,"(NEGATIVE, 0.998304009437561)"
3,Covid-19: Milestones of the global pandemic,"(NEGATIVE, 0.9516258835792542)"
4,‘That son of a bitch’: New Woodward book revea...,"(POSITIVE, 0.9973061084747314)"
5,CNN Politics,"(NEGATIVE, 0.9981604218482971)"


In [9]:
#entity frequency collection
from collections import Counter

def aggregate_entities(new_df):
    entity_counter = Counter()
    for entities in new_df['entities']:
        entity_counter.update([ent[0] for ent in entities])
    return entity_counter.most_common(10)

# Get the most frequent entities
most_common_entities = aggregate_entities(new_df)
print(most_common_entities)

[('Biden', 58), ('Trump', 57), ('Woodward', 54), ('Putin', 28), ('US', 22), ('2020', 19), ('China', 16), ('Netanyahu', 14), ('one', 13), ('Israel', 12)]


In [10]:
# Interactive Visualization using Dash
import dash
from dash import dcc, html
import plotly.express as px

# Prepare data for plotting
entity_names, entity_counts = zip(*most_common_entities)

# Create a bar chart for entity frequency
fig = px.bar(x=entity_names, y=entity_counts, labels={'x': 'Entity', 'y': 'Count'}, title='Top Entities in News Articles')

# Initialize Dash app
app = dash.Dash(__name__)

app.layout = html.Div(children=[
    html.H1(children='NER & Sentiment Analysis Dashboard'),

    html.H2(children='Entity Frequency'),
    dcc.Graph(
        id='entity-frequency',
        figure=fig
    ),

    html.H2(children='News Article Sentiments'),
    dcc.Graph(
        id='sentiment-analysis',
        figure=px.histogram(new_df, x='sentiment', title="Sentiment Analysis")
    )
])

# Run the Dash app
if __name__ == '__main__':
    app.run_server(debug=True)


<IPython.core.display.Javascript object>