<a href="https://colab.research.google.com/github/Amy-Elena/LiveNewsUpdate-WebData-Analysis-Python/blob/main/cnn_web_scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
# import library
from bs4 import BeautifulSoup
import requests

In [24]:
# URL of the news website
url = 'https://edition.cnn.com/europe/live-news/russia-ukraine-war-news-06-26-23/index.html'

In [25]:
# Send an HTTP GET request
response = requests.get(url)

In [26]:
# Parse the HTML content with BeautifulSoup
response = BeautifulSoup(response.content, "html.parser")

print(response)

<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="ie=edge" http-equiv="x-ua-compatible"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="/cnn-live-story/static/favicon.ico" rel="shortcut icon" type="image/x-icon"/><title data-rh="true">June 26, 2023 - Russia-Ukraine, Wagner rebellion news</title><meta charset="utf-8" data-rh="true"/><meta content="width=device-width, initial-scale=1" data-rh="true" name="viewport"/><meta content="europe" data-rh="true" name="section"/><meta content='By &lt;a href="/profiles/kathleen-magramo"&gt;Kathleen Magramo&lt;/a&gt;, Christian Edwards, &lt;a href="/profiles/aditi-sandal"&gt;Aditi Sangal&lt;/a&gt;, Mike Hayes, &lt;a href="/profiles/maureen-chowdhury"&gt;Maureen Chowdhury&lt;/a&gt; and &lt;a href="/profiles/amir-vera"&gt;Amir Vera&lt;/a&gt;, CNN' data-rh="true" name="author"/><meta content="europe, June 26, 2023 - Russia-Ukraine, Wagner rebellion news" data-rh="true" name="keywords"/><meta c

In [27]:
print(response.prettify())  # This prints the parsed HTML for debugging purposes (more readable format)

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="/cnn-live-story/static/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <title data-rh="true">
   June 26, 2023 - Russia-Ukraine, Wagner rebellion news
  </title>
  <meta charset="utf-8" data-rh="true"/>
  <meta content="width=device-width, initial-scale=1" data-rh="true" name="viewport"/>
  <meta content="europe" data-rh="true" name="section"/>
  <meta content='By &lt;a href="/profiles/kathleen-magramo"&gt;Kathleen Magramo&lt;/a&gt;, Christian Edwards, &lt;a href="/profiles/aditi-sandal"&gt;Aditi Sangal&lt;/a&gt;, Mike Hayes, &lt;a href="/profiles/maureen-chowdhury"&gt;Maureen Chowdhury&lt;/a&gt; and &lt;a href="/profiles/amir-vera"&gt;Amir Vera&lt;/a&gt;, CNN' data-rh="true" name="author"/>
  <meta content="europe, June 26, 2023 - Russia-Ukraine, Wagner rebellion news" 

From above HTML output, it seems that the data extracted is stored in a structured **JSON-LD format**.  
**JSON-LD** is a way to encode Linked Data using JSON. In this case, each article's information is enclosed in a JSON object within a script tag.  
To scrape this data, you need to locate these script tags and then extract the JSON data from them.

---

In [28]:
import json

# Find all <script> tags with type="application/ld+json"

script_tags = response.find_all("script", type="application/ld+json")

print("Number of script tags found:", len(script_tags))

Number of script tags found: 3


**Inspect the JSON-LD Data:** Manually inspect the contents of the script_tags list to see if the JSON-LD data is being correctly extracted. You can add a print statement to check:

In [29]:
for script_tag in script_tags:
    print(script_tag.string)

{"@context":"http://schema.org/","@type":"NewsArticle","mainEntityOfPage":{"@type":"WebPage","@id":"https://google.com/article"},"description":"Russian President Vladimir Putin said Monday night that \"the armed rebellion would have been suppressed anyway,\" referring to the insurrection launched by the Wagner Group over the weekend.","publisher":{"@type":"Organization","name":"CNN","logo":{"@type":"ImageObject","url":"https://dynaimage.cdn.cnn.com/cnn/q_auto,h_60/%2F%2Fcdn.cnn.com%2Fcnn%2F.e%2Fimg%2F4.0%2Flogos%2Fcnn_logo_social.jpg"}},"url":"https://www.cnn.com/europe/live-news/russia-ukraine-war-news-06-26-23/index.html","headline":"June 26, 2023 - Russia-Ukraine, Wagner rebellion news","image":{"@type":"ImageObject","url":"https://cdn.cnn.com/cnnnext/dam/assets/230625112553-prigozhin-putin-split-super-tease.jpg","height":"1100","width":"619"},"datePublished":"2023-06-26T04:00:26Z","dateModified":"2023-06-27T05:16:37Z","author":{"@type":"Person","name":"By <a href=\"/profiles/kathle

**Print JSON Data:** If the JSON data extraction is successful, print the extracted json_data for each article within the loop:

In [30]:
for script_tag in script_tags:
    try:
        json_data = json.loads(script_tag.string)
        print(json_data)  # Print the extracted JSON data
        # Rest of the extraction code
    except json.JSONDecodeError as e:
        print("Error decoding JSON:", e)

{'@context': 'http://schema.org/', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://google.com/article'}, 'description': 'Russian President Vladimir Putin said Monday night that "the armed rebellion would have been suppressed anyway," referring to the insurrection launched by the Wagner Group over the weekend.', 'publisher': {'@type': 'Organization', 'name': 'CNN', 'logo': {'@type': 'ImageObject', 'url': 'https://dynaimage.cdn.cnn.com/cnn/q_auto,h_60/%2F%2Fcdn.cnn.com%2Fcnn%2F.e%2Fimg%2F4.0%2Flogos%2Fcnn_logo_social.jpg'}}, 'url': 'https://www.cnn.com/europe/live-news/russia-ukraine-war-news-06-26-23/index.html', 'headline': 'June 26, 2023 - Russia-Ukraine, Wagner rebellion news', 'image': {'@type': 'ImageObject', 'url': 'https://cdn.cnn.com/cnnnext/dam/assets/230625112553-prigozhin-putin-split-super-tease.jpg', 'height': '1100', 'width': '619'}, 'datePublished': '2023-06-26T04:00:26Z', 'dateModified': '2023-06-27T05:16:37Z', 'author': {'@type': 'Person'

In [31]:
# Process and save JSON-LD data into a text file
with open("json_ld_data.txt", "w") as file:
    for script_tag in script_tags:
        try:
            json_data = json.loads(script_tag.string)
            json_str = json.dumps(json_data, indent=4)  # Pretty-print JSON
            file.write(json_str + "\n")  # Write JSON data to the file
        except json.JSONDecodeError as e:
            print("Error decoding JSON:", e)

print("JSON-LD data saved to json_ld_data.txt")

JSON-LD data saved to json_ld_data.txt


---

To **save the JSON data** from your text file into a **proper JSON file**, you can follow these steps:

In [34]:
# Note: I had to manually edit the txt file, used a json parser to find the errors

# Read the content of the text file
with open('/content/drive/MyDrive/Colab Notebooks/My Colab Projects/json_ld_edited.txt', 'r') as file:
    text_content = file.read()

# Parse the JSON data from the text content
json_data = json.loads(text_content)

# Write the JSON data to a new JSON file
with open('json_ld_data.json', 'w') as output_file:
    json.dump(json_data, output_file, indent=4)  # 'indent' parameter adds formatting for better readability


The JSON file you have is not automatically loaded as a Python dictionary. Instead, it's currently just a textual representation of JSON data.  
**To work with the data in Python as a dictionary**, you need to load it using the json.load() function, which will convert the JSON data into a Python dictionary.

In [36]:
''' Note: 'output_json.json' renamed 'cnn_data.json' and stored in my google drive.
can't call it directly here, have to add path'''

# Load the JSON data from the file
with open('/content/drive/MyDrive/Colab Notebooks/My Colab Projects/json_ld_data.json', 'r') as json_file:
    json_data = json.load(json_file)

# Now 'json_data' is a Python dictionary
# You can access and analyze its contents like any other dictionary

---
**Some Analysis on the scraped data**

In [37]:
# Count the number of entries (articles) in your JSON data:
num_entries = len(json_data)
print(f"Number of entries: {num_entries}")

Number of entries: 16


It seems that the data is **nested within various levels of dictionaries** and lists. To access the data for e.g headlines, you need to navigate through these levels.  
Remember that if there are multiple articles in your JSON data, you should loop through the articles to access their headlines in a similar manner.  
Here's how you can do it:

In [38]:
# Accessing the main article's headline
main_headline = json_data['headline']
print("Main Headline:", main_headline)

Main Headline: June 26, 2023 - Russia-Ukraine, Wagner rebellion news


In [39]:
# Accessing the headlines of live blog updates
live_blog_updates = json_data['liveBlogUpdate']
for update in live_blog_updates:
    update_headline = update['headline']
    print("Update Headline:", update_headline)

Update Headline: What we covered here
Update Headline: None
Update Headline: China's foreign minister touts Beijing and Moscow as a force for "global peace"
Update Headline: Putin addresses insurrection and Prigozhin's whereabouts are unknown. Here's what you need to know
Update Headline: Ukrainian fighters have advanced in all directions of frontline, Zelensky says
Update Headline: During Wagner rebellion, allies reached out to Ukraine advising not to strike inside Russia
Update Headline: Wagner uprising "was almost hiding in plain sight," US Sen. Mark Warner says
Update Headline: US gathered detailed intelligence on Wagner chief's rebellion plans but kept it secret, sources say
Update Headline: Analysis: Here's how Ukraine will seek to take advantage following rebellion in Russia
Update Headline: Putin speaks to Emirati counterpart about Wagner rebellion, the Kremlin says
Update Headline: Putin is holding meeting with top security officials, including defense minister, Kremlin says
U

In [40]:
# Extract main article details
main_article = json_data
main_headline = main_article['headline']
main_description = main_article['description']
main_date_published = main_article['datePublished']
main_author = main_article['author']['name']

print("Main Article Headline:", main_headline)
print("Main Article Description:", main_description)
print("Main Article Date Published:", main_date_published)
print("Main Article Author:", main_author)

Main Article Headline: June 26, 2023 - Russia-Ukraine, Wagner rebellion news
Main Article Description: Russian President Vladimir Putin said Monday night that "the armed rebellion would have been suppressed anyway," referring to the insurrection launched by the Wagner Group over the weekend.
Main Article Date Published: 2023-06-26T04:00:26Z
Main Article Author: Kathleen Magramo, Christian Edwards, Aditi Sangal, Mike Hayes, Maureen Chowdhury and Amir Vera, CNN


In [41]:
# Extract live blog update details
live_blog_updates = main_article['liveBlogUpdate']
for update in live_blog_updates:
    update_headline = update['headline']
    update_date_published = update['datePublished']
    update_author = update['author']['name']
    print("\nUpdate Headline:", update_headline)
    print("Update Date Published:", update_date_published)
    print("Update Author:", update_author)


Update Headline: What we covered here
Update Date Published: 2023-06-26T01:47:43.693Z
Update Author: Kathleen Magramo, Christian Edwards, Aditi Sangal, Mike Hayes, Maureen Chowdhury and Amir Vera, CNN

Update Headline: None
Update Date Published: 2023-06-27T04:00:15.479Z
Update Author: Kathleen Magramo, Christian Edwards, Aditi Sangal, Mike Hayes, Maureen Chowdhury and Amir Vera, CNN

Update Headline: China's foreign minister touts Beijing and Moscow as a force for "global peace"
Update Date Published: 2023-06-27T03:50:30.920Z
Update Author: Kathleen Magramo, Christian Edwards, Aditi Sangal, Mike Hayes, Maureen Chowdhury and Amir Vera, CNN

Update Headline: Putin addresses insurrection and Prigozhin's whereabouts are unknown. Here's what you need to know
Update Date Published: 2023-06-26T22:03:43.281Z
Update Author: Kathleen Magramo, Christian Edwards, Aditi Sangal, Mike Hayes, Maureen Chowdhury and Amir Vera, CNN

Update Headline: Ukrainian fighters have advanced in all directions of

In [42]:
# Count the number of live blog updates
num_updates = len(live_blog_updates)
print("\nNumber of Live Blog Updates:", num_updates)


Number of Live Blog Updates: 66


In [53]:
'''
This part of the code is focused on extracting and analyzing unique author names from both the main article
and its associated live blog updates.

authors = set(): This initializes an empty set called authors.
A set is a data structure in Python that stores unique elements, meaning each element can only appear once in the set.

authors.add(main_author): This adds the name of the author of the main article to the authors set.
This ensures that if the main article author is also one of the authors of the live blog updates,
they won't be duplicated in the set.

for update in live_blog_updates:: This initiates a loop that iterates through each live blog update
in the live_blog_updates list.

authors.add(update['author']['name']): Inside the loop, this line adds the name of the author
of the current live blog update to the authors set. This is done for every update in the list.

print("\nUnique Authors:", authors): After the loop finishes, this line prints out the set of unique author names.
Since a set only stores unique values, you will see a list of distinct
author names from both the main article and its updates.
'''
# Analyze author names
authors = set()
authors.add(main_author)
for update in live_blog_updates:
    authors.add(update['author']['name'])
print("\nUnique Authors:", authors)


Unique Authors: {'Kathleen Magramo, Christian Edwards, Aditi Sangal, Mike Hayes, Maureen Chowdhury and Amir Vera, CNN'}


In [52]:
# Word Frequency Analysis: Analyze the frequency of words in the article bodies:

'''
1. We first combine all the text content from various fields
in the JSON data into a single string called all_text.

2. We use regular expressions (re.findall(r'\w+', all_text.lower())) to split the text into words.
This also converts the text to lowercase to ensure case-insensitive word counting.

3. We create a Counter object called word_counter,
which automatically counts the frequency of each word.

4. Finally, we print the top 10 most common words
and their frequencies using the most_common() method of the Counter.
'''
from collections import Counter
import re

# Combine all text content into a single string
# Not all updates have headlines or article bodies.
# To avoid the error, you should check if those fields exist before concatenating them.
all_text = ""
all_text += main_article['headline'] + " "
all_text += main_article['description'] + " "
for update in live_blog_updates:
    if 'headline' in update and update['headline']:
        all_text += update['headline'] + " "
    if 'articleBody' in update and update['articleBody']:
        all_text += update['articleBody'] + " "

# Remove special characters and split the text into words
words = re.findall(r'\w+', all_text.lower())

# Create a word frequency counter
word_counter = Counter(words)

# Print the top 10 most common words and their frequencies
print("Top 10 most common words:")
for word, count in word_counter.most_common(20):
    print(f"{word}: {count}")

Top 10 most common words:
the: 889
to: 432
in: 335
of: 313
and: 299
a: 289
s: 234
that: 198
on: 189
said: 180
russia: 173
russian: 154
prigozhin: 132
wagner: 130
with: 124
was: 118
ukraine: 102
for: 100
putin: 97
president: 93


In [45]:
# Print the summary of the main article
print("Summary of the Main Article:")
print(main_article['description'])

Summary of the Main Article:
Russian President Vladimir Putin said Monday night that "the armed rebellion would have been suppressed anyway," referring to the insurrection launched by the Wagner Group over the weekend.


---

To perform **Sentiment Analysis** on the text content of the articles, you can use a **Natural Language Processing (NLP)** library like TextBlob.  
**TextBlob** *italicized text* is a simple library that provides easy-to-use methods for various NLP tasks, including sentiment analysis.

In [46]:
from textblob import TextBlob

In [51]:
'''
 Not all updates have headlines or article bodies.
 To avoid the error, you should check if those fields exist before concatenating them.
 '''
# Text data for sentiment analysis
text_data = main_article['description'] + " "
if 'headline' in update and update['headline']:
        all_text += update['headline'] + " "
if 'articleBody' in update and update['articleBody']:
        all_text += update['articleBody'] + " "

In [48]:
# Perform sentiment analysis
blob = TextBlob(text_data)

# Calculate polarity and subjectivity
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity

In [49]:
# Print sentiment analysis results
print("Sentiment Polarity:", polarity)
print("Sentiment Subjectivity:", subjectivity)

Sentiment Polarity: 0.0
Sentiment Subjectivity: 0.0


In [50]:
# Determine sentiment label
if polarity > 0:
    sentiment_label = "Positive"
elif polarity < 0:
    sentiment_label = "Negative"
else:
    sentiment_label = "Neutral"

print("Sentiment Label:", sentiment_label)

Sentiment Label: Neutral
