Task: Web Scraping and Data Visualization from BBC News

Objective: Scrape data from the BBC News website, store the data in a CSV file, and create visualizations based on the collected data.

Instructions:

Website to Scrape:
Visit the BBC News website at https://www.bbc.com/news.


Data to Collect:
1 Headlines of news articles,
2 Author names (if available),
3 Summary or description of the articles,
4 URLs of the articles.

Data Storage:
Store the scraped data in a CSV file with the following columns: headline, publication_date, author, summary, url.

Documentation:
Document the steps taken in the scraping process and any challenges faced.
Include comments in the code for clarity.

Submission:
Submit the CSV file and visualizations, along with the source code and documentation.



In [1]:
import requests

data=requests.get("https://www.bbc.com/news")
page_contents=data.text


In [2]:
print(data.status_code)

200


# Fetch Headlines of news articles

In [3]:
import requests
from bs4 import BeautifulSoup

url='https://www.bbc.com/news'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
#print(soup)
def get_headlines_newsarticles(soup):
    headlines = soup.find('body').find_all('h2')
    headlines_news=[]
    for x in headlines:
        headlines_news.append(x.text.strip())
    return headlines_news
        

In [4]:
get_headlines_newsarticles(soup)

['Bowen: Year of killing and broken assumptions has taken Middle East to edge of deeper, wider war',
 'Israeli air strikes hit Gaza and Beirut as 7 October attacks remembered',
 'Maldives president in Delhi to seek aid and reboot ties',
 'Orla Gartland: US tour will cost me thousands',
 'New hurricane threatens Florida as it reels from devastation',
 'Israeli air strikes hit Gaza and Beirut as 7 October attacks remembered',
 "Blast kills two Chinese near Pakistan's Karachi airport",
 'Russian opposition activist killed fighting for Ukraine',
 'Climbers rescued after three days missing in Himalayas',
 'Maldives president in Delhi to seek aid and reboot ties',
 'Antisemitic incidents in US surge to record high - report',
 'Policewoman killed and 10 injured in shooting in Israel',
 'New Zealand loses first naval ship to sea since WW2',
 "Judi Dench speaks of grief after Maggie Smith's death",
 'Tool promised to help non-verbal people - but did it manipulate them instead?',
 'Conflict in M

# Fetch Article URL

In [29]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.bbc.com/news'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

def get_article_url(soup):
    article_urls = set()  
    base_url = 'https://www.bbc.com'

    
    for tag in soup.find_all("a", href=True):
        href = tag['href']

        # Check if the link is an internal link (starts with "/")
        if href.startswith("/news"):  # Filter for news articles only
            full_url = base_url + href
            article_urls.add(full_url)
        
    return list(article_urls)

article_links = get_article_url(soup)
for link in article_links:
    print(link)


https://www.bbc.com/news/bbcindepth
https://www.bbc.com/news/war-in-ukraine
https://www.bbc.com/news/world/europe
https://www.bbc.com/news/articles/c8xe17pegrno
https://www.bbc.com/news/uk
https://www.bbc.com/news/articles/c8dj0833g99o
https://www.bbc.com/news/topics/cwywpe90nnyt
https://www.bbc.com/news/articles/cg7825rk8j9o
https://www.bbc.com/news/articles/cyvye9l43dgo
https://www.bbc.com/news/reality_check
https://www.bbc.com/news/articles/ce3we9n0v79o
https://www.bbc.com/news/topics/c79wd85wg8et
https://www.bbc.com/news/scotland
https://www.bbc.com/news/world/australia
https://www.bbc.com/news/articles/c5y0jm7jpx8o
https://www.bbc.com/news/northern_ireland
https://www.bbc.com/news/videos/c625nlxjpz7o
https://www.bbc.com/news/articles/crl8e084r9yo
https://www.bbc.com/news/scotland/scotland_politics
https://www.bbc.com/news/world/africa
https://www.bbc.com/news/world/middle_east
https://www.bbc.com/news/articles/ced0wlqnvlno
https://www.bbc.com/news/world/latin_america
https://www.b

# Author Name

In [45]:
import requests
from bs4 import BeautifulSoup
import json

# Function to fetch author name from the article URL
def fetch_author_name(article_url):
    try:
        # Fetch the webpage content
        response = requests.get(article_url)
        response.raise_for_status()  # Raise an error for bad responses

        # Parse the webpage using BeautifulSoup
        article_soup = BeautifulSoup(response.text, 'html.parser')

        # Attempt to find author name in the structured data (JSON-LD)
        script_tag = article_soup.find('script', type='application/ld+json')

        author_name = "Author not found"  # Default value if author is not found
        if script_tag:
            try:
                json_data = json.loads(script_tag.string)

                # Extract author name if available in the JSON-LD data
                if 'author' in json_data and isinstance(json_data['author'], list) and 'name' in json_data['author'][0]:
                    author_name = json_data['author'][0]['name']
            except (KeyError, json.JSONDecodeError):
                pass  # Handle cases where the JSON data does not have the required structure

        return author_name

    except requests.RequestException as e:
        return "Error", f"Error fetching article: {e}"


# Iterate over article links and fetch author names
for url in article_links:
    author_name = fetch_author_name(url)
    print(f"URL: {url}\nAuthor: {author_name}\n")


URL: https://www.bbc.com/news/bbcindepth
Author: Author not found

URL: https://www.bbc.com/news/war-in-ukraine
Author: Author not found

URL: https://www.bbc.com/news/world/europe
Author: Author not found

URL: https://www.bbc.com/news/uk
Author: Author not found

URL: https://www.bbc.com/news/articles/c8dj0833g99o
Author: Graeme Baker

URL: https://www.bbc.com/news/topics/cwywpe90nnyt
Author: Author not found

URL: https://www.bbc.com/news/articles/cyvye9l43dgo
Author: Robert Greenall

URL: https://www.bbc.com/news/articles/ce3we9n0v79o
Author: Lucy Manning 

URL: https://www.bbc.com/news/reality_check
Author: Author not found

URL: https://shop.bbc.com/
Author: Author not found

URL: https://www.bbc.com/news/topics/c79wd85wg8et
Author: Author not found

URL: https://www.bbc.com/news/scotland
Author: Author not found

URL: https://www.bbc.com/newsletters
Author: Author not found

URL: https://www.bbc.com/news/world/australia
Author: Author not found

URL: https://www.bbc.com/news/art

# Author name for Single URL

In [42]:
import json
def fetch_author_names(article_url):
    
    try:
        response = requests.get(article_url)
        response.raise_for_status()  # Raise an error for bad responses
        article_soup = BeautifulSoup(response.text, 'html.parser')
        

        script_tag = article_soup.find('script', type='application/ld+json')

        # Load the JSON content from the script tag
        json_data = json.loads(script_tag.string)

        # Extract the author name
        author_name = json_data['author'][0]['name']

        # Print the author name
        print(f"Author: {author_name}")

          
        
        #author_tag=article_soup.find('div', {'data-testid': 'byline-new-contributors'})
        #print(author_tag)
        description_tag = article_soup.find('meta', property='og:description')  
        
                
        if description_tag:
            return description_tag['content'].strip()  # Get the content of the meta tag
        else:
            return "Description not found"

    except requests.RequestException as e:
        return f"Error fetching article: {e}"

for url in article_links:
    description = fetch_author_names(url)
    print(f"URL: {url}\nDescription: {description}\n")
#article_url="https://www.bbc.com/news/articles/c8dj0833g99o"
#fetch_article_description(article_url)


Author: Graeme Baker


'Alejandro Arcos is the second politician to be killed in a week in the city of Chilpachingo.'

# Summary or description of the articles

In [47]:
import requests
from bs4 import BeautifulSoup

def fetch_article_description(article_url):
    try:
        response = requests.get(article_url)
        response.raise_for_status()  # Raise an error for bad responses
        article_soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find the description or summary
        # Adjust the selector based on the actual HTML structure of the article
        description_tag = article_soup.find('meta', property='og:description')  # Example for Open Graph description
        
        if description_tag:
            return description_tag['content'].strip()  # Get the content of the meta tag
        else:
            return "Description not found"

    except requests.RequestException as e:
        return f"Error fetching article: {e}"

# Example of usage
# Assuming you already have a BeautifulSoup object for a BBC news page
# soup = BeautifulSoup(your_html_content, 'html.parser')
# article_urls = get_article_url(soup)

for url in article_links:
    description = fetch_article_description(url)
    print(f"URL: {url}\nDescription: {description}\n")


URL: https://www.bbc.com/news/bbcindepth
Description: Thought-provoking analysis from our top journalists that informs, feeds your curiosity, and helps you make sense of a complex world.

URL: https://www.bbc.com/news/war-in-ukraine
Description: Follow the latest news about the Russia Ukraine war. Find reports from the ground, verified videos, maps and expert analysis by BBC correspondents across the world.

URL: https://www.bbc.com/news/world/europe
Description: Get all the latest news, live updates and content about Europe from across the BBC.

URL: https://www.bbc.com/news/uk
Description: Get all the latest news, live updates and content about the UK from across the BBC.

URL: https://www.bbc.com/news/articles/c8dj0833g99o
Description: Alejandro Arcos is the second politician to be killed in a week in the city of Chilpachingo.

URL: https://www.bbc.com/news/topics/cwywpe90nnyt
Description: Description not found

URL: https://www.bbc.com/news/articles/cyvye9l43dgo
Description: Immigr

In [55]:
# Main scraping function
def scrape_multiple_pages(n):
    base_url = 'https://www.bbc.com/news'
    bbc_headlines, bbc_author, bbc_url, bbc_description = [], [], [], []

    for page in range(1, n + 1):
        doc = get_doc(base_url)
        
        # Extract data from each page
        headlines = list(get_headlines_newsarticles(doc))  
        authors = list(fetch_author_name(doc))            
        urls = list(get_article_url(doc))                  
        descriptions = list(fetch_article_description(doc)) 
        
        # Ensure lists are of the same length by appending placeholder values if needed
        max_len = max(len(headlines), len(authors), len(urls), len(descriptions))
        
        # Adjust all lists to the same length
        headlines += ['Missing Headline'] * (max_len - len(headlines))
        authors += ['Unknown Author'] * (max_len - len(authors))
        urls += ['Unknown URL'] * (max_len - len(urls))
        descriptions += ['No Description'] * (max_len - len(descriptions))
        
        # Append data to the main lists
        bbc_headlines.extend(headlines)
        bbc_author.extend(authors)
        bbc_url.extend(urls)
        bbc_description.extend(descriptions)

    # Create a dictionary with the scraped data
    bbc_data = {
        'HEADLINES': bbc_headlines,
        'AUTHOR': bbc_author,
        'LINK': bbc_url,
        'DESCRIPTION': bbc_description
    }

    # Convert the dictionary to a DataFrame
    bbc_df = pd.DataFrame(bbc_data)

    # Save the DataFrame to a CSV file
    bbc_df.to_csv('bbc_news_data.csv', index=False)

    return bbc_df

# Example usage: Scrape 1 page of BBC news and save to CSV
scraped_data = scrape_multiple_pages(1)
print(scraped_data.head())


                                           HEADLINES  \
0  'Far too many civilians have suffered,' Biden ...   
1  Bowen: Year of killing and broken assumptions ...   
2  Hezbollah rockets hit northern Israeli city of...   
3        My son's not a monster, says Diddy's mother   
4    Japan's government admits editing cabinet photo   

                                              AUTHOR  \
0                                              Error   
1  Error fetching article: No connection adapters...   
2                                     Unknown Author   
3                                     Unknown Author   
4                                     Unknown Author   

                                             LINK DESCRIPTION  
0             https://www.bbc.com/news/bbcindepth           E  
1         https://www.bbc.com/news/war-in-ukraine           r  
2           https://www.bbc.com/news/world/europe           r  
3                     https://www.bbc.com/news/uk           o  
4  htt