# Web Scraping

Importing the necessary libraries

In [83]:
import pandas as pd 
import numpy as np

import time 
import json

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException, TimeoutException

from newspaper import Article, ArticleException

The objective of this project is to categorize news articles, and to achieve this, we first need to gather data that will later be used to train classification models. To accomplish this task, I utilized the web scraping technique, which involves extracting data from a website.

I implemented a scraping function that utilizes the **Selenium** library to initialize a web driver based on the provided _URL_. The function then performs string manipulation to iterate over selected categories for a specified number of pages. The resulting output is a dictionary that contains all the downloaded links.

In [85]:
def scraping(url_base, n_pages, categories):
    ''' 
    Perform Web-Scraping: 
    
    Args: 
        url_base: Base URL for the web scraping. This is the starting point for constructing page URLs.
        n_pages:  Number of pages to scrape. Specifies how many pages of the website to visit and extract data from.
        categories: List of categories to focus on during scraping. These can be used to filter and extract specific information.

    Output:
        all_links_dict: A dictionary containing URLs categorized by the specified categories. Each key represents a category, and the associated value is a list of URLs 
        extracted from the specified number of pages for that category.
    '''

    # Initialize the web driver
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
    all_links_dict = {}  # Dictionary to store URLs by category

    for category in categories:
        all_links = []

        for page in range(0, n_pages):
            url = f'{url_base}{category}&from={page * 10}&types=article&sort=relevance'
            driver.get(url)

            # Handle cookies
            try:
                accept_button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'onetrust-accept-btn-handler')))
                accept_button.click()
            except:
                # Handle the case where the accept button is not found
                pass

            # Find and collect links
            links = driver.find_elements(By.CSS_SELECTOR, '.container_list-images-with-description__link')
            for link in links:
                href = link.get_attribute('href')
                
                if href not in all_links:
                    all_links.append(href)
            
            time.sleep(2)

        all_links_dict[category] = all_links
        print(f'Finished processing {category} category')
    
    # Close the driver when done
    driver.quit()
    return all_links_dict

To use the function I chosen a search query on CNN's website, a famous source for news and information. 
The categories I am interested to are instored in the list: **categories_list**. 

In [87]:
categories_list = ['Politics', 'Business', 'Health', 'Entertainment']
links = scraping('https://www.cnn.com/search?q=', n_pages = 30, categories = categories_list)

Finished processing Politics category
Finished processing Business category
Finished processing Health category
Finished processing Entertainment category


In [92]:
links['Business']

['https://www.cnn.com/2023/12/30/business/small-businesses-big-retailers-impact/index.html',
 'https://www.cnn.com/2022/08/19/economy/worker-shortage-small-business/index.html',
 'https://www.cnn.com/2021/07/06/business/money/small-business-retirement/index.html',
 'https://www.cnn.com/2022/04/05/success/starting-a-business-retirement-finances/index.html',
 'https://www.cnn.com/2023/08/09/business/dentons-dacheng-china-business-split-intl-hnk/index.html',
 'https://www.cnn.com/2023/11/21/business/small-business-owners-gear-up-for-the-biggest-shopping-weekend-of-the-holiday-season/index.html',
 'https://www.cnn.com/2023/06/13/economy/nfib-small-business-optimism/index.html',
 'https://www.cnn.com/cnn-underscored/deals/best-small-business-deals-amazon-prime-day-2023-07-11',
 'https://www.cnn.com/cnn-underscored/deals/best-small-business-deals-amazon-prime-day-2023-07-12',
 'https://www.cnn.com/2020/04/13/business/businesses-transition-online-trnd/index.html']

This function iterates through each category and its associated URLs, using the **Article** class from the **newspaper3k** library to download, parse, and extract information from the articles. Specifically it download a summary of the whole text with the method **.summary** It then gather this information, including category, full description, and title, into a DataFrame. The DataFrame is further processed to handle missing values and converted into a dictionary. Finally, this dictionary is saved in a JSON file, and the resulting dictionary is returned.

The function is designed to handle exceptions during the downloading and parsing process.
If an article cannot be successfully downloaded or parsed, the function logs the error, providing empty values for the respective article's description and title. That makes the function more robust. 

In [89]:
def store_news(urls_dict, categories):
    '''
    Downloads news texts from provided URLs and stores the information.    
    
    Args:
    urls_dict: A dictonary where the keys are the categories and the values are lists of urls.
    categories: List of categories.
    
    Output:
    news_text: A dictionary containing downloaded texts per URL in urls_dict.
                The dictionary is saved in a JSON file on the computer.
    '''
    news_texts_list = []

    for category, urls in urls_dict.items():
        for url in urls:
            try:
                article = Article(url=url)
                article.download()
                article.parse()
                article.nlp()
                description = article.summary
                title = article.title
                news_texts_list.append({
                    'Category': category,  
                    'Full_description': description,
                    'Title': title
                })
            except ArticleException as e:
                print(f"ArticleException: {str(e)}")
                news_texts_list.append({
                    'Category': category,
                    'Full_description': '',
                    'Title':''
                })
            except Exception as e:
                print(f"Error: {str(e)}")
                news_texts_list.append({
                    'Category': category,
                    'Full_description': '',
                    'Title': ''
            
                })
                
    news_texts = pd.DataFrame(news_texts_list)
    news_texts = news_texts.replace('', np.nan)  
    news_texts = news_texts.dropna(subset=['Full_description', 'Title'])
    
    news_texts = news_texts.to_dict()

    json_file_path = 'C:/Users/Lenovo/Desktop/IT coding/IT project/news_trial.json'
    # Save the URLs to a JSON file
    with open(json_file_path, 'w') as json_file:
        json.dump(news_texts, json_file)
        
    return news_texts

In [90]:
store_news = store_news(links, categories_list)

ArticleException: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.cnn.com/cnn-underscored/deals/best-small-business-deals-amazon-prime-day-2023-07-11 on URL https://www.cnn.com/cnn-underscored/deals/best-small-business-deals-amazon-prime-day-2023-07-11
ArticleException: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.cnn.com/cnn-underscored/deals/best-small-business-deals-amazon-prime-day-2023-07-12 on URL https://www.cnn.com/cnn-underscored/deals/best-small-business-deals-amazon-prime-day-2023-07-12
ArticleException: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.cnn.com/2021/02/01/cnn-underscored/best-home-entertainment-systems on URL https://www.cnn.com/2021/02/01/cnn-underscored/best-home-entertainment-systems


In [93]:
pd.DataFrame(store_news).head()

Unnamed: 0,Category,Full_description,Title
0,Business,The case of UNO and AMC 24 Hamilton speaks to ...,What happens to small businesses when big reta...
1,Business,Small business owners across the United States...,America’s small businesses are running out of ...
2,Business,"“When you are a small business owner, almost e...",How to navigate retirement as a small business...
3,Business,"For starters, about a third of small businesse...",How to protect your personal finances when you...
4,Business,"Hong Kong CNN —Dentons, the world’s biggest la...","Dentons: Global law firm splits off Dacheng, i..."
