# Web Scraping: The Guardian¶
* [Ref Github Link](https://github.com/miguelfzafra/Latest-News-Classifier)

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time

## Obtain list of news from the coverpage¶
URL definition:

In [2]:
# url definition
url = "https://www.theguardian.com/uk"

List of news:

In [3]:
# Request
r1 = requests.get(url)
r1.status_code

# We'll save in coverpage the cover page content
coverpage = r1.content

# Soup creation
soup1 = BeautifulSoup(coverpage, 'html5lib')

# News identification
coverpage_news = soup1.find_all('h3', class_='fc-item__title')
len(coverpage_news)

125

## Let's extract the text from the articles:¶
First, we'll define the number of articles we want:

In [4]:
number_of_articles = 125

In [5]:
# Empty lists for content, links and titles
news_contents = []
list_links = []
list_titles = []

for n in np.arange(0, number_of_articles):
        
        # We need to ignore "live" pages since they are not articles
        if "live" in coverpage_news[n].find('a')['href']:  
                continue

        # Getting the link of the article
        link = coverpage_news[n].find('a')['href']
        list_links.append(link)

        # Getting the title
        title = coverpage_news[n].find('a').get_text()
        list_titles.append(title)

        # Reading the content (it is divided in paragraphs)
        article = requests.get(link)
        article_content = article.content
        soup_article = BeautifulSoup(article_content, 'html5lib')
        #print(soup_article)
        body = soup_article.find_all('p', class_='dcr-s23rjr')

        # Unifying the paragraphs
        list_paragraphs = []
        for p in body:
                paragraph = p.get_text()
                list_paragraphs.append(paragraph)
                final_article = " ".join(list_paragraphs)

        news_contents.append(final_article)

## Let's put them into:

* a dataset which will the input of the models (df_features)

In [18]:
news_contents[6]

'The Conservative MP Sir David Amess died on Friday after being stabbed multiple times at a constituency surgery. Here is what we know so far:'

In [6]:
# df_features
df_features = pd.DataFrame(
     {'Article Content': news_contents 
})

In [21]:
for news in news_contents:
    print(news)

The Speaker of the House of Commons, Sir Lindsay Hoyle, has demanded “an end to hatred” against MPs and a kinder form of political discourse following the fatal stabbing of Sir David Amess, as evidence mounts of the shocking scale of intimidation and threats suffered daily by elected politicians and their staff. In a highly unusual intervention for a Commons Speaker, Hoyle makes the appeal as he writes in the Observer. He describes the late Tory MP and father of five as a friend who would regularly drop into his office for a chat, and as “a man who found a connection with everyone, no matter their background”. On Saturday, political leaders put on a defiant show of unity and solidarity, with Boris Johnson and Sir Keir Starmer, along with Hoyle and the home secretary, Priti Patel, together laying wreaths at Belfairs Methodist church in Leigh-on-Sea, Essex, where Amess died after being stabbed repeatedly as he held his constituency surgery on Friday. A man was arrested shortly after the 

In [22]:
all_content = []
for news in news_contents:
        tmp_content = news.replace('“', '').replace('”','').replace('’','').split(". ")
        
        for tmp in tmp_content:
                for tmp2 in tmp.split(", "):
                        all_content.append(tmp2)

In [35]:
final_content = []
statistic = []
for unique_content in np.unique(all_content):
    if len(unique_content.strip()) >= 500:
        final_content.append(unique_content.strip()[:499])
    else:
        final_content.append(unique_content.strip())
    statistic.append(len(unique_content.strip()))

# DB Manipulation
* https://datatofish.com/how-to-connect-python-to-sql-server-using-pyodbc/

# Connect to DB

In [40]:
import pyodbc 

# SELECT @@SERVERNAME
# Use this to find servername

conn = pyodbc.connect('Driver={SQL Server};'
                          'Server=LAPTOP-Q0DB1ITI\SQLEXPRESS;'
                      'Database=DB;'
                      'Trusted_Connection=yes;')

cursor = conn.cursor()
cursor.execute('SELECT * FROM ATData')

for i in cursor:
    print(i)

(1, 'Hi, my name is Eric.', 'C:default/path/to/your/audio/file')
(2, '', 'C:default/path/to/your/audio/file')
(3, 'According to West', 'C:default/path/to/your/audio/file')
(4, 'All the snakes were northern Pacific rattlesnakes', 'C:default/path/to/your/audio/file')
(5, 'CNN first reported in February 2021 that court documents showed that the two private jets used by the Saudi assassination squad were owned by a company that had previously been seized by Prince Mohammed', 'C:default/path/to/your/audio/file')
(6, 'Edwards is now considering suing Santander for indirect discrimination', 'C:default/path/to/your/audio/file')
(7, 'Floella Benjamin', 'C:default/path/to/your/audio/file')
(8, 'For his part', 'C:default/path/to/your/audio/file')
(9, 'He feels he and his commission colleagues did their best', 'C:default/path/to/your/audio/file')
(10, 'He spent 12 years teaching in London', 'C:default/path/to/your/audio/file')
(11, 'I kept finding snakes for the next almost four hours', 'C:default

(686, 'I received a call … The crown prince heard of me', 'C:default/path/to/your/audio/file')
(687, 'I reckon I reported over 1,000 death threats', 'C:default/path/to/your/audio/file')
(688, 'I relived everything I have not been able to do in the last 15 years', 'C:default/path/to/your/audio/file')
(689, 'I remember first reading the report and thinking', 'C:default/path/to/your/audio/file')
(690, 'I see it becoming – my old enemy Rupert Murdochs dream made real', 'C:default/path/to/your/audio/file')
(691, 'I swear even in the past when weve had our differences and weve argued and stuff she would always call me or text me and say', 'C:default/path/to/your/audio/file')
(692, 'I think it would still take years for the scar tissue to heal with the employees', 'C:default/path/to/your/audio/file')
(693, 'I think news prefers people that sort of stay within the system and people of color who are assimilatory and want equal rights – thats something we understand', 'C:default/path/to/your/aud

(1375, 'The Casalesi clan', 'C:default/path/to/your/audio/file')
(1376, 'The Committee on Climate Change', 'C:default/path/to/your/audio/file')
(1377, 'The Conservative MP Sir David Amess died on Friday after being stabbed multiple times at a constituency surgery', 'C:default/path/to/your/audio/file')
(1378, 'The FDA usually takes the advice of its expert panels', 'C:default/path/to/your/audio/file')
(1379, 'The Foreign Office has been contacted for comment.', 'C:default/path/to/your/audio/file')
(1380, 'The Great Penrhyn Quarry Strike of 1900-3', 'C:default/path/to/your/audio/file')
(1381, 'The Guardian has found a variety of harmful pro-anorexia hashtags remain searchable on the popular video-sharing app TikTok', 'C:default/path/to/your/audio/file')
(1382, 'The Guardian has not been able to independently verify these claims', 'C:default/path/to/your/audio/file')
(1383, 'The Harvard Business School-educated executive is reportedly due to attend his first Newcastle match on Sunday in h

(2076, 'an increase of more than 30%', 'C:default/path/to/your/audio/file')
(2077, 'an internalising of angst – that makes the country appear relatively unshaken by the ongoing toll of the pandemic? There definitely isnt', 'C:default/path/to/your/audio/file')
(2078, 'an investigation by the Washington Post echoed the issues raised by Abrams and painted a picture of an organization riddled with distrust of its leadership', 'C:default/path/to/your/audio/file')
(2079, 'an irony for a company whose motto is Gradatim Ferociter', 'C:default/path/to/your/audio/file')
(2080, 'and #bodyacceptance', 'C:default/path/to/your/audio/file')
(2081, 'and Carlsons colleagues have previously praised the companys six weeks of paid leave on-air', 'C:default/path/to/your/audio/file')
(2082, 'and Charging Crow served as primary caretaker for her mother as she dealt with the medical complications of end-stage kidney failure', 'C:default/path/to/your/audio/file')
(2083, 'and I knew it would be explosive', 'C:d

(2811, 'of Arizona State University', 'C:default/path/to/your/audio/file')
(2812, 'of Notting Hill', 'C:default/path/to/your/audio/file')
(2813, 'of course', 'C:default/path/to/your/audio/file')
(2814, 'of course Id like to see my home town move away from coal', 'C:default/path/to/your/audio/file')
(2815, 'of which Pedro Maria Martinez Ocio', 'C:default/path/to/your/audio/file')
(2816, 'officers from the Metropolitan Polices counter-terrorism unit said they were investigating a potential motivation linked to Islamist extremism', 'C:default/path/to/your/audio/file')
(2817, 'officials ordered more than 70 mines in Inner Mongolia to increase coal production by almost 100m tonnes early this month', 'C:default/path/to/your/audio/file')
(2818, 'offshore wind and schemes to decarbonise homes.', 'C:default/path/to/your/audio/file')
(2819, 'on an almost daily basis', 'C:default/path/to/your/audio/file')
(2820, 'on average 4,870 people died in mine accidents every year', 'C:default/path/to/your/

(3457, 'which finds we are almost out of time to act in any sort of meaningful way on the climate crisis', 'C:default/path/to/your/audio/file')
(3458, 'which forces power companies to lower their opening offers to customers', 'C:default/path/to/your/audio/file')
(3459, 'which government MPs support and non-government MPs do not', 'C:default/path/to/your/audio/file')
(3460, 'which has 15.3bn views', 'C:default/path/to/your/audio/file')
(3461, 'which has 214m views', 'C:default/path/to/your/audio/file')
(3462, 'which has been delayed by a year because of the pandemic', 'C:default/path/to/your/audio/file')
(3463, 'which has only exacerbated its already considerable delays', 'C:default/path/to/your/audio/file')
(3464, 'which has raised $11.5m', 'C:default/path/to/your/audio/file')
(3465, 'which has secured $4.5m in seed funding', 'C:default/path/to/your/audio/file')
(3466, 'which has sold more than 10 million copies worldwide', 'C:default/path/to/your/audio/file')
(3467, 'which have been a

In [38]:
from tqdm import tqdm

# Insert data into DB

In [39]:
# Do the insert
for content in tqdm(final_content):
        cursor.execute("insert into ATData(TEXT) values (?)", str(content))
        conn.commit()

100%|███████████████████████████████████████████████████████████████████| 3656/3656 [00:02<00:00, 1256.91it/s]


In [19]:

import pyodbc 

# SELECT @@SERVERNAME
# Use this to find servername

conn = pyodbc.connect('Driver={SQL Server};'
                          'Server=LAPTOP-Q0DB1ITI\SQLEXPRESS;'
                      'Database=DB;'
                      'Trusted_Connection=yes;')

cursor = conn.cursor()
cursor.execute("""SELECT TOP(1) FROM user""")
for i in cursor:
    print(i)

ProgrammingError: ('42000', "[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]接近關鍵字 'FROM' 之處的語法不正確。 (156) (SQLExecDirectW)")