# News Scraping Tool for Scraping Multiple Media Websites 

#### By Joyce Jiang | Code by Joyce

Declare: My code is adapted from holwech's personal blog "Automatic news scraping with Python, Newspaper and Feedparser", you can check it out at https://holwech.github.io/blog/Automatic-news-scraper/. The author also provided a Github download, you can check it out at https://github.com/holwech/NewsScraper. This automatic scraping tool uses Newspaper library to scraping multiple news websites at the same time. You can find the documentation of Newspaper library at https://github.com/codelucas/newspaper. 

Thanks holwech for sharing a Python code. I modified this script by adding other elements, such as keywords, summary, and authors to collect from news articles with the help of Newspaper's documentation. I also developed a method to tranfer newspaper's JSON export file to a dataframe. This script is published for study and research exploration purpose only, and it would not be used for any commercial purpose. 

I used an example of Ugandan Newspaper Json file to test hotwech's method. I will also share this file in my Github Repo. 

In [1]:
import feedparser as fp
import json
import newspaper
from newspaper import Article
from time import mktime
from datetime import datetime

### Set the limit for number of articles to download

In [18]:

LIMIT = 30

data = {}
data['newspapers'] = {}

In [19]:
with open('NewsPapers_Uganda.json') as data_file:
    companies = json.load(data_file)

### Iterate through each news company

In [20]:
count = 1

# Iterate through each news company
for company, value in companies.items():
    # If a RSS link is provided in the JSON file, this will be the first choice.
    # Reason for this is that, RSS feeds often give more consistent and correct data.
    # If you do not want to scrape from the RSS-feed, just leave the RSS attr empty in the JSON file.
    if 'rss' in value:
        d = fp.parse(value['rss'])
        print("Downloading articles from ", company)
        newsPaper = {
            "rss": value['rss'],
            "link": value['link'],
            "articles": []
        }
        for entry in d.entries:
            # Check if publish date is provided, if no the article is skipped.
            # This is done to keep consistency in the data and to keep the script from crashing.
            if hasattr(entry, 'published'):
                if count > LIMIT:
                    break
                article = {}
                article['link'] = entry.link
                date = entry.published_parsed
                article['published'] = datetime.fromtimestamp(mktime(date)).isoformat()
                try:
                    content = Article(entry.link)
                    content.download()
                    content.parse()
                except Exception as e:
                    # If the download for some reason fails (ex. 404) the script will continue downloading
                    # the next article.
                    print(e)
                    print("continuing...")
                    continue
                article['title'] = content.title
                article['author']=content.authors
#                article['keywords']=content.keywords
 #               article['summary']=content.summary
                article['text'] = content.text
                newsPaper['articles'].append(article)
                print(count, "articles downloaded from", company, ", url: ", entry.link)
                count = count + 1
                
    else:
        # This is the fallback method if a RSS-feed link is not provided.
        # It uses the python newspaper library to extract articles
        print("Building site for ", company)
        paper = newspaper.build(value['link'], memoize_articles=False)
        newsPaper = {
            "link": value['link'],
            "articles": []
        }
        noneTypeCount = 0
        for content in paper.articles:
            if count > LIMIT:
                break
            try:
                content.download()
                content.parse()
            except Exception as e:
                print(e)
                print("continuing...")
                continue
            # Again, for consistency, if there is no found publish date the article will be skipped.
            # After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.
            if content.publish_date is None:
                print(count, " Article has date of type None...")
                noneTypeCount = noneTypeCount + 1
                if noneTypeCount > 10:
                    print("Too many noneType dates, aborting...")
                    noneTypeCount = 0
                    break
                count = count + 1
                continue
            article = {}
            article['title'] = content.title
            article['author']=content.authors
#            article['keywords']=content.keywords
 #           article['summary']=content.summary
            article['text'] = content.text
            article['link'] = content.url
            article['published'] = content.publish_date.isoformat()
            newsPaper['articles'].append(article)
            print(count, "articles downloaded from", company, " using newspaper, url: ", content.url)
            count = count + 1
            noneTypeCount = 0
            
    count = 1
    data['newspapers'][company] = newsPaper

Building site for  Daily Monitor
1  Article has date of type None...
2  Article has date of type None...
3  Article has date of type None...
4  Article has date of type None...
5  Article has date of type None...
6  Article has date of type None...
7 articles downloaded from Daily Monitor  using newspaper, url:  https://www.monitor.co.ug/News/National/Report-projects-fall-tourism-fortunes--/688334-5577420-10te9ge/index.html
8 articles downloaded from Daily Monitor  using newspaper, url:  https://www.monitor.co.ug/News/National/Politics-alliances-against-Museveni-past-elections/688334-5577370-3m3cs4/index.html
9 articles downloaded from Daily Monitor  using newspaper, url:  https://www.monitor.co.ug/News/National/688334-5577396-mm2s7nz/index.html
10 articles downloaded from Daily Monitor  using newspaper, url:  https://www.monitor.co.ug/News/National/Top-aviation-bosses-under-investigation-minister/688334-5577390-fw7itfz/index.html
11  Article has date of type None...
12  Article has da

2 articles downloaded from CHIMP REPORTS  using newspaper, url:  https://chimpreports.com/uganda-must-strengthen-its-institutions-to-realize-socio-economic-transformation/
3 articles downloaded from CHIMP REPORTS  using newspaper, url:  https://chimpreports.com/mother-of-triplets-camps-at-district-officers-as-husband-flees/
4 articles downloaded from CHIMP REPORTS  using newspaper, url:  https://chimpreports.com/eacj-court-concludes-its-inception-online-sessions-successfully/
5 articles downloaded from CHIMP REPORTS  using newspaper, url:  https://chimpreports.com/private-school-owners-ask-govt-to-create-education-recovery-fund/
6 articles downloaded from CHIMP REPORTS  using newspaper, url:  https://chimpreports.com/masaza-cup-gomba-seals-17th-signing/
7 articles downloaded from CHIMP REPORTS  using newspaper, url:  https://chimpreports.com/4-arrested-for-killing-buffalo-calf/
8 articles downloaded from CHIMP REPORTS  using newspaper, url:  https://chimpreports.com/court-orders-taxi-b

1  Article has date of type None...
2  Article has date of type None...
3  Article has date of type None...
4  Article has date of type None...
5  Article has date of type None...
6  Article has date of type None...
7  Article has date of type None...
8  Article has date of type None...
9  Article has date of type None...
10  Article has date of type None...
11  Article has date of type None...
Too many noneType dates, aborting...
Building site for  THE INDEPENDENT
1 articles downloaded from THE INDEPENDENT  using newspaper, url:  https://www.independent.co.ug/german-virus-hunters-track-down-corona-outbreaks/
2 articles downloaded from THE INDEPENDENT  using newspaper, url:  https://www.independent.co.ug/a-world-redrawn-nobel-winner-deaton-warns-virus-could-worsen-inequality/
3 articles downloaded from THE INDEPENDENT  using newspaper, url:  https://www.independent.co.ug/open-letter-to-president-yoweri-museveni-decision-not-to-reopen-schools-may-do-more-harm-than-good/
4 articles downl

19 articles downloaded from BUKEDDE  using newspaper, url:  https://www.bukedde.co.ug/news/1520710/ebiri-mu-kiraamo-kya-kasirye-ggwanga
20 articles downloaded from BUKEDDE  using newspaper, url:  https://www.bukedde.co.ug/amawulire/1520743/ebyewuunyisa-ku-kasirye-gwanga-obadde-tomanyi
21 articles downloaded from BUKEDDE  using newspaper, url:  https://www.bukedde.co.ug/news/1520669/ekyagaanyi-kasirye-ggwanga-okuziikibwa-ku-kiggya-kya-ffamire
22 articles downloaded from BUKEDDE  using newspaper, url:  https://www.bukedde.co.ug/ssenga/1520933/engeri-abafumbo-gye-babadde-bakukuta-ku-ssimu-mu-kalantiini
23 articles downloaded from BUKEDDE  using newspaper, url:  https://www.bukedde.co.ug/ssenga/1520931/kikyamu-abaana-bo-okusiibanga-ku-muliraano-olw-obutagula-byakulya-waka
24 articles downloaded from BUKEDDE  using newspaper, url:  https://www.bukedde.co.ug/ssenga/1520935/njagala-mwami-alina-empisa
Article `download()` failed with 500 Server Error: Internal Server Error for url: https://www

19 articles downloaded from OBSERVER  using newspaper, url:  https://observer.ug/appnews/59403-unrest-as-banks-deny-govt-agencies-loans
20 articles downloaded from OBSERVER  using newspaper, url:  https://observer.ug/life/59383-mother-kevin-embarks-on-long-journey-to-beatification
21 articles downloaded from OBSERVER  using newspaper, url:  https://observer.ug/appnews/59189-bou-fighting-for-most-critical-asset-reputation
22 articles downloaded from OBSERVER  using newspaper, url:  https://observer.ug/businessnews/65254-mtn-pays-shs-372bn-for-operating-licence-for-next-12-years
23 articles downloaded from OBSERVER  using newspaper, url:  https://observer.ug/businessnews/65203-standard-bank-predicts-tough-future-for-uganda-s-economy
24 articles downloaded from OBSERVER  using newspaper, url:  https://observer.ug/businessnews/65202-business-leaders-call-for-more-digital-platforms
25 articles downloaded from OBSERVER  using newspaper, url:  https://observer.ug/businessnews/65226-mtn-discus

10 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2020/06/airtel-10-years-in-uganda/
11 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2020/06/appointing-nira-boss-with-it-backgroun/
12 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2020/06/eversend-donation-feature/
13 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2020/06/undp-offer-govt-ict-equipment-to-combat-covid19/
14 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2020/06/hiil-uganda-kampala-legal-hackathon/
15 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2020/06/u-s-huawei-ban-is-about-5g-leadership-race/
16 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2020/06/mtn-clarifies-on-ruling-to-pay-ugx-24-billion-in-taxes-to-ura/
17 articles downloaded from PC TECH  using newspaper, url:  https://pctechmag.com/2

10 articles downloaded from BIG EYE  using newspaper, url:  https://bigeye.ug/sheilah-gashumba-states-the-things-she-loves-about-gods-plan/
11 articles downloaded from BIG EYE  using newspaper, url:  https://bigeye.ug/all-you-need-to-know-about-rising-star-revboy/
12 articles downloaded from BIG EYE  using newspaper, url:  https://bigeye.ug/chamili-congratulates-spice-daina-upon-receiving-youtube-award/
13 articles downloaded from BIG EYE  using newspaper, url:  https://bigeye.ug/mpaka-records-dre-cali-drops-new-song-ebisooka-ne-bisembayo/
14 articles downloaded from BIG EYE  using newspaper, url:  https://bigeye.ug/video-iry-tina-releases-drops-new-hold-me/
15 articles downloaded from BIG EYE  using newspaper, url:  https://bigeye.ug/video-alert-vivian-tendo-praises-imaginary-lover-in-mu-kati/
16 articles downloaded from BIG EYE  using newspaper, url:  https://bigeye.ug/video-you-are-a-legend-in-this-game-rude-boy-tells-chameleone/
17 articles downloaded from BIG EYE  using newspaper,

13 articles downloaded from SQOOP  using newspaper, url:  https://www.sqoop.co.ug/202006/four-one-one/ratata-its-zex-bilangilangi.html
14 articles downloaded from SQOOP  using newspaper, url:  https://www.sqoop.co.ug/202006/four-one-one/eddy-kenzo-spice-diana-eat-big.html
15 articles downloaded from SQOOP  using newspaper, url:  https://www.sqoop.co.ug/202006/four-one-one/cinemas-open-then-asked-to-close-again.html
16 articles downloaded from SQOOP  using newspaper, url:  https://www.sqoop.co.ug/202006/news/bad-blacks-advert-was-voluntary-says-government.html
17 articles downloaded from SQOOP  using newspaper, url:  https://www.sqoop.co.ug/202006/news/bad-black-sues-govt-over-failure-to-pay-for-covid-19-advert.html
18 articles downloaded from SQOOP  using newspaper, url:  https://www.sqoop.co.ug/202006/news/sacked-sanyu-fm-employees-blame-managers-over-strike.html
19 articles downloaded from SQOOP  using newspaper, url:  https://www.sqoop.co.ug/202006/features/celebrity-profiles/i-migh

#### *The limit amount of articles that are allowed to scrape depends on each website. My experience is that the limit number varies from 25 to 100 articles. With RSS Feed URL you will be able to scrape more articles. 

In [7]:
try:
    with open('scraped_articles.json', 'w') as outfile:
        json.dump(data, outfile)
except Exception as e: print(e)

### Convert json file to dataframe

In [8]:
import csv 
import pandas as pd
import pprint
import pprintpp

In [9]:
with open('scraped_articles.json') as json_file: 
    data = json.load(json_file) 

In [10]:
pprintpp.pprint(data)

{
    'newspapers': {
        'bbc': {
            'articles': [
                {
                    'author': ['Mark Piesing'],
                    'link': 'https://www.bbc.com/future/article/20200615-the-record-breaking-jet-which-still-haunts-a-country',
                    'published': '2020-06-15T00:00:00',
                    'text': 'In the early years of the Cold War, Canada decided to design and build the most advanced fighter aircraft in the world.\n\nCanada is well known for its rugged bush planes, capable of rough landings and hair-raising take-offs in the wilderness. From the late 1930s, the North American country had also started to manufacture British-designed planes for the Allied war effort. Many of these planes were iconic wartime designs like the Hawker Hurricane fighter and Avro Lancaster bomber.\n\nAmbitious Canadian politicians and engineers weren’t satisfied with this. They decided to forge a world-leading aircraft manufacturing industry out of the factories and

                    'text': 'The COVID Tracking Project reported on Monday that COVID-19 deaths continue to fall in the United States but also noted in a series of tweets the number of cases is up slightly over the past seven days and that they are watching closely five states where current hospitalizations have increased over the past two weeks.\n\nThese are the five states we’re watching most closely. pic.twitter.com/cHnFPHUg9M — The COVID Tracking Project (@COVID19Tracking) June 15, 2020\n\nStates reported 18,521 positives today, up from 17,123 last Monday. pic.twitter.com/XcIQed7QbD — The COVID Tracking Project (@COVID19Tracking) June 15, 2020\n\nNate Silver added his own analysis of the recent COVID-19 data:\n\nThere continues to be this ebb-and-flow where people alternatively become too optimistic or too pessimistic in ways that are somewhat detached from the data. Certainly, the death numbers have been better, lately. 7-day average now at ~700, down from ~2,000 at the peak. — Na

                    'text': "The Supreme Court’s (SCOTUS) reinterpretation of a federal prohibition against employment discrimination based on sex — which now includes sexual orientation and “gender identity” — will “create a tsunami of new litigation” against religious organizations, explained Carrie Severino, president of the Judicial Crisis Network, offering her remarks on SiriusXM’s Breitbart News Tonight with host Rebecca Mansour.\n\nThe Supreme Court’s legal redefinition of “sex” to include sexual orientation and “gender identity” opens the door for further left-wing lawfare against religious organizations, Severino noted.\n\nSeverino said, “The Supreme Court left a lot of really important issues open, like, how do you balance this with religious freedom? How do you balance it with freedom of speech? If you’ve got a law, for example, saying that using someone’s preferred pronoun is mandatory — or you can be fined [for non-compliance], how do we balance that with some of these oth

                    'text': 'In an interview with Politico on Sunday, Chicago Democrat Congressman Bobby Rush attacked Chicago law enforcement.\n\nLong-time activist and career politician Bobby L. Rush unleashed a scathing review of law enforcement officers on Sunday, comparing them to the Ku Klux Klan. “The number-one cause that prevents police accountability, that promotes police corruption, that protects police lawlessness, is a culprit called the Fraternal Order of Police,” he told Politico, calling the two organizations “kissing, hugging and law-breaking cousins”:\n\nThey’re the organized guardians of continuous police lawlessness, of police murder and police brutality. The Chicago Fraternal Order of Police is the most rabid, racist body of criminal lawlessness by police in the land. It stands shoulder to shoulder with the Ku Klux Klan then and the Ku Klux Klan now.\n\nRush’s words were part of his response to a leaked video showing police at leisure in his office on the weekend o

                    'link': 'http://edition.cnn.com/2020/06/11/health/regeneron-covid-19-antibody-trial-starts/index.html',
                    'published': '2020-06-11T00:00:00',
                    'text': '(CNN) A medicine that may treat and prevent Covid-19 is now being tested in patients in multiple sites around the United States, according to an announcement Thursday from Regeneron Pharmaceuticals Inc.\n\nIt is the first trial of a Covid-19 antibody cocktail in the United States. If successful, Regeneron hopes it could be available by the fall.\n\nThe clinical trial started Wednesday. Regeneron said its antibody cocktail will be tested in four separate study populations: people who are hospitalized with Covid-19; people who have symptoms for the disease, but are not hospitalized; people who are healthy but are at a high risk for getting sick; and healthy people who have come into close contact with a person who is sick.\n\n"We have created a unique anti-viral antibody cocktail wi

                    'text': '(CNN) The widely available steroid drug dexamethasone may be key in helping to treat the sickest Covid-19 patients who require ventilation or oxygen, according to researchers in the United Kingdom.\n\nTheir findings are preliminary, still being compiled and have not been published in a peer-reviewed journal -- but some not involved with the study called the results a breakthrough\n\nThe two lead investigators of the Recovery Trial , a large UK-based trial investigating potential Covid-19 treatments, announced to reporters in a virtual press conference on Tuesday that a low-dose regimen of dexamethasone for 10 days was found to reduce the risk of death by a third among hospitalized patients requiring ventilation in the trial.\n\n"That\'s a highly statistically significant result," Martin Landray, deputy chief investigator of the trial and a professor at the University of Oxford , said on Tuesday.\n\n"This is a completely compelling result. If one looks at th

                    'text': '(CNN) The United States could see more than 200,000 deaths from Covid-19 by October 1, a closely watched model predicted Monday as states continue to reopen.\n\nMore than 2 million have been infected by the virus and 116,125 have died, according to data from Johns Hopkins University . Though many states are seeing improved conditions, the pandemic has not yet reached its conclusion. The projection comes as 18 states are still seeing an upward trend in new cases.\n\n"Increased mobility and premature relaxation of social distancing led to more infections, and we see it in Florida, Arizona, and other states," said Ali Mokdad, one of the creators of the model from the Institute for Health Metrics and Evaluation (IHME) at the University of Washington.\n\n"This means more projected deaths."\n\nAlthough daily death rates are expected in drop in June and July, the model forecasts a second hike in deaths through September, culminating in 201,129 by October 1.\n\nThe

                    'text': 'Violent leftists hijacked a healthcare worker protest in Paris, prompting running battles with riot police as chaos ensued.\n\nThe original demonstration by healthcare workers sought to guarantee better pay and increased funding, but it was soon usurped by ‘black bloc’ Antifa-style militants who attacked cops with missiles and overturned vehicles.\n\nThe far left are rampaging in Paris now, France is facing internal wars on many grounds. pic.twitter.com/75iK5I03dv — Jack Dawkins (@DawkinsReturns) June 16, 2020\n\nSeveral thousand workers descended on the Ministry of Health, but attention was dragged elsewhere in the city as left-wing radicals staged violent riots.\n\n“These more violent demonstrators hurled bottles and fireworks at the police, who responded by firing tear gas at the mob, some of whom flipped vehicles and set fires,” reports RT.\n\nAt some of the more peaceful demonstrations, police applauded the healthcare workers, but their colleagues were

                    'text': "Watch and share this bombshell Tuesday edition of the most banned broadcast in the world! You do not want to miss this one!\n\n\n\nFollow Alex Jones on Telegram:\n\n\n\nDavid Knight Show: Gorsuch Comes Out As a Statist & Why Trump Should’ve Listened To Rand Paul\n\nRemember to share this banned broadcast!\n\nOn this Tuesday edition of The David Knight Show, we’ll examine recent rulings from the Supreme Court that show Judge Neil Gorsuch has aligned himself with the establishment.\n\nAlso, what is the “right way” to use deadly force to protect yourself and why doesn’t President Trump listen to Rand Paul?\n\nToday's News LIVE 9AM EASTERN\n\n➡\ufe0f#SCOTUS: Gorsuch Comes Out — as Statist\n\nas court takes a hard left turn on #2A & other issues\n\n➡\ufe0fDeadly Force — the right and wrong way illustrated by citizen & cop\n\n➡\ufe0fBarbarians are INSIDE the Gate\n\n➡\ufe0f#Bolton: Trump should've listened to @RandPaul — David Knight (@libertytarian) June 16, 202

                    'text': 'Demand for luxury properties in Aspen, Colorado, and Park City, Utah, is “through the roof,” explained Mauricio Umansky, CEO of real estate firm The Agency, who recently spoke with Fox Business.\n\nUmansky said the pandemic had accelerated the trend of wealthy vacationers staying year-round in rural, resort communities.\n\n“A lot of traveling to Europe this year is probably nonexistent … And so I think a lot of Americans are looking where to enjoy the summer,” he said.\n\nAs we’ve previously noted, “there’s a mad rush” of wealthy folks leaving big cities due to the virus pandemic, economic crash, and social unrest. It was noted by Sotheby’s relators that people in the San Francisco Bay Area are fleeing the city for rural communities, such as Marin County, Napa wine country, and south to Monterey’s Carmel Valley.\n\nSome rich people have also fled to their luxury doomsday bunkers — but it seems, for the mainstream household with a couple million dollars in t

                    'published': '2020-06-16T13:43:52+00:00',
                    'text': 'Get breaking news alerts and special reports. The news and stories that matter, delivered weekday mornings.\n\nWall Street started the day with massive gains across all three major averages, after positive retail sales data boosted confidence in a swifter economic recovery, and a drug trial showed "breakthrough" results for COVID-19 treatment.\n\nThe Dow Jones Industrial Average traded higher by around 850 points, with the S&P 500 up by 2.7 percent and the Nasdaq up by 2.3 percent at the opening bell.\n\nMonthly retail sales data released Tuesday morning from the Census Bureau showed a 17.7 percent increase, the biggest jump on record. Economists were expecting a gain of around 8 percent. Data for April registered the the largest monthly drop ever, down 14.7 percent.\n\nLet our news meet your inbox. The news and stories that matters, delivered weekday mornings. This site is protected by recaptcha

                    'text': "A patient almost died after being misdiagnosed and sent home from hospital on the first day of the lockdown as the NHS curtailed many normal services to focus on Covid-19.\n\nThe NHS trust involved has admitted that its failings led to the man suffering excruciating pain, developing life-threatening blood poisoning, and contracting the flesh-eating bug necrotising fasciitis. He needed eight operations to remedy the damage caused by his misdiagnosis.\n\nPrivate hospitals can help the NHS recover from Covid-19 - here's how | Jim Mackey Read more\n\nThe man, his wife and his GP spent three weeks after his discharge trying to get him urgent medical care. However, St Mary’s hospital on the Isle of Wight rejected repeated pleas by them for doctors to help him, even though his health was deteriorating sharply.\n\nThe man, who does not want to be named, said his experience of seeking NHS care for something other than Covid-19 during the pandemic had been “debilitat

                    'text': 'Four workers describe how they’ve coped since losing their jobs as the pandemic hit\n\nUK unemployment is expected to climb over the next few months from one of the lowest in the developed world, at 3.9%, to a much higher level. The Bank of England has warned that the rate could more than double to 9%.\n\nOfficial figures released on Tuesday, covering the jobs market in April and May, showed the number of UK payrolls fell by more than 600,000 between March and May, as the impact of the Covid-19 crisis begins to feed through to the official jobs figures.\n\nThe Guardian has spoken to workers who were made unemployed and lost work in March and April.\n\nIzabela Ceckowska, 32, waitress, Oxfordshire\n\nFacebook Twitter Pinterest Izabela Ceckowska has been existing on jobseeker’s allowance: £74 per week.\n\nCeckowska said she took a week off from 16 March and was planning to visit Poland to see her family but her company instructed her not to because of the risk

                    'title': 'Charities for deaf people call for more see-through face masks',
                },
                {
                    'author': ['Kalyeena Makortoff'],
                    'link': 'https://www.theguardian.com/business/2020/jun/16/greggs-reopen-stores-sales-coronavirus-crisis',
                    'published': '2020-06-16T10:26:34',
                    'text': 'Greggs is pressing ahead with plans to reopen 800 stores on Thursday but has said it expects sales to be lower than normal for some time, prompting the bakery chain to suspend plans to open new stores and seek rent reductions from its landlords.\n\nThe Newcastle-based chain, which has 2,050 shops, is planning to reopen the remainder of its stores in early July, with Covid-19 safety measures in place such as protective screens at its counters and markings to help ensure customers maintain physical distancing.\n\n“We are not able to predict the impact of social distancing on our ability to trade or

                    'title': 'New Zealand ends Covid-free run with two cases from UK',
                },
                {
                    'author': ['Kate Connolly'],
                    'link': 'https://www.theguardian.com/world/2020/jun/16/germany-appeals-to-nation-to-download-coronavirus-app',
                    'published': '2020-06-16T14:21:11',
                    'title': 'Germany appeals to nation to download coronavirus app',
                },
            ],
            'link': 'https://www.theguardian.com/international',
            'rss': 'https://www.theguardian.com/uk/rss',
        },
        'theonion': {'articles': [], 'link': 'http://www.theonion.com/'},
        'washingtonpost': {
            'articles': [
                {
                    'author': [
                        'William Booth',
                        'London Bureau Chief',
                        'Quentin Aries',
                        'Karla Adam',
                        'London Correspond

                    'text': 'It is fair to say that Britain, with the third-highest coronavirus death toll in the world and its prime minister forced into an intensive care unit to fight for his life, has been distracted from long-running Brexit debates over cod fisheries, customs duties in the Irish Sea and shared aviation standards.\n\nAD\n\nBut Brexit is still a big deal for Prime Minister Boris Johnson, whose political career is built upon delivering it, even if pollsters say most Britons now care much more about the economic crisis and the virus.\n\nAD\n\nOn Monday, Johnson and European Commission President Ursula von der Leyen, alongside leaders of the European Council and European Parliament, held their first high-level talks in months — via a videoconference call between London and Brussels.\n\nAfter the closed session, Johnson cheerily said a deal could be reached by July with “a bit of oomph.”\n\nEuropean leaders pushed back and said there would be no deal unless Britain agre

                    'text': 'An officer and two soldiers died during a “violent faceoff” late Monday that caused “casualties on both sides,” the Indian Army said in a statement. It did not say how they were killed.\n\nSenior military officers from both countries are meeting to “defuse the situation,” the army added.\n\nAD\n\nChinese Foreign Ministry spokesman Zhao Lijian did not confirm any Chinese casualties at a news briefing Tuesday.\n\nAD\n\nZhao told reporters that Indian troops had twice crossed the Line of Actual Control, the unofficial border that divides the two countries. He accused India of “provoking and attacking Chinese personnel, which led to serious physical conflicts between the two sides.”\n\nThe clashes come at a time when China is flexing its muscles across the region amid a global pandemic. In recent weeks, it has confronted Malaysian and Vietnamese vessels in the South China Sea and twice sailed an aircraft carrier through the Taiwan Strait.\n\nIndia and China fou

                    'text': 'They would talk from several feet apart — Smith, 27, firmly planted on the U.S. side in Washington state and Bosello, 31, on the Canadian side. Border officers eavesdropped, and trucks sped by, drowning out their already muffled conversations, they said.\n\nOther couples and families started showing up at the 0 Avenue ditch, too. For months, it was a strange and dusty meetup spot where couples would go to see each other and would often notice others doing the same.\n\nAD\n\nAD\n\nMeeting this way was painful, Bosello said, “but it was still better than nothing.”\n\nAs the weather warmed and shutdowns lifted, a superior reunion spot emerged in mid-May: Peace Arch Park. There, cross-national couples and families could actually embrace — at long last.\n\nThe recently reopened park is situated between Blaine, Wash., and Surrey, B.C.\n\n“When I finally hugged him again, it felt like it was the first time I ever did,” Bosello said.\n\nIn fact, the couple, who has

                },
                {
                    'author': [
                        'Ishaan Tharoor',
                        'Columnist Covering Foreign Affairs',
                        'Geopolitics',
                    ],
                    'link': 'https://www.washingtonpost.com/world/2020/06/16/israel-annexation-is-saying-quiet-part-loud/?utm_source=rss&utm_medium=referral&utm_campaign=wp_world',
                    'published': '2020-06-16T17:00:00',


                    'title': 'For Israel, West Bank annexation is saying the quiet part loud',
                },
                {
                    'author': [
                        'Isabelle Khurshudyan',
                        'Foreign Correspondent Based In Moscow',
                    ],
                    'link': 'https://www.washingtonpost.com/world/europe/verdict-for-american-paul-whelan-held-in-russia-on-spy-charges-expected-monday/2020/06/15/175fc7a4-aa5e-11ea-a43b-be9f6494a87d_story.html?utm_source=rss&utm_medium=referral&utm_campaign=wp_world',
                    'published': '2020-06-15T07:42:00',


                    'text': 'Whelan, arrested on Dec. 28, 2018, has said he thought the flash drive that he received from an acquaintance contained holiday photos. He said Monday that he plans to appeal the court’s decision.\n\nAD\n\nNow that Whelan has been convicted, speculation is rife about a possible prisoner exchange with the United States. Zherebenkov said Monday that “Paul expected this decision because even when he was detained, he was told [by Russian security service agents] that he would be exchanged.”\n\nAD\n\nWithout revealing his source, Zherebenkov said he was told Konstantin Yaroshenko, a pilot who was arrested in 2010 for conspiracy to smuggle cocaine into the United States, and Viktor Bout, a gun runner who inspired the 2005 Hollywood film “Lord of War,” are the people the Kremlin is focused on as possible trades for Whelan’s release.\n\n“I heard talk that, why should we waste time on the appeal if we can just go ahead with the exchange?” Zherebenkov said. “I can’t g

                    'title': 'A violent hailstorm and flooding struck Calgary, Canada, on Saturday',
                },
                {
                    'author': [
                        'Miriam Berger',
                        'Reporter Covering Middle East',
                        'Foreign Affairs',
                    ],
                    'link': 'https://www.washingtonpost.com/world/2020/06/15/coronavirus-surge-new-global-hotspots/?utm_source=rss&utm_medium=referral&utm_campaign=wp_world',
                    'published': '2020-06-15T17:58:54',


                    'title': 'As control measures lift, the coronavirus pandemic continues to grow. Here are the global hot spots.',
                },
            ],
            'link': 'https://www.washingtonpost.com/',
            'rss': 'http://feeds.washingtonpost.com/rss/world',
        },
    },
}


In [12]:
df = pd.read_json ('scraped_articles.json')
df

Unnamed: 0,newspapers
bbc,{'link': 'https://www.bbc.co.uk/search?q=covid...
breitbart,"{'link': 'http://www.breitbart.com/', 'article..."
cnn,"{'link': 'http://edition.cnn.com/', 'articles'..."
foxnews,"{'link': 'http://www.foxnews.com/', 'articles'..."
infowars,"{'link': 'https://www.infowars.com/', 'article..."
nbcnews,"{'link': 'http://www.nbcnews.com/', 'articles'..."
theguardian,"{'rss': 'https://www.theguardian.com/uk/rss', ..."
theonion,"{'link': 'http://www.theonion.com/', 'articles..."
washingtonpost,{'rss': 'http://feeds.washingtonpost.com/rss/w...


In [13]:
#pprintpp.pprint(data['newspapers']['cnn']['articles'][3]['author'])

In [14]:
outlets=[]
titles=[]
authors=[]
texts=[]
links=[]
publisheds=[]

def get_list(name):
    for x in range(len(data['newspapers'][name]['articles'])):
        outlets.append(name)
        titles.append(data['newspapers'][name]['articles'][x]['title'])
        authors.append(data['newspapers'][name]['articles'][x]['author'])
        texts.append(data['newspapers'][name]['articles'][x]['text'])
        links.append(data['newspapers'][name]['articles'][x]['link'])
        publisheds.append(data['newspapers'][name]['articles'][x]['published'])
    

get_list('cnn')
get_list('bbc')
get_list('theguardian')
get_list('breitbart')
get_list('infowars')
get_list('foxnews')
get_list('nbcnews')
get_list('washingtonpost')
get_list('theonion')

get_list('Daily Monitor')
get_list('New Vision')
get_list('CHIMP REPORTS')
get_list('NILE POST')
get_list('UGANDA RADIO NETWORK')
get_list('THE INDEPENDENT')
get_list('BUKEDDE')
get_list('RED PEPPER')
get_list('OBSERVER')
get_list('PML DAILY')
get_list('SOFT POWER NEWS')
get_list('PC TECH')
get_list('TECH JAJA')
get_list('BIG EYE')
get_list('HIPIPO')
get_list('SQOOP')
get_list('HOWWEBIZ')
get_list('KAWOWO SPORTS')
get_list('NTV UGANDA')
get_list('NBS TV')

In [15]:
get_list('cnn')
get_list('bbc')
get_list('foxnews')
#get_list('NILE POST')
#get_list('UGANDA RADIO NETWORK')
#get_list('THE INDEPENDENT ')
#get_list('BUKEDDE')
#get_list('RED PEPPER')
#get_list('OBSERVER')
#get_list('PML DAILY')
#get_list('SOFT POWER NEWS')
#get_list('PC TECH')
#get_list('TECH JAJA')
#get_list('BIG EYE')
#get_list('HIPIPO')
#get_list('SQOOP')
#get_list('HOWWEBIZ')
#get_list('KAWOWO SPORTS')

In [16]:
dic={
    'newspaper':outlets,
    'title':titles,
    'authors':authors,
    'text':texts,
    'link':links,
    'published':publisheds
}

df2 = pd.DataFrame(dic, columns = ['newspaper','title', 'authors','text', 'link','published'])
df2

Unnamed: 0,newspaper,title,authors,text,link,published
0,cnn,US surgeons successfully perform double lung t...,[Jacqueline Howard],(CNN) A young woman in the United States whose...,http://edition.cnn.com/2020/06/11/health/lung-...,2020-06-11T00:00:00
1,cnn,US human trials begin for first antibody cockt...,[Jen Christensen],(CNN) A medicine that may treat and prevent Co...,http://edition.cnn.com/2020/06/11/health/regen...,2020-06-11T00:00:00
2,cnn,Grocery stores and universities should reopen ...,[Jen Christensen],"(CNN) New research suggests grocery stores, ba...",http://edition.cnn.com/2020/06/10/health/groce...,2020-06-10T00:00:00
3,cnn,Asymptomatic coronavirus spread: WHO clarifies...,[Jacqueline Howard],(CNN) The World Health Organization tried on T...,http://edition.cnn.com/2020/06/09/health/who-c...,2020-06-09T00:00:00
4,cnn,Dexamethasone and Covid: Steroid reduces risk ...,[Jacqueline Howard],(CNN) The widely available steroid drug dexame...,http://edition.cnn.com/2020/06/16/health/dexam...,2020-06-16T00:00:00
5,cnn,"Model projects 200,000 people in the US could ...",[Madeline Holcombe],(CNN) The United States could see more than 20...,http://edition.cnn.com/2020/06/16/health/us-co...,2020-06-16T00:00:00
6,cnn,Children with ADHD can now be prescribed a vid...,"[Naomi Thomas, Amy Woodyatt]",(CNN) The first video game-based treatment for...,http://edition.cnn.com/2020/06/16/health/adhd-...,2020-06-16T00:00:00
7,cnn,Try this total-body summer workout inspired by...,[Stephanie Mansour],(CNN) Whether you're a child or an adult who's...,http://edition.cnn.com/2020/06/16/health/playg...,2020-06-16T00:00:00
8,bbc,The record-breaking jet which still haunts a c...,[Mark Piesing],"In the early years of the Cold War, Canada dec...",https://www.bbc.com/future/article/20200615-th...,2020-06-15T00:00:00


for x in range(len(data['newspapers']['cnn']['articles'])):
    outlets.append('CNN')
    titles.append(data['newspapers']['cnn']['articles'][x]['title'])
    texts.append(data['newspapers']['cnn']['articles'][x]['text'])
    links.append(data['newspapers']['cnn']['articles'][x]['link'])
    publisheds.append(data['newspapers']['cnn']['articles'][x]['published'])

In [17]:
df2.to_excel('Output/Newspaper_update.xlsx')