### 1. Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    
    'title': 'the title of the article',
    
    'content': 'the full text content of the article'
}

Plus any additional properties you think might be helpful.

Bonus: Scrape the text of all the articles linked on codeup's blog page.

In [15]:
from requests import get
from bs4 import BeautifulSoup
import os
import re
import pandas as pd
import json

* Codeup blog articles:
    1. https://codeup.com/tips-for-prospective-students/mental-health-first-aid-training/
    2. https://codeup.com/featured/5-reasons-to-attend-our-new-cloud-administration-program/
    3. https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/
    4. https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/
    5. https://codeup.com/codeup-news/codeup-tv-commercial/

In [2]:
urls = ['https://codeup.com/tips-for-prospective-students/mental-health-first-aid-training/',
       'https://codeup.com/featured/5-reasons-to-attend-our-new-cloud-administration-program/',
       'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/',
       'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/',
       'https://codeup.com/codeup-news/codeup-tv-commercial/']

In [3]:
url = 'https://codeup.com/tips-for-prospective-students/mental-health-first-aid-training/'
headers = {'User-Agent': 'Codeup Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)
response

<Response [200]>

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
title = soup.title.text
title

'Mental Health First Aid Training - Codeup'

In [7]:
article = soup.find('div', class_='entry-content')
text = article.text

In [8]:
soup.select('h1.entry-title')

[<h1 class="entry-title">Mental Health First Aid Training</h1>]

In [9]:
date = soup.select('span.published')[0].text
date

'May 31, 2022'

In [10]:
websites = []
websites.append({"title": title, "content": text})
websites

[{'title': 'Mental Health First Aid Training - Codeup',
  'content': '\nAs a student of Codeup, going through a massive career transition can be mentally taxing. Did you know that members of our student-facing staff and human resources team are trained in Mental Health First Aid? Let’s dive into what that means for Codeup!\xa0\nMental Health First Aid Training\xa0\nSome of our Codeup staff that works directly with our students are trained in Mental Health First Aid. This includes members of our student experience team, career coaches, and the human resources department. This training was courtesy of the Center for Health Care Services in San Antonio. They graciously provided the funding and training for our team in this pilot training program.\xa0\nWhat is Mental Health First Aid? According to mentalhhealthfirstaid.org, “is a course that teaches you how to identify, understand, and respond to signs of mental illness and substance use disorders. The training gives you the skills you nee

In [53]:
def get_blog_articles():
    filename = "blog_posts.csv"
    if os.path.isfile(filename):        
        websites = pd.read_csv(filename,index_col=[0])
        return websites
    else:
        
        websites = []
        urls = ['https://codeup.com/tips-for-prospective-students/mental-health-first-aid-training/',
                'https://codeup.com/featured/5-reasons-to-attend-our-new-cloud-administration-program/',
                'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/',
                'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/',
                'https://codeup.com/codeup-news/codeup-tv-commercial/']
        for x in urls:
            headers = {'User-Agent': 'Codeup Data Science'}
            response = get(x, headers=headers)
            soup = BeautifulSoup(response.content, 'html.parser')
            title = soup.title.text
            text = soup.find('div', id='main-content').text
            date = soup.select('span.published')[0].text
            websites.append({"title": title, "content": text, "date_published": date})
            websites = pd.DataFrame(websites)
            websites.to_csv("blog_posts.csv")
    return websites

In [54]:
websites = get_blog_articles()
websites

Unnamed: 0,title,content,date_published
0,Mental Health First Aid Training - Codeup,\n\n\n\n\n\nMental Health First Aid Training\n...,"May 31, 2022"
1,5 Reasons To Attend Our New Cloud Administrati...,\n\n\n\n\n\n5 Reasons To Attend Our New Cloud ...,"May 17, 2022"
2,What Jobs Can You Get After a Coding Bootcamp?,\n\n\n\n\n\nWhat Jobs Can You Get After a Codi...,"Jul 7, 2022"
3,What Jobs Can You Get After a Coding Bootcamp?...,\n\n\n\n\n\nWhat Jobs Can You Get After a Codi...,"Jul 14, 2022"
4,Codeup TV Commercial - Codeup News,"\n\n\n\n\n\nCodeup TV Commercial\nJul 20, 2022...","Jul 20, 2022"


### review
def get_blog_articles(article_list):
    
    file = 'blog_posts.json'
    
    if os.path.exists(file):
        
        with open(file) as f:
        
            return json.load(f)
    
    headers = {'User-Agent': 'Codeup Data Science'}
    
    article_info = []
    
    for article in article_list:
        
        info_dict = {}
        
        response = get(article, headers=headers)
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        info_dict['title'] = soup.find('h1').text
    
        info_dict['date_published'] = soup.find('span', class_='published').text
    
        info_dict['content'] = soup.find('div', class_='entry-content').text
    
        article_info.append(info_dict)
        
    with open(file, 'w') as f:
        
        json.dump(article_info, f)
        
    return article_info    

### 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:
{

    'title': 'The article title',
    
    'content': 'The article content',
    
    'category': 'business' # for example
}

Hints:

    a. Start by inspecting the website in your browser. Figure out which elements will be useful.
    b. Start by creating a function that handles a single article and produces a dictionary like the one above.
    c. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
    d. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

### get one specific information from one news

In [36]:
url = 'https://inshorts.com/en/news/cancelling-ac-firstclass-confirmed-train-tickets-to-now-attract-5-gst-1661858617350'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [37]:
response

<Response [200]>

In [51]:
title = soup.find('span', itemprop = 'headline').text
title

'Cancelling AC, first-class confirmed train tickets to now attract 5% GST'

In [53]:
body = soup.find('div', itemprop = 'articleBody').text
body

'The Finance Ministry stated that cancellation of confirmed first-class and AC coach tickets will now attract 5% GST. As per their circular, the booking of tickets is a "contract", under which the service provider (IRCTC/Indian Railways) promises services to the customer. And if the contract is breached by the passenger, the service provider is compensated with a small amount. '

In [57]:
author = soup.find('span', class_ = 'author').text
author

'Ridham Gambhir'

In [63]:
date = soup.find("span", class_="time")["content"]
date

'2022-08-30T11:23:37.000Z'

In [43]:
def get_one_news():
    news = []
    url = ['https://inshorts.com/en/news/cancelling-ac-firstclass-confirmed-train-tickets-to-now-attract-5-gst-1661858617350']
    for x in url:
        headers = {'User-Agent': 'Codeup Data Science'}
        response = get(x, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.find('span', itemprop = 'headline').text
        body = soup.find('div', itemprop = 'articleBody').text
        author = soup.find('span', class_ = 'author').text
        date = soup.find("span", class_="time")["content"]
        news.append({"title": title, "content": body, "author": author, "date": date})
    return news

In [44]:
one_news = get_one_news()
one_news

[{'title': 'Cancelling AC, first-class confirmed train tickets to now attract 5% GST',
  'content': 'The Finance Ministry stated that cancellation of confirmed first-class and AC coach tickets will now attract 5% GST. As per their circular, the booking of tickets is a "contract", under which the service provider (IRCTC/Indian Railways) promises services to the customer. And if the contract is breached by the passenger, the service provider is compensated with a small amount. ',
  'author': 'Ridham Gambhir',
  'date': '2022-08-30T11:23:37.000Z'}]

In [46]:
one_news = pd.DataFrame(one_news)

In [47]:
one_news.to_csv("one_news.csv")

In [49]:
df = pd.read_csv("blog_posts.csv",index_col=[0])
df

Unnamed: 0,title,content,date_published
0,Mental Health First Aid Training - Codeup,\n\n\n\n\n\nMental Health First Aid Training\n...,"May 31, 2022"
1,5 Reasons To Attend Our New Cloud Administrati...,\n\n\n\n\n\n5 Reasons To Attend Our New Cloud ...,"May 17, 2022"
2,What Jobs Can You Get After a Coding Bootcamp?,\n\n\n\n\n\nWhat Jobs Can You Get After a Codi...,"Jul 7, 2022"
3,What Jobs Can You Get After a Coding Bootcamp?...,\n\n\n\n\n\nWhat Jobs Can You Get After a Codi...,"Jul 14, 2022"
4,Codeup TV Commercial - Codeup News,"\n\n\n\n\n\nCodeup TV Commercial\nJul 20, 2022...","Jul 20, 2022"


### get the urls from one topic

In [70]:
url = "https://inshorts.com/en/read/business"
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [68]:
soup.find('li', class_ = 'active-category')

<li class="active-category selected">All News</li>

In [71]:
urls = []
for link in soup.find_all("a", href=True):
    urls.append(link["href"])
lines = pd.Series(urls)
urls = lines[lines.str.contains(r"^/en/news")].tolist()
new_urls = []
for i in urls:
    new_urls.append("http://" + i)
new_urls

['http:///en/news/indias-gdp-grows-at-135-in-first-quarter-of-fy23-fastest-in-a-year-1661948679998',
 'http:///en/news/musk-seeks-to-delay-twitter-trial-to-nov-amid-whistleblowers-claims-1661915789584',
 'http:///en/news/2-top-executives-at-snap-quit-hours-after-report-about-20-layoffs-emerges-1661919101133',
 'http:///en/news/musk-cites-whistleblowers-claims-in-new-notice-as-reason-to-end-twitter-deal-1661860951964',
 'http:///en/news/viral-video-shows-amazon-parcels-thrown-out-of-train-at-station-railways-clarifies-1661934085226',
 'http:///en/news/dell-among-firms-conducting-stay-interviews-to-contain-high-attrition-rates-report-1661941889565',
 'http:///en/news/worlds-3rd-richest-person-adanis-wealth-surged-over-13-times-in-25-years-1661925711652',
 'http:///en/news/russias-gazprom-halts-gas-supply-to-europe-via-major-pipeline-1661923360139',
 'http:///en/news/infosys-divests-entire-stake-in-usbased-trifacta-for-$12-million-1661935784212',
 'http:///en/news/softbank-corporate-offic

In [82]:
def get_url_business():
    url = f"https://inshorts.com/en/read/business"
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    urls = []
    # Find all links within that topic
    for link in soup.find_all("a", href=True):
        urls.append(link["href"])
        lines = pd.Series(urls)
        urls = lines[lines.str.contains(r"^/en/news")].tolist()
        new_urls = []
        for i in urls:
            new_urls.append("https://inshorts.com" + i)
    return new_urls

### get articles from one topic

In [96]:
def get_business_new():
    urls = get_url_business()
    news = []
    for x in urls:
        headers = {'User-Agent': 'Codeup Data Science'}
        response = get(x, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.find('span', itemprop = 'headline').text
        body = soup.find('div', itemprop = 'articleBody').text
        author = soup.find('span', class_ = 'author').text
        date = soup.find("span", class_="time")["content"]
        news.append({"title": title, "body": body, "author": author, "date": date, 'topic':'business'})
    return pd.DataFrame(news)

In [97]:
business_news = get_business_new()
business_news

Unnamed: 0,title,body,author,date,topic
0,India's GDP grows at 13.5% in first quarter of...,India's GDP grew at 13.5% in the first quarter...,Anmol Sharma,2022-08-31T12:24:39.000Z,business
1,Musk seeks to delay Twitter trial to Nov amid ...,Tesla CEO Elon Musk is seeking to delay the tr...,Ridham Gambhir,2022-08-31T03:16:29.000Z,business
2,2 top executives at Snap quit hours after repo...,Two senior advertising executives at Snap quit...,Ridham Gambhir,2022-08-31T04:11:41.000Z,business
3,Viral video shows Amazon parcels thrown out of...,A video from Guwahati railway station has gone...,Apaar Sharma,2022-08-31T08:21:25.000Z,business
4,Dell among firms conducting stay interviews to...,"To contain the high attrition rates, some comp...",Ridham Gambhir,2022-08-31T10:31:29.000Z,business
5,World's 3rd richest person Adani's wealth surg...,Adani Group Chairman Gautam Adani on Tuesday b...,Ridham Gambhir,2022-08-31T06:01:51.000Z,business
6,Russia's Gazprom halts gas supply to Europe vi...,Russia stopped gas supplies via a major pipeli...,Srishty Choudhury,2022-08-31T05:22:40.000Z,business
7,Netflix hires 2 top advertising executives fro...,Netflix on Tuesday announced that it has hired...,Ashley Paul,2022-08-31T10:43:16.000Z,business
8,Japan calls for $24 bn investment to boost bat...,"Japan's Ministry of Economy, Trade and Industr...",Purnima Rajput,2022-08-31T10:44:52.000Z,business
9,Most of crypto still junk: JP Morgan's digital...,JP Morgan's digital assets division head Umar ...,Ashley Paul,2022-08-30T15:10:39.000Z,business


### get articles from Business, Sports, Technology, Entertainment

In [109]:
def get_url(topic):
    url = f"https://inshorts.com/en/read/{topic}"
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    urls = []
    # Find all links within that topic
    for link in soup.find_all("a", href=True):
        urls.append(link["href"])
        lines = pd.Series(urls)
        urls = lines[lines.str.contains(r"^/en/news")].tolist()
        new_urls = []
        for i in urls:
            new_urls.append("https://inshorts.com" + i)
    return new_urls

def get_news_info(new_urls, topic):
    news = []
    for new_url in new_urls:
        headers = {'User-Agent': 'Codeup Data Science'}
        response = get(new_url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.find('span', itemprop = 'headline').text
        body = soup.find('div', itemprop = 'articleBody').text
        author = soup.find('span', class_ = 'author').text
        date = soup.find("span", class_="time")["content"]
        news.append({"title": title, "content": body, "author": author, "date": date, 'category': topic})
    return news

def get_all_news(topics = []):
    all_news = []
    for topic in topics:
        new_urls = get_url(topic)
        news = get_news_info(new_urls, topic)
        all_news.append(news)
    all_news = sum(all_news, [])
    return pd.DataFrame(all_news)

In [110]:
news = get_all_news(topics = ["business", "sports", "technology", "entertainment"])
news.head()

Unnamed: 0,title,content,author,date,category
0,India's GDP grows at 13.5% in first quarter of...,India's GDP grew at 13.5% in the first quarter...,Anmol Sharma,2022-08-31T12:24:39.000Z,business
1,Musk seeks to delay Twitter trial to Nov amid ...,Tesla CEO Elon Musk is seeking to delay the tr...,Ridham Gambhir,2022-08-31T03:16:29.000Z,business
2,2 top executives at Snap quit hours after repo...,Two senior advertising executives at Snap quit...,Ridham Gambhir,2022-08-31T04:11:41.000Z,business
3,Viral video shows Amazon parcels thrown out of...,A video from Guwahati railway station has gone...,Apaar Sharma,2022-08-31T08:21:25.000Z,business
4,World's 3rd richest person Adani's wealth surg...,Adani Group Chairman Gautam Adani on Tuesday b...,Ridham Gambhir,2022-08-31T06:01:51.000Z,business


In [112]:
news.groupby(["author", "category"])[["category"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,category
author,category,Unnamed: 2_level_1
Aishwarya Awasthi,technology,1
Amartya Sharma,entertainment,9
Anisha Joneja,business,3
Ankur Taliyan,sports,11
Anmol Sharma,business,1
Anmol Sharma,sports,7
Anmol Sharma,technology,1
Apaar Sharma,business,1
Apaar Sharma,entertainment,3
Arnab Mukherji,entertainment,5
