# Walking through Curriculum Example

In [71]:
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import os

In [11]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [12]:
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	
	<!-- This site is optimized with the Yoast SEO plugin v15.2 - https://yoast.com/wordpress/plugins/seo/ -->
	<title>Codeup’s Data Science Career Accelerator is Here! - Codeup</title>
	<meta name="description" content="The rumors are true! The time has arrived


In [13]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

## Beautiful Soup Methods and Properties

```soup.title.string``` gets the page's title (the same text in the browser tab for a page, this is the ```<title>``` element
    
```soup.prettify()``` is useful to print in case you want to see the HTML
    
```soup.find_all("a")``` find all the anchor tags, or whatever argument is specified.
    
```soup.find("h1")``` finds the first matching element
    
```soup.get_text()``` gets the text from within a matching piece of soup/HTML
    
The soup.select() method takes in a CSS selector as a string and returns all matching elements. super useful

In [37]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content')
article.text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

In [44]:
soup.find('a').text

'Skip to content'

In [15]:
with open('article.txt', 'w') as f:
    f.write(article.text)

In [16]:
def get_article_text():
    # if we already have the data, read it locally
    if path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='jupiterx-post-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text

# Exercises
1. Codeup Blog Articles: Scrape the article text from the following pages:
    - https://codeup.com/codeups-data-science-career-accelerator-is-here/
    - https://codeup.com/data-science-myths/
    - https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
    - https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
    - https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/  
    
Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
``` 

Plus any additional properties you think might be helpful.

Bonus:

Scrape the text of all the articles linked on codeup's blog page.

In [53]:
# otherwise go fetch the data
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
soup = BeautifulSoup(response.text)
article = soup.find('div', class_='jupiterx-post-content')

article_dict = {'title':[], 'content':[]}

In [54]:
soup.title.string

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [55]:
article_dict['title'] = soup.title.string

In [56]:
article_dict['content'] = article.text

In [57]:
article_dict

{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
 'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and R

In [74]:
def get_blog_articles(urls, cached = False):
    
    # if we already have the data and cached == True, read it locally
    if cached == True:
        df = pd.read_json('blogs.json')
    
    # if we don't have the data or we want to resave with any new data
    else:
        blogs = []
    
        # loops through urls passed in function
        for blog in urls:

            # web scraping
            headers = {'User-Agent': 'Codeup Data Science'}
            response = get(blog, headers=headers)
            # takes URL and returns a soup object of the text
            soup = BeautifulSoup(response.text)
            article = soup.find('div', class_='jupiterx-post-content')

            # creates empty dictionary to hold the article title and content
            article_dict = {'title':[], 'content':[]}
            # adds title to dict
            article_dict['title'] = soup.title.string
            # adds article to dict
            article_dict['content'] = article.text
        
            # adds this dict of the article to the blog list
            blogs.append(article_dict)
        
        # save it for next time
        blogs = pd.DataFrame(blogs)
        blogs.to_json('blogs.json')
        
    return blogs

In [75]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

get_blog_articles(urls)

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


In [76]:
pd.read_json('blogs.json')

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment
- The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

Hints:

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [119]:
response = get('https://inshorts.com/en/news/modernas-early-data-shows-its-covid19-vaccine-is-945-effective-1605529838483', headers={'User-Agent': 'Inshorts'})

In [120]:
response.ok

True

In [121]:
response.text

'<!doctype html>\n<html lang="en">\n\n<head>\n  <meta charset="utf-8" />\n  <style>\n    /* The Modal (background) */\n    .modal_contact {\n        display: none; /* Hidden by default */\n        position: fixed; /* Stay in place */\n        z-index: 8; /* Sit on top */\n        left: 0;\n        top: 0;\n        width: 100%; /* Full width */\n        height: 100%;\n        overflow: auto; /* Enable scroll if needed */\n        background-color: rgb(0,0,0); /* Fallback color */\n        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */\n    }\n\n    /* Modal Content/Box */\n    .modal-content {\n        background-color: #fefefe;\n        margin: 15% auto;\n        padding: 20px !important;\n        padding-top: 0 !important;\n        /* border: 1px solid #888; */\n        text-align: center;\n        position: relative;\n        border-radius: 6px;\n    }\n\n    /* The Close Button */\n    .close {\n      left: 90%;\n      color: #aaa;\n      float: right;\n      font-size: 2

In [127]:
url = 'https://inshorts.com/en/news/modernas-early-data-shows-its-covid19-vaccine-is-945-effective-1605529838483'
news_category = url.split('/')[-1]
data = get(url)
soup = BeautifulSoup(data.content, 'html.parser')

In [133]:
# finding article headline
soup.find('span', attrs={"itemprop": "headline"}).string

"Moderna's early data shows its COVID-19 vaccine is 94.5% effective"

In [135]:
# finding article text
soup.find('div', attrs={"itemprop": "articleBody"}).string

"American biotechnology company Moderna on Monday announced its experimental vaccine was 94.5% effective in preventing COVID-19 based on interim data from a late-stage clinical trial. Moderna's interim analysis was based on 95 infections among trial participants who received either a placebo or the vaccine. Among those, only five infections occurred in those who received the vaccine."

In [145]:
# finding author
soup.find('span', attrs={"author"}).string

'Pragya Swastik'

In [151]:
# finding date
soup.find('span', attrs={"date"}).string

'16 Nov'

In [138]:
# category as assigned in url
url.split('/')[-2]

'news'

In [165]:
def get_inshorts_dataset(urls, cached=False):
    '''
    Function to scape articles from Inshorts.com; If cached == False, runs code to scrape data
    from chosen url articles, add to dictionary, save as df in json file. If cached == True,
    reads the saved json file to a df.
    '''
    # if cached, we read already saved json file to df
    if cached == True:
        articles = pd.read_json('inshorts_articles.json')

    # cached == False, if we don't have the data or we want to resave with any new data
    else:
        
        # empty list to add individual article dictionaries to
        articles = []
        
        # loops through selected articles from Inshorts
        for article in urls:
            
            # dictionary for acrticle and information we are going to find
            article_dict = {'headline':'','author':'','date':'','article':'','category':''}
            
            # web scraping
            headers = {'User-Agent': 'Inshorts'}
            data = get(article, headers)
            # takes URL and returns a soup object of the text
            soup = BeautifulSoup(data.text)

            # specific article information to add to dictionary
            article_dict['headline'] = soup.find('span', attrs={"itemprop": "headline"}).string
            article_dict['author'] = soup.find('span', attrs={"author"}).string
            article_dict['date'] = soup.find('span', attrs={"date"}).string
            article_dict['article'] = soup.find('div', attrs={"itemprop": "articleBody"}).string
            article_dict['category'] = url.split('/')[-2]
            
            # adding dictionary to list
            articles.append(article_dict)

        # converting list of dictionaries to a df
        articles = pd.DataFrame(articles)
        articles = articles[['headline', 'author','date','article', 'category']]
        # Write df to a json file for faster access
        articles.to_json('inshorts_articles.json')
        
    return articles

In [166]:
urls = ['https://inshorts.com/en/news/15-countries-sign-worlds-biggest-freetrade-pact-which-india-left-last-year-1605430233640'
       'https://inshorts.com/en/news/olympic-medalist-working-as-delivery-boy-to-generate-income-amid-covid19-1605507634906'
       'https://inshorts.com/en/news/michelle-would-leave-me-obama-on-taking-a-position-in-bidens-cabinet-1605513396532']

get_inshorts_dataset(urls)

AttributeError: 'NoneType' object has no attribute 'string'

3. Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).