In [1]:
import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
import os

In [2]:
# define the url and headers
url = 'https://ryanorsinger.com'
headers = {'User-Agent': 'Codeup Data Science'} 

In [3]:
# use get to request the url and headers and store in the variable response
response = get(url, headers=headers)

In [4]:
# check the status code of the response, 200=ok
response

<Response [200]>

In [5]:
# or use this
response.ok

True

In [7]:
# look at the text we got
response.text
# this is the HTML string our request returned

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n\n    <title>Programming Log</title>\n    <meta name="HandheldFriendly" content="True" />\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\n\n    <link rel="stylesheet" type="text/css" href="/assets/built/screen.css?v=25fe62e8ab" />\n\n    <meta name="description" content="Thoughts, stories and ideas." />\n    <link rel="shortcut icon" href="/favicon.png" type="image/png" />\n    <link rel="canonical" href="https://ryanorsinger.com/" />\n    <meta name="referrer" content="no-referrer-when-downgrade" />\n    \n    <meta property="og:site_name" content="Programming Log" />\n    <meta property="og:type" content="website" />\n    <meta property="og:title" content="Programming Log" />\n    <meta property="og:description" content="Thoughts, stories and ideas." />\n    <meta property="og:url" content="https://ryanorsinger.com/" /

In [8]:
# create a soup object by passing stored value
soup = BeautifulSoup(response.text, 'html.parser')

In [9]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<title>Programming Log</title>
<meta content="True" name="HandheldFriendly"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="/assets/built/screen.css?v=25fe62e8ab" rel="stylesheet" type="text/css"/>
<meta content="Thoughts, stories and ideas." name="description"/>
<link href="/favicon.png" rel="shortcut icon" type="image/png"/>
<link href="https://ryanorsinger.com/" rel="canonical"/>
<meta content="no-referrer-when-downgrade" name="referrer"/>
<meta content="Programming Log" property="og:site_name"/>
<meta content="website" property="og:type"/>
<meta content="Programming Log" property="og:title"/>
<meta content="Thoughts, stories and ideas." property="og:description"/>
<meta content="https://ryanorsinger.com/" property="og:url"/>
<meta content="https://casper.ghost.org/v1.0.0/images/blog-cover.jpg" property="og:image"/>
<meta conte

In [10]:
# create a function to get HTML and return soup

def make_soup(url):
    '''
    This helper function takes in a url and requests and parses HTML
    returning a soup object.
    '''
    headers = {'User-Agent': 'Codeup Data Science'} 
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup


In [11]:
# Filter the Soup to obtain the elements from the first article card on the web page.

card = soup.find('article')
card

<article class="post-card post no-image">
<div class="post-card-content">
<a class="post-card-content-link" href="/air-fried-tofu/">
<header class="post-card-header">
<h2 class="post-card-title">Air Fried Marinated Tofu</h2>
</header>
<section class="post-card-excerpt">
<p>Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice</p>
</section>
</a>
<footer class="post-card-meta">
<ul class="author-list">
<li class="author-list-item">
<div class="author-name-tooltip">
                        Ryan Orsinger
                    </div>
<a class="static-avatar" href="/author/ryan/"><img alt="Ryan Orsinger" class="author-profile-image" src="//www.gravatar.com/avatar/87aa917618034fcac7d589652b2110a5?s=250&amp;d=mm&amp;r=x"/></a>
</li>
</ul>
<span class="reading-time">1 min read</span>
</footer>
</div>
</article>

In [12]:
# Filter the article card to obtain the title. (just like you filtered the Soup object).

title = card.find('h2').text
title

'Air Fried Marinated Tofu'

In [13]:
# Filter the blog card to obtain the text preview of the article card.

content = card.find('p').text
content

'Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice'

In [14]:
# create a dictionary holding title and content
card_dict = {'title': title, 'content': content}

In [17]:
# use find_all to find all of the articles
card_list = soup.find_all('article')
card_list[:2]

[<article class="post-card post no-image">
 <div class="post-card-content">
 <a class="post-card-content-link" href="/air-fried-tofu/">
 <header class="post-card-header">
 <h2 class="post-card-title">Air Fried Marinated Tofu</h2>
 </header>
 <section class="post-card-excerpt">
 <p>Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice</p>
 </section>
 </a>
 <footer class="post-card-meta">
 <ul class="author-list">
 <li class="author-list-item">
 <div class="author-name-tooltip">
                         Ryan Orsinger
                     </div>
 <a class="static-avatar" href="/author/ryan/"><img alt="Ryan Orsinger" class="author-profile-image" src="//www.gravatar.com/avatar/87aa917618034fcac7d589652b2110a5?s=250&amp;d=mm&amp;r=x"/></a>
 </li>
 </ul>
 <span class="reading-time">1 min read</span>
 </footer>
 </div>
 </article>,
 <article class="post-card post no

In [19]:
# Create your list of card dictionaries.
dict_list = []
for card in card_list:
    title = card.find('h2').text
    content = card.find('p').text
    card_dict = {'title':title, 'content':content}
    dict_list.append(card_dict)

In [20]:
# Convert your list of dictionaries into a pandas df.
df = pd.DataFrame(dict_list)
df

Unnamed: 0,title,content
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...
1,Using Gitigore,Gitignore Exercises Exercise 1 Change director...
2,Air Fried Avocados,"Ingredients: 2 large avacados Salt, pepper Dir..."
3,How to Run Julia Language on Jupyter Notebooks,Rationale Julia is high performance general pu...
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...
7,Effectively practice a 5-10 minute group prese...,If you're still working on memorizing your scr...
8,Cheesy Garlic Brussel Sprouts,Dependencies: oven stovetop range microwave pa...
9,How to get started on a programming exercise,"Scenario: You're learning to code, learning sy..."


In [21]:
# Get the relative url to add to the base url and access the article page from a card.

relative_url = card.find('a').get('href')
relative_url

'/fresh-air/'

In [22]:
# Concatenate the base url with the relative_url to get the url for the blog.

blog_url = url + relative_url
blog_url

'https://ryanorsinger.com/fresh-air/'

In [23]:
article_dict = {'title': title, 'content': content, 'blog_url': blog_url}
article_dict

{'title': 'Breaths of fresh air!',
 'content': "It's a great pleasure to share a few things that are rejuvinating breaths of very fresh air. These are game changers: new ways to fall in love with the essense of things all",
 'blog_url': 'https://ryanorsinger.com/fresh-air/'}

In [24]:
# Use my new function to `make the soup`. (request and partse HTML and return Soup object)

soup = make_soup(blog_url)

In [25]:
# Filter the Soup to obtain the title of the article.

blog_title = soup.find('h1').text
blog_title

'Breaths of fresh air!'

In [26]:
# Filter the Soup to obtain the content of the article.

blog_content = soup.find('div', class_="kg-card-markdown").text
print(blog_content)

It's a great pleasure to share a few things that are rejuvinating breaths of very fresh air. These are game changers: new ways to fall in love with the essense of things all over again.
The new CSS tools are rejuvinating!
The fresh air in CSS is CSS Grid, flexbox, and all the new properties that provide tremendous leverage with layout.

Jen Simmons's Layout Land videos have been rejuvinating, insightful, and inspiring. Her Layout Land videos are information rich and motivating to action.
The New CSS Layout by Rachel Andrew is a jewel. So far this book:

Helped me understand the traditional layout tools like float and absolute/relative positioning better than ever.
Opens the door to CSS Grid and Flexbox with super quick, crystal clear examples.
Provide native CSS code that does the work of an 3rd party CSS framework in a dozen lines of CSS.


These additions to CSS help me feel completely rejuvinated. Thanks to these  changes, I'm coming to CSS with a beginner's mind and a reinvigorated

In [27]:
# Filter the Soup to obtain the datetime value from the blog.

blog_date = soup.find('time').attrs['datetime']
blog_date

'2018-08-27'

In [29]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<title>Breaths of fresh air!</title>
<meta content="True" name="HandheldFriendly"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="/assets/built/screen.css?v=25fe62e8ab" rel="stylesheet" type="text/css"/>
<link href="/favicon.png" rel="shortcut icon" type="image/png"/>
<link href="https://ryanorsinger.com/fresh-air/" rel="canonical"/>
<meta content="no-referrer-when-downgrade" name="referrer"/>
<link href="https://ryanorsinger.com/fresh-air/amp/" rel="amphtml"/>
<meta content="Programming Log" property="og:site_name"/>
<meta content="article" property="og:type"/>
<meta content="Breaths of fresh air!" property="og:title"/>
<meta content="It's a great pleasure to share a few things that are rejuvinating breaths of very fresh air. These are game changers: new ways to fall in love with the essense of things all over again. The new CSS 

# Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

1. Codeup Blog Articles
- Scrape the article text from the following pages:
    - https://codeup.com/codeups-data-science-career-accelerator-is-here/
    - https://codeup.com/data-science-myths/
    - https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
    - https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
    - https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/
- Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
- Plus any additional properties you think might be helpful.

**Bonus:**
- Scrape the text of all the articles linked on codeup's blog page.

2. News Articles
- We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.
- Write a function that scrapes the news articles for the following topics:
    - Business
    - Sports
    - Technology
    - Entertainment
- The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
**Hints:**
- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

**Bonus: cache the data**
- Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).