<a href="https://colab.research.google.com/github/RyvynYoung/natural-language-processing-exercises/blob/main/Copy_of_web_scraping_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting Started
1. Create your own notebook to work in.
2. Click the 'File' tab in the upper-left corner of the menu bar and click 'Save a copy in Drive' to create your own copy of this Google Colab notebook that you can edit and save.
3. As you complete exercises, be sure to Save your work by either clicking on the 'File' tab in the menu bar and 'Save' or using `cmd+S`.

### Orientation:
- This notebook is composed of cells. Each cell will contain either text or Python code.
- To run the Python code cells, click the "play" button next to the cell or click your cursor inside the cell and do "Shift + Enter" on your keyboard. 
- Run the code cells in order from top to bottom, because order matters in programming and code.

### Troubleshooting
- If the notebook appears to not be working correctly, then restart this environment by going up to **Runtime** then select **Restart Runtime**. 
- If the notebook is running correctly, but you need a fresh copy of this original notebook, go [here](https://colab.research.google.com/drive/1EX6EjkVw7BEo85hQIwbv7TPqsr-SuGG5?usp=sharing) and repeat the steps above in 'Getting Started'.
- Save frequently (`cmd+S`) and save often!

### Push to Github
- You can push up your notebook to Github by clicking on the 'File' tab in the upper-left corner of the menu bar and choosing to 'Save a Copy to Github.'

## Notebook Objectives
This notebook...
- models the Web Scraping Workflow.
- walks you through acquiring data from a web page using the BeautifulSoup and requests libraries.
- prepares you for the nlp curriculum exercises that culminate in an acquire.py file with functions that make your process reproducible.
- is a safe place to dig and explore!

In [1]:
# Imports

import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
import os

## Workflow

### Inspect the Web Page

Here is where we make our plan for scraping our target website. We can inspect the HTML of a web page in a couple of ways:

- We can inspect the source code of a web page by prefixing the url in the address bar of our browser with 'view-source:' like in the example below. This method displays the HTML in your browser as it is returned to you in your request, without any extra information like javascript code.
```python
view-source:https://ryanorsinger.com
```
OR

- We can right-click on the part of the page we are interested in scraping and choose 'Inspect.' 

>**Right-click on the card and choose 'Inspect.'**

![Image of Inspecting](https://i.pinimg.com/564x/ae/ff/e5/aeffe504496de41a48e4f694f47fd4d8.jpg)

___

### Make the Request

Here is where we use the Python `requests` library to **obtain the HTML** from our target website. Remember, we did this step in obtaining data from an API before.

In [2]:
url = 'https://ryanorsinger.com'
headers = {'User-Agent': 'Codeup Data Science'} 

In [3]:
response = get(url, headers=headers)

In [4]:
response.ok

True

In [5]:
# Take a look at the string of HTML returned from our request to the main blog page.

response.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n\n    <title>Programming Log</title>\n    <meta name="HandheldFriendly" content="True" />\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\n\n    <link rel="stylesheet" type="text/css" href="/assets/built/screen.css?v=25fe62e8ab" />\n\n    <meta name="description" content="Thoughts, stories and ideas." />\n    <link rel="shortcut icon" href="/favicon.png" type="image/png" />\n    <link rel="canonical" href="https://ryanorsinger.com/" />\n    <meta name="referrer" content="no-referrer-when-downgrade" />\n    \n    <meta property="og:site_name" content="Programming Log" />\n    <meta property="og:type" content="website" />\n    <meta property="og:title" content="Programming Log" />\n    <meta property="og:description" content="Thoughts, stories and ideas." />\n    <meta property="og:url" content="https://ryanorsinger.com/" /

___

### Make the Soup

Here is where we pass our html and choice of parser to `BeautifulSoup` to **parse our HTML and create our Soup object** for searching and navigating in the next step. We need to pass either a string of HTML or an HTML file to Beautiful Soup to parse the HTML.

In [6]:
# Create the Soup object by passing our string of HTML and choice of parser to BeautifulSoup.

soup = BeautifulSoup(response.text, 'html.parser')

#### Reproducible Soup

We can make the above steps into a function for future use if we like.

In [7]:
# Create a helper function that requests and parses HTML returning a soup object.

def make_soup(url):
    '''
    This helper function takes in a url and requests and parses HTML
    returning a soup object.
    '''
    headers = {'User-Agent': 'Codeup Data Science'} 
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

___

## Extract the Data

### Scrape Article Cards on Blog Homepage

Here, we **extract the data** we want from our soup using BeautifulSoup methods with element names, attributes, etc.

**Find the HTML element that holds all of the elements for the entire article card.**

![Image of HTML code for title](https://i.pinimg.com/564x/0a/89/dc/0a89dc531a65d286614f88ffd374cd04.jpg)

In [8]:
# Filter the Soup to obtain the elements from the first article card on the main blog page.

card = soup.find('article') 
card

<article class="post-card post no-image">
<div class="post-card-content">
<a class="post-card-content-link" href="/air-fried-tofu/">
<header class="post-card-header">
<h2 class="post-card-title">Air Fried Marinated Tofu</h2>
</header>
<section class="post-card-excerpt">
<p>Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice</p>
</section>
</a>
<footer class="post-card-meta">
<ul class="author-list">
<li class="author-list-item">
<div class="author-name-tooltip">
                        Ryan Orsinger
                    </div>
<a class="static-avatar" href="/author/ryan/"><img alt="Ryan Orsinger" class="author-profile-image" src="//www.gravatar.com/avatar/87aa917618034fcac7d589652b2110a5?s=250&amp;d=mm&amp;r=x"/></a>
</li>
</ul>
<span class="reading-time">1 min read</span>
</footer>
</div>
</article>

___
**Find the HTML element that holds the title of the article card.**

![Image HTML for Title](https://i.pinimg.com/564x/f6/64/4c/f6644c79fb6198a1602071093160866e.jpg)

In [9]:
# Filter the article card to obtain the title. (just like you filtered the Soup object).

title = card.find('h2').text
title

'Air Fried Marinated Tofu'

___
**Find the HTML element that holds the text we want.**

![Image for HTML paragraph](https://i.pinimg.com/564x/14/39/9a/14399a28bd9f23fa03964416f6d2cd05.jpg)

In [10]:
# Filter the article card to obtain the text preview.

content = card.find('p').text
content

'Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice'

In [11]:
# Create a dictionary holding the title and content for the article card.

card_dict = {'title': title, 'content': content}
card_dict

{'content': 'Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice',
 'title': 'Air Fried Marinated Tofu'}

In [12]:
# Use .find_all() to filter the soup to obtain a list of the HTML for ALL of the article cards on the blog page.

card_list = soup.find_all('article')
card_list[:2]

[<article class="post-card post no-image">
 <div class="post-card-content">
 <a class="post-card-content-link" href="/air-fried-tofu/">
 <header class="post-card-header">
 <h2 class="post-card-title">Air Fried Marinated Tofu</h2>
 </header>
 <section class="post-card-excerpt">
 <p>Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice</p>
 </section>
 </a>
 <footer class="post-card-meta">
 <ul class="author-list">
 <li class="author-list-item">
 <div class="author-name-tooltip">
                         Ryan Orsinger
                     </div>
 <a class="static-avatar" href="/author/ryan/"><img alt="Ryan Orsinger" class="author-profile-image" src="//www.gravatar.com/avatar/87aa917618034fcac7d589652b2110a5?s=250&amp;d=mm&amp;r=x"/></a>
 </li>
 </ul>
 <span class="reading-time">1 min read</span>
 </footer>
 </div>
 </article>, <article class="post-card post no-

In [13]:
# Here's that first article card we worked with above, but now we have a list containing the elements from all 11 cards on the page.

print(card_list[0])
print()
print(f'There are {len(card_list)} cards in our list.')

<article class="post-card post no-image">
<div class="post-card-content">
<a class="post-card-content-link" href="/air-fried-tofu/">
<header class="post-card-header">
<h2 class="post-card-title">Air Fried Marinated Tofu</h2>
</header>
<section class="post-card-excerpt">
<p>Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice</p>
</section>
</a>
<footer class="post-card-meta">
<ul class="author-list">
<li class="author-list-item">
<div class="author-name-tooltip">
                        Ryan Orsinger
                    </div>
<a class="static-avatar" href="/author/ryan/"><img alt="Ryan Orsinger" class="author-profile-image" src="//www.gravatar.com/avatar/87aa917618034fcac7d589652b2110a5?s=250&amp;d=mm&amp;r=x"/></a>
</li>
</ul>
<span class="reading-time">1 min read</span>
</footer>
</div>
</article>

There are 11 cards in our list.


___

### Your Turn

- Reference the process we went through above to obtain a list of dictionaries, one for each of the eleven cards in our `card_list` variable. Each dictionary should contain the keys `title` and `content` like the `card_dict` dictionary we created above for our first card.
- That list of dictionaries can easily be turned into that beloved and familiar pandas DataFrame, so end this stage by converting your list to a df you can use in the future.

#### List of Dictionaries

- You might want to use a loop to run through the above steps to obtain the title and content from a card, create a dictionary for that card, and add that dictionary to a list.



In [14]:
# Create a list containing a dictionary, card_dict, for each each of the 11 article cards on the main blog page.
def get_card_info(card):
  title = card.find('h2').text
  content = card.find('p').text
  relative_url = card.find('a').get('href')
  read_time = card.find('span', class_='reading-time')
  card_dict = {'title':title, 'content':content, 'relative_url':relative_url, 'read_time':read_time}
  return card_dict

In [15]:
# Convert your list of dictionaries into a pandas df.
dict_list = []

for card in card_list:
  card_info = get_card_info(card)
  dict_list.append(card_info)

card_df = pd.DataFrame(dict_list)


In [16]:
card_df

Unnamed: 0,title,content,relative_url,read_time
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,/air-fried-tofu/,[1 min read]
1,Using Gitigore,Gitignore Exercises Exercise 1 Change director...,/using-gitigore/,[1 min read]
2,Air Fried Avocados,"Ingredients: 2 large avacados Salt, pepper Dir...",/air-fried-avacados/,[1 min read]
3,How to Run Julia Language on Jupyter Notebooks,Rationale Julia is high performance general pu...,/run-julia-language-on-jupyter-notebook/,[1 min read]
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...,/intro-to-binary/,[2 min read]
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,/effective-practice-with-flash-cards/,[3 min read]
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,/intro-to-testing-and-test-driven-development/,[1 min read]
7,Effectively practice a 5-10 minute group prese...,If you're still working on memorizing your scr...,/how-to-practice-a-5-minute-group-presentation/,[1 min read]
8,Cheesy Garlic Brussel Sprouts,Dependencies: oven stovetop range microwave pa...,/cheesy-garlicky/,[1 min read]
9,How to get started on a programming exercise,"Scenario: You're learning to code, learning sy...",/how-to-start-a-programming-exercise/,[2 min read]


#### Scrape More

- You can try to grab other elements from the cards and create an even more interesting DataFrame, too.

In [17]:
# Looking at the HTML below, what else do you think you could scrape from the article cards?

card

<article class="post-card post">
<a class="post-card-image-link" href="/fresh-air/">
<div class="post-card-image" style="background-image: url(/content/images/2018/08/Salem-Willows-Waterfront.jpg)"></div>
</a>
<div class="post-card-content">
<a class="post-card-content-link" href="/fresh-air/">
<header class="post-card-header">
<h2 class="post-card-title">Breaths of fresh air!</h2>
</header>
<section class="post-card-excerpt">
<p>It's a great pleasure to share a few things that are rejuvinating breaths of very fresh air. These are game changers: new ways to fall in love with the essense of things all</p>
</section>
</a>
<footer class="post-card-meta">
<ul class="author-list">
<li class="author-list-item">
<div class="author-name-tooltip">
                        Ryan Orsinger
                    </div>
<a class="static-avatar" href="/author/ryan/"><img alt="Ryan Orsinger" class="author-profile-image" src="//www.gravatar.com/avatar/87aa917618034fcac7d589652b2110a5?s=250&amp;d=mm&amp;r=x

___
**Find the HTML element responsible for holding the relative url for the article page.**

![Image for article relative url](https://i.pinimg.com/564x/48/a8/f3/48a8f3373bc9c48a0339bb61c17f7106.jpg)

In [18]:
# Get the relative url to add to the base url to access and scrape the article page referenced by the article card.

relative_url = card.find('a').get('href')
relative_url

'/fresh-air/'

In [19]:
# Concatenate the base url with the relative_url to get the url for the article page.

article_url = url + relative_url
article_url

'https://ryanorsinger.com/fresh-air/'

In [20]:
# concat base url with relative_url in my card_df and add to card_df
# add base_url to card_df
card_df['base_url'] = url
card_df


Unnamed: 0,title,content,relative_url,read_time,base_url
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,/air-fried-tofu/,[1 min read],https://ryanorsinger.com
1,Using Gitigore,Gitignore Exercises Exercise 1 Change director...,/using-gitigore/,[1 min read],https://ryanorsinger.com
2,Air Fried Avocados,"Ingredients: 2 large avacados Salt, pepper Dir...",/air-fried-avacados/,[1 min read],https://ryanorsinger.com
3,How to Run Julia Language on Jupyter Notebooks,Rationale Julia is high performance general pu...,/run-julia-language-on-jupyter-notebook/,[1 min read],https://ryanorsinger.com
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...,/intro-to-binary/,[2 min read],https://ryanorsinger.com
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,/effective-practice-with-flash-cards/,[3 min read],https://ryanorsinger.com
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,/intro-to-testing-and-test-driven-development/,[1 min read],https://ryanorsinger.com
7,Effectively practice a 5-10 minute group prese...,If you're still working on memorizing your scr...,/how-to-practice-a-5-minute-group-presentation/,[1 min read],https://ryanorsinger.com
8,Cheesy Garlic Brussel Sprouts,Dependencies: oven stovetop range microwave pa...,/cheesy-garlicky/,[1 min read],https://ryanorsinger.com
9,How to get started on a programming exercise,"Scenario: You're learning to code, learning sy...",/how-to-start-a-programming-exercise/,[2 min read],https://ryanorsinger.com


In [21]:
# Add that article_url to the article_dict along with the title and content we scraped from the first article card.

article_dict = {'title': title, 'content': content, 'article_url': article_url}
article_dict

{'article_url': 'https://ryanorsinger.com/fresh-air/',
 'content': 'Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary. Prepare the Marinade Prepare the marinade while the tofu is pressing 1-2 ounces of brown rice',
 'title': 'Air Fried Marinated Tofu'}

___
### Scrape Article Page
- Now that I have the url to the actual blog article, I can scrape information from that article page.

In [41]:
# Use my new function to `make the soup`. (request and parse HTML and return Soup object)
article_url = 'https://ryanorsinger.com/air-fried-tofu/'
soup = make_soup(article_url)

___
**Find the HTML element responsible for holding the article title.**
![Image for article title](https://i.pinimg.com/564x/24/cc/37/24cc37ffbcc228de7c23d6bbccae2c8f.jpg)

In [42]:
# Filter the Soup to obtain the title of the article.

article_title = soup.find('h1').text
article_title

'Air Fried Marinated Tofu'

___
**Find the HTML element responsible for holding the content or text of the article.**

![Image for artice content](https://i.pinimg.com/564x/f9/36/61/f9366187a852182e39a66d68341b10a6.jpg)

In [43]:
# Filter the Soup to obtain the content of the article.

blog_content = soup.find('div', class_="kg-card-markdown").text
print(blog_content)

Drain then press tofu for 30-45 minutes. A tofu press is a good idea, but not necessary.
Prepare the Marinade
Prepare the marinade while the tofu is pressing

1-2 ounces of brown rice vinegar (aged, seasoned)
2 ounces of rice wine vinegar (white, seasoned)
1 oz soy sauce
onion powder
garlic powder
powdered mustard
msg
salt
3 spoons of sugar
generous amount of shakes of garlic tabasco sauce
stir vigorously to blend the spices

Marinade the Tofu
Once the tofu is pressed, cut tofu blocks into 1 inch cubes (roughly).
Make sure that the tofu squares are not completely stuck together as you transfer them into a plastic baggie.
Pour the marinade over the tofu.
Set the tofu for to marinade 10-15 minutes. Then flip the bag over. Check on it and flip in another 15-20 minutes. The tofu should have soaked up most or all of the marinade.
Cooking the Tofu
Put avocado oil on a paper towel and wipe down the air fryer basket.
Put the empty air fryer in for 400 degrees for 5 minutes.
Spread the tofu cub

___
**Find the HTML element responsible for holding the datetime for the article.**

![Image for datetime of Article](https://i.pinimg.com/564x/c8/8b/83/c88b83c96405fb9c8708fdf7c6ae8daf.jpg)

In [25]:
# Filter the Soup to obtain the datetime value from the article.

article_date = soup.find('time').attrs['datetime']
article_date

'2018-08-27'

### Your Turn

In [26]:
# Can you filter the Soup to obtain some other data from the blog article?
art_copy = soup.find('section', class_="copyright").text


In [27]:
# Can you create a dictionary from the information you scraped above from the blog article? 
# (like we did for the article card above but this time for the blog article)
single_art_dict = {'art_title': article_title, 'art_content': blog_content, 'art_date': article_date, 'art_copyright': art_copy}
single_art_dict


{'art_content': 'It\'s a great pleasure to share a few things that are rejuvinating breaths of very fresh air. These are game changers: new ways to fall in love with the essense of things all over again.\nThe new CSS tools are rejuvinating!\nThe fresh air in CSS is CSS Grid, flexbox, and all the new properties that provide tremendous leverage with layout.\n\nJen Simmons\'s Layout Land videos have been rejuvinating, insightful, and inspiring. Her Layout Land videos are information rich and motivating to action.\nThe New CSS Layout by Rachel Andrew is a jewel. So far this book:\n\nHelped me understand the traditional layout tools like float and absolute/relative positioning better than ever.\nOpens the door to CSS Grid and Flexbox with super quick, crystal clear examples.\nProvide native CSS code that does the work of an 3rd party CSS framework in a dozen lines of CSS.\n\n\nThese additions to CSS help me feel completely rejuvinated. Thanks to these  changes, I\'m coming to CSS with a beg

In [28]:
# Can you create a list of urls for all of the articles from the blog homepage?
# (like we did for this single article above but a list of urls for every article referenced by an article card)

card_df['article_url'] = card_df.base_url + card_df.relative_url
art_url_list = card_df.article_url.to_list()
art_url_list

['https://ryanorsinger.com/air-fried-tofu/',
 'https://ryanorsinger.com/using-gitigore/',
 'https://ryanorsinger.com/air-fried-avacados/',
 'https://ryanorsinger.com/run-julia-language-on-jupyter-notebook/',
 'https://ryanorsinger.com/intro-to-binary/',
 'https://ryanorsinger.com/effective-practice-with-flash-cards/',
 'https://ryanorsinger.com/intro-to-testing-and-test-driven-development/',
 'https://ryanorsinger.com/how-to-practice-a-5-minute-group-presentation/',
 'https://ryanorsinger.com/cheesy-garlicky/',
 'https://ryanorsinger.com/how-to-start-a-programming-exercise/',
 'https://ryanorsinger.com/fresh-air/']

In [48]:
# Can you use a loop to create a list of dictionaries for all of the blog articles (article_title, article_content, article_date)?
# (like the dictionary we created for the single blog article above)
# def get_card_info(card):
#   title = card.find('h2').text
#   content = card.find('p').text
#   relative_url = card.find('a').get('href')
#   read_time = card.find('span', class_='reading-time')
#   card_dict = {'title':title, 'content':content, 'relative_url':relative_url, 'read_time':read_time}
#   return card_dict

def get_article_info(url):
  soup = make_soup(url)
  article_title = soup.find('h1').text
  article_content = soup.find('div', class_="kg-card-markdown").text
  article_date = soup.find('time').attrs['datetime']
  art_copy = soup.find('section', class_="copyright").text
  art_dict = {'art_title': article_title, 'art_content': article_content, 'art_date': article_date, 'art_copyright': art_copy}
  return art_dict


In [49]:
art_dict_list = []

for url in art_url_list:
  art_info = get_article_info(url)
  art_dict_list.append(art_info)


In [50]:
# Can you now convert your list of dictionaries containing the title, content, and date of all of the blog articles to a DataFrame?


article_df = pd.DataFrame(art_dict_list)



In [51]:
article_df


Unnamed: 0,art_title,art_content,art_date,art_copyright
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,2020-10-13,Programming Log © 2020
1,Using Gitigore,Gitignore Exercises\nExercise 1\n\nChange dire...,2020-07-20,Programming Log © 2020
2,Air Fried Avocados,"Ingredients:\n\n2 large avacados\nSalt, pepper...",2020-01-16,Programming Log © 2020
3,How to Run Julia Language on Jupyter Notebooks,Rationale\n\nJulia is high performance general...,2019-08-03,Programming Log © 2020
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...,2019-07-18,Programming Log © 2020
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,2019-03-04,Programming Log © 2020
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,2019-02-12,Programming Log © 2020
7,Effectively practice a 5-10 minute group prese...,\nIf you're still working on memorizing your s...,2019-02-01,Programming Log © 2020
8,Cheesy Garlic Brussel Sprouts,Dependencies:\n\noven\nstovetop range\nmicrowa...,2018-12-19,Programming Log © 2020
9,How to get started on a programming exercise,"Scenario:\nYou're learning to code, learning s...",2018-08-29,Programming Log © 2020


___

#### Save the Content

Here, we will **save our DataFrame** to a json file because it handles the multi-line text data that we are scraping better than a csv file. 
```python
# Save df to a json file to current directory
pd.to_json('df_name.json')
```

```python
# Read a json file into pandas DataFrame
df = pd.read_json('df_name.json')
```

The important part of this step is that we have our scraped data in a format that is usable in the next step of our particular project.


In [53]:
article_df.to_json('article_df.json')

In [55]:
df = pd.read_json('article_df.json')

___

In [56]:
df

Unnamed: 0,art_title,art_content,art_date,art_copyright
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,2020-10-13,Programming Log © 2020
1,Using Gitigore,Gitignore Exercises\nExercise 1\n\nChange dire...,2020-07-20,Programming Log © 2020
2,Air Fried Avocados,"Ingredients:\n\n2 large avacados\nSalt, pepper...",2020-01-16,Programming Log © 2020
3,How to Run Julia Language on Jupyter Notebooks,Rationale\n\nJulia is high performance general...,2019-08-03,Programming Log © 2020
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...,2019-07-18,Programming Log © 2020
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,2019-03-04,Programming Log © 2020
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,2019-02-12,Programming Log © 2020
7,Effectively practice a 5-10 minute group prese...,\nIf you're still working on memorizing your s...,2019-02-01,Programming Log © 2020
8,Cheesy Garlic Brussel Sprouts,Dependencies:\n\noven\nstovetop range\nmicrowa...,2018-12-19,Programming Log © 2020
9,How to get started on a programming exercise,"Scenario:\nYou're learning to code, learning s...",2018-08-29,Programming Log © 2020


## Make Your Work Reproducible

Here, you may want to make your process from above reproducible by creating functions that make your scraping repeatable for this site.

In [32]:
# Create a helper function that obtains the url for each blog article and returns all of them in a list.

def scrape_urls():
  # use the process above as the guts for this function
  pass

In [None]:
# Create the Soup object by passing our string of HTML and choice of parser to BeautifulSoup.

soup = BeautifulSoup(response.text, 'html.parser')

### Concat the card and article dataframes?

Can the card_df and article_df be made into one dataframe?

In [57]:
# view card_df
card_df

Unnamed: 0,title,content,relative_url,read_time
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,/air-fried-tofu/,[1 min read]
1,Using Gitigore,Gitignore Exercises Exercise 1 Change director...,/using-gitigore/,[1 min read]
2,Air Fried Avocados,"Ingredients: 2 large avacados Salt, pepper Dir...",/air-fried-avacados/,[1 min read]
3,How to Run Julia Language on Jupyter Notebooks,Rationale Julia is high performance general pu...,/run-julia-language-on-jupyter-notebook/,[1 min read]
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...,/intro-to-binary/,[2 min read]
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,/effective-practice-with-flash-cards/,[3 min read]
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,/intro-to-testing-and-test-driven-development/,[1 min read]
7,Effectively practice a 5-10 minute group prese...,If you're still working on memorizing your scr...,/how-to-practice-a-5-minute-group-presentation/,[1 min read]
8,Cheesy Garlic Brussel Sprouts,Dependencies: oven stovetop range microwave pa...,/cheesy-garlicky/,[1 min read]
9,How to get started on a programming exercise,"Scenario: You're learning to code, learning sy...",/how-to-start-a-programming-exercise/,[2 min read]


In [58]:
# view article_df
article_df

Unnamed: 0,art_title,art_content,art_date,art_copyright
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,2020-10-13,Programming Log © 2020
1,Using Gitigore,Gitignore Exercises\nExercise 1\n\nChange dire...,2020-07-20,Programming Log © 2020
2,Air Fried Avocados,"Ingredients:\n\n2 large avacados\nSalt, pepper...",2020-01-16,Programming Log © 2020
3,How to Run Julia Language on Jupyter Notebooks,Rationale\n\nJulia is high performance general...,2019-08-03,Programming Log © 2020
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...,2019-07-18,Programming Log © 2020
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,2019-03-04,Programming Log © 2020
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,2019-02-12,Programming Log © 2020
7,Effectively practice a 5-10 minute group prese...,\nIf you're still working on memorizing your s...,2019-02-01,Programming Log © 2020
8,Cheesy Garlic Brussel Sprouts,Dependencies:\n\noven\nstovetop range\nmicrowa...,2018-12-19,Programming Log © 2020
9,How to get started on a programming exercise,"Scenario:\nYou're learning to code, learning s...",2018-08-29,Programming Log © 2020


In [59]:
# merge dataframes on title and article title
full_df = card_df.merge(article_df, left_on='title', right_on='art_title')
full_df

Unnamed: 0,title,content,relative_url,read_time,art_title,art_content,art_date,art_copyright
0,Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,/air-fried-tofu/,[1 min read],Air Fried Marinated Tofu,Drain then press tofu for 30-45 minutes. A tof...,2020-10-13,Programming Log © 2020
1,Using Gitigore,Gitignore Exercises Exercise 1 Change director...,/using-gitigore/,[1 min read],Using Gitigore,Gitignore Exercises\nExercise 1\n\nChange dire...,2020-07-20,Programming Log © 2020
2,Air Fried Avocados,"Ingredients: 2 large avacados Salt, pepper Dir...",/air-fried-avacados/,[1 min read],Air Fried Avocados,"Ingredients:\n\n2 large avacados\nSalt, pepper...",2020-01-16,Programming Log © 2020
3,How to Run Julia Language on Jupyter Notebooks,Rationale Julia is high performance general pu...,/run-julia-language-on-jupyter-notebook/,[1 min read],How to Run Julia Language on Jupyter Notebooks,Rationale\n\nJulia is high performance general...,2019-08-03,Programming Log © 2020
4,Intro to Binary,Intro to Binary Numbers - Why should you care?...,/intro-to-binary/,[2 min read],Intro to Binary,Intro to Binary Numbers - Why should you care?...,2019-07-18,Programming Log © 2020
5,Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,/effective-practice-with-flash-cards/,[3 min read],Effective Practice with Flash Cards,Effective practice with flash-cards is about s...,2019-03-04,Programming Log © 2020
6,Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,/intro-to-testing-and-test-driven-development/,[1 min read],Gentle Introduction Test Driven Development (TDD),My Intro to Testing JS tutorial is hot off th...,2019-02-12,Programming Log © 2020
7,Effectively practice a 5-10 minute group prese...,If you're still working on memorizing your scr...,/how-to-practice-a-5-minute-group-presentation/,[1 min read],Effectively practice a 5-10 minute group prese...,\nIf you're still working on memorizing your s...,2019-02-01,Programming Log © 2020
8,Cheesy Garlic Brussel Sprouts,Dependencies: oven stovetop range microwave pa...,/cheesy-garlicky/,[1 min read],Cheesy Garlic Brussel Sprouts,Dependencies:\n\noven\nstovetop range\nmicrowa...,2018-12-19,Programming Log © 2020
9,How to get started on a programming exercise,"Scenario: You're learning to code, learning sy...",/how-to-start-a-programming-exercise/,[2 min read],How to get started on a programming exercise,"Scenario:\nYou're learning to code, learning s...",2018-08-29,Programming Log © 2020
