# Data Acquisition Web Scraping Exercises
### Kwame V. Taylor

By the end of this exercise, you should have a file named ```acquire.py``` that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. ```acquire_codeup_blog.py``` and ```acquire_news_articles.py```), but the end function should be present in ```acquire.py``` (that is, ```acquire.py``` should import ```get_blog_articles``` from the ```acquire_codeup_blog``` module.)

1. **Codeup Blog Articles**

Scrape the article text from the following pages:

* https://codeup.com/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/data-science-myths/
* https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
* https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
* https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named ```get_blog_articles``` that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

>```
>{
>    'title': 'the title of the article',
>    'content': 'the full text content of the article'
>}
>```

Plus any additional properties you think might be helpful.

**Bonus:**
Scrape the text of all the articles linked on codeup's blog page.

In [11]:
import numpy as np
import pandas as pd

from requests import get
import re
from bs4 import BeautifulSoup
import os

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [3]:
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/* <![CDATA[ */( function() { var ua = window.navigator.userAgent || ''; if ( -1 !== ua.indexOf( 'MSIE ' ) || -1 !== ua.indexOf( 'Tri


In [4]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [8]:
title = soup.find('h1').text
content = soup.find('p').text
articles

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.'

In [24]:
def get_blog_articles(urls):
    articles = []

    for url in urls:
        # Make request and soup object
        headers = {'User-Agent': 'Codeup Data Science'} 
        response = get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        # get title and paragraph/content
        title = soup.find('h1').text
        content = soup.find('p').text

        # store articles
        article = {'title': title, 'content': content}
        articles.append(article)
            
    # save as a DataFrame
    df = pd.DataFrame(articles)
    
    return df

In [25]:
df = get_blog_articles(['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/'])
df

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust
2,Data Science VS Data Analytics: What’s The Dif...,By Dimitri Antoniou
3,10 Tips to Crush It at the SA Tech Job Fair,The third bi-annual San Antonio Tech Job Fair ...
4,Competitor Bootcamps Are Closing. Is the Model...,


2. **News Articles**

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

The end product of this should be a function named ```get_news_articles``` that returns a list of dictionaries, where each dictionary has this shape:

>```
>{
>    'title': 'The article title',
>    'content': 'The article content',
>    'category': 'business' # for example
>}
>```

Hints:

a. Start by inspecting the website in your browser. Figure out which elements will be useful.

b. Start by creating a function that handles a single article and produces a dictionary like the one above.

c. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.

d. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [38]:
# I think h1 and h2 will be helpful info to scrap.
# The navigation links are in div, and start with 'action_links'

#def get_article(url):
url = 'https://inshorts.com/en/read/business'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)
#    return

In [39]:
print(response.text[:400])

<!doctype html>
<html lang="en">

<head>
  <meta charset="utf-8" />
  <style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if ne


In [40]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [46]:
title = soup.find('span').text
content = soup.find('content')
title

'toggle menu'

In [None]:
def find_articles():
    return

In [43]:
def get_news_articles(urls):
    articles = []

    for url in urls:
        # Make request and soup object
        headers = {'User-Agent': 'Codeup Data Science'} 
        response = get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        # get title and paragraph/content
        title = soup.find('h1').text
        content = soup.find('p').text

        # store articles
        article = {'title': title, 'content': content}
        articles.append(article)
            
    # save as a DataFrame
    df = pd.DataFrame(articles)
    
    return df

In [44]:
df = get_news_articles(['https://inshorts.com/en/read/business',
        'https://inshorts.com/en/read/sports',
        'https://inshorts.com/en/read/technology',
        'https://inshorts.com/en/read/entertainment'])
df

AttributeError: 'NoneType' object has no attribute 'text'