### 1. Codeup Blog Articles - Scrape the article text from the following pages
- https://codeup.com/codeups-data-science-career-accelerator-is-here
- https://codeup.com/data-science-myths/
- https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
- https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
- https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

**Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:**

`{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}`

In [9]:
from requests import get
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup

In [10]:
# Create a helper function that requests and parses HTML returning a soup object.

def make_soup(url):
    '''
    This helper function takes in a url and requests and parses HTML
    returning a soup object.
    '''
    headers = {'User-Agent': 'Codeup Data Science'} 
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

In [11]:
def get_blog_articles(urls, cached=False):
    '''
    This function takes in a list of Codeup Blog urls and a parameter
    with default cached == False which scrapes the title and text for each url, 
    creates a list of dictionaries with the title and text for each blog, 
    converts list to df, and returns df.
    If cached == True, the function returns a df from a json file.
    '''
    if cached == True:
        df = pd.read_json('big_blogs.json')
        
    # cached == False completes a fresh scrape for df     
    else:

        # Create an empty list to hold dictionaries
        articles = []

        # Loop through each url in our list of urls
        for url in urls:

            # Make request and soup object using helper
            soup = make_soup(url)

            # Save the title of each blog in variable title
            title = soup.find('h1').text

            # Save the text in each blog to variable text
            content = soup.find('div', class_="jupiterx-post-content").text

            # Create a dictionary holding the title and content for each blog
            article = {'title': title, 'content': content}

            # Add each dictionary to the articles list of dictionaries
            articles.append(article)
            
        # convert our list of dictionaries to a df
        df = pd.DataFrame(articles)

        # Write df to a json file for faster access
        df.to_json('big_blogs.json')
    
    return df

In [12]:
# Here cached == False, so the function will do a fresh scrape of the urls and write data to a json file.

urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

blogs = get_blog_articles(urls=urls, cached=False)
blogs

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


### Bonus question

In [13]:
url = 'https://codeup.com/resources/#blog'
soup = make_soup(url)

In [14]:
# I'm filtering my soup to return a list of all anchor elements from my HTML.

urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')

In [15]:
# Filter the href attribute value for each anchor element in my list; we scraped 40 urls.

# I'm using a set comprehension to return only unique urls because there are two links for each article.
urls = {link.get('href') for link in urls_list}

# I'm converting my set to a list of urls.
urls = list(urls)

print(f'There are {len(urls)} unique links in our urls list.')
print()
urls 


There are 20 unique links in our urls list.



['https://codeup.com/what-is-machine-learning/',
 'https://codeup.com/introducing-salary-refund-guarantee/',
 'https://codeup.com/codeup-in-houston/',
 'https://codeup.com/codeup-wins-civtech-datathon/',
 'https://codeup.com/codeup-alumni-make-water/',
 'https://codeup.com/transition-into-data-science/',
 'https://codeup.com/what-data-science-career-is-for-you/',
 'https://codeup.com/what-to-expect-at-codeup/',
 'https://codeup.com/what-is-python/',
 'https://codeup.com/education-is-an-investment/',
 'https://codeup.com/covid-19-data-challenge/',
 'https://codeup.com/codeup-inc-5000/',
 'https://codeup.com/from-slacker-to-data-scientist/',
 'https://codeup.com/succeed-in-a-coding-bootcamp/',
 'https://codeup.com/codeups-application-process/',
 'https://codeup.com/new-scholarship/',
 'https://codeup.com/build-your-career-in-tech/',
 'https://codeup.com/math-in-data-science/',
 'https://codeup.com/how-were-celebrating-world-mental-health-day-from-home/',
 'https://codeup.com/journey-into

In [16]:
def get_all_urls():
    '''
    This function scrapes all of the Codeup blog urls from
    the main Codeup blog page and returns a list of urls.
    '''
    # The base url for the main Codeup blog page
    url = 'https://codeup.com/resources/#blog' 
    
    # Make request and soup object using helper
    soup = make_soup(url)
    
    # Create a list of the anchor elements that hold the urls.
    urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
    
    # I'm using a set comprehension to return only unique urls because list contains duplicate urls.
    urls = {link.get('href') for link in urls_list}

    # I'm converting my set to a list of urls.
    urls = list(urls)
        
    return urls

In [17]:
# Now I can use my same function with my new function.
# cached == False does a fresh scrape.

big_blogs = get_blog_articles(urls=get_all_urls(), cached=False)

In [18]:
big_blogs.head(10)

Unnamed: 0,title,content
0,What is Machine Learning?,"There’s a lot we can learn about machines, and..."
1,Introducing Our Salary Refund Guarantee,"Here at Codeup, we believe it’s time to revolu..."
2,Codeup Launches Houston!,"Houston, we have a problem: there aren’t enoug..."
3,Codeup Grads Win CivTech Datathon,Many Codeup alumni enjoy competing in hackatho...
4,How Codeup Alumni are Helping to Make Water,Imagine having a kit mailed to you with all th...
5,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
6,What Data Science Career is For You?,If you’re struggling to see yourself as a data...
7,What to Expect at Codeup,"Setting Expectations for Life Before, During, ..."
8,What is Python?,If you’ve been digging around our website or r...
9,Your Education is an Investment,You have many options regarding educational ro...


### 2. News Articles

**We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.**

**Write a function that scrapes the news articles for the following topics:**

- Business
- Sports
- Technology
- Entertainment

**The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

In [21]:
# Make the soup object using my function.

url = 'https://inshorts.com/en/read/entertainment'
soup = make_soup(url)

In [22]:
# Scrape a ResultSet of all the news cards on the page and inspect the elements on the first card.

cards = soup.find_all('div', class_='news-card')

print(f'There are {len(cards)} news cards on this page.')
print()
cards[0]

There are 24 news cards on this page.



<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/govinda-didnt-visit-when-one-of-my-twins-was-fighting-for-life-nephew-krushna-1605501525227" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Daisy Mowke" itemprop="name"></span>
</span>
<span content="Govinda didn't visit when one of my twins was fighting for life: Nephew Krushna " itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2020/11_nov/16_mon/img_1605501016018_425.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span 

In [23]:
# Create a list of titles using the span element and itemprop attribute with text method.

titles = [card.find('span', itemprop='headline').text for card in cards]
titles[:5]

["Govinda didn't visit when one of my twins was fighting for life: Nephew Krushna ",
 'Tamil TV series actor hacked to death, CCTV footage shows argument with gang',
 "Big B's father's statue in Poland honoured with diya on Diwali, actor shares pic",
 "Salman is everything to me: Amaal on trolling by his fans for saying he's SRK fan",
 'Dwayne Johnson fails to get into Porsche for chase sequence in movie, shares pic']

In [24]:
# Create a list of authors using the span element and class attribute with text method.

authors = [card.find('span', class_='author').text for card in cards]
authors[:5]

['Daisy Mowke', 'Daisy Mowke', 'Daisy Mowke', 'Daisy Mowke', 'Anmol Sharma']

In [25]:
# Create a list of content strings using the div element and itemprop attribute with text method.

content = [card.find('div', itemprop='articleBody').text for card in cards]
content[:5]

['Krushna Abhishek has opted out of an episode of \'The Kapil Sharma Show\' which will feature his uncle, Govinda, as guest. "The enmity has affected me badly. When...relationship between...people is strained, it\'s difficult to perform comedy," he stated. Krushna added, "[He] didn\'t even come to see my twins in hospital, not even when one of them was fighting for life."',
 'Selvarathinam, an actor who played a villain in a Tamil TV series was hacked to death. "[On Sunday]...he received a call after which he left...Later, his roommate received the information [about his death]," police said. CCTV footage shows 4 suspicious men moving about near the murder spot. The actor could be found involved in a brief argument with the gang.',
 'Amitabh Bachchan has shared a photo of a diya being lit near the statue of his late father, poet Harivansh Rai Bachchan, in a square in the Polish city of Wroclaw which has been named after his father. "They honour Babuji by placing a \'diya\' for Deepaval

In [26]:
# Create an empty list, articles, to hold the dictionaries for each article.
articles = []

# Loop through each news card on the page and get what we want
for card in cards:
    title = card.find('span', itemprop='headline' ).text
    author = card.find('span', class_='author').text
    content = card.find('div', itemprop='articleBody').text
    
    # Create a dictionary, article, for each news card
    article = {'title': title, 'author': author, 'content': content}
    
    # Add the dictionary, article, to our list of dictionaries, articles.
    articles.append(article)

In [27]:
# Here we see our list contains 24-25 dictionaries for news cards

print(len(articles))
articles[0]

24


{'title': "Govinda didn't visit when one of my twins was fighting for life: Nephew Krushna ",
 'author': 'Daisy Mowke',
 'content': 'Krushna Abhishek has opted out of an episode of \'The Kapil Sharma Show\' which will feature his uncle, Govinda, as guest. "The enmity has affected me badly. When...relationship between...people is strained, it\'s difficult to perform comedy," he stated. Krushna added, "[He] didn\'t even come to see my twins in hospital, not even when one of them was fighting for life."'}

In [28]:
def get_news_articles(cached=False):
    '''
    This function with default cached == False does a fresh scrape of inshort pages with topics 
    business, sports, technology, and entertainment and writes the returned df to a json file.
    cached == True returns a df read in from a json file.
    '''
    # option to read in a json file instead of scrape for df
    if cached == True:
        df = pd.read_json('articles.json')
        
    # cached == False completes a fresh scrape for df    
    else:
    
        # Set base_url that will be used in get request
        base_url = 'https://inshorts.com/en/read/'
        
        # List of topics to scrape
        topics = ['business', 'sports', 'technology', 'entertainment']
        
        # Create an empty list, articles, to hold our dictionaries
        articles = []

        for topic in topics:
            
            # Create url with topic endpoint
            topic_url = base_url + topic
            
            # Make request and soup object using helper
            soup = make_soup(topic_url)

            # Scrape a ResultSet of all the news cards on the page
            cards = soup.find_all('div', class_='news-card')

            # Loop through each news card on the page and get what we want
            for card in cards:
                title = card.find('span', itemprop='headline' ).text
                author = card.find('span', class_='author').text
                content = card.find('div', itemprop='articleBody').text

                # Create a dictionary, article, for each news card
                article = ({'topic': topic, 
                            'title': title, 
                            'author': author, 
                            'content': content})

                # Add the dictionary, article, to our list of dictionaries, articles.
                articles.append(article)
            
        # Create a DataFrame from list of dictionaries
        df = pd.DataFrame(articles)
        
        # Write df to json file for future use
        df.to_json('articles.json')
    
    return df

In [29]:
# Test our function with cached == False to do a freash scrape and create `articles.json` file.

df = get_news_articles(cached=False)
df.head()

Unnamed: 0,topic,title,author,content
0,business,Moderna's early data shows its COVID-19 vaccin...,Pragya Swastik,American biotechnology company Moderna on Mond...
1,business,15 countries sign world's biggest free-trade p...,Pragya Swastik,Fifteen Asia-Pacific countries signed the Regi...
2,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
3,business,Reliance Retail buys 96% stake in Urban Ladder...,Rishabh Bhatnagar,Reliance Industries' retail arm Reliance Retai...
4,business,"Reduce foreign funding to 26% by Oct 15, 2021:...",Pragya Swastik,The I&B Ministry on Monday asked digital media...


In [30]:
df.topic.value_counts()

technology       25
sports           25
business         25
entertainment    24
Name: topic, dtype: int64

In [31]:
# Test our function to read in the df from `articles.csv`

df = get_news_articles(cached=True)
df.head()

Unnamed: 0,topic,title,author,content
0,business,Moderna's early data shows its COVID-19 vaccin...,Pragya Swastik,American biotechnology company Moderna on Mond...
1,business,15 countries sign world's biggest free-trade p...,Pragya Swastik,Fifteen Asia-Pacific countries signed the Regi...
2,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
3,business,Reliance Retail buys 96% stake in Urban Ladder...,Rishabh Bhatnagar,Reliance Industries' retail arm Reliance Retai...
4,business,"Reduce foreign funding to 26% by Oct 15, 2021:...",Pragya Swastik,The I&B Ministry on Monday asked digital media...
