# Exercises
- By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

#1 Codeup Blog Articles

- Scrape the article text from the following pages:

    - https://codeup.com/codeups-data-science-career-accelerator-is-here/
    - https://codeup.com/data-science-myths/
    - https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
    - https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
    - https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

- Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

Plus any additional properties you think might be helpful.


#2 News Articles

- We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

- Write a function that scrapes the news articles for the following topics:

    - Business
    - Sports
    - Technology
    - Entertainment


- The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

- Hints:

    - a. Start by inspecting the website in your browser. Figure out which elements will be useful.
    - b. Start by creating a function that handles a single article and produces a dictionary like the one above.
    - c. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
    - d. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [135]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import os.path

# 1. Codeup Blog Articles

In [86]:
#take a look at an article from Codeup's blog
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the python-requests default user-agent
response = get(url, headers=headers)

**Note: use headers because it is a polite way to grab websites instead of it just saying the default python-requests when someone else is looking at the request pull**

In [87]:
#After making the request, we'll perform a quick sanity check to make sure what we are looking at is indeed HTML data.
#Can see this is an html string
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/*nelio-ab-testing-kick


In [88]:
#tells us it works
response.status_code

200

In [89]:
# response.content

In [90]:
# Make a soup variable holding the response content
#html.parser: tells it this a html file
soup = BeautifulSoup(response.content, 'html.parser')

In [91]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content')
article.text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

In [92]:
#another way to get title, commented out 
# soup.find(class_='jupiterx-post-title').text

In [93]:
#creating empty dictionary 
article_dictionary= {'title':[], 'content': []}

In [94]:
#Use title function of beautiful soup to pull out title string
soup.title.string

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [95]:
#setting title key and content key to their values
article_dictionary['title']= soup.title.string
article_dictionary['content']= article.text

In [96]:
#printing dictionary with their keys and values
article_dictionary

{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
 'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and R

In [97]:
#take a look at an article from Codeup's blog (data science myths)
url = 'https://codeup.com/data-science-myths/'
headers = {'User-Agent': 'Codeup Data Science Myths'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [98]:
#After making the request, we'll perform a quick sanity check to make sure what we are looking at is indeed HTML data.
#Can see this is an html string
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/*nelio-ab-testing-kick


In [99]:
# Make a soup variable holding the response content
#html.parser: tells it this a html file
soup = BeautifulSoup(response.content, 'html.parser')

In [100]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content')
article.text

'By Dimitri Antoniou and Maggie Giust\nData Science, Big Data, Machine Learning, NLP, Neural Networks…these buzzwords have rapidly spread into mainstream use over the last few years. Unfortunately, definitions are varied and sources of truth are limited. Data Scientists are in fact not magical unicorn wizards who can snap their fingers and turn a business around! Today, we’ll take a cue from our favorite Mythbusters to tackle some common myths and misconceptions in the field of Data Science.\n\xa0\nMyth #1: Data Science = Statistics\nAt first glance, this one doesn’t sound unreasonable. Statistics is defined as, “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.” That sounds a lot like our definition of Data Science: a method of drawing actionable intelligence from data. \nIn truth, statistics is actually one small piece of Data Science. As our Senior Data Scientist puts it, “Statistics forces us to make assumpt

In [101]:
#creating empty dictionary 
article_dictionary= {'title':[], 'content': []}

In [102]:
#Use title function of beautiful soup to pull out title string
soup.title.string

'Data Science Myths - Codeup'

In [103]:
#setting title key and content key to their values
article_dictionary['title']= soup.title.string
article_dictionary['content']= article.text

In [104]:
#printing dictionary with their keys and values
article_dictionary

{'title': 'Data Science Myths - Codeup',
 'content': 'By Dimitri Antoniou and Maggie Giust\nData Science, Big Data, Machine Learning, NLP, Neural Networks…these buzzwords have rapidly spread into mainstream use over the last few years. Unfortunately, definitions are varied and sources of truth are limited. Data Scientists are in fact not magical unicorn wizards who can snap their fingers and turn a business around! Today, we’ll take a cue from our favorite Mythbusters to tackle some common myths and misconceptions in the field of Data Science.\n\xa0\nMyth #1: Data Science = Statistics\nAt first glance, this one doesn’t sound unreasonable. Statistics is defined as, “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.” That sounds a lot like our definition of Data Science: a method of drawing actionable intelligence from data. \nIn truth, statistics is actually one small piece of Data Science. As our Senior Data Sci

In [105]:
#take a look at an article from Codeup's blog (data science vs data analytics)
url = 'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/'
headers = {'User-Agent': 'Codeup Data Science vs Data Analytics'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [106]:
#After making the request, we'll perform a quick sanity check to make sure what we are looking at is indeed HTML data.
#Can see this is an html string
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/*nelio-ab-testing-kick


In [107]:
# Make a soup variable holding the response content
#html.parser: tells it this a html file
soup = BeautifulSoup(response.content, 'html.parser')

In [108]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content')
article.text

'By Dimitri Antoniou\nA week ago, Codeup launched our immersive Data Science career accelerator! With our first-class kicking off in February and only 25 seats available, we’ve been answering a lot of questions from prospective students. One, in particular, has come up so many times we decided to dedicate a blog post to it. What is the difference between data science and data analytics?\nFirst, let’s define some of our terms! Take a look at this blog to understand what Data Science is. In short, it is a method of turning raw data into action, leading to the desired outcome. Big Data refers to data sets that are large and complex, usually exceeding the capacity of computers and normal processing power to deal with. Machine Learning is the process of ‘learning’ underlying patterns of data in order to automate the extraction of intelligence from that data.\n\xa0\n\xa0\nNow, let’s look at the data pipeline that data scientists work through to reach the actionable insights and outcomes we m

In [109]:
#creating empty dictionary 
article_dictionary= {'title':[], 'content': []}

In [110]:
#Use title function of beautiful soup to pull out title string
soup.title.string

'Data Science VS Data Analytics: What’s The Difference? - Codeup'

In [111]:
#setting title key and content key to their values
article_dictionary['title']= soup.title.string
article_dictionary['content']= article.text

In [112]:
#printing dictionary with their keys and values
article_dictionary

{'title': 'Data Science VS Data Analytics: What’s The Difference? - Codeup',
 'content': 'By Dimitri Antoniou\nA week ago, Codeup launched our immersive Data Science career accelerator! With our first-class kicking off in February and only 25 seats available, we’ve been answering a lot of questions from prospective students. One, in particular, has come up so many times we decided to dedicate a blog post to it. What is the difference between data science and data analytics?\nFirst, let’s define some of our terms! Take a look at this blog to understand what Data Science is. In short, it is a method of turning raw data into action, leading to the desired outcome. Big Data refers to data sets that are large and complex, usually exceeding the capacity of computers and normal processing power to deal with. Machine Learning is the process of ‘learning’ underlying patterns of data in order to automate the extraction of intelligence from that data.\n\xa0\n\xa0\nNow, let’s look at the data pipe

In [113]:
#take a look at an article from Codeup's blog (10 tips for SA Tech Job Fair)
url = 'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/'
headers = {'User-Agent': 'Codeup 10 tips for SA Tech Job Fair'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [114]:
#After making the request, we'll perform a quick sanity check to make sure what we are looking at is indeed HTML data.
#Can see this is an html string
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/*nelio-ab-testing-kick


In [115]:
# Make a soup variable holding the response content
#html.parser: tells it this a html file
soup = BeautifulSoup(response.content, 'html.parser')

In [116]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content')
article.text

'SA Tech Job Fair\nThe third bi-annual San Antonio Tech Job Fair is just around the corner. Over 25 companies will be at The Jack Guenther Pavilion\xa0on April 10th, and they are hungry for new tech team members!\nAt the job fair, companies want to quickly source a list of new talent leads. AKA they need to find qualified employees they can begin interviewing for jobs. Recruiters will represent their organization at tables with informational handouts and company swag. Your goal at a job fair is to set yourself apart from other candidates and ensure your name makes it to the top of those lead lists.\nThink of your interaction with the company as a mini screening interview. The company rep will subtly evaluate basic qualities like your professionalism, communication and interpersonal skills, work experience, and interest level in the organization. Job fairs are also an opportunity for you to gain information about companies that may not be easily accessible online. \xa0\nAt Codeup, we’re

In [117]:
#creating empty dictionary 
article_dictionary= {'title':[], 'content': []}

In [118]:
#Use title function of beautiful soup to pull out title string
soup.title.string

'10 Tips to Crush It at the SA Tech Job Fair - Codeup'

In [119]:
#setting title key and content key to their values
article_dictionary['title']= soup.title.string
article_dictionary['content']= article.text

In [120]:
#printing dictionary with their keys and values
article_dictionary

{'title': '10 Tips to Crush It at the SA Tech Job Fair - Codeup',
 'content': 'SA Tech Job Fair\nThe third bi-annual San Antonio Tech Job Fair is just around the corner. Over 25 companies will be at The Jack Guenther Pavilion\xa0on April 10th, and they are hungry for new tech team members!\nAt the job fair, companies want to quickly source a list of new talent leads. AKA they need to find qualified employees they can begin interviewing for jobs. Recruiters will represent their organization at tables with informational handouts and company swag. Your goal at a job fair is to set yourself apart from other candidates and ensure your name makes it to the top of those lead lists.\nThink of your interaction with the company as a mini screening interview. The company rep will subtly evaluate basic qualities like your professionalism, communication and interpersonal skills, work experience, and interest level in the organization. Job fairs are also an opportunity for you to gain information ab

In [121]:
#take a look at an article from Codeup's blog ()
url = 'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/'
headers = {'User-Agent': 'Codeup Competitor Bootcamps Are Closing. Is the Model in Danger?'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [122]:
#After making the request, we'll perform a quick sanity check to make sure what we are looking at is indeed HTML data.
#Can see this is an html string
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/*nelio-ab-testing-kick


In [123]:
# Make a soup variable holding the response content
#html.parser: tells it this a html file
soup = BeautifulSoup(response.content, 'html.parser')

In [124]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content')
article.text

'Competitor Bootcamps Are Closing. Is the Model in Danger?\n\xa0\n\nIs the programming bootcamp model in danger?\nIn recent news, DevBootcamp and The Iron Yard announced that they are closing their doors. This is big news. DevBootcamp was the first programming bootcamp model and The Iron Yard is a national player with 15 campuses across the U.S. In both cases, the companies cited an unsustainable business model. Does that mean the boot-camp model is dead?\n\ntl;dr “Nope!”\nBootcamps exist because traditional education models have failed to provide students job-ready skills for the 21st century. Students demand better employment options from their education. Employers demand skilled and job ready candidates. Big Education’s failure to meet those needs through traditional methods created the fertile ground for the new business model of the programming bootcamp.\nEducation giant Kaplan and Apollo Education Group (owner of University of Phoenix) bought their way into this new educational m

In [125]:
#creating empty dictionary 
article_dictionary= {'title':[], 'content': []}

In [126]:
#Use title function of beautiful soup to pull out title string
soup.title.string

'Competitor Bootcamps Are Closing. Is the Model in Danger? - Codeup'

In [127]:
#setting title key and content key to their values
article_dictionary['title']= soup.title.string
article_dictionary['content']= article.text

In [128]:
#printing dictionary with their keys and values
article_dictionary

{'title': 'Competitor Bootcamps Are Closing. Is the Model in Danger? - Codeup',
 'content': 'Competitor Bootcamps Are Closing. Is the Model in Danger?\n\xa0\n\nIs the programming bootcamp model in danger?\nIn recent news, DevBootcamp and The Iron Yard announced that they are closing their doors. This is big news. DevBootcamp was the first programming bootcamp model and The Iron Yard is a national player with 15 campuses across the U.S. In both cases, the companies cited an unsustainable business model. Does that mean the boot-camp model is dead?\n\ntl;dr “Nope!”\nBootcamps exist because traditional education models have failed to provide students job-ready skills for the 21st century. Students demand better employment options from their education. Employers demand skilled and job ready candidates. Big Education’s failure to meet those needs through traditional methods created the fertile ground for the new business model of the programming bootcamp.\nEducation giant Kaplan and Apollo E

In [129]:
def get_codeup_blog(url):
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)
    
    soup = BeautifulSoup(response.text)
    
    title = soup.find("h1").text
    published_date = soup.time.text
    
    if len(soup.select(".jupiterx-post-image")) > 0:
        blog_image = soup.select(".jupiterx-post-image")[0].picture.img["data-src"]
    else:
        blog_image = None
        
    content = soup.select(".jupiterx-post-content")[0].text
    
    output = {}
    output["title"] = title
    output["published_date"] = published_date
    output["blog_image"] = blog_image
    output["content"] = content
    
    return output

In [130]:
def get_blog_articles(urls):
    # List of dictionaries
    posts = [get_codeup_blog(url) for url in urls]
    
    return pd.DataFrame(posts)

In [136]:
urls = [
    "https://codeup.com/codeups-data-science-career-accelerator-is-here/",
    "https://codeup.com/data-science-myths/",
    "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
    "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
    "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"
]

In [137]:
df = get_blog_articles(urls)

In [138]:
df.head()

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


# 2. News Articles
- https://inshorts.com/en/read/business
- https://inshorts.com/en/read/sports
- https://inshorts.com/en/read/technology
- https://inshorts.com/en/read/entertainment
- https://inshorts.com/en/read/science
- https://inshorts.com/en/read/world

In [139]:
categories= ['business', ' sports', 'technology', 'entertainment', 'science', 'world']

In [140]:
base_url = 'https://inshorts.com/en/read/'

In [141]:
first_cat = categories[0]

In [142]:
first_page = base_url + first_cat

In [143]:
first_page

'https://inshorts.com/en/read/business'

In [144]:
headers

{'User-Agent': 'Codeup Competitor Bootcamps Are Closing. Is the Model in Danger?'}

In [145]:
# step one: get our content
response = get(first_page, headers=headers)

In [146]:
response.text[:400]

'<!doctype html>\n<html lang="en">\n\n<head>\n  <meta charset="utf-8" />\n  <style>\n    /* The Modal (background) */\n    .modal_contact {\n        display: none; /* Hidden by default */\n        position: fixed; /* Stay in place */\n        z-index: 8; /* Sit on top */\n        left: 0;\n        top: 0;\n        width: 100%; /* Full width */\n        height: 100%;\n        overflow: auto; /* Enable scroll if ne'

In [147]:
# make our soup
soup = BeautifulSoup(response.text)

In [148]:
articles = soup.select('.news-card')

In [149]:
articles[0]

<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/amazon-job-posting-fuels-speculations-about-plan-to-accept-payments-in-crypto-1627312165039" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Pragya Swastik" itemprop="name"></span>
</span>
<span content="Amazon job posting fuels speculations about plan to accept payments in crypto" itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2021/07_jul/26_mon/img_1627309467319_923.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span 

In [150]:
articles[1].select("[itemprop='articleBody']")[0]

<div itemprop="articleBody">China's Larry Chen, a former teacher who became a billionaire with edtech company Gaotu Techedu, lost his billionaire status after his company's shares fell 98%. Chen, Gaotu Techedu's Founder and CEO, is now worth $336 million according to Bloomberg. The development comes as China's new regulations banned companies teaching school curriculums from making profits, raising capital or going public.</div>

In [151]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [152]:
def get_articles(category, base ="https://inshorts.com/en/read/"):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output

In [153]:
# Example of using the get_articles function sending in the category name that's part of the URL
# get_articles("business")

In [154]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [155]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]
df = get_all_news_articles(categories)

In [156]:
df

Unnamed: 0,title,content,category
0,Amazon job posting fuels speculations about pl...,A new job posting by Amazon has fuelled specul...,business
1,China's ex-teacher turned billionaire no more ...,"China's Larry Chen, a former teacher who becam...",business
2,"Musk takes a jibe at rival car companies, says...",Tesla CEO and the world's second-richest perso...,business
3,"Unemployment rate rises in both urban, rural a...",India's unemployment rate soared to 7.14% in t...,business
4,Govt paid Infosys ₹164.5 crore for new Income ...,The government paid ₹164.5 crore to Infosys to...,business
...,...,...,...
142,Lebanese lawmakers pick billionaire Najib Mika...,Lebanese lawmakers during parliamentary consul...,world
143,Ugandan govt spends $30 mn on cars for lawmake...,The Ugandan government was criticised after it...,world
144,New Zealand agrees to accept alleged Islamic S...,New Zealand on Monday agreed to repatriate an ...,world
145,UAE extends ban on passenger flights from Indi...,The UAE has extended a ban on passenger flight...,world
