# News Api

Using the [News Api](https://newsapi.org/) we can retrieve articles from a various range of subjects and from all around the world. Although this Api limits us in the quantity of words that we can retrieve from each article, pairing it with a web scrapper like Beautifulsoup allows us to retrieve the entirety of those news articles.

This Api can be particularly useful in the case of topic modelling or text classification.

We will start by installing the necessary libraries for this example.

*Beautifulsoup* will allow us to scrap the articles web pages and *the newsapi-python* is a Python client library for the News Api. The github page of the library can be found here: https://github.com/mattlisiv/newsapi-python.

In [None]:
!pip install beautifulsoup4
!pip install newsapi-python

Collecting newsapi-python
  Downloading newsapi_python-0.2.7-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: newsapi-python
Successfully installed newsapi-python-0.2.7


The very first step will be to create an account on the [News Api](https://newsapi.org/) website. Once the account created, you will be able to retrieve an Api Key on your account page. Assign the Api Key to the *api_key* variable.

Since we are under a *Developper* subscription we only have access to a 100 Api calls per day. But this should be way enough to test and retrieve all the informations we want.

The next step is to change the search criteria to the main subject treated by the news articles you are looking for.

There is some other parameters available for the search queries, such as the language in which the news article is written. A full list of those parameters is available here: https://newsapi.org/docs/client-libraries/python.

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from newsapi import NewsApiClient

# Lists that will hold the content of the articles
titles = []
authors = []
sources = []
contents = []

# Copy and paste your Api key here
api_key = ''

# Authentication process facilitated by the library
newsapi = NewsApiClient(api_key=api_key)

# The different parameters for the Api request
q = 'neuralink'
language = 'en'
page = 1

# An article count that serve us to keep a count of how many articles we've retrieved thanks to the request
articleCount = 100

while (articleCount == 100):
  # Return the list and content of articles in the defined page of our search
  articles = newsapi.get_everything(q=q, language=language, page=page)
  newsDf = pd.DataFrame(articles)
  articleCount = len(newsDf)
  print('Number of articles : ', articleCount + 100 * (page - 1), ' out of ', newsDf['totalResults'][1])
  for article in newsDf['articles']:
    # We need a try catch in the case an exception is raised due to connection aborted issues
    try:
      # Retrieve the html code of the news article page
      response = requests.get(article['url'])
      # Initialize the beautifulsoup html parser
      soup = BeautifulSoup(response.text, 'html.parser')
      # We are looking for the first element with an 'article' tag on the page and continue if we can't find any
      result = soup.find('article')
      if not result:
        continue
      # Now we are looking for all elements with a 'p' tag inside the result of the previous search
      texts = result.find_all('p')
      if not texts:
        continue
      articleContent = ''
      # We loop through all the article elements to retrieve the entirety of its content
      for text in texts:
        articleContent = articleContent + text.text
      # A consuming check but we have to verify the article returned actually contains the content of the search
      if q.lower() not in articleContent.lower() and q.lower() not in article['title'].lower():
        continue
      # At this point we should have the content of the article, we can add all the important informations to our lists
      contents.append(articleContent)
      titles.append(article['title'])
      authors.append(article['author'])
      sources.append(article['source']['name'])
    except Exception:
      pass
  page += 1

# We create a dataframe from the content lists that we have
finalDf = pd.DataFrame({
    'article':contents,
    'title': titles,
    'author':authors,
    'source':sources
})

# This part produce a csv file and then compress it into a zip to facilitate the file download and transfer
compression_opts = dict(method='zip', archive_name='News_Api_Articles.csv')
finalDf.to_csv('News_Api_Articles.zip', index=False, compression=compression_opts)

Number of articles :  100  out of  321
Number of articles :  200  out of  321
Number of articles :  300  out of  321
Number of articles :  321  out of  321
