## **Data Collection From API**

### Web API: News API

### Student Number: 21202384

In [5]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [6]:
# Importing Needed Libraries
import requests # Library for Getting data Via GET method
import json # Json Library
from time import time,sleep
import os
import numpy as np
import pandas as pd

In [7]:
import os
os.listdir('/content/gdrive/My Drive')

['Colab Notebooks',
 'Documents',
 'Local Notebooks',
 'Books_Hack',
 'Figure 1.pdf',
 'Customer_Churn.ipynb',
 'Okechukwu Joshua Ifeanyi Results Transcript.pdf',
 'Copy of StarterNotebook.ipynb',
 'Trust Wallet',
 'Copy of Hamoye Data Science Internships Handbook 2022.pdf',
 'FoodBalanceSheets_E_Africa_NOFLAG.csv',
 'Copy of FoodBalanceSheets_E_Africa_NOFLAG.csv',
 'data',
 'News_Data',
 'News_Data_Merged.json',
 'Josh_300Level_Result.pdf']

We seek to get the Top news headlines from major news outlets. To extract these headlines a list of Avaliable Sources of data is needed. Firstly the sources API is used to get the information/data about the available major news outlets parse and extract ID's of the Sources from the information data returned.Then get the top headlines for each of these ID's from the News Data API endpoint.

### News Sources API

This endpoint returns the subset of sources of available news publishers that top headlines (/v2/top-headlines). It is mainly a convenience endpoint that you can use to keep track of the publishers available on the API, and you can pipe it straight through to your users.

In [8]:
url_path='https://newsapi.org/v2/top-headlines/sources?apiKey=0995bb58a5ab449a8e8399a9162c2df7' # Path to get Data
# Define a function to query the data 
def Query_API(url_path=url_path,parameters=None):
    iterator =0
    while iterator ==0:
        try:
            response = requests.get(url_path,params=parameters)
            break
        except:
            print("sleep for 5 seconds")
            sleep(5) # This Delays execution for 5 seconds
            continue
    return response
# Implement Query API function
response_sources=Query_API(url_path=url_path)
print('Response Status Code: {}'.format(response_sources.status_code))

Response Status Code: 200


The Requests Get Function sends multiple request at a very short interval to the url address from the same ip address, The server may
reject such request.therefore an except function is used to catch the exception to prevent the termination of code, a time gap of 5 is used to delay request in case it is rejected,the while loop loops infinitely and sends the request continously untill it becomes succesful then the code is broken via the break statement.

In [9]:
response_sources=response_sources.json() # Converts Response from a json object
print('Retirved data is of type {}'.format(type(response_sources)))

Retirved data is of type <class 'dict'>


In [10]:
print('Length of Data Set {}'.format(len(response_sources)))
response_sources.keys()

Length of Data Set 2


dict_keys(['status', 'sources'])

The response is a dict file with the key 'status' that shows the condition of our data request and key 'sources' that is a list of new sources we seek to extract.

In [11]:
response_sources=response_sources['sources']
print('Response  Sources Data Type: {}'.format(type(response_sources)))
response_sources[:5]

Response  Sources Data Type: <class 'list'>


[{'category': 'general',
  'country': 'us',
  'description': 'Your trusted source for breaking news, analysis, exclusive interviews, headlines, and videos at ABCNews.com.',
  'id': 'abc-news',
  'language': 'en',
  'name': 'ABC News',
  'url': 'https://abcnews.go.com'},
 {'category': 'general',
  'country': 'au',
  'description': "Australia's most trusted source of local, national and world news. Comprehensive, independent, in-depth analysis, the latest business, sport, weather and more.",
  'id': 'abc-news-au',
  'language': 'en',
  'name': 'ABC News (AU)',
  'url': 'http://www.abc.net.au/news'},
 {'category': 'general',
  'country': 'no',
  'description': 'Norges ledende nettavis med alltid oppdaterte nyheter innenfor innenriks, utenriks, sport og kultur.',
  'id': 'aftenposten',
  'language': 'no',
  'name': 'Aftenposten',
  'url': 'https://www.aftenposten.no'},
 {'category': 'general',
  'country': 'us',
  'description': 'News, analysis from the Middle East and worldwide, multimedi

In [12]:
print('Total Number of News Sources: {:d}'.format(len(response_sources))) # The length of the response list equals the number of data sources
# We will use list comprehension to Iterate over the content of the response list and extract the 'id' from each dict.
sources=[x['id'] for x in response_sources]
sources[:5]

Total Number of News Sources: 128


['abc-news', 'abc-news-au', 'aftenposten', 'al-jazeera-english', 'ansa']

Sources Variable contains the list of avaliable new sources in the API. the output above shows the first 5 News sources in sorted order

The API can recieve and extract news from multiple news sources. A comma-seperated string of identifiers (maximum 20) for the news sources or blogs you want headlines from. The code below joins the news sources into string seperated by comma and stores them in list.

In [13]:
# Groups the API's into strings of maximum 20 seperated by comma.
sources_list=[] # List to hold the joined strings
i=0
try:
    for x in range(0,128,20):
        sources_list.append(','.join(sources[x:x+20]))
except:
    sources_list.append(','.join(sources[x:x+8]))
print('Length of Sources List: {}'.format(len(sources_list)))
sources_list[:1]

Length of Sources List: 7


['abc-news,abc-news-au,aftenposten,al-jazeera-english,ansa,argaam,ars-technica,ary-news,associated-press,australian-financial-review,axios,bbc-news,bbc-sport,bild,blasting-news-br,bleacher-report,bloomberg,breitbart-news,business-insider,business-insider-uk']

The Try and Except statement above is used to catch the exception that will be raised at the last iteration of the list. The sources has 128 entries,after 6 iterations selecting 20 entries per iteration, we are left with 8 entries. This will raise an exception in the code, the except statement catches this exception and reduces the entries to be selected to 8.

### News Data API 

Only news from the news sources is selected by passing them as 20 sources each to the API at each iteration.  

In [18]:
news_list=[]
now=time() # Setting time to calculate the duration of Data Retrieval.
for no,sources in enumerate(sources_list): # Iterates over the content of the list
    url_path='https://newsapi.org/v2/everything?sources='+sources+'&from=2022-02-01&to=2022-02-28&sortBy=popularity&apiKey=0995bb58a5ab449a8e8399a9162c2df7' # API path
    for page in range(5,10): # Gets the first 5 pages
        response_articles=Query_API(url_path,parameters={'page':page}) # Uses The Query API function to query the API.
        news_list.append(response_articles.json())  # Covert Response to json file
        sleep(0.1)
    sleep(2) # Delays Further execution by 5 seconds
    print('Data Batch {} recieved'.format(no))
print(news_list[0].keys())
print('Time Taken to extract data is {} Mins {} Seconds'.format((time()-now)//60,((time()-now)%60))) # Prints the amount of time taken to execute the cell
    

Data Batch 0 recieved
Data Batch 1 recieved
Data Batch 2 recieved
Data Batch 3 recieved
Data Batch 4 recieved
Data Batch 5 recieved
Data Batch 6 recieved
dict_keys(['status', 'totalResults', 'articles'])
Time Taken to extract data is 0.0 Mins 23.760852575302124 Seconds


The API query is delayed by 5 seconds after each pass of 20 data sources have been collected.

This News API is queried multiple times over a period to get enough data for analysis. This is done by changing the range for page parameter.

The needed data has been extracted,the News sources data and the News Article Data.These two files will be merged into a Dict and saved as a json file. rather than saving them seperately.  

### Saving Data To Path

In [60]:
# Merging the Two Data Into a Single path
data={} # Empty Dictionary
data_keys=['sources','articles']
data_values=[response_sources,news_list]
for key,value in zip(data_keys,data_values):
    data[key]=value
data.keys()

dict_keys(['sources', 'articles'])

In [61]:
# Save Data To Path
path_to_save='/content/gdrive/My Drive'
try:
    os.mkdir(os.path.join(path_to_save,'News_Data'))
except:
    pass
with open('/content/gdrive/My Drive/News_Data_Merged (1).json','w') as file_open:
    json.dump(data,fp=file_open)