# News Data Gathering
This notebook contains the logic for gathering news datasets for the project. Data is gathered from the AlphaVantage API where the rate limiting is 15 requests per minute. Helper functions are provided to specify date ranges for specific assets to download data while automatically avoiding hitting the rate limit.

**_getNewsData_**: Gets data for a specific asset for a specific month and specific time interval (1 hour etc).

**_getDataforPeriod_**: Reuses the _getNewsData_ function multiple times over different months to gather data for a range of specified years. (03-2022 - 09-2023).

In [1]:
import requests
import pandas as pd
import numpy as np
import time
from datetime import datetime
from dateutil.relativedelta import relativedelta

In [2]:
def getNewsData(API_KEY, tickers, time_from, time_to, count=0):
    tickerString = ""
    topicString = ""
    fromString = ""
    toString = ""

    count = checkRequests(count)

    if tickers != None:
        tickerString = f"&tickers={tickers}"
    if time_from != None:
        fromString = f"&time_from={time_from}"
    if time_to != None:
        toString = f"&time_to={time_to}"
    
    URL = f"https://www.alphavantage.co/query?function=NEWS_SENTIMENT{tickerString}{fromString}{toString}&apikey={API_KEY}&limit=1000&sort=EARLIEST"
    response = requests.get(URL)
    data = response.json()
    data = data['feed']
    df = pd.DataFrame.from_dict(data)
    return df, count

def checkRequests(count):
    if count == 75:
        print("API requests reached for min... Waiting")
        time.sleep(62)
        count = 0
    count += 1
    return count

def getDataForPeriod(API_KEY, tickers, dateStart, dateEnd, count):
    start = datetime.strptime(dateStart, '%Y%m%d')
    end = datetime.strptime(dateEnd, '%Y%m%d')
    dfList = []

    current = start
    print(f"Getting data for {tickers} from {start.date()} to {end.date()}")
    while current < end:
        time_from = current.strftime('%Y%m%dT0001')
        if (current + relativedelta(weeks=3)) > end:
            current = end
        else:
            current += relativedelta(weeks=3)
        time_to = current.strftime('%Y%m%dT2359')
        df, count = getNewsData(API_KEY=API_KEY, tickers=tickers, time_from=time_from, time_to=time_to, count=count)
        dfList.append(df)
    full_df = pd.concat(dfList, ignore_index=True)
    return full_df, count

### Data Gathering

News that contain the assets either in the heading or content is retrieved for all 5 assets (**AAPL**, **TSLA**, **AMZN**, **BTC**, **ETH**).

In [3]:
API_KEY = "5HUC90FRQ4H9PK0Q"
count = 0
aapl_df, count = getDataForPeriod(API_KEY=API_KEY, tickers="AAPL", dateStart="20220301", dateEnd="20230930", count=count)
tsla_df, count = getDataForPeriod(API_KEY=API_KEY, tickers="TSLA", dateStart="20220301", dateEnd="20230930", count=count)
amzn_df, count = getDataForPeriod(API_KEY=API_KEY, tickers="AMZN", dateStart="20220301", dateEnd="20230930", count=count)
btc_df, count = getDataForPeriod(API_KEY=API_KEY, tickers="CRYPTO:BTC", dateStart="20220301", dateEnd="20230930", count=count)
eth_df, count = getDataForPeriod(API_KEY=API_KEY, tickers="CRYPTO:ETH", dateStart="20220301", dateEnd="20230930", count=count)

Getting data for AAPL from 2022-03-01 to 2023-09-30
Getting data for TSLA from 2022-03-01 to 2023-09-30
Getting data for AMZN from 2022-03-01 to 2023-09-30
API requests reached for min... Waiting
Getting data for CRYPTO:BTC from 2022-03-01 to 2023-09-30
Getting data for CRYPTO:ETH from 2022-03-01 to 2023-09-30


In [4]:
print(f"AAPL Data: {aapl_df.shape}")
print(f"TSLA Data: {tsla_df.shape}")
print(f"AMZN Data: {amzn_df.shape}")
print(f"BTC Data: {btc_df.shape}")
print(f"ETH Data: {eth_df.shape}")

AAPL Data: (16391, 13)
TSLA Data: (19613, 13)
AMZN Data: (11988, 13)
BTC Data: (20150, 13)
ETH Data: (19167, 13)


### Store Datasets

All the datasets either loaded from APIs or generated with resampling are now saved to the datasets directory.

In [5]:
import os

directories = ["./datasets/stocks/news", "./datasets/crypto/news"]
for dir in directories:
    os.makedirs(dir, exist_ok=True)

aapl_df.to_csv("datasets/stocks/news/AAPL_news.csv")
tsla_df.to_csv("datasets/stocks/news/TSLA_news.csv")
amzn_df.to_csv("datasets/stocks/news/AMZN_news.csv")
eth_df.to_csv("datasets/crypto/news/ETH_news.csv")
btc_df.to_csv("datasets/crypto/news/BTC_news.csv")