# Accessing Datasets

For Homework-3 you'll need to work with some social media and news headlines text data. 

To get you started, there are two samples prepared for this class:

* `nyt_20200401_20210401` -- NYT article metadata retrieved from the [NYTimes Archive API](https://developer.nytimes.com/docs/archive-product/1/overview)
* `wsb_20200401_20210401` -- Wallstreetbets submissions retrieved from [Pushift.IO](https://pushshift.io/) using [the psaw Python client](https://pypi.org/project/psaw/).

These are just raw data dumps, but this notebook will show you how to extract the dates and headlines from these dumps.


## NYTimes Dataset


1. Make sure you have a shortcut to [nyt_20200401_202010401](https://drive.google.com/drive/folders/1PaSu6Rhrx9E6cwfgsAt-EAE2efoxd1N2) available in your Google-Drive account.
1. Also make sure you have your Google-Drive mounted to your Colab instance.  



In [1]:
import itertools
import json
from pathlib import Path

import pandas as pd

In [2]:
# You'll need to change this to match where you've saved the short-cut to the class
# folder in your Google Drive.
NYT_PATH = Path("drive/MyDrive/baruch-nlp/math-9796-2022/share-2022-mth-9796/2022-homework-3/nyt-20200401_20210401/")
assert NYT_PATH.exists(), f"Can't find {NYT_PATH}. Did you remember to mount your Google-Drive?"

In [3]:
def iter_nytimes_docs(root):
  """Iterate over all doc" dictionaries in nyt news-story repository."""
  for path in root.glob("*.json"):
    dd = json.load(path.open())
    docs = dd['response']['docs']
    for doc in docs:
      yield doc

def extract_dated_headline(doc):
  """Extract just the (pub_date, headline)-tuple from a "doc" dictionary."""
  return (
      pd.Timestamp(doc['pub_date']),
      doc['headline']['main']
  )

# Show the first 25 articles in the repository
N = 25
ii = iter_nytimes_docs(NYT_PATH)
for _ in range(N):
  print(extract_dated_headline(next(ii)))

(Timestamp('2020-04-01 00:00:07+0000', tz='UTC'), 'Human Rights Group Says Two U.S. Strikes Killed Somali Civilians')
(Timestamp('2020-04-01 00:01:20+0000', tz='UTC'), '‘Never Thought I Would Need It’: Americans Put Pride Aside to Seek Aid')
(Timestamp('2020-04-01 00:43:34+0000', tz='UTC'), '$30 Million in Illegal Drugs Seized From Cross-Border Tunnel in San Diego, U.S. Says')
(Timestamp('2020-04-01 01:38:10+0000', tz='UTC'), 'As Furloughs Grow, Kennedy Center Defends Use of $25 Million in Aid')
(Timestamp('2020-04-01 02:00:03+0000', tz='UTC'), 'Historic Town in Veszprém County')
(Timestamp('2020-04-01 02:19:39+0000', tz='UTC'), 'Trump Calls New Fuel Economy Rule a Boon. Some Experts See Steep Costs.')
(Timestamp('2020-04-01 03:00:55+0000', tz='UTC'), 'This Broccoli-Dill Pasta Has a Hippie Twist. Your Kids Will Love It.')
(Timestamp('2020-04-01 03:01:03+0000', tz='UTC'), 'Quotation of the Day: Cases Spiral Aboard an Aircraft Carrier, and a Commander Pleads for Help')
(Timestamp('2020-0

## WallStreetBets 


1. Make sure you have a shortcut to [wsb_20200401_202010401](https://drive.google.com/drive/folders/1FUf15yDGHrcVCytSJYIVPBmEIVeZ41j3) available in your Google-Drive account.
1. Also make sure you have your Google-Drive mounted to your Colab instance.  



In [4]:
# You'll need to change this to match where you've saved the short-cut to the class
# folder in your Google Drive.
WSB_PATH = Path("drive/MyDrive/baruch-nlp/math-9796-2022/share-2022-mth-9796/2022-homework-3/wsb-20200401_20210401/")
assert WSB_PATH.exists(), f"Can't find {WSB_PATH}. Did you remember to mount your Google-Drive?"

In [5]:
def iter_wsb_dataframes(root):
  """Iterate over all wsb dataframes."""
  for path in root.glob("*.csv"):
    yield pd.read_csv(path)


# concat the first 10 wsb-CSVs into one big dataframe.
# ...this takes a while to run...
wsb25 = pd.concat(itertools.islice(iter_wsb_dataframes(WSB_PATH), 10))

In [6]:
wsb25[['created_utc_dt','title']]

Unnamed: 0,created_utc_dt,title
0,2020-09-23 18:51:09,Carnival Earnings Play
1,2020-09-23 18:50:16,Deluxe loss porn made for you! 😎😎😎Thanks for t...
2,2020-09-23 18:50:05,$POLA Hitch a ride on the flight
3,2020-09-23 18:49:13,Should I just sell LAC and take the loss?
4,2020-09-23 18:49:11,Is This The Next Big Short? Michael Burry Goin...
...,...,...
95,2020-09-22 01:29:49,Am I doing this right?
96,2020-09-22 01:29:32,Deleted Elon Tweet
97,2020-09-22 01:27:47,Deleted Elon Tweet
98,2020-09-22 01:27:27,Little startup trading group


## Downloading Stock Data

In [7]:
!pip install yfinance

Collecting yfinance
  Downloading yfinance-0.1.70-py2.py3-none-any.whl (26 kB)
Collecting requests>=2.26
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.0 MB/s 
Collecting lxml>=4.5.1
  Downloading lxml-4.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 11.1 MB/s 
Installing collected packages: requests, lxml, yfinance
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: lxml
    Found existing installation: lxml 4.2.6
    Uninstalling lxml-4.2.6:
      Successfully uninstalled lxml-4.2.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests

In [8]:
import yfinance

In [9]:
tsla = yfinance.download(['TSLA'], pd.Timestamp('2020-04-01'), pd.Timestamp('2021-04-01'), period="1d") 

[*********************100%***********************]  1 of 1 completed


In [10]:
tsla.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-04-01,100.800003,102.790001,95.019997,96.311996,96.311996,66766000
2020-04-02,96.206001,98.851997,89.279999,90.893997,90.893997,99292000
2020-04-03,101.900002,103.098,93.678001,96.001999,96.001999,112810500
2020-04-06,102.239998,104.199997,99.592003,103.248001,103.248001,74509000
2020-04-07,109.0,113.0,106.468002,109.089996,109.089996,89599000
