# Homework exercise 1
## Deadline: upload to Moodle by 22 May 18:00 h

__Please submit your homework either as a Jupyter Notebook or using .py files.__

If you use .py files, please also include a PDF containing the output of your code and your explanations. Either way, the code needs to be in a form that can be easily run on another computer.

__Name 1:__

The name of the file that you upload should be named *Homework1_YourLastName_YourStudentID*.

Reminder: you are required to attend class on 23 May to earn points for this homework exercise unless you have a valid reason for your absence.

You are expected to work on this exercise individually. If any part of the questions is unclear, please ask on the Moodle forum.

__SEC EDGAR__

Filings made by companies to the regulator are a useful source of text data. The most important source in this regard is the US Securities and Exchange Commission (SEC).

The SEC provides information on how to access their filings here: https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm . Please read that page to find information on how to access the HTML pages listing all files included in a filing.

1. Write a function `get_index` that downloads the "crawler" index file for a particular day in the years from 2015 to 2022.
2. Write a function `get_filing_urls` that retrieves the urls of the filings from the downloaded index file. An optional argument should specify the form type of the filings you want to access. 
3. Write a function `get_filing_files`that uses the urls obtained in 2. to download the corresponding pages. Within each page find the url of the main document,which is the first among the files listed on the page.

4. Use the functions from 1. to 3. to download the HTML documents of form type S-1 for some date. Then

    * remove all tables and images from the files
    * separately save the text contained (i.e. excluding all HTML tags) in each item (an item corresponds to one entry in the Table of Contents)

  Note that there will be days not containing any S-1 files. Please test your code for days comprising a total of at least 10 filings.

Ensure that your code respects the SEC's constraints with respect to the number of requests and the identification of your crawler via the User Agent.

In [39]:
import requests
import pandas as pd
from datetime import datetime
import time
import os
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from numpy import random
rng = random.default_rng(12345)
from bs4 import BeautifulSoup
import re
from io import StringIO

In [None]:
path = '/Users/esatmazreku/Documents/MSc Banking and Finance/2023S/Python 2/Course Material-20230502/data'

In [38]:
'''
1. Write a function get_index that downloads the "crawler" index file for a 
particular day in the years from 2015 to 2022.
'''

def get_index(year: int, month: int, day: int):
    if year < 2015 or year > 2022:
        raise ValueError("Years: 2015 - 2022")
    date = datetime(year, month, day)
    url = f"https://www.sec.gov/Archives/edgar/daily-index/{year}/QTR{((month-1)//3)+1}/crawler.{date.strftime('%Y%m%d')}.idx"
    headers = {
        "User-Agent": "(Macintosh; Apple M1 Pro OS Ventura 13.1) AppleWebKit/18614.3.7.1.5 (KHTML, a11842436@unet.univie.ac.at) Safari/16.2",
        "Referer": "https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm"
    }
    
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        print(response.text)
    else:
        print(f"Failed to download index. Status code: {response.status_code}")

get_index(2019, 4, 29)

Description:           Daily Crawler Index of EDGAR Dissemination Feed by Company Name
Last Data Received:    Apr 29, 2019
Comments:              webmaster@sec.gov
 
 
 
 
Company Name                                                  Form Type   CIK
      Date Filed  URL 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 800 FLOWERS COM INC                                         8-K         1084869     20190429    http://www.sec.gov/Archives/edgar/data/1084869/0001157523-19-000924-index.htm         
1253 NEWTON, LLLP                                             D           1774239     20190429    http://www.sec.gov/Archives/edgar/data/1774239/0001774239-19-000001-index.htm         
180 DEGREE CAPITAL CORP. /NY/                                 DEF 14A     893739      20190429    http://www.sec.gov/Archives/edgar/data/893739/0000893739-19-000012-index.htm   

In [32]:
'''
Write a function get_filing_urls that retrieves the urls of the filings from the downloaded index file. 
An optional argument should specify the form type of the filings you want to access.
'''
def get_filing_urls(index_content: str, form_type: Optional[str] = None):
    base_url = 'https://www.sec.gov/Archives/edgar/daily-index/'
    lines = index_content.split('\n')
    urls = []

    for line in lines:
        components = line.split('|')
        if len(components) >= 5:
            if form_type is None or components[2] == form_type:
                urls.append(base_url + components[4])

    return urls


index_content = requests.get('https://www.sec.gov/Archives/edgar/daily-index/2022/QTR1/crawler.20160229.idx').text
urls = get_filing_urls(index_content,form_type='S-1')
print(urls)

[]


In [33]:
'''
3. Write a function get_filing_filesthat uses the urls obtained in 2. to download the corresponding pages. 
Within each page find the url of the main document,which is the first among the files listed on the page.
'''
def get_filing_files(year: int, month: int, day: int, form_type: Optional[str] = None):
    urls = get_filing_urls(year, month, day, form_type)
    headers = {
        "User-Agent": "(Macintosh; Apple M1 Pro OS Ventura 13.1) AppleWebKit/18614.3.7.1.5 (KHTML, a11842436@unet.univie.ac.at) Safari/16.2",
        "Referer": "https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm"
    }
    document_urls = []
    for url in urls:
        url_base = url.rsplit('/', 1)[0]  # Remove the last part of URL
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            table = soup.find('table', {'class': 'tableFile', 'summary': 'Document Format Files'})
            if table is not None:
                rows = table.find_all('tr')
                for row in rows[1:]:  # Skip the header row
                    columns = row.find_all('td')
                    if len(columns) > 2:
                        document_name = columns[2].text.strip()
                        if document_name.endswith('.htm'):
                            document_url = f"{url_base}/{document_name}"
                            document_urls.append(document_url)
                            break  # Stop after the first document
    return document_urls

In [30]:
'''
Use the functions from 1. to 3. to download the HTML documents of form type S-1 for some date.
'''
def download_and_process_files(year: int, month: int, day: int, form_type: Optional[str] = None):
    urls = get_filing_files(year, month, day, form_type)
    headers = {
        "User-Agent": "(Macintosh; Apple M1 Pro OS Ventura 13.1) AppleWebKit/18614.3.7.1.5 (KHTML, a11842436@unet.univie.ac.at) Safari/16.2",
        "Referer": "https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm"
    }
    for url in urls:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # a. Remove tables and images
            for table in soup.find_all('table'):
                table.decompose()
            for img in soup.find_all('img'):
                img.decompose()
            # b. Extract and save text content
            for i, item in enumerate(soup.find_all('item')):
                text = "".join([str(e) if isinstance(e, NavigableString) else e.text for e in item.children])
                with open(f"{form_type}_{year}{month:02d}{day:02d}_{i}.txt", "w") as f:
                    f.write(text)


download_and_process_files(2019, 4, 29)

TypeError: get_filing_urls() takes from 1 to 2 positional arguments but 4 were given