# HW3 - HTML Processing
### Quentin Phillips
### DATA 440-02
### 10/27/23

#1. Data Collection


### First I had to iterate through the text file of final URIs in order to save the content of each as an HTML file. I did this in PowerShell using a python script.


```python
import os
import hashlib
import requests

# Hash generator
def generate_hash(url):
    hash_object = hashlib.md5(url.encode())
    return hash_object.hexdigest()

def save_html(url, folder_path):
    response = requests.get(url)
    if response.status_code == 200:
        filename = generate_hash(url)
        file_path = os.path.join(folder_path, filename + ".html")
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response.text)
        
        print(f"Saved HTML from {url} to {file_path}")
    else:
        print(f"Failed to retrieve HTML from {url}")

def process_url(file_path, folder_path):
    with open(file_path, 'r') as url_file:
        for line in url_file:
            url = line.strip()
            save_html(url, folder_path)

if __name__ == "__main__":
    url_file_path = "C:\\Users\\quill\\Downloads\\resolvedURLS.txt"  
    output_folder = "C:\\Users\\quill\\Downloads\\HTMLs"  
    
    # Just in case need folder
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    process_url(url_file_path, output_folder)
```

### This python script saved all of the HTML content in separate files with hashed file names in a new folder I created.

### Next I used a new script that utilized boilerpy3 to remove the Boilerplate from the HTML files, leaving just the relevant text.

### I used ArticleExtractor from boilerpy3 instead of DefaultExtractor, as the documentation suggested it and it seemed to have better results on a few files I tested. I also made sure to use UTF-8 encoding since I'm running this on PowerShell and I don't want weird windows text processing errors. I also had to set raise_on_failure to false for the extractor as the default was true and the code would break any time there was a file that was unable to be fully processed properly. This may result in some empty/broken files, but was necessary to process all the HTML files.

```python
import os
from boilerpy3 import extractors
from bs4 import BeautifulSoup

def remove_boilerplate(html_content):
    extractor = extractors.ArticleExtractor(raise_on_failure=False)
    extracted_text = extractor.get_content(html_content)
    return extracted_text

def process_html(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for filename in os.listdir(input_folder):
        if filename.endswith(".html"):
            input_file_path = os.path.join(input_folder, filename)
            output_file_path = os.path.join(output_folder, filename)

            with open(input_file_path, 'r', encoding='utf-8') as input_file:
                html_content = input_file.read()
                cleaned_content = remove_boilerplate(html_content)

                with open(output_file_path, 'w', encoding='utf-8') as output_file:
                    output_file.write(cleaned_content)

                print(f"Cleaned HTML saved to {output_file_path}")

if __name__ == "__main__":
    input_folder = "C:\\Users\\quill\\Downloads\\HTMLs"
    output_folder = "C:\\Users\\quill\\Downloads\\HTML_Proc"  

    process_html(input_folder, output_folder)
```

#Q

###74 of the 432 files that were able to be processed without error were 0 byte files. This is fewer than I was expecting, as I got a lot of weird and unexpected links since they were all scraped from twitter, and I was expecting more of them to have broken or unusable info.


# Q2

## I chose the word "news" as many of the websites I scraped had news articles about a wide variety of topics.

### For IDF I got a value of 0.48 by searching "news" and getting 25,200,000,000 relevant results. I then used this with Google's corpus size of 35 billion. I made sure to take the log base 2 to control the size. For TF I calculated it manually by using ctrl-f to get the occurance rate in each document, and then using the word count function to find the total words.

|TF-IDF |TF |IDF  |URI
|------:|--:|---:|---
|0.013 |58/2099  |0.48 |file:///C:/Users/quill/AppData/Local/Temp/249a2e64-4f7e-4259-968a-d72e62dbf72f_HTML_Proc.zip.72f/HTML_Proc/43a65a9cd1e47791362d99f82e57fdca.html
|0.0014  |5/1675 | 0.48 |file:///C:/Users/quill/AppData/Local/Temp/e3bbd45b-24b7-425d-ae8f-2f8f040055af_HTML_Proc.zip.5af/HTML_Proc/778f6966d5cc5213706e21a5a034ce5f.html
|0.0086  |7/389  |0.48 |file:///C:/Users/quill/AppData/Local/Temp/17a6895d-cd01-47db-aa05-07001eb5e4eb_HTML_Proc.zip.4eb/HTML_Proc/933a3bd0520e2e230faeef1bd2968fb2.html
|0.24 |6/1177 | 0.48 |file:///C:/Users/quill/AppData/Local/Temp/32b46924-f23a-488b-8a09-d0fa550029a1_HTML_Proc.zip.9a1/HTML_Proc/4167a0e6b3392d11b967a0f8c8ce6225.html
|0.0025  |3/576 |0.48 |file:///C:/Users/quill/AppData/Local/Temp/0c1474e8-c3f6-469e-9a8e-45ce8065bd95_HTML_Proc.zip.d95/HTML_Proc/9370d41017cd0d1d5b0ef983dc29807c.html
|0.014 |1/33 | 0.48 |file:///C:/Users/quill/AppData/Local/Temp/985b0871-2fea-4c62-8e3f-596b8cff68bb_HTML_Proc.zip.8bb/HTML_Proc/45620d2aedba60cae7fff1d9cd4e2fdd.html
|0.0016  |2/590  |0.48 |file:///C:/Users/quill/AppData/Local/Temp/aee92727-ad68-4b1f-9954-aa27989b96b0_HTML_Proc.zip.6b0/HTML_Proc/298212ece0c467f09d49cd3f411d0a28.html
|0.002  |1/238 | 0.48 |file:///C:/Users/quill/AppData/Local/Temp/1fabc64e-310f-4d81-87ce-ad5616272e48_HTML_Proc.zip.e48/HTML_Proc/618698084b7217b08621c1646452ead2.html
|0.0017  |2/581  |0.48 |file:///C:/Users/quill/AppData/Local/Temp/9ff3fce2-3366-402c-b64a-a6cecdc74a11_HTML_Proc.zip.a11/HTML_Proc/a1d480d890d23bf1766d3e28b34b0153.html
|0.00014  |3/10318 | 0.48 |file:///C:/Users/quill/AppData/Local/Temp/cc9fdc81-c389-4922-8025-41630e8e8bea_HTML_Proc.zip.bea/HTML_Proc/a5995da9821c094165e511129ed467d3.html

### The formula I used for IDF was: logbase2(Total docs in corpus/Relevant docs with keyword)

### The formula I used for TF was: Total number of occurences/Total number of words

### TF-IDF I multiplied the values together

# References

* Insert Reference 1, <https://pypi.org/project/boilerpy3/>
