# Problem 1 - Josh

Read Shannon’s 1948 paper ’A Mathematical Theory of Communication’.  
Focus on pages 1-19 (up to Part II), the remaining part is more relevant for communication.
https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

*Q: Summarize what you learned briefly (e.g. half a page).*

\<Summary\>

# Problem 2 - Jackson  

ICML is a top research conference in Machine learning. Scrape all the pdfs of all ICML 2017 papers from http://proceedings.mlr.press/v70/.
1. What are the top 10 common words in the ICML papers?
2. Let Zbe a randomly selected word in a randomly selected ICML paper. Estimate the entropy
of Z.
3. Synthesize a random paragraph using the marginal distribution over words.
4. (Extra credit) Synthesize a random paragraph using an n-gram model on words. Synthesize
a random paragraph using any model you want. Top five synthesized text paragraphs win
bonus (+30 points).

In [None]:
# Scraper
import requests
import logging
import os
from bs4 import BeautifulSoup as bs

def scrape(dump_folder, source):
    #Create folder to dump into
    if not os.path.isdir(dump_folder):
        os.mkdir(dump_folder)

    #Set up logging
    f = open(f'{dump_folder}log.txt', 'w') #Open the file if its not already opened
    f.close()
    logging.basicConfig(level=logging.INFO, filename=f'{dump_folder}log.txt')
    
    #Get list of links
    html = requests.get(source)
    soup = bs(html.content, 'html.parser')
    links = soup.findAll('a')

    #Scrape all PDFs
    names = []
    for l in links:
        if l.decode_contents() == 'Download PDF' or l.decode_contents() == 'Supplementary PDF':
            src = l.get('href').replace('ı', 'i') # Fix small error in one of the scraped links
            fname = src[src.rindex('/')+1:]
            if fname in names:
                logging.CRITICAL(f'OVERWRITING FILE WITH NAME {fname}')
            names.append(fname)
            logging.info(f'Scraping pdf from {src} into {fname}')

            pdf = requests.get(src)
            with open(f'{dump_folder}{fname}', 'wb') as f:
                f.write(pdf.content)

In [12]:
import fitz
import pandas as pd
import numpy as np
from tqdm import tqdm

def parse_pdf(fp):
    df = pd.DataFrame(columns=['word', 'prev'])
    pdf = fitz.open(fp)
    for page in pdf:
        _blocks = pdf[0].get_text_blocks()
        blocks = []
        for b in _blocks:
            if b[-1] != 1:
                blocks.append(b)

        #Create 2D array of words by block
        data = [x[4].replace('. ', ' . ').split() +['\end'] for x in blocks]
        mlen = max([len(d) for d in data])
        data = [d + ['' for i in range(mlen - len(d) + 1)] for d in data]

        #Create array of previous words by block
        words = np.array(data)
        prev = np.roll(words, 1, axis=1)
        prev[:, 0] = '\start'

        words = words.flatten()
        prev = prev.flatten()

        df = pd.concat(df, pd.DataFrame(data = np.column_stack([words, prev]), columns=['word', 'prev']))

    return df


def parse_pdfs(dump_folder):
    df = pd.DataFrame(columns=['word', 'prev', 'count'])

    for fp in tqdm(os.listdir(dump_folder)):
        if not fp.endswith('pdf'):
            continue #Only process PDFs

        df = parse_pdf(dump_folder+fp, df)
    
    return df
        

In [13]:
out = parse_pdf('./scraped/achab17a-supp.pdf')

TypeError: parse_pdf() got an unexpected keyword argument 'dump_folder'

In [None]:
#Text Synthesizing

#Choose the highest probability word at each step until it chooses \end

In [None]:
def problem2(source='http://proceedings.mlr.press/v70/', 
                dump_folder = './scraped/'):
    
    # Part 1 - Scrape PDFs (if not already done)
    if dump_folder[-1] != '/':
        dump_folder += '/'
    files = os.listdir(dump_folder)
    if len(files) < 10:
        #Probably haven't scraped
        scrape(dump_folder, source)

    # Part 2 - Load and process PDFs
    df = parse_pdfs(dump_folder)


    # Part 3 - Synthesize Text

# Problem 3 - Jhanvi

Continue building your toolbox on Kaggle. Work on submissions for the same competition
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
1. What is the best Kaggle forum post that you found? Briefly describe what you learned from
it.
2. What is the best public leader board (LB) score you can achieve? Describe your approach.
3. Submit a model that is definitely overfitting and a model that is definitely underfitting.


Overfitting means that your training error is much smaller compared to your test error (and LB score).   
Underfitting means that your model is too simple and even the training error is very large (and so will the test error).  
You can experiment with depth of decision trees in random forests or XGBoost classifiers as the metric of complexity for your models, or any other family of models you want.