# **Welcome to the Demo Notebook for Webscraping**

In this demo notebook we go through the `thesis_scraper.py` module used in the project and demonstrate its functionality.<br>

**Disclaimer:** The notebook was run by the authors on the "mavis" computing server (1024 GB memory; 40 physical cores at 3.1 GHz) of the Humboldt Lab for Empirical and Quantitative Research. Execution time may be significantly longer for other users.

### **The Dependencies**


First  import some basic libraries and then install the requirements to set up the environment needed for the project.

In [1]:
# Basic libraries
import re
import os
import gc
import warnings
warnings.filterwarnings("ignore")
import time

import sys
sys.path.append("..")
# Custom function to measure runtime
from measure_time import measure_time

In [2]:
os.getcwd()

'D:\\Seafile\\Моя библиотека\\2 semester\\DEDA\\GitHub\\Bacha fork\\DEDA_class_SoSe2023\\DEDA_class_SoSe2023_LDA_Theses\\DEDA_class_SoSe2023_LDA_MSc_Theses\\LDA_MSc_1_Webscraping'

In [None]:
# Install Requirements
!pip install -r ../requirements.txt

### **The Scraper**
Import the master_theses_scraper from the  added `thesis_scraper.py` module, docstring of which is given below. The function is constructed deliberately to produce many `print()` statements along the way to let the users know what stage of the work they are on and how this or that entry currently being processed looks like.

Also, to note, the scraping implementation is designed in a way that is specifically targeted at the HU website, meaning it will need some tinkering inside for repurposing. 


In [2]:
# Import the custom function and inspect
from thesis_scraper import master_theses_scraper 
master_theses_scraper?

[1;31mSignature:[0m [0mmaster_theses_scraper[0m[1;33m([0m[0murl[0m[1;33m,[0m [0mdown_dir[0m[1;33m,[0m [0mheaders[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Scrapes master's theses from a specified URL, retrieves download links, and downloads the theses.

Args:
    url (str): The URL of the webpage containing the LvB theses.
    down_dir (str): The directory where the scraped PDFs will be downloaded.
    headers (dict): HTTP headers to be used in the requests.
[1;31mFile:[0m      d:\seafile\моя библиотека\2 semester\deda\github\bacha fork\deda_class_sose2023\deda_class_sose2023_lda_theses\deda_class_sose2023_lda_msc_theses\lda_msc_1_webscraping\thesis_scraper.py
[1;31mType:[0m      function

In [3]:
# Specify the link to scrape
url = 'https://www.wiwi.hu-berlin.de/de/forschung/irtg/lvb/research/dmb'

# Sets the directory for downloading our scraped pdfs
down_dir = 'OCRed PDFs/'

# Makes the directory in case it does not exist already
os.makedirs(down_dir, exist_ok = True)


# Set your own user agent here
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US'
}

In [4]:
# Set begginning time 
st = time.time()

# Run the function:
master_theses_scraper(url = url,
                      down_dir = down_dir,
                      headers = headers
)

# Measure time spent on execution of the function
measure_time(st)


Web page accessed.

244 entries found.

An example entry in our links container looks like:
 <a data-linktype="external" data-val="https://edoc.hu-berlin.de/handle/18452/24455" href="https://edoc.hu-berlin.de/handle/18452/24455">Comparing Cryptocurrency Indices to Traditional Indices</a>

Identifying invalid links...

51 invalid links identified.

193 entries remain.

Identifying Master's Theses...

124 Master's Theses identified.

A sample entry looks as follows:
 <a href="https://edoc.hu-berlin.de/handle/18452/23881">App-based Forecasting of CRIX Index Returns Using R and R-Shiny</a>

Retrieving download links...

Due to missing link, dropped entry: <a href="http://edoc.hu-berlin.de/master/ristig-alexander-2012-02-03">Modelling of Vector MEM with Hierarchical Archimedean Copula</a>

Due to missing link, dropped entry: <a href="http://edoc.hu-berlin.de/master/schelisch-martin-2011-06-10">Jumps in High Frequency Data</a>

Due to missing link, dropped entry: <a href="http://edoc.hu-ber

### Note

Physically stored MSc theses have also been added, so we manually added information about them in the `thesis_info.pkl`.

This was done with the following code:

```python
import pickle

# Unpickling data from Corpus Maker

with open('thesis_info.pkl', 'rb') as file:
    thesis_info = pickle.load(file)
    
#adding physically stored theses
added_theses={
    'Alexander Hölzer_2022-07-25.pdf': ['Supervised Machine Learning Sentiment Measures', 'Hölzer, Alexander'],
    'Franziska Sabine Wehrmann_2021-08-08.pdf': ['Trading Strategies for Bitcoin Options based on Deviations in Risk Neutral and Historical Densities', 'Wehrmann, Franziska Sabine'],
    'Ivan Kotik_2022-09-28.pdf': ['Indexing, interfaces & searching in dynamic knowledge platforms', 'Kotik, Ivan'],
    'Judith Bender_2022-08-05.pdf': ['Portfolio Diversification based on Risk Profile Clustering', 'Bender, Judith'],
    'Kevin Noessler_2020-11-12.pdf': ['In search for stability in crypto-assets: An Index-Pegged Stablecoin', 'Noessler, Kevin'],
    'Lucas Valentin Umann_2023-02-13.pdf': ['Blockchain Characteristics and Systematic Risk: A Neural Network Based Factor Model for Cryptocurrencies', 'Umann, Lucas Valentin'],
    'Man Yuan_2022-11-10.pdf': ['Private Equity Premium Puzzle Revisited with Beta Coefficient', 'Yuan, Man'],
    'Marius Sterling_2020-03-07.pdf': ['Forecasting Stock Prices of Limit Order Book Data with Deep Neural Networks', "Sterling, Marius"],
    'Thomas Georg Herrdum_2020-11-04.pdf': ['CRIX the Coin: A Crypto Collateralized Index Coin', 'Herrdum, Thomas Geord'],
    'Xun Gong_2020-01-21.pdf': ['Personalized Recipe Recommender System using Recurrent Neural Network', 'Gong, Xun'],
    'Yarong Yang_2021-12-10.pdf': ['The Financial Risk Meter and its application for Singapore', 'Yang, Yarong']
}

thesis_info.update(added_theses)

#editing filename for a cropped thesis (see notes in LDA_MSc_2_Preprocessing.ipynb) 
thesis_info['aydinli cropped_2004-07-15.pdf'] = thesis_info.pop('113.aydinli.pdf_2004-07-15.pdf')

#saving the updated pickle file
with open('thesis_info.pkl', 'wb') as file:
    pickle.dump(thesis_info, file)
```