# Projektarbeit NLP (Document Classifier)

Lösung zur Projektarbeit im Fach Natural Language Processing (NLP). 

**Gruppe: Lukas Kölbl, Fabian Schmidt**

<a id="0"></a>
 # Gliederung
1. [Datensammlung](#1)     
    1. [Manuelle Datenexploration](#2)
    2. [Automatisierte Datenerfassung](#3)
2. [Datenaufbereitung](#4) 
1. [Datenverarbeitung, Visualisierung und Auswertung/Beurteilung](#5) 
    1. [Transformer-Modelle](#6)     
    2. [Extraktion der Sentence & Word Embeddings](#7)
    3. [Clustering, Visualisierung und Bewertung der "Cluster](#8)
    4. [Topic Klassifikation & Zuordnung](#9)


## Teilaufgabe 2.1 Datensammlung (Crawler für die OTH Website)<a class="anchor" id="1"></a>


### Mögliche Probleme / Challenges
- Text liegt teils in zwei Sprachen (English und Deutsch) vor.
- Abkürzungen im Text selbst (POC, ECTS, TOEIC usw.)
- Sonderzeichen (+, (, ), *, : usw.)
- Zahlen
- Hyperlinks

In [2]:
import requests, re, json, time, pdfplumber, io
from bs4 import BeautifulSoup
from urllib.parse import urlparse

class Crawler:
    """
    Crawler class to crawl a website and extract all links and text content
    Parameters:
        start_url: The URL to start crawling from
        url_pattern: A regex pattern to filter out all links that do not match the pattern
        link_thresh: The maximum depth to crawl
        delay_seconds: The delay between each request
    """
    def __init__(self, start_url, url_pattern, link_thresh=5, delay_seconds=3):
        self.depth = 0
        self.start_url = start_url
        self.url_pattern = re.compile(url_pattern)
        self.link_thresh = link_thresh
        self.delay_seconds = delay_seconds
        self.visited_links = set()
        self.link_tree = {}
        self.link_contents = []

    def add_hostname(self, link, prefix):
        """
        Adds a hostname to a link if it is missing
        Parameters:
            link: The link to add the hostname to
            prefix: The prefix to add to the link
        """
        prefix_pattern = re.compile(prefix)
        if prefix_pattern.search(link) is None:
            link = prefix + link
        return link
    
    def crawl_url(self, url):
        """
        Crawls a single URL and extracts all text content and links.
        If the url links to a pdf, the content of the pdf is extracted
        as text and returned, just like the html content for a normal website.
        Parameters:
            url: The URL to crawl
        Returns:
            text: The text/pdf content of the URL
            links: All links found on the URL
        """
        response = requests.get(url)
        content_type = response.headers.get('content-type')
        response.raise_for_status()
        
        #If the url links to a pdf, download it and read the content
        if 'application/pdf' in content_type:
            try:
                text = ""
                with pdfplumber.open(io.BytesIO(response.content)) as pdf:
                    for page in pdf.pages:
                        text += page.extract_text()
                        
                return text, url
            except Exception as e:
                error = f"An error occured while processing the pdf content: {e}"
                print(error)
                
                return error, url
        elif 'text/html' in content_type:
            try:
                soup = BeautifulSoup(response.text, "html.parser")  #Format text for readability
                text = soup.get_text()
                links = [link.get('href') for link in soup.find_all('a') if link.get('href') is not None]
            
                return text, links
            except Exception as e:
                error = f"An error occured while parsing the html content: {e}"
                print(error)
                
                return error, url
        else:
            return 'unknown content type', url
    
    def crawl_urls(self, url):
        """
        Crawls a URL and extracts all text content and links
        Parameters:
            url: The URL to crawl
            
        """
        #Return if max depth is reached, or the start_url was already visited
        if self.depth >= self.link_thresh or url in self.visited_links:
            return
            
        try:    
            #Extract scheme/hostname to be able to crawl links that are missing these
            print("Currently crawling: " + url)
            scheme = urlparse(url).scheme
            hostname = urlparse(url).hostname
            prefix = scheme + "://" + hostname
            time.sleep(self.delay_seconds)
            text, links = self.crawl_url(url)
            self.link_contents.append({url: text})
            
            #Mark the current URL as visited
            self.visited_links.add(url)
        
            #Filter out all topic irrelevant links (correct choice of url_pattern is important!) and add hostname to the link if needed
            valid_urls = list(set([self.add_hostname(link, prefix) for link in links if self.url_pattern.search(link) is not None and link is not url]))
        
            #Update the link tree
            if self.depth not in self.link_tree:
                self.link_tree[self.depth] = []
            self.link_tree[self.depth].append({url: valid_urls})
            self.depth = self.depth + 1

            for curr_url in valid_urls:
                if curr_url not in self.visited_links:
                    self.crawl_urls(curr_url)

        except Exception as e:
            print(f"Crawling failed on {start_url}: {e}")
    
        #Save the crawled data 
        with open ("crawler_result.json", 'w') as fh:
            json.dump(self.link_contents, fh)

        return 

    def start_crawling(self):
        """
        Starts the crawling process
        """        
        self.crawl_urls(self.start_url)
        print("Crawling complete, resulting link tree:")
        print(self.link_tree)

start_url = "https://www.oth-aw.de/en/study-programmes-and-educational-opportunities/study-programmes/master-degree-programs/international-energy-engineering/program-international-energy-engineering/"
url_pattern = "international-energy-engineering"

crawler = Crawler(start_url, url_pattern)
crawler.start_crawling()
text, url = crawler.crawl_url("https://www.oth-aw.de/files/oth-aw/Einrichtungen/Bib/Suchen_und_Finden/Infoblatt_E-Books.pdf")
print(text)

Currently crawling: https://www.oth-aw.de/en/study-programmes-and-educational-opportunities/study-programmes/master-degree-programs/international-energy-engineering/program-international-energy-engineering/
Currently crawling: https://www.oth-aw.de/en/study-programmes-and-educational-opportunities/study-programmes/master-degree-programs/international-energy-engineering/contacts/
Currently crawling: https://www.oth-aw.de/en/study-programmes-and-educational-opportunities/study-programmes/master-degree-programs/international-energy-engineering/course-content/
Currently crawling: https://www.oth-aw.de/studium/studienangebote/studiengaenge/master/international-energy-engineering/studieninhalte/
Currently crawling: https://www.oth-aw.de/studium/studienangebote/studiengaenge/master/international-energy-engineering/ansprechpartner/
Crawling complete, resulting link tree:
{0: [{'https://www.oth-aw.de/en/study-programmes-and-educational-opportunities/study-programmes/master-degree-programs/inter

### Fragen zu Teilaufgabe 2.2.1 - Manuelle Datenexploration <a class="anchor" id="2"></a>
- Wie sehen die inhaltlichen Textstrukturen aus?
    - Strukturell Fließtext aber auch Aufzählungen / Stichpunkte
    - Inhaltlich hauptsächlich allgemeine Informationen über den Studiengang
    - Aber auch Beschreibung, weswegen dieser wichtig ist
    - Cookie Informationen werden ebenfalls mit extrahiert (bspw. Laufzeit usw.)
- Wie und in welcher Form werden alle Textinformationen dargestellt?
    - Text wird teils in Englisch / Deutsch angezeigt
- Reduntante Strukturen / Inhalte?
    - Teils ist dieselbe Website nur in anderen Sprachen (Tschechisch, Russisch usw.) verlinkt
    - Folgt man diesen hat man dann natürlich wieder dieselben links, auch zur vorherigen Seite (evtl. Endlosschleife, wenn man den crawler nicht durch eine Schranke o.ä. stoppt)
    - Ansonsten ähnelt sich der Aufbau auch in diesem Fall wieder (Überschriften gefolgt von Fließtext und Aufzählungen zwischendrin)

### Automatisierte Datenerfassung <a class="anchor" id="3"></a>

## Datenaufbereitung <a class="anchor" id="4"></a>

## Datenverarbeitung, Visualisierung und Auswertung/Beurteilung <a class="anchor" id="5"></a>

### Transformer-Modelle <a class="anchor" id="6"></a>

### Extraktion der Sentence & Word Embeddings <a class="anchor" id="7"></a>

### Clustering, Visualisierung und Bewertung der "Cluster" <a class="anchor" id="8"></a>

### Topic Klassifikation & Zuordnung <a class="anchor" id="9"></a>