<a href="https://colab.research.google.com/github/HarikrishnaYashoda/Aerospace_treadmill_data_project/blob/main/NLP_Key_word_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

As a first step in solving this problem, we will load the provided CSV files using the Pandas library. The training CSV file contains 100 rows, with three columns: URL, doc_id, and a label. The test CSV file has 48 rows with two columns: URL and doc_id. The goal is to train a machine learning model that can predict a label for the documents provided in the test CSV based on the data that is available in the training CSV.

In [1]:
import pandas as pd

In [2]:
train_csv = pd.read_csv(filepath_or_buffer="train.csv")
print("Training set shape", train_csv.shape)
train_csv.head()

Training set shape (100, 3)


Unnamed: 0,url,doc_id,label
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3


In [3]:
test_csv = pd.read_csv(filepath_or_buffer="test.csv")
print("Test set shape", test_csv.shape)
test_csv.head()

Test set shape (48, 2)


Unnamed: 0,url,doc_id
0,http://chirurgie-goettingen.de/medizinische-ve...,0
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16


In [4]:
tumor_keywords = pd.read_csv(filepath_or_buffer="keyword2tumor_type.csv")
print("Tumor keywords set shape", tumor_keywords.shape)
tumor_keywords.head()

Tumor keywords set shape (126, 2)


Unnamed: 0,keyword,tumor_type
0,senologische,Brust
1,brustzentrum,Brust
2,breast,Brust
3,thorax,Brust
4,thorakale,Brust


In [5]:
train_csv.groupby(by="label").size()

Unnamed: 0_level_0,0
label,Unnamed: 1_level_1
1,32
2,59
3,9


In [6]:
train_csv['label'].unique()

array([1, 3, 2])

We have 100 documents in the training set, and 48 in the test set. We have 32 documents that mention no tumor board (label = 1), 59 documents where a tumor board is mentioned, but we are not certain if it is the main focus of the page (label = 2), and 9 documents for which we are certain that they are dedicated to tumor boards.

**Loading Data**

In [14]:
def read_html(doc_id: int) -> str:
    # Construct the file path using the doc_id
    html_file_path = f"/content/HTML/{doc_id}.html"
    try:
        with open(file=html_file_path,
                  mode="r",
                  encoding="latin1") as f:
            html = f.read()
    except FileNotFoundError:
        # Handle cases where the file for a specific doc_id is not found
        print(f"Warning: HTML file not found for doc_id {doc_id}")
        html = "" # Or some other placeholder like None
    return html

# Apply the modified function
train_csv["html"] = train_csv["doc_id"].apply(read_html)

In [12]:
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_..."
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or..."
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<..."
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me..."
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T..."


In [15]:
import warnings

from bs4 import BeautifulSoup

warnings.filterwarnings(action="ignore")


def extract_html_text(html):
    bs = BeautifulSoup(markup=html, features="lxml")
    for script in bs(name=["script", "style"]):
        script.decompose()
    return bs.get_text(separator=" ")


train_csv["html_text"] = train_csv["html"].apply(extract_html_text)

In [16]:
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html,html_text
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_...",\n \n \n \n \n Prostata-Karzinom-Zentrum - Sch...
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me...",\n \n \n \n Darmzentrum Rheinpfalz » Zentren A...
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...


So far we are making some progress, but we immediately observe an issue, and that is the large number of new line symbols \n at the beginning of each document. Ideally, we would want to provide clear text, with no special characters and in a proper, human-readable format.

In [34]:
import re
from bs4 import BeautifulSoup

def preprocessed_html_text(html_text: str) -> str:
    text = BeautifulSoup(html_text, "html.parser").get_text()
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)  # Remove non-alphanumeric
    text = re.sub(r'\s+', ' ', text)  # Remove multiple whitespaces
    text = text.strip().lower()
    return text


In [35]:
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html,html_text
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_...",\n \n \n \n \n Prostata-Karzinom-Zentrum - Sch...
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me...",\n \n \n \n Darmzentrum Rheinpfalz » Zentren A...
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...


**Exploratory Data Analysis**

In [36]:
import plotly.express as px
import plotly.offline as pyo

# set notebook mode to work in offline
pyo.init_notebook_mode(connected=True)

In [45]:
import re
from bs4 import BeautifulSoup

def preprocessed_html_text(html_text: str) -> str:
    # Check if html_text is a string before processing
    if not isinstance(html_text, str):
        return "" # Return an empty string or handle appropriately

    # Use html.parser for the first step to potentially handle malformed HTML more gracefully
    # Note: The previous step used lxml for extraction, so we should be consistent or
    # carefully handle potential differences. For this specific issue, sticking to
    # the original logic of using BeautifulSoup for the text cleaning step after
    # initial extraction seems intended.
    try:
        text = BeautifulSoup(html_text, "html.parser").get_text()
    except Exception as e:
        print(f"Warning: Error parsing HTML text: {e}")
        text = html_text # Fallback to original text if parsing fails

    text = re.sub(r'[^A-Za-z0-9\s]', '', text)  # Remove non-alphanumeric
    text = re.sub(r'\s+', ' ', text)  # Remove multiple whitespaces
    text = text.strip().lower()
    return text

# Apply the preprocessed_html_text function to the 'html_text' column
train_csv["preprocessed_html_text"] = train_csv["html_text"].apply(preprocessed_html_text)
)

# Now the 'preprocessed_html_text' column exists and can be used for plotting
px.histogram(x=train_csv["preprocessed_html_text"].apply(len), title="Distribution of Text Length (Character Count)")

In [42]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: text.split(" ")).apply(len),
             title="Distribution of Text Length (Word Count)")

In [43]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: set(text.split(" "))).apply(len),
             title="Unique Words Count")