# Keyword Detection on Websites



## Assignment
Your task is to create an algorithm, that takes html page as input and infers if the page contains the information about cancer tumorboard or not. What is a tumor board? Tumor Board is a consilium of doctors (usually from different disciplines) discussing cancer cases in their departments. If you want to know more please read this article.

The expected result is a CSV file for test data with columns [doc_id and prediction].

Bonus: if you would like to go the extra mile in this task try to identify tumor board types interdisciplinary, breast, and any third type of tumor board up to you. For these tumor boards please try to identify their schedule: Day (e.g. Friday), frequency (e.g. weekly, bi-weekly, monthly), and time when they start.

## Data Description
You have train.csv and test.csv files and folder with corresponding .html files.

Files:

train.csv contains next columns: url, doc_id and label
test.csv contains next columns: url and doc_id
htmls contains files with names {doc_id}.html
keyword2tumor_type.csv contains useful keywords for types of tumorboards
Description of tumor board labels:

1 (no evidence): tumor boards are not mentioned on the page
2 (medium confidence): tumor boards are mentioned, but the page is not completely dedicated to tumor board description
3 (high confidence): page is completely dedicated to the description of tumor board types and dates
You are asked to prepare a model using htmls, referred to in train.csv, and make predictions for htmls from test.csv

## Practicalities
You should prepare a Jupyter Notebook with the code that you used for making the predictions and the following documentation:

How did you decide to handle this amount of data?
How did you decide to do feature engineering?
How did you decide which models to try (if you decide to train any models)?
How did you perform validation of your model?
What metrics did you measure?
How do you expect your model to perform on test data (in terms of your metrics)?
How fast will your algorithm performs and how could you improve its performance if you would have more time?
How do you think you would be able to improve your algorithm if you would have more data?
What potential issues do you see with your algorithm?

## Tips
to extract clean text from the page you can use BeautifulSoup module like this

from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

clean_text = soup.get_text(' ')


## If you decide that you don't need, for example, tags <p> in your document you can do this:##


from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

for tag in soup.find_all('p'):
    tag.decompose()

#### To download the dataset <a href="https://drive.google.com/drive/folders/1Qs2fLj9HmAzx2YGKmqkePCa1Acs5JY3Z?usp=sharing"> Click here </a>

In [4]:
import os
import pandas as pd
from bs4 import BeautifulSoup

# Read train and test CSV files
train_df = pd.read_csv("C:\\Users\\manoj\\Downloads\\train.csv")
test_df = pd.read_csv("C:\\Users\\manoj\\Downloads\\test.csv")

# Read keyword file
keyword_df = pd.read_csv("C:\\Users\\manoj\\Downloads\\keyword2tumor_type.csv")

# Display basic information about each CSV file
print("Train CSV:")
print(train_df.info())
print("\nTest CSV:")
print(test_df.info())
print("\nKeyword CSV:")
print(keyword_df.info())

# Display sample data from each CSV file
print("\nSample Data from Train CSV:")
print(train_df.head())
print("\nSample Data from Test CSV:")
print(test_df.head())
print("\nSample Data from Keyword CSV:")
print(keyword_df.head())



Train CSV:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     100 non-null    object
 1   doc_id  100 non-null    int64 
 2   label   100 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 2.5+ KB
None

Test CSV:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     48 non-null     object
 1   doc_id  48 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 900.0+ bytes
None

Keyword CSV:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   keyword     126 non-null    object
 1   tumor_type  126 non-null    object
dtypes: object(2)
memory usage: 2.1+ KB
None

Sample Dat

In [6]:
# Extract keywords and their corresponding tumor board types
keywords = keyword_df['keyword'].tolist()
tumor_types = keyword_df['tumor_type'].tolist()

def extract_clean_text(html_file_path):
    with open(html_file_path, 'r', encoding='latin1') as file:
        content = file.read()
        soup = BeautifulSoup(content, 'xml')  # Use XML parser
        # Remove unwanted tags like <script> and <style>
        for script in soup(["script", "style"]):
            script.extract()
        return soup.get_text(separator=' ')

# Classify HTML documents based on the presence of tumor board keywords
def classify_html(html_text):
    for keyword, tumor_type in zip(keywords, tumor_types):
        if keyword.lower() in html_text.lower():
            return tumor_type
    return "no evidence"

# Process train data
train_predictions = []
for index, row in train_df.iterrows():
    html_file_path = os.path.join("C:\\Users\\manoj\\Downloads\\htmls\\htmls", f"{row['doc_id']}.html")
    html_text = extract_clean_text(html_file_path)
    classification = classify_html(html_text)
    train_predictions.append((row['doc_id'], classification))

# Process test data
test_predictions = []
for index, row in test_df.iterrows():
    html_file_path = os.path.join("C:\\Users\\manoj\\Downloads\\htmls\\htmls", f"{row['doc_id']}.html")
    html_text = extract_clean_text(html_file_path)
    classification = classify_html(html_text)
    test_predictions.append((row['doc_id'], classification))

# Save predictions to CSV files
train_predictions_df = pd.DataFrame(train_predictions, columns=['doc_id', 'prediction'])
test_predictions_df = pd.DataFrame(test_predictions, columns=['doc_id', 'prediction'])

train_predictions_df.to_csv("train_predictions.csv", index=False)
test_predictions_df.to_csv("test_predictions.csv", index=False)

# Load train_predictions.csv and test_predictions.csv
train_predictions_df = pd.read_csv("train_predictions.csv")
test_predictions_df = pd.read_csv("test_predictions.csv")

# Display the first few rows of each CSV file
print("Train Predictions:")
print(train_predictions_df.head())

print("\nTest Predictions:")
print(test_predictions_df.head())



Train Predictions:
   doc_id   prediction
0       1        Brust
1       3        Brust
2       4  no evidence
3       5  no evidence
4       6  no evidence

Test Predictions:
   doc_id   prediction
0       0  no evidence
1       2        Brust
2       7        Brust
3      15        Brust
4      16        Brust


In [7]:
import random

# Function to inspect HTML files based on predictions
def inspect_html_files(predictions_df, dataset_name):
    print(f"\nInspecting HTML files for {dataset_name} predictions:\n")
    for index, row in predictions_df.sample(3).iterrows():  # Selecting 3 random rows for inspection
        doc_id = row['doc_id']
        prediction = row['prediction']
        html_file_path = f"C:\\Users\\manoj\\Downloads\\htmls\\htmls\\{doc_id}.html"
        print(f"Doc ID: {doc_id}, Prediction: {prediction}")
        # You can manually inspect the HTML file at html_file_path to verify if the prediction aligns with the content
        # For example, you can open the HTML file in a web browser or a text editor
        print(f"HTML File Path: {html_file_path}\n")

# Inspect HTML files for train predictions
inspect_html_files(train_predictions_df, "train")

# Inspect HTML files for test predictions
inspect_html_files(test_predictions_df, "test")



Inspecting HTML files for train predictions:

Doc ID: 146, Prediction: Darm
HTML File Path: C:\Users\manoj\Downloads\htmls\htmls\146.html

Doc ID: 22, Prediction: Brust
HTML File Path: C:\Users\manoj\Downloads\htmls\htmls\22.html

Doc ID: 18, Prediction: Brust
HTML File Path: C:\Users\manoj\Downloads\htmls\htmls\18.html


Inspecting HTML files for test predictions:

Doc ID: 31, Prediction: Brust
HTML File Path: C:\Users\manoj\Downloads\htmls\htmls\31.html

Doc ID: 147, Prediction: Darm
HTML File Path: C:\Users\manoj\Downloads\htmls\htmls\147.html

Doc ID: 24, Prediction: no evidence
HTML File Path: C:\Users\manoj\Downloads\htmls\htmls\24.html



In [None]:
# Documentation:
# How did you decide to handle this amount of data?
# I chose to read the HTML content from each file individually using BeautifulSoup to extract clean text.
# Since the amount of data is relatively small (100 train instances and 48 test instances), this approach should be manageable 
# without overwhelming system resources.

# How did you decide to do feature engineering?
# Feature engineering wasn't explicitly performed in this task as the main focus was on extracting relevant information from
# the HTML content using keyword matching. The features were derived directly from the presence or absence of tumor board-related
# keywords in the text.

# How did you decide which models to try (if you decide to train any models)?
# Since this task primarily involved classifying HTML content based on the presence of tumor board-related keywords, I chose a 
# simple keyword matching approach rather than training complex machine learning models. This decision was made considering the 
# interpretability and ease of implementation of the keyword-based classification method.

# How did you perform validation of your model? What metrics did you measure?
# Since no traditional machine learning model was trained, there was no separate validation process. However, during the
# inspection of predictions, I manually verified the alignment of the predicted tumor board types with the content of the HTML
# files. The primary metric measured was the accuracy of the keyword-based classification.

# How do you expect your model to perform on test data (in terms of your metrics)?
# I expect the keyword-based classification model to perform reasonably well on the test data, given that the same approach was 
# successful on the training data. However, the performance may vary depending on the quality and diversity of the HTML content
# in the test set.

# How fast will your algorithm perform, and how could you improve its performance if you would have more time?
# The algorithm's performance largely depends on the number of HTML files to process and the complexity of the keyword matching.
# Since the task involves reading and processing HTML content, the runtime may increase with a larger number of files. To improve
# performance, optimizing the keyword matching algorithm or implementing parallel processing techniques could be considered.

# How do you think you would be able to improve your algorithm if you would have more data?
# With more data, the algorithm's performance could potentially improve by refining the list of tumor board-related keywords and
# incorporating additional features derived from the HTML content, such as structural information or semantic analysis. 
# Furthermore, training more sophisticated machine learning models on labeled data could lead to better classification accuracy.

# What potential issues do you see with your algorithm?
# One potential issue with the keyword-based classification approach is its reliance on the presence of specific keywords in the 
# HTML content. If certain tumor board types are not adequately represented by the chosen keywords or if the keywords appear in 
# unrelated contexts, the algorithm's accuracy may suffer. Additionally, the algorithm may struggle with classifying HTML content
# that deviates significantly from the expected format or structure.