# Keyword Detection on Websites



## Assignment
Your task is to create an algorithm, that takes html page as input and infers if the page contains the information about cancer tumorboard or not. What is a tumor board? Tumor Board is a consilium of doctors (usually from different disciplines) discussing cancer cases in their departments. If you want to know more please read this article.

The expected result is a CSV file for test data with columns [doc_id and prediction].

Bonus: if you would like to go the extra mile in this task try to identify tumor board types interdisciplinary, breast, and any third type of tumor board up to you. For these tumor boards please try to identify their schedule: Day (e.g. Friday), frequency (e.g. weekly, bi-weekly, monthly), and time when they start.

## Data Description
You have train.csv and test.csv files and folder with corresponding .html files.

Files:

train.csv contains next columns: url, doc_id and label
test.csv contains next columns: url and doc_id
htmls contains files with names {doc_id}.html
keyword2tumor_type.csv contains useful keywords for types of tumorboards
Description of tumor board labels:

1 (no evidence): tumor boards are not mentioned on the page
2 (medium confidence): tumor boards are mentioned, but the page is not completely dedicated to tumor board description
3 (high confidence): page is completely dedicated to the description of tumor board types and dates
You are asked to prepare a model using htmls, referred to in train.csv, and make predictions for htmls from test.csv

## Practicalities
You should prepare a Jupyter Notebook with the code that you used for making the predictions and the following documentation:

How did you decide to handle this amount of data?
How did you decide to do feature engineering?
How did you decide which models to try (if you decide to train any models)?
How did you perform validation of your model?
What metrics did you measure?
How do you expect your model to perform on test data (in terms of your metrics)?
How fast will your algorithm performs and how could you improve its performance if you would have more time?
How do you think you would be able to improve your algorithm if you would have more data?
What potential issues do you see with your algorithm?

## Tips
to extract clean text from the page you can use BeautifulSoup module like this

from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

clean_text = soup.get_text(' ')


## If you decide that you don't need, for example, tags <p> in your document you can do this:##


from bs import BeautifulSoup

content = read_html()

soup = BeautifulSoup(content)

for tag in soup.find_all('p'):
    tag.decompose()

#### To download the dataset <a href="https://drive.google.com/drive/folders/1Qs2fLj9HmAzx2YGKmqkePCa1Acs5JY3Z?usp=sharing"> Click here </a>

In [25]:
import pandas as pd
from bs4 import BeautifulSoup
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import re

In [26]:


# Load train and test datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Function to read and clean HTML content with encoding handling
def read_html(doc_id, folder='htmls'):
    file_path = os.path.join(folder, f"{doc_id}.html")
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
    except UnicodeDecodeError:
        try:
            with open(file_path, 'r', encoding='ISO-8859-1') as file:
                content = file.read()
        except UnicodeDecodeError:
            print(f"File encoding issue: {file_path}")
            content = ""
    except FileNotFoundError:
        print(f"File not found: {file_path}")
        content = ""
    
    if content:
        soup = BeautifulSoup(content, 'html.parser')
        clean_text = soup.get_text(' ')
    else:
        clean_text = ""
    
    return clean_text

# Extract and clean text for training data
train_df['text'] = train_df['doc_id'].apply(read_html)
test_df['text'] = test_df['doc_id'].apply(read_html)

# Display the first few rows to ensure the data is loaded correctly
print(train_df.head())
print(test_df.head())




                                                 url  doc_id  label  \
0  http://elbe-elster-klinikum.de/fachbereiche/ch...       1      1   
1  http://klinikum-bayreuth.de/einrichtungen/zent...       3      3   
2  http://klinikum-braunschweig.de/info.php/?id_o...       4      1   
3  http://klinikum-braunschweig.de/info.php/?id_o...       5      1   
4  http://klinikum-braunschweig.de/zuweiser/tumor...       6      3   

                                                text  
0  \n \n \n \n \n \n \n \n Elbe-Elster Klinikum -...  
1  \n \n \n \n \n \n \n \n Onkologisches Zentrum ...  
2  \n \n \n Zentrum - Sozialpädiatrisches Zentrum...  
3  \n \n \n Leistung - Spezielle Unterstützung be...  
4  \n \n \n Zuweiser - Tumorkonferenzen - Tumorko...  
                                                 url  doc_id  \
0  http://chirurgie-goettingen.de/medizinische-ve...       0   
1  http://evkb.de/kliniken-zentren/chirurgie/allg...       2   
2  http://krebszentrum.kreiskliniken-reutlingen.d..

In [27]:
# Vectorize text using TF-IDF with ngrams
vectorizer = TfidfVectorizer(stop_words='english', max_features=2000, ngram_range=(1, 3))
X = vectorizer.fit_transform(train_df['text'])
y = train_df['label']

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting matrices
print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")


X_train shape: (80, 2000)
X_val shape: (20, 2000)
y_train shape: (80,)
y_val shape: (20,)


In [28]:
# Retrain a Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Validate the retrained model
y_pred = model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print("Precision:", precision_score(y_val, y_pred, average='weighted'))
print("Recall:", recall_score(y_val, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val, y_pred, average='weighted'))


Accuracy: 0.5
Precision: 0.425
Recall: 0.5
F1 Score: 0.39920634920634923


  _warn_prf(average, modifier, msg_start, len(result))


In [29]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Instantiate the GridSearchCV object
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')

# Perform grid search on the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Best Parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200}
Best Score: 0.7125


In [30]:
# Retrain a Random Forest model with the best parameters
best_model = RandomForestClassifier(random_state=42, max_depth=None, min_samples_split=2, n_estimators=200)
best_model.fit(X_train, y_train)

# Validate the retrained model
y_pred = best_model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print("Precision:", precision_score(y_val, y_pred, average='weighted'))
print("Recall:", recall_score(y_val, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val, y_pred, average='weighted'))


Accuracy: 0.45
Precision: 0.23684210526315788
Recall: 0.45
F1 Score: 0.3103448275862069


  _warn_prf(average, modifier, msg_start, len(result))


In [31]:
from sklearn.svm import SVC

# Train SVM classifier
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)

# Validate the SVM model
y_pred_svm = svm_model.predict(X_val)
print("SVM Accuracy:", accuracy_score(y_val, y_pred_svm))
print("SVM Precision:", precision_score(y_val, y_pred_svm, average='weighted'))
print("SVM Recall:", recall_score(y_val, y_pred_svm, average='weighted'))
print("SVM F1 Score:", f1_score(y_val, y_pred_svm, average='weighted'))


SVM Accuracy: 0.4
SVM Precision: 0.23529411764705882
SVM Recall: 0.4
SVM F1 Score: 0.29629629629629634


  _warn_prf(average, modifier, msg_start, len(result))


In [32]:
from sklearn.ensemble import GradientBoostingClassifier

# Train Gradient Boosting classifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

# Validate the Gradient Boosting model
y_pred_gb = gb_model.predict(X_val)
print("Gradient Boosting Accuracy:", accuracy_score(y_val, y_pred_gb))
print("Gradient Boosting Precision:", precision_score(y_val, y_pred_gb, average='weighted'))
print("Gradient Boosting Recall:", recall_score(y_val, y_pred_gb, average='weighted'))
print("Gradient Boosting F1 Score:", f1_score(y_val, y_pred_gb, average='weighted'))


Gradient Boosting Accuracy: 0.5
Gradient Boosting Precision: 0.4076923076923077
Gradient Boosting Recall: 0.5
Gradient Boosting F1 Score: 0.4478260869565217


  _warn_prf(average, modifier, msg_start, len(result))


In [35]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize Gradient Boosting classifier
gb_model = GradientBoostingClassifier(random_state=42)

# Train the model
gb_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_gb = gb_model.predict(X_val)

# Evaluate the model
accuracy_gb = accuracy_score(y_val, y_pred_gb)
precision_gb = precision_score(y_val, y_pred_gb, average='weighted')
recall_gb = recall_score(y_val, y_pred_gb, average='weighted')
f1_gb = f1_score(y_val, y_pred_gb, average='weighted')

print("Gradient Boosting Classifier Evaluation:")
print("Accuracy:", accuracy_gb)
print("Precision:", precision_gb)
print("Recall:", recall_gb)
print("F1 Score:", f1_gb)


Gradient Boosting Classifier Evaluation:
Accuracy: 0.5
Precision: 0.4076923076923077
Recall: 0.5
F1 Score: 0.4478260869565217


  _warn_prf(average, modifier, msg_start, len(result))


In [36]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_rf = rf_model.predict(X_val)

# Evaluate the model
accuracy_rf = accuracy_score(y_val, y_pred_rf)
precision_rf = precision_score(y_val, y_pred_rf, average='weighted')
recall_rf = recall_score(y_val, y_pred_rf, average='weighted')
f1_rf = f1_score(y_val, y_pred_rf, average='weighted')

print("Random Forest Classifier Evaluation:")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1 Score:", f1_rf)


Random Forest Classifier Evaluation:
Accuracy: 0.5
Precision: 0.425
Recall: 0.5
F1 Score: 0.39920634920634923


  _warn_prf(average, modifier, msg_start, len(result))


1.  I decided to handle the amount of data by first loading the provided train and test datasets into memory.
Since the size of the dataset was manageable, I chose to load the entire dataset at once rather than using techniques
like data batching or streaming. Additionally, I ensured to split the dataset into training and validation sets to evaluate
the model's performance effectively.


2.  For feature engineering, I opted to utilize the TF-IDF vectorization technique to convert the text data from HTML pages into numerical features. TF-IDF was chosen because it effectively captures the importance of words within each document while accounting for their frequency across the entire corpus. I also considered experimenting with keyword extraction techniques to identify relevant terms related to tumor board


3.  I decided to try three different models for classification: Support Vector Machines (SVM), Gradient Boosting, and Random Forest. I chose these models based on their versatility and effectiveness in handling classification tasks with text data. Additionally, these models offer a good balance between interpretability and performance, which was important for this task.


4.  To validate the model, I split the training data into training and validation sets using a stratified approach to ensure class balance. I then trained each model on the training set and evaluated its performance on the validation set. Additionally, I used k-fold cross-validation to assess the models' generalization performance and mitigate overfitting.


5.  I measured several classification metrics to evaluate the models' performance, including accuracy, precision, recall, and F1 score. These metrics provide insights into different aspects of the model's performance, such as overall correctness, positive predictive value, sensitivity, and balance between precision and recall.


6. Based on the evaluation metrics obtained on the validation set, I expect the model to perform reasonably well on the test data. However, I anticipate some variation in performance due to differences in the distribution of data between the training/validation and test sets. Nonetheless, I believe the chosen models have the potential to generalize well to unseen data


7. The algorithms performed reasonably fast on the provided dataset, but their performance may vary depending on the complexity of the data and computational resources available. but i tried somethings but couldnt get accuracy i am trying but deadline has come close so i have to submit what i have done so far, but after submiting i have to try get more accuracy and see where i missed things


8.  With more data, I could train the models on a larger and more diverse dataset, which may lead to improved generalization and performance.


9.  One potential issue with the algorithm is its sensitivity to the quality and representativeness of the training data. If the dataset is biased or contains noisy or irrelevant information, it could affect the model's performance negatively.
but i need more practice on these kind of problems so i will work hard as much as possible. Regularization techniques and careful model selection can help mitigate these issues. so i will see what i can do after submiting this. Thank you.