# Project Phase 2

Phase 2 of our project.

# Section 2. Compute Agreement Between Annotators

We will use Cohen's Kappa for the compute agreement which is good for cases of two annotators.  The other options of Fleiss' Kappa, Krippendorff's Alpha, and Percentage Agreement do not seem as good of choices in our case.

As a side note, the label 'None' for the column 'pay' had to be changed to 'NoPay' because Pandas was interpreting that as a NaN.  Instead of doing a work around in Pandas, which was possible, I decided to just change the label name to make it easier for others to work with the dataset.

## Cohen's Kappa

For pairwise agreement between two annotators.  We will compute each of the two labels separately, first for 'hire', then for 'pay'.

### Compute for Hire

In [22]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score

# Load the CSV files from the GitHub repository
url_annotator1 = 'https://raw.githubusercontent.com/RyanS974/RyanS974/main/datasets/phase2/zoiya.csv'
url_annotator2 = 'https://raw.githubusercontent.com/RyanS974/RyanS974/main/datasets/phase2/meriem.csv'

# Read the data
annotator1 = pd.read_csv(url_annotator1)
annotator2 = pd.read_csv(url_annotator2)

# Ensure both dataframes are sorted and indexed in the same way
annotator1 = annotator1.sort_values(by="id").reset_index(drop=True)
annotator2 = annotator2.sort_values(by="id").reset_index(drop=True)

# Extract labels
labels_annotator1 = annotator1['hire']
labels_annotator2 = annotator2['hire']

# Compute Cohen's Kappa
kappa_score = cohen_kappa_score(labels_annotator1, labels_annotator2)

# Print the result and interpretation
print(f"Cohen's Kappa Score: {kappa_score:.2f}")

# Interpret the Kappa score
if kappa_score >= 0.75:
    print("Interpretation: Strong agreement")
elif 0.6 <= kappa_score < 0.75:
    print("Interpretation: Moderate agreement")
elif 0.4 <= kappa_score < 0.6:
    print("Interpretation: Fair agreement")
else:
    print("Interpretation: Poor agreement")

Cohen's Kappa Score: 0.68
Interpretation: Moderate agreement


This score is of a moderate agreement between the annotators in the 'hire' column (label).

### Compute for Pay

In [25]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score

# Load the CSV files from the GitHub repository
url_annotator1 = 'https://raw.githubusercontent.com/RyanS974/RyanS974/main/datasets/phase2/zoiya.csv'
url_annotator2 = 'https://raw.githubusercontent.com/RyanS974/RyanS974/main/datasets/phase2/meriem.csv'

# Read the data
annotator1 = pd.read_csv(url_annotator1)
annotator2 = pd.read_csv(url_annotator2)

# Extract labels
labels_annotator1 = annotator1['pay']
labels_annotator2 = annotator2['pay']

# Compute Cohen's Kappa
kappa_score = cohen_kappa_score(labels_annotator1, labels_annotator2)

# Print the result and interpretation
print(f"Cohen's Kappa Score: {kappa_score:.2f}")

# Interpret the Kappa score
if kappa_score >= 0.75:
    print("Interpretation: Strong agreement")
elif 0.6 <= kappa_score < 0.75:
    print("Interpretation: Moderate agreement")
elif 0.4 <= kappa_score < 0.6:
    print("Interpretation: Fair agreement")
else:
    print("Interpretation: Poor agreement")

Cohen's Kappa Score: 0.38
Interpretation: Poor agreement


For the 'pay' column (label) we have a poor agreement between the annotators.

## Interpretation

Even though there is what is rated as 'poor agreement' with our 'pay' label, this is not that bad.  There is a large amount of interpretation ability with that.  The algorithm for Cohen's Kappa is taking into account the large number of NoPay matches and altering the score, I believe.  The score would normally be higher here.  This is the first time I have worked with Cohen's Kappa but I believe that is the explanation.

# Section 3. Determine Ground Truth Labels

Our ground truth labels are the final labels for the dataset.  There are several options available, but the best choice for this case was the 'majority voting' method, which is essentially just a third annotator, myself.

## Majority Voting

I annotated the data as a third annotator which helped resolve any ties.  The full_dataset.csv file is of the final 'ground truth' labels for the dataset.

# Section 4. Analyze Data

# Section 5. Setup Codabench Page for Task

## Task Definition

CodaBench has a basic Task Definition concept which is given below.

### Actual Task Definition

Task Name: Job Candidate Hiring and Salary Prediction

Objective: Predict whether a candidate will be hired based on their skills, experience, and other features, and estimate their salary if they are hired.

Dataset: The dataset includes information on candidate skills, years of experience, grades, number of completed projects, and involvement in extracurriculars. Labels include "hire" (Yes, No, Interview) and "pay" (a value or "NoPay" if not hired).

Expected Output: A binary prediction for "hire" and, if positive, a prediction for "pay."

Evaluation Metrics: Models will be evaluated based on F1-score.

Example Input/Output:

Input: Skills: SQL, Machine Learning, Java; Experience: 3 years; Grades: 92; Projects: 3; Extra: 2
Output: Hire: No, Pay: NoPay

## Split Data

We will work with a standard 60 20 20 dataset split, for training, validation, and testing.

In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
dataset1 = 'https://raw.githubusercontent.com/RyanS974/RyanS974/main/datasets/phase2/full_dataset.csv'
data = pd.read_csv(dataset1)

# Separate the features and labels
X = data.drop(columns=["hire", "pay"])  # Drop target columns to get features
y = data[["hire", "pay"]]  # Target columns (multi-label)

# First, split the data into training and temp (which will be split into validation and test sets)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)

# Now split the temp set equally into validation and test sets (20% each of the original data)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Print shapes to verify the split
print("Training set:", X_train.shape, y_train.shape)
print("Validation set:", X_val.shape, y_val.shape)
print("Test set:", X_test.shape, y_test.shape)


Training set: (600, 7) (600, 2)
Validation set: (200, 7) (200, 2)
Test set: (200, 7) (200, 2)


## Evaluation Metric

We will use f1-score for both 'hire' and 'pay'.  This for both classification labels ("hire" and "pay") is a great choice, especially since it balances precision and recall.

## Baseline Model

We will use a Random Forest classifier for our baseline model, with some basic hyperparameter tuning to make it better than the dummy baseline we will create later.  We will also need to use a multi-label wrapper also since we have two labels.

We will also combine the training, validation, and testing in one code block below, with basic print outs of the f1-scores for validation and testing.

In [38]:
# Import necessary libraries for encoding
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import f1_score

# Initialize CountVectorizer for skills column
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(";"))

# Set up the preprocessor with CountVectorizer for the skills column
preprocessor = ColumnTransformer(
    transformers=[
        ("skills", vectorizer, "skills")
    ],
    remainder="passthrough"  # Keeps the other (numerical) columns as they are
)

# Create a pipeline to combine preprocessing and model training
pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("model", MultiOutputClassifier(RandomForestClassifier(n_estimators=300, max_depth=None, max_features=1.0, random_state=42)))
])

# Train the model on the training set
pipeline.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred = pipeline.predict(X_val)

# Calculate F1-score for each label on the validation set
f1_hire_val = f1_score(y_val['hire'], y_val_pred[:, 0], average="weighted")
f1_pay_val = f1_score(y_val['pay'], y_val_pred[:, 1], average="weighted")

print("Validation F1-score for 'hire' label:", f1_hire_val)
print("Validation F1-score for 'pay' label:", f1_pay_val)

# Make predictions on the test set to evaluate generalization
y_test_pred = pipeline.predict(X_test)

# Calculate F1-score for each label on the test set
f1_hire_test = f1_score(y_test['hire'], y_test_pred[:, 0], average="weighted")
f1_pay_test = f1_score(y_test['pay'], y_test_pred[:, 1], average="weighted")

print("\nTest F1-score for 'hire' label:", f1_hire_test)
print("Test F1-score for 'pay' label:", f1_pay_test)




Validation F1-score for 'hire' label: 0.99
Validation F1-score for 'pay' label: 0.9659289013858205

Test F1-score for 'hire' label: 0.9894642857142857
Test F1-score for 'pay' label: 0.9780191815856777


We have somewhat of a class imbalance issue in our labels that is being shown in these f1-scores.  They are very high, but normal in our scenario.  We have many 'NoPay' and 'No' labels which are the main reason for this.  There is no way around this without changing the nature of the project, so I believe the class imbalance is essentially unavoidable and these scores are acceptable.

There is a small discrepancy of f1-score results in the the two sets of validation and test.  This is normal and acceptable also.  They are both composed of 200 entries, with the entries not being identical.  We went from about .988 to .99 from test to validation with the 'hire', and a similar difference in the other.

### Hyperparameters

Here are the main hyperparameters for the Random Forest classifier we will use:

**n_estimators**: The number of trees directly affects the model's ensemble power. It's usually the first parameter to tune since increasing it improves performance up to a point but with a diminishing return.

**max_depth**: Controlling the tree depth is crucial for managing complexity and preventing overfitting, especially on small or noisy datasets. Shallower trees reduce overfitting, while deeper trees can capture more data nuances.

**max_features**: This parameter impacts diversity across trees. Lower values help create trees that specialize in different parts of the feature space, which generally leads to better generalization.

***In our above baseline model we just trained and evaluated, those settings should be on the higher end for f1-scoring.  In our dummy baseline below, we will use more random and default settings and will give lower scores.***

Our validation metric is f1-score for both classification labels.  This is what the codabench competition is based on.

# Section 6. Dummy Baseline (Random Baseline)

Here is our dummy baseline which is the Random based one.  We will use default or random hyperparameters values also.

## Training, Validation, and Testing of Dummy Baseline

In [40]:
# Initialize CountVectorizer for skills column
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(";"))

# Set up the preprocessor with CountVectorizer for the skills column
preprocessor = ColumnTransformer(
    transformers=[
        ("skills", vectorizer, "skills")
    ],
    remainder="passthrough"  # Keeps the other (numerical) columns as they are
)

# Create a pipeline to combine preprocessing and model training
pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("model", MultiOutputClassifier(RandomForestClassifier(n_estimators=75, max_depth=15, max_features="sqrt", random_state=42)))
])

# Train the model on the training set
pipeline.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred = pipeline.predict(X_val)

# Calculate F1-score for each label on the validation set
f1_hire_val = f1_score(y_val['hire'], y_val_pred[:, 0], average="weighted")
f1_pay_val = f1_score(y_val['pay'], y_val_pred[:, 1], average="weighted")

print("Validation F1-score for 'hire' label:", f1_hire_val)
print("Validation F1-score for 'pay' label:", f1_pay_val)

# Make predictions on the test set to evaluate generalization
y_test_pred = pipeline.predict(X_test)

# Calculate F1-score for each label on the test set
f1_hire_test = f1_score(y_test['hire'], y_test_pred[:, 0], average="weighted")
f1_pay_test = f1_score(y_test['pay'], y_test_pred[:, 1], average="weighted")

print("\nTest F1-score for 'hire' label:", f1_hire_test)
print("Test F1-score for 'pay' label:", f1_pay_test)



Validation F1-score for 'hire' label: 0.9947899159663866
Validation F1-score for 'pay' label: 0.9404081632653061

Test F1-score for 'hire' label: 0.9775263157894737
Test F1-score for 'pay' label: 0.9626582278481013


These hyperparameter settings that are different, mainly of a random nature and default nature, are performing worse than our baseline.  Not by much, but it is lower.  In our CodaBench competition, it is allowing for comparisons still, which is the main point.  We do, as mentioned earlier, have somewhat of a class imbalance issue which is causing our high scores, but due to the nature of the project, it is not really avoidable, making the high scores acceptable.

# Trained Model vs Dummy Model

Earlier, we trained and validated the baseline.  We just trained and validated our dummy baseline which was intended to be simpler.  Below is a more detailed comparison.

The comparison between the Dummy Baseline and Actual Baseline highlights the effects of model complexity on F1-scores for both labels, "hire" and "pay." The Dummy Baseline, configured with a moderate number of trees (n_estimators=75), a limited depth (max_depth=15), and a subset of features (max_features="sqrt"), performs exceptionally well for the "hire" label, achieving a slightly higher F1-score on the validation set than the Actual Baseline. This suggests that the "hire" labelâ€™s patterns are well captured with a simpler model, which balances performance with generalization. However, the "pay" label demonstrates a clear benefit from added model complexity, as evidenced by higher validation and test F1-scores with the Actual Baseline.

The Actual Baseline, which leverages a higher number of trees (n_estimators=300), unlimited depth (max_depth=None), and full feature usage (max_features=1.0), delivers the highest F1-scores on the test set for both labels, showing it can generalize well across unseen data. This indicates that the "pay" label likely contains more nuanced patterns that a simpler model might miss, making the added complexity of the Actual Baseline worthwhile. In scenarios where computational resources allow, the Actual Baseline is the better choice for maximizing predictive power, particularly on the "pay" label. However, if computational efficiency is prioritized, the Dummy Baseline is a solid alternative, delivering high F1-scores with fewer resources.