# Submission Instructions

Here are instructions on how to submit your validation dataset based model results.

In this GitHub directory is the training set, named 'training_set_labeled.csv', and the validation set you get your submission results from, named 'validation_set_unlabeled.csv', along with 'test_set_unlabeled.csv' which is the final dataset the preds.txt is based on.  Load the training set into this python notebook in a following section that is an example, or your own python code, then train your model.  With that, run your model on the validation set tuning hyperparameters and analyzing metrics, then run it on the test set and print the resulting labels to a file named 'preds.txt'.  Then simply submit that in a zipped file named 'preds.txt.zip' on the CodaLabs page.  The example combine.py file is a basic example to show the process, which only uses the training set and test set to then print the preds.txt.  Ideally you use the validation set also for hyperparameter tuning and metrics analysis.

There are some various other files in the GitHub directory related to the project also.

# Example Code for Loading the Datasets

In [4]:
import pandas as pd

# Load the training and validation datasets from GitHub
train_dataset_url = 'https://raw.githubusercontent.com/RyanS974/RyanS974/main/datasets/phase2/training_set_labeled.csv'
test_dataset_url = 'https://raw.githubusercontent.com/RyanS974/RyanS974/main/datasets/phase2/test_set_unlabeled.csv'

# column names
column_names = ["id", "skills", "exp", "grades", "projects", "extra", "offer", "hire", "pay"]
column_names2 = ["id", "skills", "exp", "grades", "projects", "extra", "offer"]

# Load the datasets
train_data = pd.read_csv(train_dataset_url, names=column_names).drop(columns=["id"])
test_data = pd.read_csv(test_dataset_url, names=column_names2).drop(columns=["id"])

# Separate features and labels for the training dataset
X_train = train_data.drop(columns=["hire", "pay"])  # Features for training
y_train = train_data[["hire", "pay"]]               # Labels for training

# Validation features without dropping any additional columns
X_test = test_data  # Features for validation

# Print shapes to verify the split
print("Training set:", X_train.shape, y_train.shape)
print("Test set:", X_test.shape)

Training set: (600, 6) (600, 2)
Validation set: (400, 6)


# Example Code for Training the Model

In [5]:
# Import necessary libraries for encoding
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import f1_score

# Initialize CountVectorizer for skills column
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(";"))

# Set up the preprocessor with CountVectorizer for the skills column
preprocessor = ColumnTransformer(
    transformers=[
        ("skills", vectorizer, "skills")
    ],
    remainder="passthrough"  # Keeps the other (numerical) columns as they are
)

# Create a pipeline to combine preprocessing and model training
pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("model", MultiOutputClassifier(RandomForestClassifier(n_estimators=300, max_depth=None, max_features=1.0, random_state=42)))
])

# Train the model on the training set
pipeline.fit(X_train, y_train)

# Make predictions on the test set for an example, skipping validation
y_test_pred = pipeline.predict(X_test)

# Print prediction labels
print(y_test_pred)



[['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['Interview' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['Yes' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['Interview' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['Yes' '125']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'NoPay']
 ['No' 'No

# Submitting Your Results


The results from the previous section can be copied into a text file and submitted. Name the file preds.txt also.

# Ideas for Your Model

You can use any classifier you want, and also you can tune the hyperparameters or add new ones.  My example uses random forest, with just three hyperparameters.  At the most basic level you could tune those hyperparameters and try to get the best possible results to submit.

Also, my GitHub directory has python code files of these steps which you can access and modify or run on your own outside of this Google Colab Python Notebook.

The scoring will be done on CodaLabs through a scoring script file after you submit the results.  The results are needed to be submitted in the format of "No,Interview" for example.  This is without any quotes in the text file, the quotes are used above to specify the line entry.  This would be one for each line.  The validation set has 400 datapoints so it would be 400.  Some of the other combinations are "Yes,125", etc...

Name this file preds.txt and submit it on the CodaLabs page for this competition, and it will score it on the leaderboard.