# CareerMultiClassClassification


The major goal of this project is to divide each profession into the corresponding Holland occupation codes, which is a very useful tool that teachers used at school that helps students identify the possible university programs and careers they can pursue based on their interests and strengths. The Holland codes are: (R: Realistic), (I: Investigative), (A: Artistic), (S: Social), (E: Entrepreneur), (C: Conventional).

## Load PreDownloaded Data

In [None]:
import pandas as pd

data = pd.read_json("/content/Search Careers by Holland Code _ Truity(1).json")

In [None]:
data

Unnamed: 0,Field1,Field2,Field3
0,\n Accountant or Auditor,"\n Persuading, Organizing",Accountants and auditors prepare and examine f...
1,\n Actor,"\n Creating, Persuading",Actors express ideas and portray characters in...
2,\n Actuary,"\n Thinking, Persuading, Organizing...",Actuaries analyze the financial costs of risk ...
3,\n Administrative Services Manager ...,"\n Persuading, Organizing",Administrative services and facilities manager...
4,\n Advertising and Promotions Manag...,"\n Creating, Persuading","Advertising, promotions, and marketing manager..."
...,...,...,...
323,\n Wholesale and Manufacturing Sale...,"\n Persuading, Organizing",Wholesale and manufacturing sales representati...
324,\n Wind Turbine Technician,"\n Building, Organizing","Wind turbine service technicians, also known a..."
325,\n Woodworker,"\n Building, Organizing","Woodworkers manufacture a variety of products,..."
326,\n Writer or Author,\n Creating,Writers and authors develop content for variou...


In [None]:
realistic = pd.read_csv("/content/Realistic.csv")
social = pd.read_csv('/content/Social.csv')
artistic = pd.read_csv('/content/Artistic.csv')
conventional = pd.read_csv('/content/Conventional.csv')
enterprising = pd.read_csv('/content/Enterprising.csv')
investigative = pd.read_csv('/content/Investigative.csv')

## Data Combining and Preprocessing

In [None]:
# Dropping the duplicates records after combining the separate data files
combined_df = pd.concat([realistic, social, artistic, conventional, enterprising, investigative], ignore_index=True)
combined_df.drop_duplicates(inplace=True)

In [None]:
combined_df

Unnamed: 0,Interest Code,Job Zone,Code,Occupation,Featured
0,RC,1,45-2091.00,Agricultural Equipment Operators,Y
1,RCS,1,35-3023.01,Baristas,Y
2,RC,1,47-2051.00,Cement Masons and Concrete Finishers,Y
3,RC,1,53-7011.00,Conveyor Operators and Tenders,Y
4,RCE,1,35-2011.00,"Cooks, Fast Food",Y
...,...,...,...,...,...
1070,ISA,5,19-3041.00,Sociologists,
1072,ISR,5,29-1229.06,Sports Medicine Physicians,
1074,IC,5,19-3022.00,Survey Researchers,
1075,IEC,5,19-3051.00,Urban and Regional Planners,


In [None]:
# Transforming the Interest Code into a vector/list format 
def split_code(code):
  return [*code]

combined_df['codes'] = combined_df['Interest Code'].apply(split_code)

In [None]:
combined_df

Unnamed: 0,Interest Code,Job Zone,Code,Occupation,Featured,codes
0,RC,1,45-2091.00,Agricultural Equipment Operators,Y,"[R, C]"
1,RCS,1,35-3023.01,Baristas,Y,"[R, C, S]"
2,RC,1,47-2051.00,Cement Masons and Concrete Finishers,Y,"[R, C]"
3,RC,1,53-7011.00,Conveyor Operators and Tenders,Y,"[R, C]"
4,RCE,1,35-2011.00,"Cooks, Fast Food",Y,"[R, C, E]"
...,...,...,...,...,...,...
1070,ISA,5,19-3041.00,Sociologists,,"[I, S, A]"
1072,ISR,5,29-1229.06,Sports Medicine Physicians,,"[I, S, R]"
1074,IC,5,19-3022.00,Survey Researchers,,"[I, C]"
1075,IEC,5,19-3051.00,Urban and Regional Planners,,"[I, E, C]"


In [None]:
# data cleaning 
def clean(text):
  return text.strip().replace('\n', "")

data["Field1"] = data["Field1"].apply(clean)
data["Field2"] = data["Field2"].apply(clean)
code_map = {
    'Building': 'R',
    'Thinking': 'I',
    'Creating': 'A',
    'Persuading': 'E',
    'Organizing': 'C',
    'Helping': 'S'
}

def convert(string):
  codes = string.split(',')
  results = []
  for item in codes:
    results.append(code_map[item.strip()])
  return ",".join(sorted(results))
data['codes'] = data['Field2'].apply(convert)
data['combined'] = data['Field1'] + ": " + data["Field3"]

In [None]:
data

Unnamed: 0,Field1,Field2,Field3,codes,combined
0,Accountant or Auditor,"Persuading, Organizing",Accountants and auditors prepare and examine f...,"C,E",Accountant or Auditor: Accountants and auditor...
1,Actor,"Creating, Persuading",Actors express ideas and portray characters in...,"A,E",Actor: Actors express ideas and portray charac...
2,Actuary,"Thinking, Persuading, Organizing",Actuaries analyze the financial costs of risk ...,"C,E,I",Actuary: Actuaries analyze the financial costs...
3,Administrative Services Manager,"Persuading, Organizing",Administrative services and facilities manager...,"C,E",Administrative Services Manager: Administrativ...
4,Advertising and Promotions Manager,"Creating, Persuading","Advertising, promotions, and marketing manager...","A,E",Advertising and Promotions Manager: Advertisin...
...,...,...,...,...,...
323,Wholesale and Manufacturing Sales Representatives,"Persuading, Organizing",Wholesale and manufacturing sales representati...,"C,E",Wholesale and Manufacturing Sales Representati...
324,Wind Turbine Technician,"Building, Organizing","Wind turbine service technicians, also known a...","C,R",Wind Turbine Technician: Wind turbine service ...
325,Woodworker,"Building, Organizing","Woodworkers manufacture a variety of products,...","C,R",Woodworker: Woodworkers manufacture a variety ...
326,Writer or Author,Creating,Writers and authors develop content for variou...,A,Writer or Author: Writers and authors develop ...


In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transform

In [None]:
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/57.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

encoder = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/71.2k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [None]:
data

Unnamed: 0,Field1,Field2,Field3,codes,combined
0,Accountant or Auditor,"Persuading, Organizing",Accountants and auditors prepare and examine f...,"C,E",Accountant or Auditor: Accountants and auditor...
1,Actor,"Creating, Persuading",Actors express ideas and portray characters in...,"A,E",Actor: Actors express ideas and portray charac...
2,Actuary,"Thinking, Persuading, Organizing",Actuaries analyze the financial costs of risk ...,"C,E,I",Actuary: Actuaries analyze the financial costs...
3,Administrative Services Manager,"Persuading, Organizing",Administrative services and facilities manager...,"C,E",Administrative Services Manager: Administrativ...
4,Advertising and Promotions Manager,"Creating, Persuading","Advertising, promotions, and marketing manager...","A,E",Advertising and Promotions Manager: Advertisin...
...,...,...,...,...,...
323,Wholesale and Manufacturing Sales Representatives,"Persuading, Organizing",Wholesale and manufacturing sales representati...,"C,E",Wholesale and Manufacturing Sales Representati...
324,Wind Turbine Technician,"Building, Organizing","Wind turbine service technicians, also known a...","C,R",Wind Turbine Technician: Wind turbine service ...
325,Woodworker,"Building, Organizing","Woodworkers manufacture a variety of products,...","C,R",Woodworker: Woodworkers manufacture a variety ...
326,Writer or Author,Creating,Writers and authors develop content for variou...,A,Writer or Author: Writers and authors develop ...


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# One hot encode the codes
one_hot_encoded = mlb.fit_transform(data['codes'])
one_hot_df = pd.DataFrame(one_hot_encoded, columns=mlb.classes_)
df = pd.concat([data, one_hot_df], axis=1)

In [None]:

def generate_target_list(row):
  result = []
  for code in ['A', 'C', 'E', 'I', 'R', 'S']:
    result.append(row[code])
  return result
df['target'] = df.apply(generate_target_list, axis=1)

In [None]:
df

Unnamed: 0,Field1,Field2,Field3,codes,combined,",",A,C,E,I,R,S,target
0,Accountant or Auditor,"Persuading, Organizing",Accountants and auditors prepare and examine f...,"C,E",Accountant or Auditor: Accountants and auditor...,1,0,1,1,0,0,0,"[0, 1, 1, 0, 0, 0]"
1,Actor,"Creating, Persuading",Actors express ideas and portray characters in...,"A,E",Actor: Actors express ideas and portray charac...,1,1,0,1,0,0,0,"[1, 0, 1, 0, 0, 0]"
2,Actuary,"Thinking, Persuading, Organizing",Actuaries analyze the financial costs of risk ...,"C,E,I",Actuary: Actuaries analyze the financial costs...,1,0,1,1,1,0,0,"[0, 1, 1, 1, 0, 0]"
3,Administrative Services Manager,"Persuading, Organizing",Administrative services and facilities manager...,"C,E",Administrative Services Manager: Administrativ...,1,0,1,1,0,0,0,"[0, 1, 1, 0, 0, 0]"
4,Advertising and Promotions Manager,"Creating, Persuading","Advertising, promotions, and marketing manager...","A,E",Advertising and Promotions Manager: Advertisin...,1,1,0,1,0,0,0,"[1, 0, 1, 0, 0, 0]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,Wholesale and Manufacturing Sales Representatives,"Persuading, Organizing",Wholesale and manufacturing sales representati...,"C,E",Wholesale and Manufacturing Sales Representati...,1,0,1,1,0,0,0,"[0, 1, 1, 0, 0, 0]"
324,Wind Turbine Technician,"Building, Organizing","Wind turbine service technicians, also known a...","C,R",Wind Turbine Technician: Wind turbine service ...,1,0,1,0,0,1,0,"[0, 1, 0, 0, 1, 0]"
325,Woodworker,"Building, Organizing","Woodworkers manufacture a variety of products,...","C,R",Woodworker: Woodworkers manufacture a variety ...,1,0,1,0,0,1,0,"[0, 1, 0, 0, 1, 0]"
326,Writer or Author,Creating,Writers and authors develop content for variou...,A,Writer or Author: Writers and authors develop ...,0,1,0,0,0,0,0,"[1, 0, 0, 0, 0, 0]"


In [None]:
def generate_target_combined(series):
  results = [0]* 6
  mappings = ['A', 'C', 'E', 'I', 'R', 'S']
  for i in range(len(mappings)):
    if mappings[i] in series:
      results[i] = 1
  return results

combined_df['target'] = combined_df['codes'].apply(generate_target_combined)

In [None]:
combined_df

Unnamed: 0,Interest Code,Job Zone,Code,Occupation,Featured,codes,target
0,RC,1,45-2091.00,Agricultural Equipment Operators,Y,"[R, C]","[0, 1, 0, 0, 1, 0]"
1,RCS,1,35-3023.01,Baristas,Y,"[R, C, S]","[0, 1, 0, 0, 1, 1]"
2,RC,1,47-2051.00,Cement Masons and Concrete Finishers,Y,"[R, C]","[0, 1, 0, 0, 1, 0]"
3,RC,1,53-7011.00,Conveyor Operators and Tenders,Y,"[R, C]","[0, 1, 0, 0, 1, 0]"
4,RCE,1,35-2011.00,"Cooks, Fast Food",Y,"[R, C, E]","[0, 1, 1, 0, 1, 0]"
...,...,...,...,...,...,...,...
1070,ISA,5,19-3041.00,Sociologists,,"[I, S, A]","[1, 0, 0, 1, 0, 1]"
1072,ISR,5,29-1229.06,Sports Medicine Physicians,,"[I, S, R]","[0, 0, 0, 1, 1, 1]"
1074,IC,5,19-3022.00,Survey Researchers,,"[I, C]","[0, 1, 0, 1, 0, 0]"
1075,IEC,5,19-3051.00,Urban and Regional Planners,,"[I, E, C]","[0, 1, 1, 1, 0, 0]"


In [None]:
cleaned_combined = combined_df[['Occupation', 'target']]
cleaned_data = df[['Field1', 'target']]
cleaned_data = cleaned_data.rename(columns={"Field1": "Occupation"})

all_df = pd.concat([cleaned_data, cleaned_combined], ignore_index=True)
# all_df.drop_duplicates(inplace=True)

In [None]:
all_df

Unnamed: 0,Occupation,target
0,Accountant or Auditor,"[0, 1, 1, 0, 0, 0]"
1,Actor,"[1, 0, 1, 0, 0, 0]"
2,Actuary,"[0, 1, 1, 1, 0, 0]"
3,Administrative Services Manager,"[0, 1, 1, 0, 0, 0]"
4,Advertising and Promotions Manager,"[1, 0, 1, 0, 0, 0]"
...,...,...
1332,Sociologists,"[1, 0, 0, 1, 0, 1]"
1333,Sports Medicine Physicians,"[0, 0, 0, 1, 1, 1]"
1334,Survey Researchers,"[0, 1, 0, 1, 0, 0]"
1335,Urban and Regional Planners,"[0, 1, 1, 1, 0, 0]"


In [None]:
text = list(all_df['Occupation'])
embeddings = encoder.encode(text)
list_of_embeddings = embeddings.tolist()
all_df['embeddings'] = list_of_embeddings

In [None]:
all_df

Unnamed: 0,Occupation,target,embeddings
0,Accountant or Auditor,"[0, 1, 1, 0, 0, 0]","[0.5902896523475647, -0.5632482171058655, -0.5..."
1,Actor,"[1, 0, 1, 0, 0, 0]","[-0.43129685521125793, -0.25423458218574524, -..."
2,Actuary,"[0, 1, 1, 1, 0, 0]","[-0.9055672883987427, 0.09641164541244507, -1...."
3,Administrative Services Manager,"[0, 1, 1, 0, 0, 0]","[-0.40037065744400024, -0.9047368764877319, -0..."
4,Advertising and Promotions Manager,"[1, 0, 1, 0, 0, 0]","[0.6407135725021362, 0.0511791929602623, -0.61..."
...,...,...,...
1332,Sociologists,"[1, 0, 0, 1, 0, 1]","[-0.6004414558410645, -0.7658765912055969, -1...."
1333,Sports Medicine Physicians,"[0, 0, 0, 1, 1, 1]","[0.3171848952770233, 0.5859827995300293, -1.14..."
1334,Survey Researchers,"[0, 1, 0, 1, 0, 0]","[0.07526396214962006, -0.4154224991798401, -1...."
1335,Urban and Regional Planners,"[0, 1, 1, 1, 0, 0]","[0.11457135528326035, -2.0100979804992676, -0...."


## Model Building and Model Training

In [None]:
import numpy as np
import tensorflow as tf
from keras import backend as K
def convert_list_to_tensor(list_of_int):
    return tf.convert_to_tensor(list_of_int)

X = np.stack(all_df['embeddings'], axis=0)

y = np.stack(all_df['target'], axis=0)


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, GlobalMaxPool1D
from keras.optimizers import Adam
import tensorflow as tf

model = Sequential()
model.add(Dense(128, input_shape=(1024,), activation='relu'))
# model.add(GlobalMaxPool1D())
# model.add(Dense(6, activation='sigmoid'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
# model.add(GlobalMaxPool1D())
model.add(Dense(6, activation='sigmoid'))
model.compile(optimizer=Adam(0.00015), loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC()])

In [None]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score

n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

accuracy_scores = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
    model.fit(
        X_train,
        y_train,
        batch_size=32,
        epochs=150,
        validation_data=(X_val, y_val)
    )
    y_pred = (model.predict(X_val) > 0.5)
    accuracy = accuracy_score(y_val, y_pred)
    accuracy_scores.append(accuracy)
    print(accuracy_scores)
average_accuracy = sum(accuracy_scores) / n_splits
print(f"Average accuracy across {n_splits} folds: {average_accuracy}")

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

In [None]:
model.fit(
    X_train,
    y_train,
    batch_size=32,
    epochs=50,
    validation_data=(X_test, y_test),
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x79e864827640>

In [None]:
model.save('/content/CareerClassificationCV')

## Test the saved model

In [None]:
from tensorflow import keras

# Load the saved model
model = keras.models.load_model('/content/content/CareerClassification')

In [None]:
embedding_test = encoder.encode(['Drawing Competition: I painted a paiting under the theme of `Treasures`'])

In [None]:
results = model.predict(embedding_test)



In [None]:
results

array([[9.9777329e-01, 3.3039559e-02, 1.0788335e-01, 8.8030042e-04,
        9.9399072e-01, 3.2596433e-04]], dtype=float32)

In [None]:
mappings  = ['A', 'C', 'E', 'I', 'R', 'S']
for i in range(len(results[0])):
  if results[0][i] >= 0.5:
    print(mappings[i])

A
R


In [None]:
def predict_program_codes(program):
  result = []
  results = model.predict(encoder.encode([program]))
  mappings  = ['A', 'C', 'E', 'I', 'R', 'S']
  for i in range(len(results[0])):
    if results[0][i] >= 0.5:
      result.append(mappings[i])
  return str(result)

In [None]:
import pandas as pd

data = pd.read_json("/content/Our Programs _ HKUST Undergraduate Admissions.json")

In [None]:
data

Unnamed: 0,Field1_text_Text,Field1_links_Link,Text
0,Science (Group A) with an Extended Major in Ar...,https://join.hkust.edu.hk/our-programs/school-...,School of Science
1,International Research Enrichment (IRE),https://join.hkust.edu.hk/our-programs/school-...,School of Science
2,BSc in Mathematics,https://join.hkust.edu.hk/our-programs/school-...,School of Science
3,BSc in Ocean Science and Technology,https://join.hkust.edu.hk/our-programs/school-...,School of Science
4,BSc in Physics,https://join.hkust.edu.hk/our-programs/school-...,School of Science
5,BSc in Data Analytics in Science,https://join.hkust.edu.hk/our-programs/school-...,School of Science
6,BSc in Data Science and Technology (Joint Scho...,https://join.hkust.edu.hk/our-programs/joint-s...,School of Science
7,BSc in Mathematics and Economics (Joint School...,https://join.hkust.edu.hk/our-programs/joint-s...,School of Science
8,BSc in Risk Management and Business Intelligen...,https://join.hkust.edu.hk/our-programs/joint-s...,School of Science
9,Engineering with an Extended Major in Artifici...,https://join.hkust.edu.hk/our-programs/school-...,School of Engineering


In [None]:
data['predicted'] = data['Field1_text_Text'].apply(predict_program_codes)



In [None]:
data

Unnamed: 0,Field1_text_Text,Field1_links_Link,Text,predicted
0,Science (Group A) with an Extended Major in Ar...,https://join.hkust.edu.hk/our-programs/school-...,School of Science,"['I', 'R']"
1,International Research Enrichment (IRE),https://join.hkust.edu.hk/our-programs/school-...,School of Science,"['E', 'I']"
2,BSc in Mathematics,https://join.hkust.edu.hk/our-programs/school-...,School of Science,"['C', 'I', 'R']"
3,BSc in Ocean Science and Technology,https://join.hkust.edu.hk/our-programs/school-...,School of Science,"['I', 'R']"
4,BSc in Physics,https://join.hkust.edu.hk/our-programs/school-...,School of Science,"['I', 'R']"
5,BSc in Data Analytics in Science,https://join.hkust.edu.hk/our-programs/school-...,School of Science,"['C', 'I', 'R']"
6,BSc in Data Science and Technology (Joint Scho...,https://join.hkust.edu.hk/our-programs/joint-s...,School of Science,"['C', 'I', 'R']"
7,BSc in Mathematics and Economics (Joint School...,https://join.hkust.edu.hk/our-programs/joint-s...,School of Science,"['C', 'I', 'R']"
8,BSc in Risk Management and Business Intelligen...,https://join.hkust.edu.hk/our-programs/joint-s...,School of Science,"['C', 'I']"
9,Engineering with an Extended Major in Artifici...,https://join.hkust.edu.hk/our-programs/school-...,School of Engineering,"['I', 'R']"
