## Occupation Classification
  - Traditional: Logistic Regression (TF-IDF)
  - Neural: Fine-tuned BERT/RoBERTa
  - Metrics: Accuracy, Precision, Recall, F1

### Data Loading

In [14]:
import pandas as pd
data = pd.read_csv('../../data/dataset3/ted_talks_en.csv')
data = data[['occupations', 'transcript']].dropna().reset_index(drop=True)
data.head()

Unnamed: 0,occupations,transcript
0,{0: ['climate advocate']},"Thank you so much, Chris. And it's truly a gre..."
1,{0: ['global health expert; data visionary']},"About 10 years ago, I took on the task to teac..."
2,{0: ['technology columnist']},"(Music: ""The Sound of Silence,"" Simon & Garfun..."
3,{0: ['activist for environmental justice']},If you're here today — and I'm very happy that...
4,"{0: ['author', 'educator']}",Good morning. How are you? (Audience) Good. It...


### Data Preprocessing

In [62]:
from sklearn.model_selection import train_test_split

clean_data = data.copy()
# clear {0: [...]} in occupations column
clean_data['occupations'] = clean_data['occupations'].str.replace(r'\{0: \[|\]\}', '', regex=True).replace("'", '', regex=True)
# clean transcript text
clean_data['transcript'] = clean_data['transcript'].str.replace(r'\[.*?\]', '', regex=True)  # remove text in brackets
clean_data['transcript'] = clean_data['transcript'].str.replace(r'\s+', ' ', regex=True).str.strip()  # remove extra spaces
# transform to array
clean_data['occupations'] = clean_data['occupations'].str.split(', ')
# split into train and test sets
train_data, test_data = train_test_split(clean_data, test_size=0.2, random_state=63)
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)
print(f"Train size: {len(train_data)}, Test size: {len(test_data)}")

Train size: 2786, Test size: 697


### Traditional Model: Logistic Regression with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
# Logistic Regression Model