#### Pipeline

1. Load and Prepare CSV Data
2. Train/Test Split Data
3. Train on Bag of Word + Logistic Regression Classifier + Hyperparam Search and Cross Fold Validation
4. Test on CSV set
5. Test on annotated.json set
6. Retrain on not-annotated.json set
7. Test annotated.json again

Morgen:
- Professional hinzufügen
- not-annotated labels erstellen
- Retrain (6.)
- Test (7.)
- aufräumen

-> Dann Rest machen also tf_idf, department, pdf schreiben und wegen task 5 schauen

In [1]:
from preprocessing.preprocessing_csv import Preprocessing_CSV_Seniority
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import pandas as pd
from preprocessing.preprocessing_json import Preprocessing_JSON_Seniority
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [2]:
# 1. Load and prepare data
data = Preprocessing_CSV_Seniority(
    "/Users/jonas/Documents/Master_Vorlesungen/Semester_02/Practical Data Science/Final/PDS_Final/data/seniority-v2.csv"
)

X = data.X
y = data.y

In [3]:
# 2. Train/Test Split Data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [4]:
# 3. Train
bow = CountVectorizer(ngram_range=(1,2))
X_train_vec = bow.fit_transform(X_train)
X_test_vec = bow.transform(X_test)

ros = RandomOverSampler(random_state=123)
X_train_balanced, y_train_balanced = ros.fit_resample(X_train_vec, y_train)

print(f"Original Shape: {X_train_vec.shape}\nBalanced Shape: {X_train_balanced.shape}")

Original Shape: (7542, 12279)
Balanced Shape: (14755, 12279)


In [5]:
parameters = {
    "C": [0.01, 0.1, 1, 10, 100],
    "penalty": ["l2"],
    "solver": ["liblinear"],
    "class_weight": [None, "balanced"]
}


logistic_reg = LogisticRegression(max_iter=1000)

grid = GridSearchCV(
    estimator=logistic_reg,
    param_grid=parameters,
    cv=5,
    scoring="f1_weighted",
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train_balanced, y_train_balanced)
best_model = grid.best_estimator_

print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'C': 100, 'class_weight': None, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV score: 0.9957907361908355


In [6]:
# 4. Test on CSV
test_csv_prediction = best_model.predict(X_test_vec)
print("Accuracy:", accuracy_score(y_test, test_csv_prediction))
print(
    classification_report(
        y_test,
        test_csv_prediction,
        target_names=data.label_encoder.classes_)
)

Accuracy: 0.9872746553552492
              precision    recall  f1-score   support

    Director       1.00      0.99      1.00       183
      Junior       0.93      1.00      0.96        67
        Lead       0.99      0.98      0.99       688
  Management       0.97      0.96      0.97       166
      Senior       0.99      0.99      0.99       782

    accuracy                           0.99      1886
   macro avg       0.98      0.99      0.98      1886
weighted avg       0.99      0.99      0.99      1886



In [8]:
# 5. Test on annotated.json
annotated_json = Preprocessing_JSON_Seniority(
    "/Users/jonas/Documents/Master_Vorlesungen/Semester_02/Practical Data Science/Final/PDS_Final/data/linkedin-cvs-annotated.json",
    label_encoder=data.label_encoder # "Professional must be used"
)

X_test = bow.transform(annotated_json.X)
y_pred = best_model.predict(X_test)

print("Accuracy:", accuracy_score(annotated_json.y, y_pred))
print(classification_report(annotated_json.y, y_pred, target_names=data.label_encoder.classes_))

[JSON] Loaded 301 samples
Accuracy: 0.6810631229235881
              precision    recall  f1-score   support

    Director       0.56      0.92      0.70        24
      Junior       0.75      0.33      0.46         9
        Lead       0.65      0.71      0.68        97
  Management       0.97      0.59      0.74       132
      Senior       0.46      0.85      0.59        39

    accuracy                           0.68       301
   macro avg       0.68      0.68      0.63       301
weighted avg       0.76      0.68      0.69       301



In [None]:
# 6. Retrain on not-annotated.json
...