## I. Cleaning
Data are cleaned following this process:
* lower case, delete digits and special char
* delete stopwords
* delete *jobwords* (eg. contract type, schedule, ..)
* remove location such as country, departement, region

In [17]:
from job_title_processing import JobOffersTitleCleaner

cleaner = JobOffersTitleCleaner(language='FR', jobword=True, location=True)
cleaner.clean_str("Ingénieur à mi-temps en CDD")

'ingenieur'

## II. Lemmatization

Lemmatizing for French language uses Morphalou ressources. You must downoad it [here](https://www.ortolang.fr/market/lexicons/morphalou/4) if you wish to use it. Additional semantic 
ressources such as matching of male/female occupation nouns and explicitation of acronyms related to
labour are included in the package (for French).

Lemmatization keeps only one occurence of each words in the title. **In the end, words order is mixed, since it doesn't matter here.**

In [20]:
from job_title_processing import JobOffersTitleLemmatizer

lemmatizer = JobOffersTitleLemmatizer(language='FR', cleaner=cleaner)
lemmatizer.lemmatize_str("serveuse")
lemmatizer.lemmatize_str("Infirmière à mi-temps en CDD à Grenoble")

'infirmier grenoble'

## III. Classification

Since training data nor final French model are public so far, official occupation nomenclature can be used to train a model. ROME nomenclature must be downloaded [here](https://www.pole-emploi.org/opendata/repertoire-operationnel-des-meti.html?type=article). If you have more training data, you can add them (-> uncomment commented lines).

If you have a model, please refer to the last part.

### a. Load and clean data

In [25]:
# Load data
from job_title_processing.tools.occupation_nomenclature import get_labels_ROME_FR

rome = get_labels_ROME_FR()
rome.rename(columns={'titre':'title', 'ROME_code':'ROME'}, inplace=True)

data = rome.copy()
data.head(5)

# # Add external data
# import pandas as pd
# ENCODING, SEP = 'utf-8-sig', ";"

# import os
# from job_title_processing.tools.utils import load_root_path
# ROOT = load_root_path()
# data_folder = os.path.join(ROOT, "data", "FR_test") # pleas create folder if doesn't exist

# file = os.path.join(data_folder, "raw_data-poleemploi_2019-01-01_2019-12-31.csv")
# columns = {'intitule':'title', 'romeCode':'ROME'}
# pole_emploi = pd.read_csv(file, encoding=ENCODING, sep=SEP, usecols=columns.keys())
# pole_emploi.rename(columns=columns, inplace=True)
# pole_emploi['domain'] = pole_emploi.ROME.str[0]

# # Merge data
# data = pole_emploi.append(rome, ignore_index=True)

# data.head(5)

Unnamed: 0,ROME,title
0,A1101,Chauffeur / Chauffeuse de machines agricoles
1,A1101,Conducteur / Conductrice d'abatteuses
2,A1101,Conducteur / Conductrice d'automoteur de récolte
3,A1101,Conducteur / Conductrice d'engins d'exploitati...
4,A1101,Conducteur / Conductrice d'engins d'exploitati...


In [20]:
# Clean data
from job_title_processing.tools.svm_classification import split

X = lemmatizer.lemmatize_series(data.title)
Y = data.ROME
# If you have enough data split them into train and test sets
# X_train, X_test, Y_train, Y_test = split(X, Y, folder=data_folder) 

# Load saved data
# import pickle
# filename = os.path.join(data_folder, "train_test.pickle")
# with open(filename, 'rb') as f:
#     X_train, X_test, Y_train, Y_test =  pickle.load(f)
#     f.close()

### b. Train and evaluate SVM model 

In [15]:
from job_title_processing.tools.svm_classification import train_svm, predict_svm, global_metrics_svm, tokenize

# Train model and save it
data_folder = "My_folder"
svm = train_svm(X, Y, folder=data_folder, filename="Simple_SVM.pickle")

# # Load saved model
# file = os.path.join(data_folder, "svm_C-1_mindf-1.pickle")
# with open(file, 'rb') as f:
#     svm =  pickle.load(f)
#     f.close()

# Get global metrics at each level on test set
Y_test, X_test = Y, X # case with no test set
Y_pred = predict_svm(svm, X_test)

print("****** Level 1: ROME occupation code ******")
global_metrics_svm(Y_test, Y_pred, level=1)
print("\n ****** Level 2: occupation group ******\n")
global_metrics_svm(Y_test, Y_pred, level=2)
print("\n ****** Level 3: occupation domain ******\n")
global_metrics_svm(Y_test, Y_pred, level=3)

### c. Using pre-trained model

In [22]:
# Get file containg models
import os
from job_title_processing.tools.utils import load_root_path

ROOT = load_root_path()
data_folder = os.path.join(ROOT, "data", "FR")
svm_file = os.path.join(data_folder, "svm_C-1_mindf-1.pickle") # wherever is the svm model

In [None]:
# Instantiate Matcher
from job_title_processing import JobOffersTitleOccupationMatcher
occupation_matcher = JobOffersTitleOccupationMatcher(lemmatizer=lemmatizer, svm_model_file=svm_file)

In [23]:
# Match a string
occupation_matcher.match_job_title('serveuse restauration à lyon')

Unnamed: 0,Job offer title,Clean and lemmatized text,ROME occupation code,ROME label
0,serveuse restauration à lyon,restauration lyon serveur,G1803,Service en restauration


In [24]:
# Match a pandas series
import pandas as pd
df = pd.Series(["Maçon/Maçonne", "Ingénieur d'études en BTP"])
occupation_matcher.match_job_title(df)

Unnamed: 0,Job offer title,Clean and lemmatized text,ROME occupation code,ROME label
0,Maçon/Maçonne,macon,F1703,Maçonnerie
1,Ingénieur d'études en BTP,public batiment etude travail ingenieur,F1106,Ingénierie et études du BTP
