## Generating similarity score SOL listing

**Author:** Yee Sen, Edited by Benjamin 
<br>
**Date:** 20th Jun 2023
<br>
**Context:** We want to find an optimal threshold to determine if a job is similar to that of the job descriptions within the SOL list. Is it possible to have a generalised threshold or do we need to create individual thresholds for each SOL listing?
<br>
**Objective:** Find an optimal threshold that can be used to classify job descriptions as similar or dissimilar to the SOL listings.


### A) Importing packages and reading in the dataset

In [3]:
import pandas as pd
import numpy as np
from numpy.linalg import norm
import requests
from ast import literal_eval

In [15]:
df = pd.read_csv('../data/sol.csv')
df = df.replace('\n', '', regex=True)
df['id'] = df['id'].astype(str).str.replace(' ', '')

df = df.reset_index(drop=True)

In [1]:
main_url = "http://localhost:8000/sentence_embeddings?id"

def query_api(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()
    else:
        return None

In [12]:
sol_df = pd.read_csv('../data/SOL_embeddings_sentence_transformers.csv')
sol_df['emb_title'] = sol_df['emb_title'].apply(literal_eval)
sol_df['emb_text'] = sol_df['emb_text'].apply(literal_eval)
sol_df['comb'] = [x + y for x,y in zip(sol_df['emb_title'], sol_df['emb_text'])]

sol_detailed_df = pd.read_excel('../data/SOL Verification checks.xlsx', sheet_name = 1)

  warn(f"Print area cannot be set to Defined name: {defn.value}.")


In [13]:
def similarity(vec1, vec2):
    return vec1 @ vec2.T/(norm(vec1)*norm(vec2))

In [17]:
from tqdm import tqdm
results=[]

for index, row in tqdm(df.iterrows(), total=len(df)):
    if row['id'] == "-":
        df['combined'][index] = 'NIL'
        df['title'][index] = 'NIL'
        df['text'][index] = 'NIL'
    else:
        try:
            name = row['SOL']
            job_id = row['id']
            res = query_api(f'{main_url}={job_id}')
            concat_r1 = np.array(res['embeddings_title'] + res['embeddings_text']).reshape(1, -1)
            title_r1 = np.array(res['embeddings_title']).reshape(1,-1)
            text_r1 = np.array(res['embeddings_text']).reshape(1,-1)
            result_df = pd.DataFrame({"SOL Occupation":sol_df['SOL Occupation'],
                                      "Combined similarity": list(map(lambda x: similarity(concat_r1, np.array(x)).round(3), sol_df['comb'])),
                                      "Title similarity": list(map(lambda x: similarity(title_r1, np.array(x)).round(3), sol_df['emb_title'])),
                                      "Text similarity": list(map(lambda x: similarity(text_r1, np.array(x)).round(3), sol_df['emb_text']))})

            combined_similarity_value1 = result_df.loc[result_df['SOL Occupation'] == name, 'Combined similarity'].values[0]
            combined_similarity_value2 = result_df.loc[result_df['SOL Occupation'] == name, 'Title similarity'].values[0]
            combined_similarity_value3 = result_df.loc[result_df['SOL Occupation'] == name, 'Text similarity'].values[0]
            df['combined'][index] = combined_similarity_value1[0]
            df['title'][index] = combined_similarity_value2[0]
            df['text'][index] = combined_similarity_value3[0]

        except:
            df['combined'][index] = 'NIL'
            df['title'][index] = 'NIL'
            df['text'][index] = 'NIL'



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['combined'][index] = 'NIL'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['title'][index] = 'NIL'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][index] = 'NIL'
100%|████████████████████████████████████████████████████████████████████████████████| 145/145 [03:45<00:00,  1.55s/it]
