# <center> Projet : Moteur de recommandation d'offres d'emploi </center>



<img src="https://marne.fr/sites/default/files/styles/pleine_page_article/public/marne_rsa_solidarite_active.jpg?itok=WXYTjvkx" width=492>


### <center>Thème : travail ,moteur de recommandation ,job matching </center>

# Problématique

L'idée  est de construire un algorithme qui permet de matcher des chercheurs d'emploi avec des offres d'entreprise. Ainsi, nous diviserons notre notebook en trois parties :

* [1/ Analyse exploratoire](#partie1)


Dans cette partie nous téléchargerons, nettoierons et analyserons notre base de données.


* [2/ Modélisation](#partie2)


Ensuite, nous modéliserons notre algorithme de matching étape par étape..


* [3/ Résultats](#partie3)

Enfin, nous discuterons des résultats et des limites de notre modélisation.



# <center> Analyse exploratoire </center>

# Chargement des bases

### df_jobs

In [1]:
#Data management/data viz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go


#Modelisation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity


#Texte
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Autres
import os
import random
random.seed(10)
try : 
    os.chdir(os.getcwd()+"/Data")
except :
    pass

import time




start = time.time()


In [2]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [3]:
def tokenizer_fct(sentence) :
    # print(sentence)
    sentence_clean = sentence.replace('-', ' ').replace('+', ' ').replace('/', ' ').replace('#', ' ')
    word_tokens = word_tokenize(sentence_clean)
    return word_tokens

stop_w = list(set(stopwords.words('english'))) + ['[', ']', ',', '.', ':', '?', '(', ')']

def stop_word_filter_fct(list_words) :
    filtered_w = [w for w in list_words if not w in stop_w]
    filtered_w2 = [w for w in filtered_w if len(w) > 2]
    return filtered_w2

# En minuscule
def lower_start_fct(list_words) :
    lw = [w.lower() for w in list_words]
    return lw

def lemma_fct(list_words) :
    lemmatizer = WordNetLemmatizer()
    lem_w = [lemmatizer.lemmatize(w) for w in list_words]
    return lem_w


CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

def nettoyage_texte(sentence):
    return ' '.join(lemma_fct(lower_start_fct(stop_word_filter_fct(tokenizer_fct(cleanhtml(sentence))))))

In [4]:
df_jobs=pd.read_csv("jobs.csv")
print(df_jobs.shape)
df_jobs.head()

(4291, 11)


Unnamed: 0,JobID,Title,Description,Requirements,City,State,Country,Zip5,StartDate,EndDate,JobCategoryID
0,75,Administrative Assistant,The Administrative Assistant must be reliable ...,Please refer to the Job Description to view th...,Houston,TX,US,77036.0,2012-03-30 01:23:57.737,2012-04-29 23:59:59,20
1,505,LAB TECH,S. Florida Medical Device Co. seeks QC Lab Tec...,General knowledge of scientific or clinical la...,Plantation,FL,US,33313.0,2012-03-23 04:48:17.063,2012-04-22 23:59:59,137
2,764,Travel-Seasonal Staff,<p>Travel-Seasonal Staff </p> <p> CB331736 Ros...,Please refer to the Job Description to view th...,Des Plaines,IL,US,60018.0,2012-03-18 05:19:04.863,2012-04-17 23:59:59,59
3,766,MARKETING/ Social Media intern,<p>Marketing Rep </p> <p> CB326227 Chicago </p...,Please refer to the Job Description to view th...,Chicago,IL,US,60606.0,2012-03-22 10:33:46.89,2012-04-21 23:59:59,141
4,781,Printing Sales,"<p>Printing Sales Experienced, outside sales w...",Please refer to the Job Description to view th...,Schiller Park,IL,US,60176.0,2012-04-01 01:16:40.797,2012-04-30 23:59:59,4


In [5]:
df_jobs["Title"]=df_jobs["Title"].apply(lambda x: nettoyage_texte(x))

df_jobs est une base de données qui réprésentent 4291 annonces d'emplois modélisés en 11 caractéristiques. Les colonnes Title et Description ne sont pas propres (balise html, majuscule/miniscule)

In [6]:
print(len(df_jobs.State.unique()))
print(len(df_jobs.JobCategoryID.unique()))

37
195


Les offres d'emploi sont répartis en 37 états et en 195 catégories.

In [7]:
df_jobs.Requirements
df_jobs.Requirements= df_jobs.Requirements.fillna('')

In [8]:
test=df_jobs[df_jobs.Requirements.str.contains("junior")]
list(test.Requirements[:1])
len(test)

5

In [9]:
test=df_jobs[df_jobs.Requirements.str.contains("junior")]


### df_feedbacks

In [10]:
df_feedbacks=pd.read_csv("feedbacks.csv")
print(df_feedbacks.shape)
df_feedbacks.head()

(28928, 3)


Unnamed: 0,UserID,JobID,Event
0,698,1053272,viewed
1,698,535105,viewed
2,698,171400,viewed
3,698,804823,viewed
4,698,1113149,viewed


df_feedbacks est une base qui contient des retours des utilisateurs sur certaines des offres d'emploi.

In [11]:
df_feedbacks.Event.value_counts()

viewed     18620
applied     9395
hired        913
Name: Event, dtype: int64

Logiquement plus les actions sont engageantes moins elles sont fréquentes.

In [12]:
df_feedbacks.UserID.value_counts().describe()

count    1861.000000
mean       15.544331
std        21.158714
min         6.000000
25%         7.000000
50%        10.000000
75%        15.000000
max       484.000000
Name: UserID, dtype: float64

En moyenne les individus donnent 15 feedbacks et au minimum 6 et au maximum  484.

Le fait qu'un individu postule ou est engagé à une offre d'emploi  implique qu'il est intéressé. Si il regarde l'annonce et qu'il ne postule, c'est qu'in fine l'offre ne lui convient pas.Donc on peut créer une variable dichotomisé qui est égale à 1 si l'individu engage une action après avoir vu l'annonce sinon la variable sera égale à 0.

In [13]:
df_feedbacks["interet"]=0

In [14]:
df_feedbacks["interet"] = df_feedbacks['Event'].apply(lambda x: 1 if x in ["hired","applied"] else 0)
df_feedbacks.interet.value_counts()

0    18620
1    10308
Name: interet, dtype: int64

In [15]:
df_feedbacks.JobID.value_counts().describe()

count    3149.000000
mean        9.186408
std        16.501161
min         2.000000
25%         3.000000
50%         4.000000
75%         8.000000
max       217.000000
Name: JobID, dtype: float64

75% des offres sont vus plus de 8 fois.

In [16]:
popular_job=df_feedbacks.JobID.value_counts().to_frame()
liste_popular_job=list(popular_job[popular_job["JobID"]>8].index)

In [17]:
df_feedbacks[["UserID","JobID","interet"]]

Unnamed: 0,UserID,JobID,interet
0,698,1053272,0
1,698,535105,0
2,698,171400,0
3,698,804823,0
4,698,1113149,0
...,...,...,...
28923,1468470,149425,0
28924,1468879,164268,0
28925,1469784,777461,0
28926,1470394,339499,0


### df_users

In [18]:
df_users=pd.read_csv("users.csv")
print(df_users.shape)
df_users.head()

(2337, 14)


Unnamed: 0,UserID,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany,MajorCategoryID
0,23,Mount Prospect,IL,US,60056,High School,Not Applicable,2002-01-01 00:00:00,3,10.0,Yes,No,0,0.0
1,698,Normal,IL,US,61761,,,,4,7.0,Yes,No,0,
2,2305,Lake Forest,IL,US,60045,,insurance,2010-12-01 00:00:00,4,9.0,Yes,No,0,180.0
3,2785,Chicago,IL,US,60607,Bachelor's,International Affairs,2005-01-01 00:00:00,6,10.0,Yes,No,0,191.0
4,3406,Joliet,IL,US,60435,Bachelor's,English,,3,20.0,No,Yes,350,40.0


df_users est la base qui contient les chercheurs d'emploi. On en a 2337 modélisé en 14 caractéristiques.

In [19]:
print(df_users.WorkHistoryCount.value_counts().sort_index())
print(77/2337)

0      77
1     112
2     219
3     438
4     483
5     351
6     272
7     133
8      92
9      66
10     91
11      1
12      1
13      1
Name: WorkHistoryCount, dtype: int64
0.0329482242190843


On remarque que l'on a 77 individus sans expériences. Soit 3% de nos individus.

In [20]:
df_users.ZipCode.value_counts().sort_index()

60001     1
60002     3
60004    11
60005     6
60006     1
         ..
62868     1
62901     1
62927     1
62948     1
62959     2
Name: ZipCode, Length: 449, dtype: int64

### df_users_history

In [21]:
df_users_history=pd.read_csv("users_history.csv")
print(df_users_history.shape)
df_users_history.head()

(11012, 4)


Unnamed: 0,UserID,Sequence,JobTitle,JobCategoryID
0,23,1,Manager,0.0
1,23,2,Assisting customers,30.0
2,698,1,Advocate Bromenn,88.0
3,698,2,"Customer Service, Patient contact, drawing blood",135.0
4,698,3,"Customer Service, Suggestive selling",1.0


In [22]:
df_users_history["CV_JobTitle"]=df_users_history.groupby('UserID').transform(lambda JobTitle : ' '.join(JobTitle))
df_users_history

Unnamed: 0,UserID,Sequence,JobTitle,JobCategoryID,CV_JobTitle
0,23,1,Manager,0.0,Manager Assisting customers
1,23,2,Assisting customers,30.0,Manager Assisting customers
2,698,1,Advocate Bromenn,88.0,"Advocate Bromenn Customer Service, Patient con..."
3,698,2,"Customer Service, Patient contact, drawing blood",135.0,"Advocate Bromenn Customer Service, Patient con..."
4,698,3,"Customer Service, Suggestive selling",1.0,"Advocate Bromenn Customer Service, Patient con..."
...,...,...,...,...,...
11007,1471500,3,Cocktail Server,91.0,Stocker Cocktail Server ICS Associate /Overnig...
11008,1471500,4,ICS Associate /Overnight Stock/Truck Unloader,136.0,Stocker Cocktail Server ICS Associate /Overnig...
11009,1471500,5,Cocktail Server,91.0,Stocker Cocktail Server ICS Associate /Overnig...
11010,1471500,6,Server,91.0,Stocker Cocktail Server ICS Associate /Overnig...


In [23]:

df_users_history['JobCategoryID'] = df_users_history['JobCategoryID'].astype(int)
df_users_history

Unnamed: 0,UserID,Sequence,JobTitle,JobCategoryID,CV_JobTitle
0,23,1,Manager,0,Manager Assisting customers
1,23,2,Assisting customers,30,Manager Assisting customers
2,698,1,Advocate Bromenn,88,"Advocate Bromenn Customer Service, Patient con..."
3,698,2,"Customer Service, Patient contact, drawing blood",135,"Advocate Bromenn Customer Service, Patient con..."
4,698,3,"Customer Service, Suggestive selling",1,"Advocate Bromenn Customer Service, Patient con..."
...,...,...,...,...,...
11007,1471500,3,Cocktail Server,91,Stocker Cocktail Server ICS Associate /Overnig...
11008,1471500,4,ICS Associate /Overnight Stock/Truck Unloader,136,Stocker Cocktail Server ICS Associate /Overnig...
11009,1471500,5,Cocktail Server,91,Stocker Cocktail Server ICS Associate /Overnig...
11010,1471500,6,Server,91,Stocker Cocktail Server ICS Associate /Overnig...


In [24]:
df_users_history.loc[1,"JobCategoryID"]

30

In [25]:
liste=[]
j=df_users_history.loc[0,"UserID"]
count=0
for i in df_users_history.UserID.values :
    #print(df_users_history.loc[count,"JobCategoryID"])
    if (i!=j):
        liste=[]
    liste.append(df_users_history.loc[count,"JobCategoryID"])
    df_users_history.loc[count,"liste_jobs"]=str(liste)
    count+=1
    j=i
    

    #print(liste)
df_users_history

Unnamed: 0,UserID,Sequence,JobTitle,JobCategoryID,CV_JobTitle,liste_jobs
0,23,1,Manager,0,Manager Assisting customers,[0]
1,23,2,Assisting customers,30,Manager Assisting customers,"[0, 30]"
2,698,1,Advocate Bromenn,88,"Advocate Bromenn Customer Service, Patient con...",[88]
3,698,2,"Customer Service, Patient contact, drawing blood",135,"Advocate Bromenn Customer Service, Patient con...","[88, 135]"
4,698,3,"Customer Service, Suggestive selling",1,"Advocate Bromenn Customer Service, Patient con...","[88, 135, 1]"
...,...,...,...,...,...,...
11007,1471500,3,Cocktail Server,91,Stocker Cocktail Server ICS Associate /Overnig...,"[0, 91]"
11008,1471500,4,ICS Associate /Overnight Stock/Truck Unloader,136,Stocker Cocktail Server ICS Associate /Overnig...,"[0, 91, 136]"
11009,1471500,5,Cocktail Server,91,Stocker Cocktail Server ICS Associate /Overnig...,"[0, 91, 136, 91]"
11010,1471500,6,Server,91,Stocker Cocktail Server ICS Associate /Overnig...,"[0, 91, 136, 91, 91]"


In [26]:
df_users_history["CV_JobTitle"]=df_users_history["CV_JobTitle"].apply(lambda x: nettoyage_texte(x))

Ce dataset contient les historiques des chercheurs d'emplois.

In [27]:
len(df_users_history.JobCategoryID.unique())

197

In [28]:
df_users_history.Sequence.value_counts().sort_index()

1     2420
2     2290
3     2053
4     1584
5     1060
6      697
7      397
8      256
9      157
10      90
11       5
12       2
13       1
Name: Sequence, dtype: int64

In [29]:
print(len(df_users_history.UserID.unique()) )
print(len(df_users_history.UserID.unique()) - 77 +277)
print(2337)

2509
2709
2337


### df_test_users

In [30]:
df_test_users=pd.read_csv("test_users.csv")
print(df_test_users.shape)
df_test_users.TotalYearsExperience=pd.to_numeric(df_test_users.TotalYearsExperience)
df_test_users.head()

(277, 14)


Unnamed: 0,UserID,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany,MajorCategoryID
0,12924,Chicago,IL,US,60628,Bachelor's,,2005-05-01 00:00:00,4,9.0,Yes,No,0,
1,18947,Chicago,IL,US,60647,,,,6,22.0,Yes,No,0,
2,20976,Rolling Meadows,IL,US,60008,Master's,,,5,6.0,No,Yes,6,
3,21412,Chicago,IL,US,60649,,Basic Studies,2007-01-01 00:00:00,5,5.0,Yes,No,0,
4,40564,Zion,IL,US,60099,,,,6,5.0,,No,0,


Cette table contient nos chercheurs d'emplois test. On en a 277.

Verifions que sur les utilisateurs du jeu de test ont tous donné des feedbacks.

In [31]:
UserID_test=list(df_test_users.UserID.unique())
UserID_train=list(df_users.UserID.unique())
UserID_feedback=list(df_feedbacks.UserID.unique())
UserID_history=list(df_users_history.UserID.unique())

In [32]:
len(UserID_train)

2337

In [33]:
len(list(set(UserID_train).intersection(UserID_feedback)))

1861

In [34]:
list(set(UserID_test).intersection(UserID_feedback))

[]

Aucun des chercheur d'emploi du test set a donné un feedback.

In [35]:
list(set(UserID_test).intersection(UserID_train))

[]

In [36]:
len(list(set(UserID_test).intersection(UserID_history)))

263

In [37]:
len(UserID_test)

277

### Nettoyage

Dans la table test, on va sciendé les individus ayant de l'expérience ou pas.

In [38]:
df_test_users.TotalYearsExperience.value_counts()

5.0     25
11.0    23
6.0     22
4.0     18
7.0     18
10.0    16
3.0     15
8.0     13
12.0    13
14.0    12
13.0    10
9.0     10
15.0     8
2.0      8
16.0     6
1.0      6
19.0     5
22.0     5
25.0     4
18.0     4
20.0     3
23.0     3
17.0     3
24.0     2
35.0     2
27.0     2
46.0     1
29.0     1
31.0     1
30.0     1
33.0     1
38.0     1
37.0     1
0.0      1
Name: TotalYearsExperience, dtype: int64

On va créer la table de recommendation pour le jeu test.

In [39]:
len(list(df_test_users.UserID))


277

In [40]:
liste_0=[0]*df_test_users.shape[0]

d = {'UserID': list(df_test_users.UserID), 'recommendation_1': liste_0,'recommendation_2': liste_0,'recommendation_3': liste_0,'recommendation_4': liste_0}
df_test_recommendation = pd.DataFrame(data=d)
df_test_recommendation 

Unnamed: 0,UserID,recommendation_1,recommendation_2,recommendation_3,recommendation_4
0,12924,0,0,0,0
1,18947,0,0,0,0
2,20976,0,0,0,0
3,21412,0,0,0,0
4,40564,0,0,0,0
...,...,...,...,...,...
272,1454754,0,0,0,0
273,1458384,0,0,0,0
274,1458917,0,0,0,0
275,1466313,0,0,0,0


In [41]:
# On va créer nos tables pour les individus et les jobs  restreint avec les features de modélisation

### Analyse 

# <center> Piste de modélisation </center>

### Hypothèses

1/ En moyenne, l'étude StepStone observe qu'il faut envoyer 20 CV pour décrocher un job et postueler sur 5 sites  et  80 % des candidats utilisent au minimum 5 sites différents pour postuler. 20/5=4, nous allons donc proposer 4 offres d'emploi à nos chercheurs d'emplois car cela serait le nombre optimal.

2/ Pour faciliter notre problème, on va considérer que nos offre d'emplois ont une capacité d'embauche supérieur à 1. C'est à dire que même si un chercheur d'emploi y est engagé alors un autre peut y candidater.

3/ On considérera que les individus n'ayant pas d'historiques via leur CV sont soit des personnes en insertion professionnels soit des personnes ayant connu une longue période de chomage.

In [42]:
20/5

4.0

In [43]:
4291/2337

1.836114676936243

On partire du principe que notre 

### Etape 0 :  Recommendations non personnalisées pour les individus sans expérience déclarée.

On sépare les individus qui ont communiqués leurs expériences passées avec ceux qui ne l'ont pas fait. On supposera que les individus n'ayant pas communiqué leurs expériences passées sont des outliers sur le marché du travail. Des personnes en insertion professionnelle (étudiants, chomeurs longue durées) ou bien des gens qui ont un CV non crédible.

In [44]:
UserID_test=list(df_test_users.UserID.unique())
UserID_train=list(df_users.UserID.unique())
UserID_feedback=list(df_feedbacks.UserID.unique())
UserID_history=list(df_users_history.UserID.unique())

In [45]:
df_test_users

Unnamed: 0,UserID,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany,MajorCategoryID
0,12924,Chicago,IL,US,60628,Bachelor's,,2005-05-01 00:00:00,4,9.0,Yes,No,0,
1,18947,Chicago,IL,US,60647,,,,6,22.0,Yes,No,0,
2,20976,Rolling Meadows,IL,US,60008,Master's,,,5,6.0,No,Yes,6,
3,21412,Chicago,IL,US,60649,,Basic Studies,2007-01-01 00:00:00,5,5.0,Yes,No,0,
4,40564,Zion,IL,US,60099,,,,6,5.0,,No,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272,1454754,Des Plaines,IL,US,60016,Bachelor's,Accounting & Auditing,,7,27.0,No,No,0,
273,1458384,Chicago,IL,US,60630,,,,10,23.0,,No,0,
274,1458917,Aurora,IL,US,60502,Bachelor's,Telecommunications Management,2001-10-01 00:00:00,6,6.0,,No,0,
275,1466313,Glenview,IL,US,60026,Master's,Marketing/Management,,8,22.0,Yes,No,0,


In [46]:
len(UserID_test)

277

In [47]:
liste_test_with_history=list(set(UserID_test).intersection(UserID_history))



In [48]:
df_test_recherche=df_test_users[df_test_users['UserID'].isin(liste_test_with_history)]
df_test_insertion= df_test_users[~df_test_users['UserID'].isin(liste_test_with_history)]
df_test_insertion

Unnamed: 0,UserID,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany,MajorCategoryID
5,42836,Chicago,IL,US,60628,,,,0,,,No,0,
22,138347,Hometown,IL,US,60456,,,,0,,,No,0,
89,480980,Chicago,IL,US,60617,,,,1,4.0,No,No,0,
99,539448,Belleville,IL,US,62221,,Biology,2012-04-01 00:00:00,0,,No,No,0,0.0
114,647688,Chicago,IL,US,60617,,,,0,,,No,0,
120,665250,Chicago,IL,US,60619,,,,0,,No,No,0,
161,909842,Chicago,IL,US,60628,,,,0,,,No,0,
168,932306,Chicago,IL,US,60636,High School,Not Applicable,2008-01-01 00:00:00,0,,No,No,0,0.0
183,1022450,Chicago,IL,US,60647,Bachelor's,Communicative Sciences and Disorders,2008-05-01 00:00:00,0,,No,Yes,3,
195,1069475,Chicago,IL,US,60620,,,,0,,No,No,0,


le numero 539448 est un étudiant en insertion professionnelle car il sa date d'obtention du diplome est proche de celle des disponibilité pour les jobs.

In [49]:
outliers=list(df_test_insertion.UserID)
outliers

[42836,
 138347,
 480980,
 539448,
 647688,
 665250,
 909842,
 932306,
 1022450,
 1069475,
 1087050,
 1179202,
 1401559,
 1451904]

In [50]:
#liste des jobs populaires, regardées plus que 75% des autres offres.
#popular_job=df_feedbacks.JobID.value_counts().to_frame()
#liste_popular_job=list(popular_job[popular_job["JobID"]>8].index)
liste_popular_job

[1050985,
 896947,
 837446,
 386591,
 67025,
 749238,
 871285,
 489323,
 209212,
 83739,
 314766,
 615722,
 23302,
 8315,
 597857,
 788041,
 1113991,
 683930,
 550180,
 110541,
 280705,
 458789,
 681893,
 26102,
 595723,
 896220,
 1050685,
 300786,
 1104317,
 158490,
 1032351,
 384121,
 852994,
 300778,
 805616,
 434309,
 643234,
 864447,
 95835,
 240127,
 900555,
 502883,
 636190,
 568000,
 697788,
 212537,
 643233,
 886039,
 156625,
 879350,
 863547,
 535582,
 812538,
 1053272,
 822884,
 969928,
 14391,
 381975,
 95535,
 171400,
 1009302,
 182525,
 299233,
 1081649,
 554639,
 46841,
 382487,
 268576,
 848283,
 39412,
 323711,
 1051415,
 758183,
 749790,
 512697,
 377336,
 537997,
 691506,
 845007,
 483050,
 572536,
 961414,
 744000,
 811952,
 970144,
 315453,
 511755,
 1009301,
 59597,
 780502,
 861085,
 883740,
 203025,
 580945,
 578522,
 44890,
 23341,
 223139,
 326864,
 260701,
 1069767,
 734022,
 867182,
 205628,
 647270,
 811951,
 1061320,
 535043,
 878077,
 1053273,
 600837,
 1

In [51]:
df_test_recommendation 

Unnamed: 0,UserID,recommendation_1,recommendation_2,recommendation_3,recommendation_4
0,12924,0,0,0,0
1,18947,0,0,0,0
2,20976,0,0,0,0
3,21412,0,0,0,0
4,40564,0,0,0,0
...,...,...,...,...,...
272,1454754,0,0,0,0
273,1458384,0,0,0,0
274,1458917,0,0,0,0
275,1466313,0,0,0,0


In [52]:
outliers

[42836,
 138347,
 480980,
 539448,
 647688,
 665250,
 909842,
 932306,
 1022450,
 1069475,
 1087050,
 1179202,
 1401559,
 1451904]

In [53]:
random.seed(10)
df_test_recommendation["recommendation_1"]






for i in df_test_recommendation.values:
    if i[0] in outliers :
        i[1]=random.choice(liste_popular_job)
        i[2]=random.choice(liste_popular_job)
        i[3]=random.choice(liste_popular_job)
        i[4]=random.choice(liste_popular_job)
        

print(df_test_recommendation.recommendation_3.value_counts())
df_test_recommendation.head()

0          263
212537       1
457789       1
218288       1
223426       1
701122       1
201925       1
608468       1
779489       1
782          1
1019160      1
137014       1
182661       1
403850       1
815547       1
Name: recommendation_3, dtype: int64


Unnamed: 0,UserID,recommendation_1,recommendation_2,recommendation_3,recommendation_4
0,12924,0,0,0,0
1,18947,0,0,0,0
2,20976,0,0,0,0
3,21412,0,0,0,0
4,40564,0,0,0,0


### Etape 1 : Vision "entreprise", concordance des profils avec les offres d'emploi

L'objectif ici est de vérifier la concordance des offres des entreprises avec les profils des candidats.

##### Recommendations

In [54]:
def recommendation_CV(test) :
    liste_job=list(df_jobs.Title)
    liste_job.append(test)
    #liste_job[-1]
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(liste_job)
    feature_names = vectorizer.get_feature_names()
    dense = vectors.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names)
    cosine_sim = cosine_similarity(df, df)
    d = {'cosinus': list(cosine_sim[-1]), 'offres':list(range(cosine_sim.shape[1]))}
    df = pd.DataFrame(data=d)
    df=df.sort_values(by=['cosinus'])
    recommendation_1=df.index[-2]
    #print(df_jobs.Title[recommendation_1])
    recommendation_2=df.index[-3]
    #print(df_jobs.Title[recommendation_2])
    recommendation_3=df.index[-4]
    #print(df_jobs.Title[recommendation_3])
    recommendation_4=df.index[-5]
    #print(df_jobs.Title[recommendation_4])
    recommendation_5=df.index[-6]
    #print(df_jobs.Title[recommendation_5])
    recommendation_6=df.index[-7]
    #print(df_jobs.Title[recommendation_6])

    
  
    recommendation_jobs=[recommendation_1,recommendation_2,recommendation_3,recommendation_4,recommendation_5,recommendation_6]
    recommendation_jobs_probs=[  round(df.cosinus.values[-2],3), round(df.cosinus.values[-3],3),round(df.cosinus.values[-4],3),round(df.cosinus.values[-5],3),round(df.cosinus.values[-6],3),round(df.cosinus.values[-7],3)]
    
    d_final = {'cosinus': recommendation_jobs_probs, 'offres':recommendation_jobs}
    df_final = pd.DataFrame(data=d_final)
    
    return(df_final)



Testons notre  modèle avec un exemple.

In [55]:
test=df_users_history["CV_JobTitle"][4893]
print(test)

instructor substitute teacher teacher preschool 3rd grade 5th grade long term substitute teacher student teacher 2nd grade


In [56]:
resultat=recommendation_CV(test)
resultat

Unnamed: 0,cosinus,offres
0,0.381,2288
1,0.345,463
2,0.328,2575
3,0.285,1844
4,0.284,3478
5,0.247,4051


In [57]:
for i in list(resultat.offres.values[:]) :
    print(i)
    print(df_jobs.Title[i])

2288
teacher infant preschool program
463
kindergarden lead teacher
2575
teacher early childhood education preschool
1844
early childhood teacher aurora
3478
lead teacher kiddie academy
4051
exciting kindergarten teacher opportunity bartlett


In [58]:
df_jobs.head()

Unnamed: 0,JobID,Title,Description,Requirements,City,State,Country,Zip5,StartDate,EndDate,JobCategoryID
0,75,administrative assistant,The Administrative Assistant must be reliable ...,Please refer to the Job Description to view th...,Houston,TX,US,77036.0,2012-03-30 01:23:57.737,2012-04-29 23:59:59,20
1,505,lab tech,S. Florida Medical Device Co. seeks QC Lab Tec...,General knowledge of scientific or clinical la...,Plantation,FL,US,33313.0,2012-03-23 04:48:17.063,2012-04-22 23:59:59,137
2,764,travel seasonal staff,<p>Travel-Seasonal Staff </p> <p> CB331736 Ros...,Please refer to the Job Description to view th...,Des Plaines,IL,US,60018.0,2012-03-18 05:19:04.863,2012-04-17 23:59:59,59
3,766,marketing social medium intern,<p>Marketing Rep </p> <p> CB326227 Chicago </p...,Please refer to the Job Description to view th...,Chicago,IL,US,60606.0,2012-03-22 10:33:46.89,2012-04-21 23:59:59,141
4,781,printing sale,"<p>Printing Sales Experienced, outside sales w...",Please refer to the Job Description to view th...,Schiller Park,IL,US,60176.0,2012-04-01 01:16:40.797,2012-04-30 23:59:59,4


In [59]:
count=0
for i in resultat.offres.values:
    print(i)
    print(df_jobs.loc[i,"JobCategoryID"])
    resultat.loc[count,"JobCategoryID"]=df_jobs.loc[i,"JobCategoryID"]
    count+=1

2288
0
463
40
2575
131
1844
0
3478
19
4051
40


In [60]:
resultat

Unnamed: 0,cosinus,offres,JobCategoryID
0,0.381,2288,0.0
1,0.345,463,40.0
2,0.328,2575,131.0
3,0.285,1844,0.0
4,0.284,3478,19.0
5,0.247,4051,40.0


#### Ajout de features

Catégorie du travail

In [61]:
resultat

Unnamed: 0,cosinus,offres,JobCategoryID
0,0.381,2288,0.0
1,0.345,463,40.0
2,0.328,2575,131.0
3,0.285,1844,0.0
4,0.284,3478,19.0
5,0.247,4051,40.0


In [62]:
#df_users_history_max=df_users_history_max.reset_index()
#df_users_history_max

In [63]:
#test=df_users_history_max.values
#test[-1]

In [64]:
df_jobs.JobCategoryID

0        20
1       137
2        59
3       141
4         4
       ... 
4286     14
4287    178
4288    178
4289    178
4290     77
Name: JobCategoryID, Length: 4291, dtype: int64

Prérequis management

### Etape 2 :  Vision "employé", concordance des offres d'emploi avec les envies des demandeurs d'emploi.

La conclusion d'un contrat est un accord bilatéral entre le salarié et l'employeur. C'est pourquoi nous allons recoefficienté nos features avec une régression logistique sur les feedbacks.

### Table Final

In [65]:
df_test_recommendation

Unnamed: 0,UserID,recommendation_1,recommendation_2,recommendation_3,recommendation_4
0,12924,0,0,0,0
1,18947,0,0,0,0
2,20976,0,0,0,0
3,21412,0,0,0,0
4,40564,0,0,0,0
...,...,...,...,...,...
272,1454754,0,0,0,0
273,1458384,0,0,0,0
274,1458917,0,0,0,0
275,1466313,0,0,0,0


In [66]:
test=df_users_history["CV_JobTitle"][4893]
print(test)
resultat=recommendation_CV(test)
resultat

instructor substitute teacher teacher preschool 3rd grade 5th grade long term substitute teacher student teacher 2nd grade


Unnamed: 0,cosinus,offres
0,0.381,2288
1,0.345,463
2,0.328,2575
3,0.285,1844
4,0.284,3478
5,0.247,4051


In [67]:
df_users_history["CV_JobTitle"][1294]

'accountant accountant accountant accountant general ledger accountant junior staff accountant'

In [68]:
#for i in df_test_recommendation.values:
#print(i[0])
resultat.offres.values[0]
df_test_recommendation.values

array([[  12924,       0,       0,       0,       0],
       [  18947,       0,       0,       0,       0],
       [  20976,       0,       0,       0,       0],
       ...,
       [1458917,       0,       0,       0,       0],
       [1466313,       0,       0,       0,       0],
       [1471500,       0,       0,       0,       0]])

In [69]:

df_users_history.index=df_users_history.loc[:,"UserID"]

In [70]:
history=df_users_history["CV_JobTitle"].drop_duplicates()
history[23]
#history.index=history.loc[:,"UserID"]

'manager assisting customer'

In [71]:
history.values[1]

'advocate bromenn customer service patient contact drawing blood customer service suggestive selling customer service retail'

In [72]:
history=df_users_history["CV_JobTitle"].drop_duplicates()
count=0

for i in df_test_recommendation.values:
    #print(i[0])
    if i[0] not in outliers :
        try:
            resultat=recommendation_CV(history[i[0]])
            i[1]=resultat.offres.values[0]
            i[2]=resultat.offres.values[1]
            i[3]=resultat.offres.values[2]
            i[4]=resultat.offres.values[3]
        except:
            count+=1

        
print(count,"erreurs")

df_test_recommendation.head()

2 erreurs


Unnamed: 0,UserID,recommendation_1,recommendation_2,recommendation_3,recommendation_4
0,12924,1958,2802,4232,2780
1,18947,4110,1544,3861,286
2,20976,1054,2720,3407,2386
3,21412,2999,1405,2240,33
4,40564,4187,2172,2177,2390


# Résultat

In [73]:
#test=df_users_history["CV_JobTitle"][4893]
#print(test)

df_test_recommendation.to_csv("Recommendation.csv",index=False)


In [74]:
end = time.time()
print("Le temps ecoulé :",end - start)

Le temps ecoulé : 596.3967068195343


### Discussion sur les résultats 

### Limites

raisonnement par clusters pour le scoring :
    - poids des caractéristiques du scoring :  fruitz intention peche change matching
    - contrat de maintenance : youtube et les boucles infinies en recommendation

### Sources

Les sources utilisées pour concevoir ce notebook sont :
- https://github.com/jalajthanaki/Job_recommendation_engine/blob/master/Job_recommendation_engine.ipynb
- https://app.datacamp.com/learn/courses/building-recommendation-engines-in-python
- https://www.jobspikr.com/blog/job-matching-algorithms/