# Job search of September 2023

## Context

This is my first "real" job search, as it coincides with the end of my engineering school.

I actually started looking for a first position in May 2023, during the middle of my end-of-study internship, but it was not successful.
After my summer holidays, I decided to start searching again, but this time with a better organization to keep track of my application and recruitment processes.

The data in this analysis is only focused on my adventure beginning in September 2023. Previous data is not considered due to inconsistent tracking and formatting.

My goal was to find a junior position in the field of data science or machine learning.

### Constraints

#### Targeted positions

- Data Scientist
- Data Engineer
- Data Analyst
- Machine Learning Engineer

#### Localization

 - France (on-site, hybrid, remote)
 - Europe (remote)

## JSA analysis

### Imports

In [1]:
%matplotlib inline

from geopy import geocoders
import numpy as np
import pandas as pd
from plotly import express as px
from plotly import io as pio

pio.templates.default = "simple_white"

### Load the data

In [23]:
df_jsa_10_2023 = pd.read_csv("../data/raw/10_2023.csv", dtype=str)

df_jsa_10_2023.head()

Unnamed: 0,Status,Company,Role,Location,Source,Attendance,Application,Type,Phone call,1st interview,2nd interview,3rd interview,Proposition,Final answer,URL
0,Rejected,Orange,Data Scientist,Rennes,Carrer Site,,30/08/2023,Offer,13/10/2023,,,,,26/10/2023,https://orange.jobs/jobs/v3/offers/127229?lang=
1,Applied,Orange,Data Scientist,Toulouse,Carrer Site,,30/08/2023,Offer,,,,,,,https://orange.jobs/jobs/v3/offers/128950?lang=
2,Applied,AZ Consulting,Data Scientist,,Recruitment Agency,,31/08/2023,Spontaneous,,,,,,,https://az-recrutement.wixsite.com/az-recrutement
3,No project for now,ALTEN,,Marignane,Reach-out,Hybrid,,,05/09/2023,06/09/2023 17:00,,,,06/09/2023,
4,No project for now,Astek,Data Scientist,,LinkedIn,,31/08/2023,Spontaneous,01/09/2023,07/09/2023 18:00,,,,07/09/2023,https://www.linkedin.com/posts/killian-vincent...


In [24]:
df_jsa_10_2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 447 entries, 0 to 446
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Status         447 non-null    object
 1   Company        444 non-null    object
 2   Role           441 non-null    object
 3   Location       419 non-null    object
 4   Source         446 non-null    object
 5   Attendance     210 non-null    object
 6   Application    437 non-null    object
 7   Type           436 non-null    object
 8   Phone call     15 non-null     object
 9   1st interview  17 non-null     object
 10  2nd interview  4 non-null      object
 11  3rd interview  1 non-null      object
 12  Proposition    4 non-null      object
 13  Final answer   146 non-null    object
 14  URL            411 non-null    object
dtypes: object(15)
memory usage: 52.5+ KB


### Format the data

#### Remove unecessary features

In [25]:
df_jsa_10_2023_f = df_jsa_10_2023.copy()
df_jsa_10_2023_f.drop(["URL"], axis=1, inplace=True)

df_jsa_10_2023_f.head()

Unnamed: 0,Status,Company,Role,Location,Source,Attendance,Application,Type,Phone call,1st interview,2nd interview,3rd interview,Proposition,Final answer
0,Rejected,Orange,Data Scientist,Rennes,Carrer Site,,30/08/2023,Offer,13/10/2023,,,,,26/10/2023
1,Applied,Orange,Data Scientist,Toulouse,Carrer Site,,30/08/2023,Offer,,,,,,
2,Applied,AZ Consulting,Data Scientist,,Recruitment Agency,,31/08/2023,Spontaneous,,,,,,
3,No project for now,ALTEN,,Marignane,Reach-out,Hybrid,,,05/09/2023,06/09/2023 17:00,,,,06/09/2023
4,No project for now,Astek,Data Scientist,,LinkedIn,,31/08/2023,Spontaneous,01/09/2023,07/09/2023 18:00,,,,07/09/2023


#### Datetime

In [26]:
dates_col = ["Application",
             "Phone call",
             "1st interview",
             "2nd interview",
             "3rd interview",
             "Proposition",
             "Final answer"]

# Convert to datetime
df_jsa_10_2023_f[dates_col] = df_jsa_10_2023_f[dates_col].apply(pd.to_datetime, format="mixed", dayfirst=True) 

# Remove time
df_jsa_10_2023_f[dates_col] = df_jsa_10_2023_f[dates_col].apply(lambda x : x.dt.normalize())

df_jsa_10_2023_f[dates_col].head()

Unnamed: 0,Application,Phone call,1st interview,2nd interview,3rd interview,Proposition,Final answer
0,2023-08-30,2023-10-13,NaT,NaT,NaT,NaT,2023-10-26
1,2023-08-30,NaT,NaT,NaT,NaT,NaT,NaT
2,2023-08-31,NaT,NaT,NaT,NaT,NaT,NaT
3,NaT,2023-09-05,2023-09-06,NaT,NaT,NaT,2023-09-06
4,2023-08-31,2023-09-01,2023-09-07,NaT,NaT,NaT,2023-09-07


#### Categories

In [27]:
categories_col = ["Status", "Location", "Role", "Source", "Attendance", "Type"]

df_jsa_10_2023_f[categories_col] = df_jsa_10_2023_f[categories_col].fillna("Not specified")

df_jsa_10_2023_f[categories_col].head()

Unnamed: 0,Status,Location,Role,Source,Attendance,Type
0,Rejected,Rennes,Data Scientist,Carrer Site,Not specified,Offer
1,Applied,Toulouse,Data Scientist,Carrer Site,Not specified,Offer
2,Applied,Not specified,Data Scientist,Recruitment Agency,Not specified,Spontaneous
3,No project for now,Marignane,Not specified,Reach-out,Hybrid,Not specified
4,No project for now,Not specified,Data Scientist,LinkedIn,Not specified,Spontaneous


#### Column names

In [28]:
def to_snake_case(s: str) -> str:
    """Format a given string into snake-case format.

    Args:
        s (str): a string.

    Returns:
        str: snake-case formatted string.
    """
    s = s.strip().lower()
    s = s.replace(" ", "_")

    return s

df_jsa_10_2023_f.rename(to_snake_case, axis='columns', inplace=True)

df_jsa_10_2023_f.head()

Unnamed: 0,status,company,role,location,source,attendance,application,type,phone_call,1st_interview,2nd_interview,3rd_interview,proposition,final_answer
0,Rejected,Orange,Data Scientist,Rennes,Carrer Site,Not specified,2023-08-30,Offer,2023-10-13,NaT,NaT,NaT,NaT,2023-10-26
1,Applied,Orange,Data Scientist,Toulouse,Carrer Site,Not specified,2023-08-30,Offer,NaT,NaT,NaT,NaT,NaT,NaT
2,Applied,AZ Consulting,Data Scientist,Not specified,Recruitment Agency,Not specified,2023-08-31,Spontaneous,NaT,NaT,NaT,NaT,NaT,NaT
3,No project for now,ALTEN,Not specified,Marignane,Reach-out,Hybrid,NaT,Not specified,2023-09-05,2023-09-06,NaT,NaT,NaT,2023-09-06
4,No project for now,Astek,Data Scientist,Not specified,LinkedIn,Not specified,2023-08-31,Spontaneous,2023-09-01,2023-09-07,NaT,NaT,NaT,2023-09-07


#### Find latitude and longitude location coordinates

In [29]:
with open("../mapbox_api_key.txt", "r") as f:
    api_key = f.read()

geolocator = geocoders.MapBox(api_key=api_key)

unique_locations = list(df_jsa_10_2023_f["location"].unique())
unique_locations.remove("Not specified")

geocodes = [geolocator.geocode(ul) for ul in unique_locations]

df_jsa_10_2023_f["location_lat"] = [np.NAN if l == "Not specified" \
                                    else geocodes[unique_locations.index(l)].latitude \
                                    for l in df_jsa_10_2023_f["location"]]

df_jsa_10_2023_f["location_long"] = [np.NAN if l == "Not specified" \
                                     else geocodes[unique_locations.index(l)].longitude \
                                     for l in df_jsa_10_2023_f["location"]]

df_jsa_10_2023_f.head()

Unnamed: 0,status,company,role,location,source,attendance,application,type,phone_call,1st_interview,2nd_interview,3rd_interview,proposition,final_answer,location_lat,location_long
0,Rejected,Orange,Data Scientist,Rennes,Carrer Site,Not specified,2023-08-30,Offer,2023-10-13,NaT,NaT,NaT,NaT,2023-10-26,48.111339,-1.68002
1,Applied,Orange,Data Scientist,Toulouse,Carrer Site,Not specified,2023-08-30,Offer,NaT,NaT,NaT,NaT,NaT,NaT,43.604462,1.444247
2,Applied,AZ Consulting,Data Scientist,Not specified,Recruitment Agency,Not specified,2023-08-31,Spontaneous,NaT,NaT,NaT,NaT,NaT,NaT,,
3,No project for now,ALTEN,Not specified,Marignane,Reach-out,Hybrid,NaT,Not specified,2023-09-05,2023-09-06,NaT,NaT,NaT,2023-09-06,43.416273,5.214627
4,No project for now,Astek,Data Scientist,Not specified,LinkedIn,Not specified,2023-08-31,Spontaneous,2023-09-01,2023-09-07,NaT,NaT,NaT,2023-09-07,,


### Save processed data

In [30]:
df_jsa_10_2023_f.to_csv("../data/processed/10_2023.csv", index=False)

### Analysis

#### Load processed data

In [2]:
df_jsa_10_2023_p = pd.read_csv("../data/processed/10_2023.csv",
                               parse_dates=["application",
                                            "phone_call",
                                            "1st_interview",
                                            "2nd_interview",
                                            "3rd_interview",
                                            "proposition",
                                            "final_answer"])

df_jsa_10_2023_p.head()

Unnamed: 0,status,company,role,location,source,attendance,application,type,phone_call,1st_interview,2nd_interview,3rd_interview,proposition,final_answer,location_lat,location_long
0,Rejected,Orange,Data Scientist,Rennes,Carrer Site,Not specified,2023-08-30,Offer,2023-10-13,NaT,NaT,NaT,NaT,2023-10-26,48.111339,-1.68002
1,Applied,Orange,Data Scientist,Toulouse,Carrer Site,Not specified,2023-08-30,Offer,NaT,NaT,NaT,NaT,NaT,NaT,43.604462,1.444247
2,Applied,AZ Consulting,Data Scientist,Not specified,Recruitment Agency,Not specified,2023-08-31,Spontaneous,NaT,NaT,NaT,NaT,NaT,NaT,,
3,No project for now,ALTEN,Not specified,Marignane,Reach-out,Hybrid,NaT,Not specified,2023-09-05,2023-09-06,NaT,NaT,NaT,2023-09-06,43.416273,5.214627
4,No project for now,Astek,Data Scientist,Not specified,LinkedIn,Not specified,2023-08-31,Spontaneous,2023-09-01,2023-09-07,NaT,NaT,NaT,2023-09-07,,


In [32]:
df_jsa_10_2023_p.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 447 entries, 0 to 446
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   status         447 non-null    object        
 1   company        444 non-null    object        
 2   role           447 non-null    object        
 3   location       447 non-null    object        
 4   source         447 non-null    object        
 5   attendance     447 non-null    object        
 6   application    437 non-null    datetime64[ns]
 7   type           447 non-null    object        
 8   phone_call     15 non-null     datetime64[ns]
 9   1st_interview  17 non-null     datetime64[ns]
 10  2nd_interview  4 non-null      datetime64[ns]
 11  3rd_interview  1 non-null      datetime64[ns]
 12  proposition    4 non-null      datetime64[ns]
 13  final_answer   146 non-null    datetime64[ns]
 14  location_lat   419 non-null    float64       
 15  location_long  419 non-

#### Insights

In [33]:
status_counts = df_jsa_10_2023_p["status"].value_counts()

px.pie(values=status_counts.values, names=status_counts.index,
       title="Application status distribution")\
  .update_traces(textinfo="value")

In [34]:
source_counts = df_jsa_10_2023_p["source"].value_counts()

px.pie(values=source_counts.values, names=source_counts.index,
       title="Application website distribution")

In [35]:
type_counts = df_jsa_10_2023_p["type"].value_counts()

px.pie(values=type_counts.values, names=type_counts.index,
       title="Application type distribution")

In [36]:
application_per_day_counts = df_jsa_10_2023_p[~df_jsa_10_2023_p["application"].isnull()]["application"].value_counts()
application_per_day_counts.sort_index(inplace=True)

px.line(application_per_day_counts,
        title="Evolution of daily applications",
        labels={"application": "date", "value": "number of applications"})\
  .update_layout(showlegend=False)

In [37]:
applications_not_null = df_jsa_10_2023_p[~df_jsa_10_2023_p["application"].isnull()]
duration_to_final_answer = applications_not_null["final_answer"] - applications_not_null["application"]

duration_to_final_answer = duration_to_final_answer.dt.days

px.histogram(duration_to_final_answer, nbins=50,
             title="Application to final answer response time distribution",
             labels={"value": "number of days", "count": "number of applications"})\
  .update_layout(showlegend=False, xaxis={"dtick": 2})\
  .add_vline(x=duration_to_final_answer.median(),
             line_width=2,line_dash="dash", line_color="goldenrod", opacity=1,
             annotation_text="median")\
  .add_vline(x=duration_to_final_answer.mean(),
             line_width=2, line_dash="dash", line_color="limegreen", opacity=1,
             annotation_text="mean")

In [49]:
application_per_locations = df_jsa_10_2023_p.groupby("location") \
                            .agg(number_of_application=("location", "size"), lat=("location_lat", "first"), long=("location_long", "first")) \
                            .reset_index()

px.scatter_geo(application_per_locations,
               lat="lat", lon="long", scope="europe",
               size="number_of_application",
               color="number_of_application", color_continuous_scale=px.colors.sequential.solar,
               hover_name="location",
               title="Application location repartition")