## Market Basket Analysis on the [LinkedIn job skills](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024) dataset

### Author: Adriano Meligrana

In [1]:
import os
import pyspark

In [2]:
#logging in into Kaggle and downloading the dataset

#kaggle enviroment
os.environ['KAGGLE_USERNAME'] = "xxxxxx"
os.environ['KAGGLE_KEY'] = "xxxxxx"

#dataset download
!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024

#dataset unzip
!unzip 1-3m-linkedin-jobs-and-skills-2024.zip -d data

/bin/bash: line 1: kaggle: command not found
unzip:  cannot find or open 1-3m-linkedin-jobs-and-skills-2024.zip, 1-3m-linkedin-jobs-and-skills-2024.zip.zip or 1-3m-linkedin-jobs-and-skills-2024.zip.ZIP.


In [3]:
#creating the Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('apriori') \
    .getOrCreate()

#importing the dataset into spark
df = spark.read.csv("./job_skills.csv", header=True, sep=",")
df.show()

24/06/10 18:27:25 WARN Utils: Your hostname, bob-Victus-by-HP-Gaming-Laptop-15-fb0xxx resolves to a loopback address: 127.0.1.1; using 172.18.84.99 instead (on interface wlo1)
24/06/10 18:27:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/10 18:27:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


+--------------------+--------------------+
|            job_link|          job_skills|
+--------------------+--------------------+
|https://www.linke...|Building Custodia...|
|https://www.linke...|Customer service,...|
|https://www.linke...|Applied Behavior ...|
|https://www.linke...|Electrical Engine...|
|https://www.linke...|Electrical Assemb...|
|https://www.linke...|Access Control, V...|
|https://www.linke...|Consultation, Sup...|
|https://www.linke...|Veterinary Recept...|
|https://www.linke...|Optical Inspectio...|
|https://www.linke...|HVAC, troubleshoo...|
|https://www.linke...|Host/Server Assis...|
|https://www.linke...|Apartment mainten...|
|https://www.linke...|Fiber Optic Cable...|
|https://www.linke...|CT Technologist, ...|
|https://ca.linked...|SAP, DRMIS, Data ...|
|https://www.linke...|Debt and equity o...|
|https://ca.linked...|Biomedical Engine...|
|https://www.linke...|Laboratory Techni...|
|https://www.linke...|Program Managemen...|
|https://www.linke...|Hiring, Tr

In [4]:
import gc
from pyspark.sql import Row

def preprocessing(row):
    skills_str = row.job_skills.lower()
    skills_str = skills_str.replace(" capacities", "")
    skills_str = skills_str.replace(" abilities", "")
    skills_str = skills_str.replace(" skills", "")
    skills_str = skills_str.replace("decisionmaking", "decision making")
    skills_str = skills_str.replace("problemsolving", "problem solving")
    skills_str = skills_str.replace("teamwork", "team work")
    skills_str = skills_str.replace("problem-solving", "problem solving")
    skills = list(set(skills_str.split(", ")))
    row_dict = row.asDict()
    row_dict['job_skills'] = skills
    return Row(**row_dict)

n = df.count()
print(f"Dataset length: {n}")
df = df.na.drop()
print(f"number of dropped NA values: {n - df.count()}")
df = df.drop("job_link")
df = df.rdd.map(lambda row: preprocessing(row)).toDF()
df = df.repartition(6)

Dataset length: 1296381


                                                                                

number of dropped NA values: 2007


### Implementation of the $\mathrm{A}$-$\mathrm{Priori}$ $\mathrm{Algorithm}$

In [12]:
from operator import add 
from itertools import combinations
import timeit
import pandas as pd
import numpy as np
import gc

"""
apriori(df, basket_col, support_threshold = 0.01, max_frequent = 5, sample_fraction = 0.1, seed = 42, cache = True, quiet = False)
  
  df              -> a PySpark DataFrame object containing baskets.
  basket_col      -> the column of the dataframe containing baskets.
  s_threshold     -> the percentage value of the support threshold, e.g. s = 1% -> s_threshold = 0.01.
  max_frequent    -> the last set size of frequent itemsets to check, e.g. if last_frequent = 2, only singletons and pairs will be checked.
  sample_fraction -> sample fraction of baskets which will be used to find frequent itemsets.
  seed            -> seed used when sampling.
  cache           -> if caching should be applied when running the algorithm.
  quiet           -> decides If some information on the progression will be printed on screen.
"""
def apriori(df, basket_col, support_threshold = 0.01, max_frequent = 6, sample_fraction = 0.1, seed = 42, cache = True, quiet = False):

  if sample_fraction != 1:
      df = df.sample(False, fraction = sample_fraction, seed = seed)

  rdd = df.rdd.map(lambda row: row[basket_col])
  size = rdd.count()
  support = int(size*support_threshold) + 1
  uniques = {e: i for i, e in enumerate(set(item for basket in rdd.collect() for item in basket))}
  uniques_vec = list(uniques.keys())
  rdd = rdd.map(lambda basket: [uniques[item] for item in basket])
      
  frequent_items_dfs = []
  for i in range(max_frequent):
    if not quiet: print(f"Checking frequent sets of {i+1} elements")
      
    t0 = timeit.default_timer()

    frequent_items_values = None
    gc.collect()
      
    if i != 0:
        rdd = rdd.map(lambda basket: compute_frequents(basket, frequent_items, i))
        rdd = rdd.filter(lambda basket: len(basket) > i)
        if cache: rdd = rdd.persist()

    n_pass = rdd.flatMap(lambda basket: ((items, 1) for items in combinations(basket, i+1))).persist()
    n_pass = n_pass.reduceByKey(add)
    n_pass = n_pass.filter(lambda key_value: key_value[1] >= support)

    frequent_items_values = n_pass.collect()
    frequent_items = set(key_values[0] for key_values in frequent_items_values)
    frequent_items_values = sorted(frequent_items_values, key = lambda key_values: key_values[1], reverse = True)
    frequent_items_values = [(*[uniques_vec[k] for k in kv[0]], kv[1]) for kv in frequent_items_values]
    fdf = pd.DataFrame.from_records(frequent_items_values, columns=[*[f'item_{s+1}' for s in range(i+1)], "frequency"])
    frequent_items_dfs.append(fdf)

    t1 = timeit.default_timer()

    if not quiet: 
        print(f"Step number {i+1} completed in {round(t1-t0, 2)} seconds.")
        print(f"Number of frequent sets of {i+1} elements: {len(frequent_items_values)}\n")
        display(fdf)
    
  return frequent_items_dfs

def compute_frequents(basket, frequent_items, i):
    new_basket = set()
    for items in combinations(basket, i):
        if items in frequent_items:
            for item in items:
                if item not in new_basket:
                    new_basket.add(item)
    return sorted(new_basket)

In [17]:
%%time
# we managed to run the algorithm on all the data with a computer having 16gb of RAM in ~60 seconds
results = apriori(df, "job_skills", support_threshold=0.01, max_frequent=6, sample_fraction=1, seed=42, cache=True, quiet = False)

                                                                                

Checking frequent sets of 1 elements




Step number 1 completed in 18.32 seconds.
Number of frequent sets of 1 elements: 186



                                                                                

Unnamed: 0,item_1,frequency
0,communication,556216
1,problem solving,316069
2,customer service,290004
3,team work,251963
4,leadership,205542
...,...,...
181,english,13403
182,rn license,13361
183,ethics,13102
184,travel nursing,13004


Checking frequent sets of 2 elements




Step number 2 completed in 15.46 seconds.
Number of frequent sets of 2 elements: 298



                                                                                

Unnamed: 0,item_1,item_2,frequency
0,communication,problem solving,257218
1,team work,communication,200249
2,communication,customer service,195320
3,communication,leadership,151171
4,team work,problem solving,142269
...,...,...,...
293,time management,microsoft office,13029
294,attention to detail,decision making,13005
295,budget management,communication,13004
296,planning,problem solving,12984


Checking frequent sets of 3 elements




Step number 3 completed in 13.83 seconds.
Number of frequent sets of 3 elements: 235



                                                                                

Unnamed: 0,item_1,item_2,item_3,frequency
0,team work,communication,problem solving,128920
1,communication,customer service,problem solving,105967
2,communication,leadership,problem solving,93025
3,team work,communication,customer service,87327
4,communication,time management,problem solving,83301
...,...,...,...,...
230,troubleshooting,communication,problem solving,13031
231,team work,leadership,adaptability,13023
232,communication,project management,analytical,12999
233,patient care,communication,leadership,12956


Checking frequent sets of 4 elements




Step number 4 completed in 12.72 seconds.
Number of frequent sets of 4 elements: 84



                                                                                

Unnamed: 0,item_1,item_2,item_3,item_4,frequency
0,team work,communication,customer service,problem solving,59328
1,team work,communication,leadership,problem solving,50491
2,team work,communication,time management,problem solving,50170
3,attention to detail,team work,communication,problem solving,46984
4,communication,leadership,customer service,problem solving,40666
...,...,...,...,...,...
79,team work,leadership,customer service,time management,13208
80,team work,communication,adaptability,time management,13196
81,team work,communication,problem solving,analytical,13157
82,critical thinking,communication,leadership,problem solving,12966


Checking frequent sets of 5 elements




Step number 5 completed in 4.91 seconds.
Number of frequent sets of 5 elements: 12



                                                                                

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,frequency
0,attention to detail,team work,communication,customer service,problem solving,24399
1,team work,communication,customer service,time management,problem solving,24283
2,attention to detail,team work,communication,time management,problem solving,24177
3,team work,communication,leadership,customer service,problem solving,24096
4,team work,communication,leadership,time management,problem solving,21844
5,attention to detail,communication,customer service,time management,problem solving,18255
6,communication,leadership,customer service,time management,problem solving,17656
7,team work,communication,customer service,sales,problem solving,16568
8,attention to detail,team work,communication,leadership,problem solving,16006
9,attention to detail,team work,communication,customer service,time management,14807


Checking frequent sets of 6 elements
Step number 6 completed in 0.72 seconds.
Number of frequent sets of 6 elements: 1



Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6,frequency
0,attention to detail,team work,communication,customer service,time management,problem solving,13192


CPU times: user 11.8 s, sys: 1.68 s, total: 13.5 s
Wall time: 1min 24s


In [15]:
%%time
results = apriori(df, "job_skills", support_threshold=0.01, max_frequent=6, sample_fraction=0.02, seed=42, cache=True, quiet = True)

                                                                                

CPU times: user 484 ms, sys: 17.1 ms, total: 501 ms
Wall time: 11 s
