# Preprocessing: Skill & Certification

**`Goal:`** Feature engineering or preprocessing on skill and certification columns in preparation for matching procedure


**EDA tasks**
- Look at gender differentials in terms of skills (e.g. python for men vs. women)
- Analysis of people who have had a minimum of 20 jobs for the skills

**Preprocessing tasks**
- ~~Collapse certifications into number of certifications~~
- ~~Dummy variable for each skill – then exact matching on skills~~
- ~~Make grouping of skills.~~
    - ~~Table of grouping in appendix~~

**Thoughts**
- Diversity of certifications does not matter -> Does the number matter

**ANTICIPATED LEVEL OF MATCHING DIFFICULTY (LOW TO HIGH):**
1. Number of skills and certified/not-certified: `collapsed_num_skills_certified_indicator.csv`
2. Number of skills and number of certifications: `collapsed_num_skills_num_certifications.csv`
3. Categorized skills and certifications in broader groups: `skills_certifications_categorized.csv`
4. Removed skills and certifications held by less than 30 people and then dummified to having utilized the skill/certification in a job or not: `matchable_individual_skills_certifications_dummified.csv`
5. Removed skills and certifications held by less than 30 people (without dummification): `matchable_individual_skills_certifications.csv`

### a. Load packages/libraries

In [None]:
import pandas as pd

### b. Load data

In [None]:
#df = pd.read_csv('../data/processed/skills_certifications_categorized_skill_count_female_treatment.csv', low_memory=False)
df = pd.read_csv('../data/processed/skills_certifications_categorized.csv', low_memory=False)

#Quick preview
df.head()

Unnamed: 0,search_query,name,gender,join_date_from_earliest,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,num_recommendations,...,management_skills,marketing_business_skills,performance_arts_skills,design_skills,teaching_training_skills,miscellaneous_skills,language_certifications,freelancer_certifications,general_skill_certifications,programming_certifications
0,2,Milen,1,7063,1,45,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
1,2,Jeremy,1,7526,1,90,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0
2,2,Nichole,0,6430,0,25,4.0,5.0,2,0,...,0.0,0.0,0.0,5.0,0.0,0.0,1,0,0,0
3,2,Robert,1,3238,1,75,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0
4,2,Jean-Paul,1,6661,5,19,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0


In [None]:
df.columns

Index(['search_query', 'name', 'gender', 'join_date_from_earliest',
       'location_size', 'hourly_rate', 'pay_grade', 'avg_rating',
       'num_reviews', 'num_recommendations', 'pct_jobs_completed',
       'pct_on_budget', 'pct_on_time', 'verification_preferred_freelancer',
       'verification_identity_verified', 'verification_payment_verified',
       'verification_phone_verified', 'verification_email_verified',
       'verification_facebook_connected', 'badge_plus_membership',
       'badge_preferred_freelancer', 'badge_verified', 'engineering_skills',
       'writing_skills', 'technical_programming_skills',
       'language_translation_skills', 'finance_accounting_skills',
       'management_skills', 'marketing_business_skills',
       'performance_arts_skills', 'design_skills', 'teaching_training_skills',
       'miscellaneous_skills', 'language_certifications',
       'freelancer_certifications', 'general_skill_certifications',
       'programming_certifications'],
      dtype=

In [None]:
winner_takes_all_skill_col = []
winner_takes_all_skill_col_num = []

skill_columns = ['engineering_skills','writing_skills', 'technical_programming_skills',
                 'language_translation_skills', 'finance_accounting_skills','management_skills', 
                 'marketing_business_skills','performance_arts_skills', 'design_skills', 
                 'teaching_training_skills','miscellaneous_skills']

for row in df.to_dict(orient="records"):
    skill_counts = []
    for idx, skill in enumerate(skill_columns):
        skill_counts.append((row[skill], skill, idx))
    
    highest_count = sorted(skill_counts, key=lambda x: x[0], reverse=True)[0]

    winner_takes_all_skill_col.append(highest_count[1])
    winner_takes_all_skill_col_num.append(highest_count[2])

In [None]:
df["skill_category"] = winner_takes_all_skill_col
df["skill_category_num"] = winner_takes_all_skill_col_num
df.head()

Unnamed: 0,search_query,name,gender,join_date_from_earliest,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,num_recommendations,...,performance_arts_skills,design_skills,teaching_training_skills,miscellaneous_skills,language_certifications,freelancer_certifications,general_skill_certifications,programming_certifications,skill_category,skill_category_num
0,2,Milen,1,7063,1,45,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,engineering_skills,0
1,2,Jeremy,1,7526,1,90,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,1,0,0,0,engineering_skills,0
2,2,Nichole,0,6430,0,25,4.0,5.0,2,0,...,0.0,5.0,0.0,0.0,1,0,0,0,design_skills,8
3,2,Robert,1,3238,1,75,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0,1,0,0,engineering_skills,0
4,2,Jean-Paul,1,6661,5,19,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0,0,0,0,engineering_skills,0


In [None]:
df["skill_category"] = winner_takes_all_skill_col
df["skill_category_num"] = winner_takes_all_skill_col_num
df.head()

Unnamed: 0,search_query,name,gender,join_date_from_earliest,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,num_recommendations,...,performance_arts_skills,design_skills,teaching_training_skills,miscellaneous_skills,language_certifications,freelancer_certifications,general_skill_certifications,programming_certifications,skill_category,skill_category_num
0,2,Milen,0,7063,1,45,0.0,0.0,0,0,...,0,8,0,1,0,0,0,0,design_skills,8
1,2,Jeremy,0,7526,1,90,0.0,0.0,0,0,...,0,18,0,0,1,0,0,0,design_skills,8
2,2,Nichole,1,6430,0,25,4.0,5.0,2,0,...,1,16,0,0,1,0,0,0,design_skills,8
3,2,Robert,0,3238,1,75,0.0,0.0,0,0,...,0,5,0,0,0,1,0,0,technical_programming_skills,2
4,2,Jean-Paul,0,6661,5,19,0.0,0.0,0,0,...,0,6,0,0,0,0,0,0,design_skills,8


In [None]:
df.to_csv('../data/processed/winner_takes_all_v1.csv',index=False)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=acc27b92-84be-4130-8026-204943f38189' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>