# Preprocessing: Skill & Certification

**`Goal:`** Feature engineering or preprocessing on skill and certification columns in preparation for matching procedure


**EDA tasks**
- Look at gender differentials in terms of skills (e.g. python for men vs. women)
- Analysis of people who have had a minimum of 20 jobs for the skills

**Preprocessing tasks**
- ~~Collapse certifications into number of certifications~~
- ~~Dummy variable for each skill – then exact matching on skills~~
- ~~Make grouping of skills.~~
    - ~~Table of grouping in appendix~~

**Thoughts**
- Diversity of certifications does not matter -> Does the number matter

**ANTICIPATED LEVEL OF MATCHING DIFFICULTY (LOW TO HIGH):**
1. Number of skills and certified/not-certified: `collapsed_num_skills_certified_indicator.csv`
2. Number of skills and number of certifications: `collapsed_num_skills_num_certifications.csv`
3. Categorized skills and certifications in broader groups: `skills_certifications_categorized.csv`
4. Removed skills and certifications held by less than 30 people and then dummified to having utilized the skill/certification in a job or not: `matchable_individual_skills_certifications_dummified.csv`
5. Removed skills and certifications held by less than 30 people (without dummification): `matchable_individual_skills_certifications.csv`

### a. Load packages/libraries

#Download word vector
!python -m spacy download en_core_web_lg

In [None]:
import pandas as pd
#import spacy
#spacy_nlp = spacy.load('en_core_web_lg')

### b. Load data

In [None]:
df = pd.read_csv('../data/interim/encoded_female_treatment.csv', low_memory=False)

#Quick preview
df.head()

Unnamed: 0,search_query,name,gender,join_date_from_earliest,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,num_recommendations,...,skill_workday_security,skill_oracle_ebs_tech_integration,pct_certifications_google_webmaster_central_1,skill_modx,skill_cubecart,skill_phaser,skill_drilling_engineering,skill_casperjs,badge_preferred_freelancer,badge_verified
0,2,Milen,0,7063,1,45,0.0,0.0,0,0,...,,,,,,,,,0,0
1,2,Jeremy,0,7526,1,90,0.0,0.0,0,0,...,,,,,,,,,0,0
2,2,Nichole,1,6430,0,25,4.0,5.0,2,0,...,,,,,,,,,0,0
3,2,Robert,0,3238,1,75,0.0,0.0,0,0,...,,,,,,,,,0,0
4,2,Jean-Paul,0,6661,5,19,0.0,0.0,0,0,...,,,,,,,,,0,0


### c. Certifications & Skills

In [None]:
#Get the certification columns
certification_cols = [col for col in df.columns if 'pct_certifications' in col]

#Get the skill columns
skill_cols = [col for col in df.columns if 'skill' in col]

#How many are there?
print('certifications:', len(certification_cols))
print('skills:', len(skill_cols))

certifications: 120
skills: 2123


### d. Preprocessing

#### 1. Collapsing into number of skills and number of certifications

In [None]:
collapsed_df = df.copy()

#Count the number of listed skills and certifications for each freelancer
collapsed_df['num_certifications'] = collapsed_df.loc[:,certification_cols].notnull().sum(axis=1)
collapsed_df['num_skills'] = collapsed_df.loc[:,skill_cols].notnull().sum(axis=1)

#Remove the columns for individual skills and certifications
#1. Skills
collapsed_df = collapsed_df.drop(columns = skill_cols)
#2. Certifications
collapsed_df = collapsed_df.drop(columns = certification_cols)

collapsed_df.shape


(9766, 24)

In [None]:
collapsed_df.columns

Index(['search_query', 'name', 'gender', 'join_date_from_earliest',
       'location_size', 'hourly_rate', 'pay_grade', 'avg_rating',
       'num_reviews', 'num_recommendations', 'pct_jobs_completed',
       'pct_on_budget', 'pct_on_time', 'verification_preferred_freelancer',
       'verification_identity_verified', 'verification_payment_verified',
       'verification_phone_verified', 'verification_email_verified',
       'verification_facebook_connected', 'badge_plus_membership',
       'badge_preferred_freelancer', 'badge_verified', 'num_certifications',
       'num_skills'],
      dtype='object')

In [None]:
collapsed_df.isna().sum()

search_query                         0
name                                 0
gender                               0
join_date_from_earliest              0
location_size                        0
hourly_rate                          0
pay_grade                            0
avg_rating                           0
num_reviews                          0
num_recommendations                  0
pct_jobs_completed                   0
pct_on_budget                        0
pct_on_time                          0
verification_preferred_freelancer    0
verification_identity_verified       0
verification_payment_verified        0
verification_phone_verified          0
verification_email_verified          0
verification_facebook_connected      0
badge_plus_membership                0
badge_preferred_freelancer           0
badge_verified                       0
num_certifications                   0
num_skills                           0
dtype: int64

In [None]:
collapsed_df.to_csv('../data/processed/collapsed_num_skills_num_certifications.csv',index=False)

#### 2. Collapsing into number of skills and certified/not-certified

In [None]:
#Make copy of above dataframe
certified_df = collapsed_df.copy()

#Note down if the individual is certified (i.e. has at least one certification) or not
certified_df['certified'] = (certified_df.num_certifications > 0).map({True:1, False:0})

#Should we drop the number of certifications column or not?
certified_df = certified_df.drop(columns=['num_certifications'])

certified_df.head()


Unnamed: 0,search_query,name,gender,join_date_from_earliest,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,num_recommendations,...,verification_identity_verified,verification_payment_verified,verification_phone_verified,verification_email_verified,verification_facebook_connected,badge_plus_membership,badge_preferred_freelancer,badge_verified,num_skills,certified
0,2,Milen,1,7063,1,45,0.0,0.0,0,0,...,0,1,1,0,0,0,0,0,15,0
1,2,Jeremy,1,7526,1,90,0.0,0.0,0,0,...,1,1,1,0,0,1,0,0,18,1
2,2,Nichole,0,6430,0,25,4.0,5.0,2,0,...,0,0,1,0,0,0,0,0,20,1
3,2,Robert,1,3238,1,75,0.0,0.0,0,0,...,0,0,1,0,0,0,0,0,14,1
4,2,Jean-Paul,1,6661,5,19,0.0,0.0,0,0,...,1,1,1,0,1,0,0,0,6,0


In [None]:
certified_df.isna().sum()

search_query                         0
name                                 0
gender                               0
join_date_from_earliest              0
location_size                        0
hourly_rate                          0
pay_grade                            0
avg_rating                           0
num_reviews                          0
num_recommendations                  0
pct_jobs_completed                   0
pct_on_budget                        0
pct_on_time                          0
verification_preferred_freelancer    0
verification_identity_verified       0
verification_payment_verified        0
verification_phone_verified          0
verification_email_verified          0
verification_facebook_connected      0
badge_plus_membership                0
badge_preferred_freelancer           0
badge_verified                       0
num_skills                           0
certified                            0
dtype: int64

In [None]:
certified_df.to_csv('../data/processed/collapsed_num_skills_certified_indicator.csv',index=False)

#### 3. Plain skill and certification matching
Here, the idea is to try to match on skills and certifications as is, but we remove skills and certifications that would be difficult to match on (i.e. too few records)

In [None]:
#Get all the skill and certification columns
cols_to_match = skill_cols
cols_to_match.extend(certification_cols)

#Count how many individuals have the different skills and certifications
skill_certification_counts = df.loc[:,cols_to_match].count()

**a. Get skills and certifications that are held by more than 30 people**

In [None]:
count_threshold = 30

#Get skills and certifications which are not held by at least 30 individuals
inadq_skill_cert = (skill_certification_counts<count_threshold).reset_index()
inadq_skill_cert = inadq_skill_cert.loc[inadq_skill_cert.iloc[:,1].values,'index'].values

#Drop skills and certifications that are not held by at least 30 people
matchable_skill_cert = df.drop(columns = inadq_skill_cert)

print('original_size:', df.shape)
print('new_size:', matchable_skill_cert.shape)

original_size: (9766, 2265)
new_size: (9766, 678)


In [None]:
matchable_skill_cert.to_csv('../data/processed/matchable_individual_skills_certifications.csv',index=False)

**b. Convert skills and certifications to dummy variables**
1 if the skill has been applied in jobs, otherwise, 0

In [None]:
#Get the skill and certification columns
skill_cert_cols = [col for col in matchable_skill_cert.columns if ('skill_' in col) or ('pct_certifications' in col)]

#Dummify the variables
dummified_matchable = matchable_skill_cert.copy()

for col in skill_cert_cols:
    dummified_matchable.loc[:,col] = dummified_matchable.loc[:,col].apply(lambda x: 1 if x > 0 else 0)

dummified_matchable.head()

Unnamed: 0,search_query,name,gender,join_date_from_earliest,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,num_recommendations,...,skill_network_engineering,skill_account_receivables_management,skill_account_payables_management,skill_uml_design,skill_devops,skill_risk_management,skill_tax_accounting,skill_bank_reconciliation,badge_preferred_freelancer,badge_verified
0,2,Milen,1,7063,1,45,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jeremy,1,7526,1,90,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,Nichole,0,6430,0,25,4.0,5.0,2,0,...,0,0,0,0,0,0,0,0,0,0
3,2,Robert,1,3238,1,75,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,Jean-Paul,1,6661,5,19,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
dummified_matchable.to_csv('../data/processed/matchable_individual_skills_certifications_dummified.csv',index=False)

#### 4. Collapsing matchable skills and certifications into groups [Total Project Count]
Here matchable skills and certifications are those that are held by at least 30 people

##### a. Skills

In [None]:
#Get the matchable skill columns
matchable_skill_cols = [col for col in matchable_skill_cert.columns if 'skill' in col]
matchable_skill_cols

['skill_engineering',
 'skill_building_architecture',
 'skill_3d_rendering',
 'skill_home_design',
 'skill_autocad',
 'skill_interior_design',
 'skill_sketchup',
 'skill_3d_modelling',
 'skill_engineering_drawing',
 'skill_drafting',
 'skill_local_job',
 'skill_architectural_rendering',
 'skill_website_design',
 'skill_graphic_design',
 'skill_logo_design',
 'skill_video_services',
 'skill_powerpoint',
 'skill_adobe_indesign',
 'skill_after_effects',
 'skill_photo_editing',
 'skill_presentations',
 'skill_videography',
 'skill_creative_design',
 'skill_video_production',
 'skill_video_editing',
 'skill_adobe_premiere_pro',
 'skill_adobe_illustrator',
 'skill_explainer_videos',
 'skill_video_ads',
 'skill_adobe_photoshop',
 'skill_banner_design',
 'skill_photoshop',
 'skill_illustrator',
 'skill_photoshop_design',
 'skill_css',
 'skill_voice_talent',
 'skill_corporate_identity',
 'skill_business_cards',
 'skill_brochure_design',
 'skill_t-shirts',
 'skill_icon_design',
 'skill_advertise

**Use word similarity on initial pass to assign skills into groups**

In [None]:
#Skill categories and keywords for determining skill category
skill_cat_keywords = [{'engineering_skills':[spacy_nlp('engineering')]},
                      {'writing_skills': [spacy_nlp('writing')]},
                      {'technical_programming_skills':[spacy_nlp('coding'),
                                                       spacy_nlp('computing'), 
                                                       spacy_nlp('programming')]},
                      {'language_translation_skills':[spacy_nlp('language'),spacy_nlp('translation')]},
                      {'finance_accounting_skills':[spacy_nlp('finance'),spacy_nlp('accounting')]},
                      {'marketing_business_skills':[spacy_nlp('marketing'),spacy_nlp('business')]},
                      {'performance_arts_skills':[spacy_nlp('music'),spacy_nlp('acting')]},
                      {'design_skills':[spacy_nlp('design'),spacy_nlp('art')]}]

#Skill categories storage
skill_categories = {'engineering_skills': set(), 'writing_skills': set(),
                    'technical_programming_skills': set(),
                    'language_translation_skills': set(),
                    'finance_accounting_skills': set(),
                    'marketing_business_skills':set(),
                    'performance_arts_skills': set(),
                    'design_skills': set()}

#Set to store skills that could not be categorized
uncategorized = set()

#Set word similarity threshold for determining skill category
sim_threshold = 0.7

for skill in matchable_skill_cols:
    
    #Extract the skill name from the column name
    skill_name = skill.replace('skill_','').replace('_',' ')

    #Create spacy token from the base skill name
    skill_token = spacy_nlp(skill_name)

    #Tracker to determine if the skill has been assigned to a category
    assigned = False

    #Iterate through the different established categories
    for category in skill_cat_keywords:
        
        #Get skill category and the keyword tokens
        key,value = [*category.items()][0]
        
        #Iterate through the established keywords for each of the categories
        for keyword in value:
            
            #If the skill is surpasses the similarity threshold,
            if keyword.similarity(skill_token) > sim_threshold:

                #print(keyword,skill_token,keyword.similarity(skill_token))
                #print(sim_threshold)

                #print(keyword,skill_token)
                
                #Store the skill as part of the underlying skill category
                skill_categories[key].add(skill)

                #Note the skill as assigned
                assigned = True

                #Move to next skill category
                break

    #If the skill was not assigned to a category
    if assigned == False:

        #Note it as uncategorized
        uncategorized.add(skill)





In [None]:
skill_categories

{'engineering_skills': {'skill_aerospace_engineering',
  'skill_audio_engineering',
  'skill_chemical_engineering',
  'skill_civil_engineering',
  'skill_construction_engineering',
  'skill_electrical_engineering',
  'skill_engineering',
  'skill_engineering_drawing',
  'skill_engineering_mathematics',
  'skill_industrial_engineering',
  'skill_materials_engineering',
  'skill_mechanical_design',
  'skill_mechanical_engineering',
  'skill_mixing_engineering',
  'skill_network_engineering',
  'skill_sound_engineering',
  'skill_structural_engineering',
  'skill_telecommunications_engineering'},
 'writing_skills': {'skill_academic_writing',
  'skill_article_writing',
  'skill_blog_writing',
  'skill_book_writing',
  'skill_business_writing',
  'skill_content_writing',
  'skill_creative_writing',
  'skill_editorial_writing',
  'skill_essay_writing',
  'skill_grant_writing',
  'skill_legal_writing',
  'skill_medical_writing',
  'skill_online_writing',
  'skill_proposal_writing',
  'skill_r

In [None]:
uncategorized

set()

In [None]:
uncategorized.difference_update([])

**Finish with manual assignment**

In [None]:
 assigned_skill_categories = {
 'engineering_skills': ['skill_aerospace_engineering','skill_chemical_engineering',
    'skill_civil_engineering','skill_construction_engineering','skill_electrical_engineering',
    'skill_engineering','skill_engineering_drawing','skill_engineering_mathematics',
    'skill_industrial_engineering','skill_materials_engineering','skill_mechanical_design',
    'skill_mechanical_engineering','skill_mixing_engineering','skill_network_engineering',
    'skill_structural_engineering','skill_telecommunications_engineering','skill_manufacturing',
    'skill_automotive','skill_construction_monitoring','skill_mechatronics','skill_pcb_layout',
    'skill_plc_&_scada'],

'writing_skills': ['skill_academic_writing','skill_article_writing','skill_blog_writing',
    'skill_book_writing','skill_business_writing','skill_content_writing','skill_creative_writing',
    'skill_editorial_writing','skill_essay_writing','skill_grant_writing','skill_legal_writing',
    'skill_medical_writing','skill_online_writing','skill_proposal_writing','skill_report_writing',
    'skill_research_writing','skill_romance_writing','skill_speech_writing','skill_technical_writing',
    'skill_travel_writing','skill_writing','skill_ghostwriting','skill_article_rewriting',
    'skill_copy_editing','skill_drafting','skill_editing','skill_proofreading','skill_resumes',
    'skill_academic_research','skill_article_submission','skill_ebooks','skill_fiction','skill_journalism',
    'skill_linkedin_profile','skill_web_page_writer'],

'technical_programming_skills': ['skill_c++_programming','skill_c_programming','skill_cloud_computing',
    'skill_coding','skill_database_programming','skill_programming','skill_r_programming_language',
    'skill_database_design','skill_.net','skill_adobe_dreamweaver','skill_ajax','skill_angularjs',
    'skill_asp','skill_bootstrap','skill_codeigniter','skill_computer_graphics','skill_css','skill_css3',
    'skill_data_entry','skill_database_development','skill_desktop_support','skill_electronics','skill_excel',
    'skill_game_development','skill_google_analytics','skill_html','skill_html5','skill_java','skill_javascript',
    'skill_jquery','skill_json','skill_microsoft','skill_microsoft_office','skill_mvc','skill_mysql',
    'skill_nosql_couch_&_mongo','skill_php','skill_psd_to_html','skill_python',
    'skill_react.js','skill_software_architecture','skill_software_testing','skill_sql','skill_system_admin',
    'skill_technical_support','skill_typing','skill_web_hosting','skill_web_security','skill_website_testing',
    'skill_word','skill_word_processing','skill_wordpress','skill_.net_core','skill_active_directory',
    'skill_algorithm','skill_amazon_web_services','skill_analytics','skill_android','skill_android_app_development',
    'skill_angular','skill_angular_material','skill_apache','skill_api','skill_api_integration','skill_app_developer',
    'skill_arduino','skill_artificial_intelligence','skill_asp.net','skill_asp.net_mvc','skill_assembly',
    'skill_aws_lambda','skill_azure','skill_backend_development','skill_bash_scripting','skill_bitcoin',
    'skill_blockchain','skill_c#_programming','skill_cakephp','skill_cisco','skill_computer_science',
    'skill_computer_security','skill_computer_support','skill_data_analysis','skill_data_analytics',
    'skill_data_cleansing','skill_data_delivery','skill_data_extraction','skill_data_mining',
    'skill_data_processing','skill_data_science','skill_data_scraping','skill_data_visualization',
    'skill_database_administration','skill_debugging','skill_deep_learning','skill_delphi','skill_devops',
    'skill_digital_electronics','skill_django','skill_dns','skill_docker','skill_documentation',
    'skill_embedded_software','skill_embedded_systems','skill_ethereum','skill_excel_macros',
    'skill_excel_vba','skill_express_js','skill_finite_element_analysis','skill_flask','skill_flutter',
    'skill_front-end_design','skill_frontend_development','skill_full_stack_development','skill_geolocation',
    'skill_git','skill_github','skill_golang','skill_google_chrome','skill_google_cloud_platform',
    'skill_google_docs','skill_google_firebase','skill_google_maps_api','skill_google_plus',
    'skill_google_sheets','skill_google_spreadsheets','skill_graphical_user_interface_(gui)',
    'skill_iis','skill_image_processing','skill_internet_of_things_(iot)','skill_internet_security',
    'skill_ios_development','skill_ipad','skill_iphone','skill_j2ee','skill_java_spring','skill_jquery_/_prototype',
    'skill_jsp','skill_kotlin','skill_kubernetes','skill_laravel','skill_less/sass/scss',
    'skill_link_building','skill_linux','skill_mac_os','skill_machine_learning_(ml)','skill_matlab',
    'skill_matlab_and_mathematica','skill_microcontroller','skill_microsoft_access','skill_microsoft_azure',
    'skill_microsoft_exchange','skill_microsoft_outlook','skill_microsoft_sql_server','skill_microsoft_word',
    'skill_mobile_app_development','skill_mobile_app_testing','skill_mobile_development','skill_mongodb',
    'skill_network_administration','skill_neural_networks','skill_nginx','skill_node.js','skill_non-fungible_tokens_(nft)',
    'skill_object_oriented_programming_(oop)','skill_objective_c','skill_office_365','skill_oracle','skill_payment_gateway_integration',
    'skill_paypal_api','skill_perl','skill_phpmyadmin','skill_plugin','skill_postgresql','skill_power_bi',
    'skill_powershell','skill_process_automation','skill_raspberry_pi','skill_react.js_framework','skill_react_native',
    'skill_software_documentation', 'skill_software_development','skill_website_analytics','skill_web_development',
    'skill_webflow','skill_web_services','skill_web_api','skill_web_search',
    'skill_website_optimization','skill_website_build','skill_web_scraping','skill_web_application',
    'skill_redux.js', 'skill_vue.js_framework', 'skill_vue.js','skill_windows_8', 'skill_windows_desktop',
    'skill_windows_server','skill_restful','skill_restful_api','skill_visual_basic','skill_visual_basic_for_apps',
    'skill_vmware','skill_relational_databases','skill_robotic_process_automation','skill_robotics',
    'skill_ruby_on_rails','skill_salesforce.com','skill_script_install','skill_scripting','skill_selenium',
    'skill_server','skill_shell_script','skill_simulation','skill_smart_contracts','skill_spreadsheets',
    'skill_sqlite','skill_statistical_analysis','skill_statistics','skill_stripe','skill_swift',
    'skill_tableau','skill_technical_documentation','skill_test_automation','skill_testing_/_qa','skill_troubleshooting',
    'skill_typescript','skill_ubuntu','skill_unix','skill_vb.net','skill_verilog_/_vhdl',
    'skill_voip','skill_xml'],

'language_translation_skills': ['skill_english_grammar','skill_english_translation','skill_translation',
    'skill_english_(us)_translator','skill_english_spelling','skill_transcription', 'skill_arabic_translator',
    'skill_french_translator', 'skill_english_(uk)_translator','skill_english_teaching', 'skill_english_tutoring',
    'skill_spanish_(spain)', 'skill_spanish','skill_simplified_chinese_(china)','skill_hindi','skill_portuguese_(brazil)',
    'skill_russian'],

'finance_accounting_skills': ['skill_account_payables_management','skill_account_receivables_management',
    'skill_accounting','skill_bookkeeping','skill_finance','skill_financial_accounting',
    'skill_financial_markets','skill_financial_research','skill_tax_accounting','skill_account_management',
    'skill_audit','skill_bank_reconciliation','skill_contracts','skill_erp','skill_financial_analysis',
    'skill_intuit_quickbooks','skill_risk_management','skill_tax','skill_trading','skill_payroll'],

'management_skills': ['skill_agile_development','skill_management','skill_agile_project_management',
    'skill_product_management','skill_project_management','skill_project_scheduling','skill_sharepoint',
    'skill_time_management'],

'marketing_business_skills': ['skill_advertising','skill_affiliate_marketing','skill_brand_marketing',
    'skill_bulk_marketing','skill_business_analysis','skill_business_analytics','skill_business_cards',
    'skill_business_coaching','skill_business_consulting','skill_business_plans','skill_business_strategy',
    'skill_content_marketing','skill_digital_marketing','skill_email_marketing',
    'skill_facebook_marketing','skill_instagram_marketing','skill_internet_marketing','skill_marketing',
    'skill_marketing_strategy','skill_ppc_marketing','skill_sales_promotion','skill_search_engine_marketing',
    'skill_social_media_marketing','skill_social_video_marketing','skill_seo_writing', 'skill_amazon',
    'skill_blog','skill_blog_install','skill_branding','skill_copywriting','skill_corporate_identity',
    'skill_customer_service','skill_customer_support','skill_drupal','skill_ecommerce','skill_forum_posting',
    'skill_google_adwords','skill_instagram','skill_keyword_research','skill_magento','skill_newsletters',
    'skill_open_cart','skill_pinterest','skill_publishing','skill_research','skill_seo','skill_social_media_management',
    'skill_social_networking','skill_twitter','skill_website_management','skill_woocommerce',
    'skill_youtube','skill_brand_management','skill_catch_phrases','skill_cms','skill_commercials',
    'skill_communications','skill_content_creation','skill_content_strategy','skill_copy_typing',
    'skill_crm','skill_ebay','skill_email_handling','skill_entrepreneurship','skill_event_planning',
    'skill_facebook_ads','skill_google_adsense','skill_inventory_management','skill_lead_generation',
    'skill_leads','skill_linkedin','skill_mailchimp','skill_market_research','skill_order_processing',
    'skill_podcasting','skill_press_releases','skill_product_descriptions','skill_product_sourcing',
    'skill_public_relations','skill_social_media_post_design', 'skill_social_media_copy','skill_reviews',
    'skill_sales','skill_seo_auditing','skill_shopify','skill_shopify_development','skill_shopping',
    'skill_shopping_carts','skill_slogans','skill_supplier_sourcing','skill_telemarketing','skill_tiktok',
    'skill_helpdesk'],

'performance_arts_skills': ['skill_music','skill_music_production','skill_voice_acting',
    'skill_audio_engineering','skill_sound_engineering','skill_audio_production','skill_audio_services',
    'skill_screenwriting','skill_short_stories','skill_voice_artist','skill_voice_talent',
    'skill_apple_logic_pro','skill_audio_editing','skill_audio_mastering','skill_audio_processing',
    'skill_garageband','skill_poetry','skill_voice_over'],

'design_skills': ['skill_3d_design','skill_advertisement_design','skill_album_design','skill_app_design',
    'skill_banner_design','skill_blog_design','skill_book_cover_design','skill_brochure_design',
    'skill_building_design','skill_circuit_design','skill_concept_art','skill_concept_design',
    'skill_creative_design','skill_design','skill_digital_art','skill_digital_design','skill_ebook_design',
    'skill_electronic_design','skill_fashion_design','skill_flyer_design','skill_furniture_design',
    'skill_game_design','skill_graphic_art','skill_graphic_design','skill_home_design','skill_icon_design',
    'skill_industrial_design','skill_interior_design','skill_invitation_design','skill_label_design',
    'skill_logo_design','skill_manufacturing_design','skill_package_design',
    'skill_packaging_design','skill_painting','skill_pcb_design_and_layout','skill_photoshop_design',
    'skill_poster_design','skill_product_design','skill_prototype_design','skill_sign_design',
    'skill_sound_design','skill_stationery_design','skill_sticker_design','skill_tattoo_design','skill_uml_design',
    'skill_visual_arts','skill_website_design','skill_wordpress_design','skill_3d_animation','skill_3d_modelling',
    'skill_3d_printing','skill_3d_rendering','skill_adobe_flash','skill_adobe_illustrator','skill_adobe_indesign',
    'skill_adobe_photoshop','skill_adobe_premiere_pro','skill_after_effects','skill_animation','skill_architectural_rendering',
    'skill_arts_&_crafts','skill_autocad','skill_autocad_architecture','skill_autodesk_revit',
    'skill_book_artist','skill_building_architecture','skill_cad/cam','skill_caricature_&_cartoons',
    'skill_covers_&_packaging','skill_drawing','skill_explainer_videos','skill_format_and_layout',
    'skill_illustration','skill_illustrator','skill_imovie','skill_infographics','skill_joomla','skill_landing_pages',
    'skill_motion_graphics','skill_photo_editing','skill_photo_restoration','skill_photo_retouching',
    'skill_photography','skill_photoshop','skill_powerpoint','skill_presentations','skill_print','skill_product_photography',
    'skill_revit_architecture','skill_shopify_templates','skill_sketch','skill_sketching','skill_sketchup',
    'skill_solidworks','skill_t-shirts','skill_templates','skill_typography','skill_usability_testing','skill_user_interface_/_ia',
    'skill_ux_/_user_experience','skill_vectorization','skill_video_ads','skill_video_editing',
    'skill_video_production','skill_video_services','skill_videography','skill_wix','skill_youtube_video_editing',
    'skill_2d_animation','skill_2d_drafting','skill_2d_drawing','skill_2d_game_art','skill_3d_architecture',
    'skill_3d_cad','skill_3d_drafting','skill_3d_logo','skill_3d_model_maker','skill_3ds_max','skill_adobe_lightroom',
    'skill_animated_video_development','skill_architecture','skill_autodesk_inventor','skill_blender',
    'skill_canva','skill_cgi','skill_character_illustration',"skill_children\\'s_book_illustration",'skill_cinema_4d',
    'skill_comics','skill_drone_photography','skill_drones','skill_fashion_modeling','skill_figma','skill_filmmaking',
    'skill_final_cut_pro','skill_infographic_and_powerpoint_slide_designing','skill_logo_animation',
    'skill_maya','skill_pattern_making','skill_prezi','skill_video_post-editing','skill_video_processing',
    'skill_video_broadcasting','skill_video_upload','skill_squarespace','skill_storyboard','skill_ui_/_user_interface',
    'skill_unity_3d','skill_wireframes'],

'teaching_training_skills': ['skill_education_&_tutoring','skill_educational_research','skill_elearning',
    'skill_elearning_designer','skill_elementor','skill_human_resources','skill_math_tutoring',
    'skill_teaching/lecturing','skill_training'],
 
#Random stuff
'miscellaneous_skills': ['skill_delivery','skill_freelance','skill_general_labor','skill_local_job','skill_odd_jobs',
    'skill_pdf','skill_sports','skill_wireless','skill_xxx','skill_biology','skill_brain_storming','skill_call_center',
    'skill_cooking_&_recipes','skill_general_office','skill_health','skill_instrumentation','skill_legal',
    'skill_legal_research','skill_logistics','skill_medical','skill_mathematics','skill_physics','skill_pickup',
    'skill_post-production','skill_pre-production','skill_real_estate','skill_recruitment', 'skill_scientific_research',
    'skill_startups','skill_telephone_handling','skill_wikipedia','skill_administrative_support',
    'skill_internet_research','skill_phone_support','skill_virtual_assistant']

}
 

**Process Summary**
- Assign into grouping also based on people who are likely to reference the skill e.g. everyone does taxes but we might expect a finance or accouting expert to cite tax as one of their skills as they might reasonably be equipped to assist with tax filing

- Also used occupations with the highest proportion of the skill as a way to group

Marketing & Business: Social media platforms, marketing & areas/types of marketing (e.g. SEO, brand management, online marketing)

Design: Design tools, modeling, photography

Performance arts: Music, drama, acting production

##### b. Collapse skills into categories

In [None]:
skill_cert_categorization_df = matchable_skill_cert.copy()

#Iterate through all the established categories
for category in assigned_skill_categories:
    
    #For each individual count the total number of projects in which they've used a skill in that category
    skill_cert_categorization_df[category] = skill_cert_categorization_df.loc[:,assigned_skill_categories[category]].sum(axis=1)

    #Drop the individual skill columns
    skill_cert_categorization_df = skill_cert_categorization_df.drop(columns= assigned_skill_categories[category])


In [None]:
skill_cert_categorization_df.columns

Index(['search_query', 'name', 'gender', 'join_date_from_earliest',
       'location_size', 'hourly_rate', 'pay_grade', 'avg_rating',
       'num_reviews', 'num_recommendations', 'pct_jobs_completed',
       'pct_on_budget', 'pct_on_time', 'verification_preferred_freelancer',
       'verification_identity_verified', 'verification_payment_verified',
       'verification_phone_verified', 'verification_email_verified',
       'verification_facebook_connected', 'pct_certifications_us_english_1',
       'badge_plus_membership', 'pct_certifications_foundation_vworker_member',
       'pct_certifications_us_english_3',
       'pct_certifications_freelancer_orientation_1',
       'pct_certifications_employer_orientation_exam_1',
       'pct_certifications_basic_numeracy_1',
       'pct_certifications_preferred_freelancer_program_sla_1',
       'pct_certifications_us_english_2',
       'pct_certifications_academic_writing_1', 'pct_certifications_php_1',
       'pct_certifications_uk_english_1', 

##### c. Certifications

In [None]:
#Get the matchable certification columns
matchable_cert_cols = [col for col in matchable_skill_cert.columns if 'certification' in col]
matchable_cert_cols

['pct_certifications_us_english_1',
 'pct_certifications_foundation_vworker_member',
 'pct_certifications_us_english_3',
 'pct_certifications_freelancer_orientation_1',
 'pct_certifications_employer_orientation_exam_1',
 'pct_certifications_basic_numeracy_1',
 'pct_certifications_preferred_freelancer_program_sla_1',
 'pct_certifications_us_english_2',
 'pct_certifications_academic_writing_1',
 'pct_certifications_php_1',
 'pct_certifications_uk_english_1',
 'pct_certifications_data_entry_1',
 'pct_certifications_html_1',
 'pct_certifications_wordpress_1',
 'pct_certifications_javascript_1',
 'pct_certifications_c#_programming_1']

In [None]:
#Manually categorize the certifications
assigned_cert_categories = {
'language_certifications': ['pct_certifications_us_english_1','pct_certifications_us_english_3',
    'pct_certifications_us_english_2','pct_certifications_uk_english_1'],

'freelancer_certifications': ['pct_certifications_foundation_vworker_member','pct_certifications_freelancer_orientation_1',
    'pct_certifications_employer_orientation_exam_1','pct_certifications_preferred_freelancer_program_sla_1'],

'general_skill_certifications': ['pct_certifications_basic_numeracy_1','pct_certifications_academic_writing_1',
    'pct_certifications_data_entry_1'],

'programming_certifications': ['pct_certifications_php_1','pct_certifications_html_1','pct_certifications_wordpress_1',
    'pct_certifications_javascript_1', 'pct_certifications_c#_programming_1']
}

##### d. Collapse certifications into categories

In [None]:
#Iterate through all the established categories
for category in assigned_cert_categories:
    
    #For each individual count the total number of certifications they have in the category (ignoring things like %)
    skill_cert_categorization_df[category] = skill_cert_categorization_df.loc[:,assigned_cert_categories[category]].count(axis=1)

    #Drop the individual certification columns
    skill_cert_categorization_df = skill_cert_categorization_df.drop(columns= assigned_cert_categories[category])


In [None]:
skill_cert_categorization_df.columns

Index(['search_query', 'name', 'gender', 'join_date_from_earliest',
       'location_size', 'hourly_rate', 'pay_grade', 'avg_rating',
       'num_reviews', 'num_recommendations', 'pct_jobs_completed',
       'pct_on_budget', 'pct_on_time', 'verification_preferred_freelancer',
       'verification_identity_verified', 'verification_payment_verified',
       'verification_phone_verified', 'verification_email_verified',
       'verification_facebook_connected', 'badge_plus_membership',
       'badge_preferred_freelancer', 'badge_verified', 'engineering_skills',
       'writing_skills', 'technical_programming_skills',
       'language_translation_skills', 'finance_accounting_skills',
       'management_skills', 'marketing_business_skills',
       'performance_arts_skills', 'design_skills', 'teaching_training_skills',
       'miscellaneous_skills', 'language_certifications',
       'freelancer_certifications', 'general_skill_certifications',
       'programming_certifications'],
      dtype=

In [None]:
skill_cert_categorization_df.head()

Unnamed: 0,search_query,name,gender,join_date_from_earliest,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,num_recommendations,...,management_skills,marketing_business_skills,performance_arts_skills,design_skills,teaching_training_skills,miscellaneous_skills,language_certifications,freelancer_certifications,general_skill_certifications,programming_certifications
0,2,Milen,0,7063,1,45,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
1,2,Jeremy,0,7526,1,90,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0
2,2,Nichole,1,6430,0,25,4.0,5.0,2,0,...,0.0,0.0,0.0,5.0,0.0,0.0,1,0,0,0
3,2,Robert,0,3238,1,75,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0
4,2,Jean-Paul,0,6661,5,19,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0


##### e. Check for missing values

In [None]:
skill_cert_categorization_df.isna().sum()

search_query                         0
name                                 0
gender                               0
join_date_from_earliest              0
location_size                        0
hourly_rate                          0
pay_grade                            0
avg_rating                           0
num_reviews                          0
num_recommendations                  0
pct_jobs_completed                   0
pct_on_budget                        0
pct_on_time                          0
verification_preferred_freelancer    0
verification_identity_verified       0
verification_payment_verified        0
verification_phone_verified          0
verification_email_verified          0
verification_facebook_connected      0
badge_plus_membership                0
badge_preferred_freelancer           0
badge_verified                       0
engineering_skills                   0
writing_skills                       0
technical_programming_skills         0
language_translation_skil

##### f. Write to CSV

In [None]:
skill_cert_categorization_df.to_csv('../data/processed/skills_certifications_categorized_female_treatment.csv',index=False)

#### 5. Collapsing matchable skills and certifications into groups [Number of skills count]
Here matchable skills and certifications are those that are held by at least 30 people

##### a. Collapse skills into categories (counting the number of listed skills in that category)

In [None]:
skill_cert_categorization_count_df = matchable_skill_cert.copy()

#Iterate through all the established categories
for category in assigned_skill_categories:
    
    #For each individual count the total number of skills they have listed in each skill category
    skill_cert_categorization_count_df[category] = skill_cert_categorization_count_df.loc[:,assigned_skill_categories[category]].count(axis=1)

    #Drop the individual skill columns
    skill_cert_categorization_count_df = skill_cert_categorization_count_df.drop(columns= assigned_skill_categories[category])


##### b. Collapse certifications in categories

In [None]:
#Iterate through all the established categories
for category in assigned_cert_categories:
    
    #For each individual count the total number of certifications they have in the category (ignoring things like %)
    skill_cert_categorization_count_df[category] = skill_cert_categorization_count_df.loc[:,assigned_cert_categories[category]].count(axis=1)

    #Drop the individual certification columns
    skill_cert_categorization_count_df = skill_cert_categorization_count_df.drop(columns= assigned_cert_categories[category])


In [None]:
skill_cert_categorization_count_df.gender.value_counts()

0    6692
1    3074
Name: gender, dtype: int64

##### c. Write to CSV

In [None]:
skill_cert_categorization_count_df.to_csv('../data/processed/skills_certifications_categorized_skill_count_female_treatment.csv',index=False)

In [None]:
import pandas as pd
dfv = pd.read_csv('/work/DS4SG-Gender-Inequality/data/processed/skills_certifications_categorized_skill_count.csv', low_memory=False)

In [None]:
dfv.isna().sum()

search_query                         0
name                                 0
gender                               0
join_date_from_earliest              0
location_size                        0
hourly_rate                          0
pay_grade                            0
avg_rating                           0
num_reviews                          0
num_recommendations                  0
pct_jobs_completed                   0
pct_on_budget                        0
pct_on_time                          0
verification_preferred_freelancer    0
verification_identity_verified       0
verification_payment_verified        0
verification_phone_verified          0
verification_email_verified          0
verification_facebook_connected      0
badge_plus_membership                0
badge_preferred_freelancer           0
badge_verified                       0
engineering_skills                   0
writing_skills                       0
technical_programming_skills         0
language_translation_skil

In [None]:
dfv[dfv.skill_virtual_assistant.notnull()].search_query.value_counts()

software engineer    241
designer             190
accountant            65
copywriter            34
Name: search_query, dtype: int64

In [None]:
dfv[dfv.skill_windows_desktop.notnull()]

Unnamed: 0,search_query,name,gender,profile_link,location,location_size,hourly_rate,pay_grade,avg_rating,num_reviews,...,skill_oracle_ebs_tech_integration,pct_certifications_google_webmaster_central_1,skill_modx,skill_cubecart,skill_phaser,skill_drilling_engineering,skill_casperjs,join_date_from_earliest,badge_preferred_freelancer,badge_verified
61,designer,Terri,female,https://www.freelancer.com/u/terrihudsonloper,Houston,5,20,0.0,0.0,0,...,,,,,,,,4815,False,False
92,designer,Syedhdita,male,https://www.freelancer.com/u/MAtta5,Skokie,1,45,0.0,0.0,0,...,,,,,,,,7513,False,False
151,designer,Thomas,male,https://www.freelancer.com/u/thomasaur,Beaverton,1,35,0.0,0.0,0,...,,,,,,,,2167,False,False
153,designer,Agustín,male,https://www.freelancer.com/u/agustindana,Seattle,5,50,0.0,0.0,0,...,,,,,,,,4072,False,False
186,designer,Christopher,male,https://www.freelancer.com/u/Web2Guru,Pointblank,0,40,0.0,0.0,0,...,,,,,,,,2534,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9700,software engineer,Harold,male,https://www.freelancer.com/u/hrbriceno,Ridgewood,1,8,0.0,0.0,0,...,,,,,,,,6794,False,False
9704,software engineer,Anton,male,https://www.freelancer.com/u/wsrtoha,Palmdale,2,25,0.0,0.0,0,...,,,,,,,,5989,False,False
9710,software engineer,David,male,https://www.freelancer.com/u/DavePie71,Pinellas Park,1,70,0.0,0.0,0,...,,,,,,,,7166,False,False
9725,software engineer,William,male,https://www.freelancer.com/u/WillWaldon,Independence,2,30,0.0,0.0,0,...,,,,,,,,7427,False,False


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=acc27b92-84be-4130-8026-204943f38189' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>