# Test Data Selection
This notebook creates a small set of labeled data that can be used to test the Doc2Vec model. Specifically, in selects 10 sample job descriptions under 2 job titles (data scientist and data engineer). It matches each of these 2 job titles with 5 courses each that I believe the model should recommend. This sample data will then be used to test the accuracy of the model.

In [130]:
import pandas as pd

## Job test data

In [131]:
# Read in the jobs data
jobs_df = pd.read_csv('../Data/Job_Data/Glassdoor_Joblist.csv')
print(jobs_df.shape)
jobs_df.head(2)

(3324, 12)


Unnamed: 0,Job_title,Company,State,City,Min_Salary,Max_Salary,Job_Desc,Industry,Rating,Date_Posted,Valid_until,Job_Type
0,Chief Marketing Officer (CMO),National Debt Relief,NY,New York,-1,-1,Who We're Looking For:\n\nThe Chief Marketing ...,Finance,4.0,2020-05-08,2020-06-07,FULL_TIME
1,Registered Nurse,Queens Boulevard Endoscopy Center,NY,Rego Park,-1,-1,"Queens Boulevard Endoscopy Center, an endoscop...",,3.0,2020-04-25,2020-06-07,FULL_TIME


In [132]:
# Check the major job titles in the dataset
jobs_df['Job_title'].value_counts()

Data Scientist                               186
Data Engineer                                129
Data Analyst                                  69
Senior Data Engineer                          44
Senior Data Scientist                         39
                                            ... 
Support Scientist-Ocean Data Assimilation      1
Insights and Analytics Manager                 1
DHS-NTC Senior Scientist                       1
Document Security Scientist                    1
Sr. Healthcare Data Analyst                    1
Name: Job_title, Length: 1619, dtype: int64

In [133]:
# Select 5 job descriptions for data scientist
ds_jobs = [901, 910, 916, 920, 938]

In [134]:
# Select 5 job descriptions for data engineer
de_jobs = [935, 1068, 1089, 1100, 1105]

In [135]:
sample_jobs = jobs_df.loc[[901, 910, 916, 920, 938, 935, 1068, 1089, 1100, 1105], ['Job_title', 'Job_Desc']]
sample_jobs

Unnamed: 0,Job_title,Job_Desc
901,Data Scientist,We are looking for Data Scientists who are int...
910,Data Scientist,The world's largest and fastest-growing compan...
916,Data Scientist,\nRole: Data Scientist.\n\nLocation: Foster Ci...
920,Data Scientist,Upstart is the leading AI lending platform par...
938,Data Scientist,"Why Divvy?Over the past decade, millions of Am..."
935,Data Engineer,About Rocket LawyerWe believe everyone deserve...
1068,Data Engineer,Our mission is to create a world where mental ...
1089,Data Engineer,Data Engineer \nIf you are a Data Engineer wit...
1100,Data Engineer,Prabhav Services Inc. is one of the premier pr...
1105,Data Engineer,About Skupos\nSkupos is the data platform for ...


In [145]:
sample_jobs['Job_id'] = sample_jobs.index
sample_jobs

Unnamed: 0,Job_title,Job_Desc,Job_id
901,Data Scientist,We are looking for Data Scientists who are int...,901
910,Data Scientist,The world's largest and fastest-growing compan...,910
916,Data Scientist,\nRole: Data Scientist.\n\nLocation: Foster Ci...,916
920,Data Scientist,Upstart is the leading AI lending platform par...,920
938,Data Scientist,"Why Divvy?Over the past decade, millions of Am...",938
935,Data Engineer,About Rocket LawyerWe believe everyone deserve...,935
1068,Data Engineer,Our mission is to create a world where mental ...,1068
1089,Data Engineer,Data Engineer \nIf you are a Data Engineer wit...,1089
1100,Data Engineer,Prabhav Services Inc. is one of the premier pr...,1100
1105,Data Engineer,About Skupos\nSkupos is the data platform for ...,1105


In [147]:
sample_jobs.to_csv('jobs_test_sample.csv', index=False)

## Course test data

In [137]:
# Read in the course data
courses_df = pd.read_csv('../Data/Course_Data/Coursera_Catalog.csv')
print(courses_df.shape)
courses_df.head(2)

(4416, 10)


Unnamed: 0,courseType,description,domainTypes,id,slug,specializations,workload,primaryLanguages,certificates,name
0,v2.ondemand,Gamification is the application of game elemen...,"[{'subdomainId': 'design-and-product', 'domain...",69Bku0KoEeWZtA4u62x6lQ,gamification,[],4-8 hours/week,['en'],['VerifiedCert'],Gamification
1,v2.ondemand,This course will cover the steps used in weigh...,"[{'subdomainId': 'data-analysis', 'domainId': ...",0HiU7Oe4EeWTAQ4yevf_oQ,missing-data,[],"4 weeks of study, 1-2 hours/week",['en'],"['VerifiedCert', 'Specialization']",Dealing With Missing Data


In [138]:
ds_courses = [3823, 143, 3165, 3588, 2517]

In [139]:
de_courses = [545, 1015, 4233, 3763, 1311]

In [140]:
sample_courses = courses_df.loc[[3823, 143, 3165, 3588, 2517, 545, 1015, 4233, 3763, 1311], ['name', 'description']]
sample_courses

Unnamed: 0,name,description
3823,The Data Scientist’s Toolbox,In this course you will get an introduction to...
143,Machine Learning,Machine learning is the science of getting com...
3165,Applied Machine Learning in Python,This course will introduce the learner to appl...
3588,Data Visualization with Python,"""A picture is worth a thousand words"". We are ..."
2517,Machine Learning with Python,This course dives into the basics of machine l...
545,Databases and SQL for Data Science,Much of the world's data resides in databases....
1015,Google Cloud Platform Big Data and Machine Lea...,This 2-week accelerated on-demand course intro...
4233,Big Data Modeling and Management Systems,Once you’ve identified a big data issue to ana...
3763,Database Management Essentials,Database Management Essentials provides the fo...
1311,"Data Warehouse Concepts, Design, and Data Inte...",This is the second course in the Data Warehous...


In [141]:
sample_courses['job_title'] = None
sample_courses

Unnamed: 0,name,description,job_title
3823,The Data Scientist’s Toolbox,In this course you will get an introduction to...,
143,Machine Learning,Machine learning is the science of getting com...,
3165,Applied Machine Learning in Python,This course will introduce the learner to appl...,
3588,Data Visualization with Python,"""A picture is worth a thousand words"". We are ...",
2517,Machine Learning with Python,This course dives into the basics of machine l...,
545,Databases and SQL for Data Science,Much of the world's data resides in databases....,
1015,Google Cloud Platform Big Data and Machine Lea...,This 2-week accelerated on-demand course intro...,
4233,Big Data Modeling and Management Systems,Once you’ve identified a big data issue to ana...,
3763,Database Management Essentials,Database Management Essentials provides the fo...,
1311,"Data Warehouse Concepts, Design, and Data Inte...",This is the second course in the Data Warehous...,


In [142]:
sample_courses.loc[ds_courses, 'job_title'] = 'Data Scientist'
sample_courses.loc[de_courses, 'job_title'] = 'Data Engineer'
sample_courses

Unnamed: 0,name,description,job_title
3823,The Data Scientist’s Toolbox,In this course you will get an introduction to...,Data Scientist
143,Machine Learning,Machine learning is the science of getting com...,Data Scientist
3165,Applied Machine Learning in Python,This course will introduce the learner to appl...,Data Scientist
3588,Data Visualization with Python,"""A picture is worth a thousand words"". We are ...",Data Scientist
2517,Machine Learning with Python,This course dives into the basics of machine l...,Data Scientist
545,Databases and SQL for Data Science,Much of the world's data resides in databases....,Data Engineer
1015,Google Cloud Platform Big Data and Machine Lea...,This 2-week accelerated on-demand course intro...,Data Engineer
4233,Big Data Modeling and Management Systems,Once you’ve identified a big data issue to ana...,Data Engineer
3763,Database Management Essentials,Database Management Essentials provides the fo...,Data Engineer
1311,"Data Warehouse Concepts, Design, and Data Inte...",This is the second course in the Data Warehous...,Data Engineer


In [148]:
sample_courses['course_id'] = sample_courses.index
sample_courses

Unnamed: 0,name,description,job_title,course_id
3823,The Data Scientist’s Toolbox,In this course you will get an introduction to...,Data Scientist,3823
143,Machine Learning,Machine learning is the science of getting com...,Data Scientist,143
3165,Applied Machine Learning in Python,This course will introduce the learner to appl...,Data Scientist,3165
3588,Data Visualization with Python,"""A picture is worth a thousand words"". We are ...",Data Scientist,3588
2517,Machine Learning with Python,This course dives into the basics of machine l...,Data Scientist,2517
545,Databases and SQL for Data Science,Much of the world's data resides in databases....,Data Engineer,545
1015,Google Cloud Platform Big Data and Machine Lea...,This 2-week accelerated on-demand course intro...,Data Engineer,1015
4233,Big Data Modeling and Management Systems,Once you’ve identified a big data issue to ana...,Data Engineer,4233
3763,Database Management Essentials,Database Management Essentials provides the fo...,Data Engineer,3763
1311,"Data Warehouse Concepts, Design, and Data Inte...",This is the second course in the Data Warehous...,Data Engineer,1311


In [149]:
sample_courses.to_csv('courses_test_sample.csv', index=False)