## A brief notebook to answer the question:
# "Which job title am I actually doing in my current position"
### <i>A quick data exploration of: </i> <b>Data Scientist vs. Software Engineer</b> 

2018-01
Justin Gosses

-----------------

## THIS IS NOTEBOOK II
### Vectorization & Machine Learning Model Building & Prediction
Go <a href="http://nbviewer.jupyter.org/github/JustinGOSSES/WhichJobTitle/blob/master/Which_Job_Title_Are_You-PartI.ipynb">here</a> for notebook part I.


In [153]:
#### Some libraries we'll be using
#### Pandas and Numpy for working with data structures
import pandas as pd
import numpy as np
#### Requests for loading webpages
import requests
#### BeautifulSoup for some very small scraping (not really data mining as we're only getting human-level data back)
from bs4 import BeautifulSoup 
#### Regular expressions for text cleaning
import re
#### Visualization via the Vega library
from vega3 import Vega
#### Writing and reading to files via JSON
import json

new this notebook, we're also importing several libraries for machine-learning, which we'll do above each model

## Part IV
## Convert json jobs results into array for numerical only area machine-learning

In [154]:
with open('data/new_array_of_jobs_orig.json') as json_data:
    new_array_of_jobs_ml = json.load(json_data)
    print(len(new_array_of_jobs_ml))

42


In [298]:
#print(new_array_of_jobs_ml)

In [156]:
#### Takes in the job array called 'new_array_of_jobs_ml' from the json 'new_array_of_jobs_orig.json'
#### And creates two files: 1) list of list of counts for each skill, which will now be features, and 2) list of lables
#### 0 = data scientist
#### 1 = software engineer
def convertJSONtoDataArray(jsonData):
    holder_of_features = []
    holder_of_labels = []
    for job in jsonData:
        skill_holder = []
        for count in job["word_counts"]:
            skill_holder.append(int(count["count"]))
        
        holder_of_features.append(skill_holder)
        if job["job"] == "Software Engineer":
            holder_of_labels.append(0)
        elif job["job"] == "Data Scientist":
            holder_of_labels.append(1)
        else:
            print("THERE IS A PROBLEM YOU HAVE A WEIRD JOB TITLE")
    return holder_of_features,holder_of_labels       

In [157]:
holder_of_features,holder_of_labels = convertJSONtoDataArray(new_array_of_jobs_ml)

In [158]:
print(len(holder_of_features))

42


In [159]:
print(len(holder_of_labels))

42


In [202]:
def convertLabelsToText(lab):
    if lab == 0:
            return "Software Engineer"
    elif lab == 1:
            return "Data Scientist"
    else:
        return "Other"
            

In [205]:
#### functino that takes in a array of labels in terms of 0 and 1 and converts it to array of Software Engineer or Data Scientist text lables
#### This will be used below to convert numerical labels to text
def convertLabelsToTextArray(holder_of_labels):
    text_labels = []
    for lab in holder_of_labels:
        text_labels.append(convertLabelsToText(lab))
    return text_labels

In [206]:
text_labels = convertLabelsToTextArray(holder_of_labels)
print(text_labels)

['Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer', 'Software Engineer']


In [207]:
#### save skill counts as features in a json
with open('data/holder_of_features.json', 'w') as outfile:
    json.dump(holder_of_features, outfile)

In [208]:
#### save labels(0 or 1) in a json
with open('data/holder_of_labels.json', 'w') as outfile:
    json.dump(holder_of_labels, outfile)

## Part V
### Machine-learning

In [209]:
from sklearn import svm
import numpy as np
import matplotlib.pyplot as plt

### DecisionTreeClassifier

In [210]:
#### importing from scikit-learn http://scikit-learn.org/
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

In [211]:
#print(holder_of_features)

In [212]:
#### Set training (X) and label data (y)
X = holder_of_features
y = holder_of_labels

#### Make a decision tree classifier
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)
scores = cross_val_score(clf, X, y)
scores.mean()                             

0.8582417582417583

In [213]:
#### Make a Random Forest Classifier
clf_rf = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=6, random_state=0)
scores_RF = cross_val_score(clf_rf, X, y)
scores_RF.mean()    

0.883882783882784

In [280]:
#### Make a Random Forest Classifier
clf_ET = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores_ET = cross_val_score(clf_ET, X, y)
scores_ET.mean()  

0.9283272283272282

### KNeighborsClassifier

In [277]:

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) 
KNeighborsClassifier(...)
print(neigh.predict_proba([holder_of_features[1]]))

[[0. 1.]]


In [278]:
scores_KNN1 = cross_val_score(neigh, X, y)
scores_KNN1.mean()  

0.7884004884004884

In [279]:
#### Examining results of KNN classification
def correctOrNotKNN(neigh,holder_of_features,holder_of_labels):
    counter = 0
    for eachJob in holder_of_features:
        result_list = neigh.predict_proba([eachJob])[0].tolist()
        if result_list[1] > result_list[0]:
            prediction = "Data Scientist"
        elif result_list[1] < result_list[0]:
            prediction = "Software Engineer"
        else:
            print("check data !!!!")
        text_label = convertLabelsToText(holder_of_labels[counter])
        if text_label == prediction:
            score = True
        else:
            score = False
        print(counter,prediction,score,result_list)
        counter += 1

In [244]:
#### Examining results of KNN classification
correctOrNotKNN(neigh,holder_of_features,holder_of_labels)

0 Data Scientist True [0.0, 1.0]
1 Data Scientist True [0.0, 1.0]
2 Data Scientist True [0.3333333333333333, 0.6666666666666666]
3 Data Scientist True [0.3333333333333333, 0.6666666666666666]
4 Data Scientist True [0.0, 1.0]
5 Data Scientist True [0.3333333333333333, 0.6666666666666666]
6 Data Scientist True [0.0, 1.0]
7 Data Scientist True [0.0, 1.0]
8 Data Scientist True [0.0, 1.0]
9 Data Scientist True [0.0, 1.0]
10 Software Engineer True [1.0, 0.0]
11 Software Engineer True [0.6666666666666666, 0.3333333333333333]
12 Software Engineer True [1.0, 0.0]
13 Software Engineer True [1.0, 0.0]
14 Software Engineer True [1.0, 0.0]
15 Software Engineer True [1.0, 0.0]
16 Software Engineer True [1.0, 0.0]
17 Software Engineer True [1.0, 0.0]
18 Software Engineer True [1.0, 0.0]
19 Software Engineer True [1.0, 0.0]
20 Software Engineer True [0.6666666666666666, 0.3333333333333333]
21 Software Engineer True [0.6666666666666666, 0.3333333333333333]
22 Software Engineer True [1.0, 0.0]
23 Softwa

### SVM SVC

#### linear kernal

In [261]:
import numpy as np
X = np.array(holder_of_features)
y = np.array(holder_of_labels)
from sklearn.svm import SVC
clf_SVC1 = SVC()
clf_SVC1.fit(X, y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
    max_iter=-1, probability=True, random_state=None, shrinking=True,
    tol=0.001, verbose=False)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [267]:
scores_SVC1 = cross_val_score(clf_SVC1, X, y)
scores_SVC1.mean()  

0.7407814407814408

In [299]:
#### Examining results of KNN classification
def correctOrNotSVM(clf,holder_of_features,holder_of_labels):
    counter = 0
    for eachJob in holder_of_features:
        result_list = clf.predict([eachJob])[0].tolist()
        #print("result_list",result_list)
        if result_list == 1:
            prediction = "Data Scientist"
        elif result_list == 0:
            prediction = "Software Engineer"
        else:
            print("check data !!!!")
        text_label = convertLabelsToText(holder_of_labels[counter])
        if text_label == prediction:
            score = True
        else:
            score = False
        #print(counter,prediction,score,result_list)
        counter += 1

In [300]:
test = correctOrNotSVM(clf_SVC1,holder_of_features,holder_of_labels)

#### rbf kernal

In [268]:
### try with rbf kernal
import numpy as np
X = np.array(holder_of_features)
y = np.array(holder_of_labels)
from sklearn.svm import SVC
clf_rbf1 = SVC()
clf_rbf1.fit(X, y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [272]:
scores_SVC1 = cross_val_score(clf_rbf1, X, y)
scores_SVC1.mean()  

0.7407814407814408

In [301]:
# SVCresults = correctOrNotSVM(clf_rbf1,holder_of_features,holder_of_labels)

#### poly kernal

In [274]:
### try with rbf kernal
import numpy as np
X = np.array(holder_of_features)
y = np.array(holder_of_labels)
from sklearn.svm import SVC
clf_poly1 = SVC()
clf_poly1.fit(X, y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [275]:
scores_poly1 = cross_val_score(clf_poly1, X, y)
scores_poly1.mean()  

0.7407814407814408

In [302]:
# SVCresults = correctOrNotSVM(clf_pol,holder_of_features,holder_of_labels)

## Part VI
## Now lets prepare text from my resume and see what it predicts

In [177]:
#### open the resume text from a saved txt file
f = open('data/Justin_Resume.txt','r')
text_ugly = f.read()
print(text_ugly)

Data Science, Data Visualization, and Web Developer
Offers two years of experience on the NASA data analytics team and nine years of experience in the oil industry as a geoscientist. Has successfully delivered projects in the machine learning, artificial intelligence, data visualization, and software engineering spaces working with clients in domains as diverse as geology, finance, human resources, and astronaut training. Seeks a position that applies my data analytics and programming skills to build new tools and capabilities.

Computer Language, Database, Web-development & Machine-Learning Skills: 
•	Language, DB & Cloud - Python, R, Java, JavaScript, Sed/Awk, Prostgresql, Neo4J, AWS System admin.
•	A.I. - Alexa skill development, CMU Sphinx speech-to-text, & Raspberry Pi IoT development.
•	ML - Scikit-learn, TensorFlow, NumPy, Theano, and Keras, Weka, MATLAB, K-means, SVM, & Decision Trees.
•	Web – Flask.py, Node.js, JQuery.js, Angular.js, & React.js, HTML, CSS, & Wordpress.
•	Data 

In [178]:
#### Clean the ugly resume text
def cleanResume(resume):
    resume = resume.lower()
    resume = re.sub('[^a-zA-Z0-9 \n\.]', '', resume)
    resume = resume.replace('\n',' ')
    resume = resume.replace('\r',' ')
    print(resume)
    return resume

In [179]:
#### I'm leaving in the periods as they are needed to capture some skills
clean_resume_text = cleanResume(text_ugly)

data science data visualization and web developer offers two years of experience on the nasa data analytics team and nine years of experience in the oil industry as a geoscientist. has successfully delivered projects in the machine learning artificial intelligence data visualization and software engineering spaces working with clients in domains as diverse as geology finance human resources and astronaut training. seeks a position that applies my data analytics and programming skills to build new tools and capabilities.  computer language database webdevelopment  machinelearning skills  language db  cloud  python r java javascript sedawk prostgresql neo4j aws system admin. a.i.  alexa skill development cmu sphinx speechtotext  raspberry pi iot development. ml  scikitlearn tensorflow numpy theano and keras weka matlab kmeans svm  decision trees. web  flask.py node.js jquery.js angular.js  react.js html css  wordpress. data visualization  d3.js three.js seaborn ggplot2 bokeh vega tableau

In [180]:
def convertJSONtoDataArray(jsonData):
    holder_of_features = []
    holder_of_labels = []
    for job in jsonData:
        skill_holder = []
        for count in job["word_counts"]:
            skill_holder.append(int(count["count"]))
        
        holder_of_features.append(skill_holder)
        if job["job"] == "Software Engineer":
            holder_of_labels.append(0)
        elif job["job"] == "Data Scientist":
            holder_of_labels.append(1)
        else:
            print("THERE IS A PROBLEM YOU HAVE A WEIRD JOB TITLE")
    return holder_of_features,holder_of_labels

In [181]:
resume_obj = {"job":'Data Scientist',"full_title":'Data Scientist',"url":"none"}

In [182]:
print(type(clean_resume_text))

<class 'str'>


In [183]:
def createFeaturesForTest(resume_obj,clean_resume_text,array_of_skills):
    results_array = []
    for skill in array_of_skills:
            instance = clean_resume_text.count(skill)
            results_array.append({"skill":skill,"count":instance})
    resume_obj["word_counts"] = results_array
    return resume_obj
        
# results_array = []
#         for skill in array_of_skills:
#             instance = noSpecial.count(skill)
#             results_array.append({"skill":skill,"count":instance})
#         job["word_counts"] = results_array
#         new_array_of_jobs.append(job)

In [184]:
#### Get array of skills again
with open('data/array_of_skills.json') as json_data:
    array_of_skills = json.load(json_data)
    print(len(array_of_skills)) 

907


In [185]:
obj_job_skills_OfResume = createFeaturesForTest(resume_obj,clean_resume_text,array_of_skills)

In [186]:
#obj_job_skills_OfResume

In [187]:
def convertJSONtoDataArray(jsonData):
    holder_of_features = []
    holder_of_labels = []
    for job in jsonData:
        skill_holder = []
        for count in job["word_counts"]:
            skill_holder.append(int(count["count"]))
        
        holder_of_features.append(skill_holder)
        if job["job"] == "Software Engineer":
            holder_of_labels.append(0)
        elif job["job"] == "Data Scientist":
            holder_of_labels.append(1)
        else:
            print("THERE IS A PROBLEM YOU HAVE A WEIRD JOB TITLE")
    return holder_of_features,holder_of_labels

In [188]:
holder_of_features_resume,holder_of_labels_resume = convertJSONtoDataArray([obj_job_skills_OfResume])
print(len(holder_of_features_resume[0]))

907


In [189]:
holder_of_labels_resume

[1]

### The best results seemed to come from the Extra Forest Classifier, so we'll use that model to see what job it thinks I do based on my resume
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

#### For refresher, this is the Extra Forest Classifier

In [297]:
#### Make a Extra Forest Classifier
clf_ET = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores_ET = cross_val_score(clf_ET, X, y)
clf_ET.fit(X, y)
scores_ET.mean()  

0.9283272283272282

If we run the features associated with my resume through it:

In [288]:
prediction_num = clf_ET.predict(holder_of_features_resume)

In [289]:
convertLabelsToText(prediction_num)

'Data Scientist'

### which means I'm likely a data scientist
#### ...at least according to the skils on my resume and a small sample of job descriptions 

*and I apologize for all the poor code, this was a silly experiment and not really worth cleaning at this time*