# **How To Get a Job in Data Science**

At this stage of my life, I am trying to solve one problem, that is how can I successfully transfer from a mechanical engineer to a job in the data science, either an analyst or scientist, or any other titles.

I will be focusing my analysis on the **2020 Kaggle Machine Learning & Data Science Survey** to extract the most recent and valuable information to help myself to gain in-depth understanding of the current job market and to make better strategic decisions in the job hunting process.

Here is a link to the survey
https://www.kaggle.com/c/kaggle-survey-2020

I broke up the analysis into three stages:
* Since I have been targetting myself to a data scientist role, I will first concentrating on **finding the relationship between the data scientist and below aspectes**, using Data Exploratory Analysis and Correspondence Analysis.
  * Age
  * Gender
  * Country
  * Education level
  * Languages and IDE
* Then I will **build a classifier with various multi-classification models to predict the job role** based on the answers each participant has given, and pick the model with the best performance.
* Last I will **use the model to find the most suitable positions for me** based on my current skill sets and experience. and I will also search for the most contributing factor to be employed for such a role (hopefully a data scientist), so that I can work on that effectively to get hired soon!

## Import Data and General Overview

In [None]:
#install necessary modules
!pip install prince
!pip install adjustText

In [None]:
#import all packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from prince import CA, MCA
import seaborn as sns
import plotly.offline as py
py.init_notebook_mode(connected=True)

In [None]:
#load data
df = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", low_memory=False)
df.shape

### General scan of the whole dataset

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
#remove first row and first column
df_cp = df.drop(index=0)
df_cp = df_cp.drop(columns = "Time from Start to Finish (seconds)")
df_cp.head()

In [None]:
#check missing values
df_cp.isna().sum()

### Restructure the dataset to make it easier to manage
We see sub-part questions presents in some of the questions, I will group them together and save them as separate data frames in a dictionary.

In [None]:
#initialise the dictionary
questions = {}

#create a list of grouped questions
qnums = list(set([q.split("_")[0] for q in df_cp.columns]))
qnums = sorted(qnums, key=lambda q:int(q[1:]))

In [None]:
#group dataframe to the question in the dictionary questions
for i in qnums:
    questions[i] = df_cp[[q for q in df_cp.columns if q.split("_")[0] == i]]

In [None]:
#example of Q7 looks like below
questions["Q7"].head()

## EDA - Overview of Data Science job market
Correspondence analysis (CA) are used extensively in this section to connect the dots between the data scientist role and the chosen categorical variables. It allows us an appropriate visualisation of qualitative variables with the help of a map of perception, so that I can obtain a better picture of the whole job market and see where myself is positioned.

Moreover, it is also able to give me some level of confidence indirectly by showing the inertia in the CA to indicate how well the visualisations at hand represent the reality.

### Data Positions in terms of Age

In [None]:
#correspondence analysis on age & positions
age_ds = pd.crosstab(df_cp.Q1,df_cp.Q5)
ca_age = CA(n_components=2)
ca_age.fit(age_ds)
ca_age.plot_coordinates(age_ds, figsize=(10,10));

The variability is explained 96.26% which gives a good indication of age distribution.

A clear segmentation is shown between job positions to different age groups. Not to mention the apparent where Student are associated with age 18-22, **there is a clear transition from younger age group who tends to work on product and application to a more senior group who does more theoretical research** which normally requires higher education level, i.e. more years is required to reach such positions.

Age 29 (my age) is more associated with Data Analyst, Machine Learning Engineer and Data Scientist and it is also true that I am currently not employed as a data scientist related jobs! (wondering how does the CA figured that out :)

Let's take a closer look at the age distribution of being a data scientist

In [None]:
#define sutset for data scientist
df_ds = df_cp[df_cp.Q5=="Data Scientist"]

In [None]:
colors = ["lightblue"] * len(df_ds.Q1.value_counts().index)
colors[0] = "lightsalmon"

fig = go.Figure(layout=go.Layout(title= go.layout.Title(text="Age Distribution of Data Scientist")))
fig.add_trace(go.Bar(x= df_ds.Q1.value_counts().index,
                     y=df_ds.Q1.value_counts().values,
                     marker_color=colors))

**At age of 29, I think I have a good chance of getting hired! Age ticked!**

### Data Positions in terms of Gender

In [None]:
#use adjustText module to reduce the overlap between the texts in maps created from CA
from adjustText import adjust_text

In [None]:
#correspondence analysis on gender & positions
gender_ds = pd.crosstab(df_cp.Q2,df_cp.Q5)
ca_gender = CA(n_components=2)
ca_gender.fit(gender_ds)
ax = ca_gender.plot_coordinates(gender_ds,
                                figsize=(10,10),
                                show_row_labels=False,show_col_labels=False).legend(loc="center right")

#adjust text to overlapping between different labels
cols=ca_gender.column_coordinates(gender_ds).to_dict()
xcols=cols[0]
ycols=cols[1]
rows=ca_gender.row_coordinates(gender_ds).to_dict()
xrows=rows[0]
yrows=rows[1]

xglobal={ k : xcols.get(k,0)+xrows.get(k,0) for k in set(xcols) | set(xrows) }
yglobal={ k : ycols.get(k,0)+yrows.get(k,0) for k in set(ycols) | set(yrows) }

fig = ax.get_figure()
texts=[plt.text(xglobal[x],yglobal[x],x,fontsize=10) for x in xglobal.keys()]
adjust_text(texts,arrowprops=dict(arrowstyle='-', color='red'));

The variability is explained 94.11% which gives a good indication of gender distribution.

**There is a huge gender imbalance in the data science industry** where most positions are occupied by Man, indicating the overall job market are male dominated.

Let's take a closer look at the gender distribution of being a data scientist

In [None]:
colors = ["lightblue"] * len(df_ds.Q2.value_counts().index)
colors[0] = "lightsalmon"

fig = go.Figure(layout=go.Layout(title= go.layout.Title(text="Gender Distribution of Data Scientist")))
fig.add_trace(go.Bar(x= df_ds.Q2.value_counts().index,
                     y=df_ds.Q2.value_counts().values,
                     marker_color=colors))

The bar plot demonstrates the same information where number of Man is approximately four times the sum of the other genders, among all the respondents. Hopefully certain schemes or programmes in data science industry can be set up to promote other genders to pursuit a career in the relevant fields.

**Being a man makes me the majority of the workforce in this industry.**

### Data Positions in terms of Country

In [None]:
#country distribution on a reduced dataset for clarification, based on the population
#who took the survey, the first 20 counties with most participants are selected here

n = 20
reduced_country = df_cp.Q3.value_counts().iloc[:n]

In [None]:
df_red = pd.DataFrame()
for c in reduced_country.index:
    df_red = df_red.append(df_cp[df_cp.Q3 == c])

In [None]:
#correspondence analysis on country & positions
country_ds = pd.crosstab(df_red.Q3,df_red.Q5)
ca_country = CA(n_components=2)
ca_country.fit(country_ds)
ca_country.plot_coordinates(country_ds, figsize=(10,10),show_row_labels=False,show_col_labels=False)

#adjust text to overlapping between different labels
cols=ca_country.column_coordinates(country_ds).to_dict()
xcols=cols[0]
ycols=cols[1]
rows=ca_country.row_coordinates(country_ds).to_dict()
xrows=rows[0]
yrows=rows[1]

xglobal={ k : xcols.get(k,0)+xrows.get(k,0) for k in set(xcols) | set(xrows) }
yglobal={ k : ycols.get(k,0)+yrows.get(k,0) for k in set(ycols) | set(yrows) }

fig = ax.get_figure()
texts=[plt.text(xglobal[x],yglobal[x],x,fontsize=10) for x in xglobal.keys()]
adjust_text(texts,arrowprops=dict(arrowstyle='-', color='red'));

The variability is explained 74.06% which gives an okay indication of country distribution.

The analysis on countries is more complicated, since there so many other factors associated with countries that would influence the position distribution within an industry, such as educational level, salary level and industry/technology level as a whole. Chances are that the same position can be called differently, or vice versa in different countries. However, **the first principal component is able to draw a line between developed countries (left side of the map) from the developing countries (right side of the map)** whereas the **second principal component differentiate the types of the job roles, from application biased (top of the map) towards research biased (bottom of the map)**.

**Thus, most data science related jobs are concentrated on developed countries; Also Germany and Japan have higher proportion in Research Scientist than the rest of the countries**.

Let's also check the absolute number of respondents' nationality of the survey to make sure we do not draw conclusion without context.

In [None]:
colors = ["lightblue"] * len(df_red.Q3.value_counts().index)
colors[6] = "lightsalmon"

fig = go.Figure(layout=go.Layout(title= go.layout.Title(text="Country Distribution of Data Scientist")))
fig.add_trace(go.Bar(x= df_red.Q3.value_counts().index,
                     y=df_red.Q3.value_counts().values,
                     marker_color=colors))
fig.update_layout(xaxis_tickangle=45)
fig.show()

As we can see, apart from India (developing country), USA (developed country) and Other, the remaining number of respondents are at least on a same level and a rather fair mix of two types of countries, so the above conclusion holds.

**Living in the UK, I am among the first six nations, and I hope that guarantees me a good chance of becoming a data scientist!**

### Data Positions in terms of Education Level

In [None]:
#correspondence analysis on education & positions
df_edu = df_cp.drop(df_cp[df_cp.Q4 == "I prefer not to answer"].index)
edu_ds = pd.crosstab(df_edu.Q4,df_edu.Q5)
ca_edu = CA(n_components=2)
ca_edu.fit(edu_ds)
ca_edu.plot_coordinates(edu_ds, figsize=(10,10),show_row_labels=False,show_col_labels=False)

#adjust text to overlapping between different labels
cols=ca_edu.column_coordinates(edu_ds).to_dict()
xcols=cols[0]
ycols=cols[1]
rows=ca_edu.row_coordinates(edu_ds).to_dict()
xrows=rows[0]
yrows=rows[1]

xglobal={ k : xcols.get(k,0)+xrows.get(k,0) for k in set(xcols) | set(xrows) }
yglobal={ k : ycols.get(k,0)+yrows.get(k,0) for k in set(ycols) | set(yrows) }

fig = ax.get_figure()
texts=[plt.text(xglobal[x],yglobal[x],x,fontsize=10) for x in xglobal.keys()]
adjust_text(texts,arrowprops=dict(arrowstyle='-', color='red'));

The variability is explained 97.53% which gives an good indication of education level.

Three main findings:
* There is **a distinct relationship between Doctoral degree with research scientist and also a good relationship with statistician**, where these two positisions are normally more educational demanding
* **Most data science jobs are associated people with Master's degree**
* **Business analyst and project manager are two unique positions that requires a professional degree**; it makes sense, you will normally required to have proven traceability of industrial and managerial experience to become a business manager.

Now let's plot the education distribution among data scientist

In [None]:
colors = ["lightblue"] * len(df_ds.Q4.value_counts().index)
colors[0] = "lightsalmon"

fig = go.Figure(layout=go.Layout(title= go.layout.Title(text="Education Distribution of Data Scientist")))
fig.add_trace(go.Bar(x= df_ds.Q4.value_counts().index,
                     y=df_ds.Q4.value_counts().values,
                     marker_color=colors))
fig.update_layout(xaxis_tickangle=45)
fig.show()

**Most workforce in data science has Master's degree and luckily I do too! And what is better, my masters is in computational methods, so hopefully it gives me a better edge!**

### Data Positions in terms of Language

In [None]:
#Q7 has multiple choices so extra steps are needed for aggregating answers
questions["Q7"]

In [None]:
df_lan = questions["Q5"].join(questions["Q7"])
df_lan

In [None]:
#"unroll" the dataframe items of one column of positions plus one column of languages
cols = list(questions["Q7"].columns)
df_lan = (df_lan.melt(id_vars="Q5", value_vars=cols))
df_lan.columns = ["Position", "variable", "Language"]

In [None]:
#correspondence analysis on language & positions
lan_ds = pd.crosstab(df_lan.Position, df_lan.Language)
ca_lan = CA(n_components=2)
ca_lan.fit(lan_ds)
ca_lan.plot_coordinates(lan_ds, figsize=(10,10),show_row_labels=False,show_col_labels=False).legend(loc="upper left")

#adjust text to overlapping between different labels
cols=ca_lan.column_coordinates(lan_ds).to_dict()
xcols=cols[0]
ycols=cols[1]
rows=ca_lan.row_coordinates(lan_ds).to_dict()
xrows=rows[0]
yrows=rows[1]

xglobal={ k : xcols.get(k,0)+xrows.get(k,0) for k in set(xcols) | set(xrows) }
yglobal={ k : ycols.get(k,0)+yrows.get(k,0) for k in set(ycols) | set(yrows) }

fig = ax.get_figure()
texts=[plt.text(xglobal[x],yglobal[x],x,fontsize=10) for x in xglobal.keys()]
adjust_text(texts,arrowprops=dict(arrowstyle='-', color='red'));

The variability is explained 77.59% which gives an okay indication of coding languages.

We can quite distinguish a universal language Python from all the other languages since it is so close to the centre point, telling us **the ratios of people who uses Python as the primary coding language are pretty much the same among all different positions**. Besides that, **a Software Engineer tends to use Javascript and Swift; a data engineer uses Bash and SQL more frequently; a Statistian concentrates more on R and a Research Scientist prefers MATLAB**.

Again, language preferences within data scientist community below:

In [None]:
df_ds_lan = df_lan[df_lan.Position=="Data Scientist"]
colors = ["lightblue"] * len(df_ds_lan.Language.value_counts().index)
colors[0] = "lightsalmon"

fig = go.Figure(layout=go.Layout(title= go.layout.Title(text="Language Distribution of Data Scientist")))
fig.add_trace(go.Bar(x= df_ds_lan.Language.value_counts().index,
                     y= df_ds_lan.Language.value_counts().values,
                     marker_color=colors))
fig.update_layout(xaxis_tickangle=45)
fig.show()

My primary language for data analysis is luckily Python, so at least I am half foot into the data science club (I hope)!

Now I see there is a good fountain for me to become a Data Scientist based on the analysis so far, but we all know I have just scratched the surface by exploring only a few features.

In order to strategise my job hunting with more precision, let me complete the survey and build a classification model to see which exactly positions that grants me the highest chance of employment!

## Build A Classification Model
* As it is quite an imbalanced dataset in terms of positions, I will first trim it to get rid of the least represented positions and also the ones that I am not interested in
* Prepare my own answers and transferred into array with the correct shape, ready for prediction.
* Build the classifier with different models and evaluate their performances to pick up the optimal model
* Predict the position I should be aiming at and find out how can I increase the odds of getting that job.

### Trim the dataset
Overview of the population of all the positions

In [None]:
fig = go.Figure(layout=go.Layout(title= go.layout.Title(text="Positions")))
fig.add_trace(go.Bar(x= df_cp.Q5.value_counts().index,
                     y=df_cp.Q5.value_counts().values))

As I am looking for a job that I am interested, so I shall only keep the below positions:
* Data Scientist
* Data Analyst
* Machine Learning Engineer
* Research Scientist
* Software Engineer

In [None]:
df_cls = df_cp[df_cp.Q5.isin(["Data Scientist","Software Engineer","Data Analyst","Research Scientist","Machine Learning Engineer"])]
df_cls = df_cls[df_cls.Q5.notna()]
df_cls.Q5.value_counts()

Seems like a good number of samples, so lets go ahead and work on this dataset.

Looking at the survey, questions after Q39 is for non-professionals which is out of the interest of this analysis, so I only keep questions before Q39.

In [None]:
df_prof = df_cls.iloc[:,:df_cls.columns.get_loc("Q39_OTHER")+1]
df_prof.head()

Let's have an overview of the population distribution of the remaining positions

In [None]:
fig = go.Figure(layout=go.Layout(title= go.layout.Title(text="Positions")))
fig.add_trace(go.Bar(x= df_prof.Q5.value_counts().index,
                     y=df_prof.Q5.value_counts().values))

### Build up my own answers
I construct my answers as below and I have put comments against some of them which I have guessed or pictured in my future job

In [None]:
#add my own choices to the survey
questions_multi = ["Q7","Q9","Q10","Q12","Q14","Q16","Q17","Q18","Q19","Q23","Q26","Q27","Q28","Q29","Q31","Q33","Q34","Q35","Q36","Q37","Q39"]
questions_AB = ["Q26","Q27","Q28","Q29","Q31","Q33","Q34","Q35"]
choices = {"Q1":["25-29"],
           "Q2":["Man"],
           "Q3":["United Kingdom of Great Britain and Northern Ireland"],
           "Q4":["Master’s degree"],
           "Q6":["< 1 years"],
           "Q7":["Python","SQL"],
           "Q8":["Python"],
           "Q9":["Jupyter (JupyterLab, Jupyter Notebooks, etc)"],
           "Q10":["Kaggle Notebooks","Colab Notebooks"],
           "Q11":["A personal computer or laptop"],
           "Q12":["None"],
           "Q13":["Never"],
           "Q14":["Matplotlib","Seaborn","Plotly / Plotly Express"],
           "Q15":["Under 1 year"],
           "Q16":["Scikit-learn","TensorFlow"],
           "Q17":["Linear or Logistic Regression","Decision Trees or Random Forests"],
           "Q18":["None"],
           "Q19":["None"],
           "Q20":["0-49 employees"],    #not sure
           "Q21":["3-4"],               #guess
           "Q22":["I do not know"],
           "Q23":["Analyze and understand data to influence product or business decisions"],   
           "Q24":["60,000-69,999"],     
           "Q25":["$1000-$9,999"],      #guess
           "Q26":["None"],              
           "Q27":["No / None"],
           "Q28":["No / None"],
           "Q29":["PostgresSQL "],
           "Q30":["MySQL "],
           "Q31":["None"],
           "Q32":["Tableau"],
           "Q33":["No / None"],
           "Q34":["No / None"],
           "Q35":["TensorBoard"],
           "Q36":["Kaggle","GitHub"],
           "Q37":["Coursera","Udemy"],
           "Q38":["Local development environments (RStudio, JupyterLab, etc.)"],
           "Q39":["Kaggle (notebooks, forums, etc)"]}

In [None]:
#function to generate instances for certain questions
def answer_generator(question,answer):
    """
    Generate an list of answers which matches with the survey answer format
    question: number of the question in string
    answer: a list of strings of the answers
    """
    options = questions[question].mode().values
    for i in range(options.shape[1]):
        if not options[0,i].strip() in answer:
            options[0,i] = np.nan
    return list(options[0])

In [None]:
#example
answer_generator("Q7",["Python","Java"])

In [None]:
#generate the profile for all the questions
def profile_generator(questions_multi,questions_AB,choices):
    """
    Generates the overall profile based on the answers given
    questions_multi: question numbers which are multiple choices
    questions_AB: questions numbers which are for both professionals and non-professionals
    choicecs: the list of the answers
    """
    profile = []
    for q in choices.keys():
        if q in questions_multi:
            if q in questions_AB:
                answer = answer_generator(q,choices[q])[:int(len(answer_generator(q,choices[q]))/2)]
            else:
                answer = answer_generator(q,choices[q])
        else:
            answer = choices[q]
            
        profile += answer
    return profile

In [None]:
#showcase my answer profile which is ready to be inserted into the main dataframe for
#predictions, NOT for building and training the model
my_profile = profile_generator(questions_multi=questions_multi,questions_AB=questions_AB,choices=choices)
my_profile

In [None]:
#append my answer profile to last row 20036
df_myself = df_prof.drop("Q5", axis=1)
df_myself.loc[df_myself.index[-1]+1] = my_profile
df_myself.tail()

### Encoding the dataset to get ready for model building

In [None]:
df_dummies = pd.get_dummies(df_myself)
df_dummies.head()

In [None]:
if (df_myself.index == df_dummies.index).sum() == len(df_myself):
    print(f"Both index matches {len(df_myself)}, proceed")

In [None]:
#create dataset for training which excludes my answer
df_model = df_prof.Q5.to_frame().join(df_dummies)
array_myself = df_dummies.loc[df_myself.index[-1]]

In [None]:
df_model

### Build four baseline models
* Logistic Regression
* KNN
* SVC
* RandomForestClassifier

In [None]:
%matplotlib inline

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve, roc_curve, auc

In [None]:
X = df_model.drop(["Q5"], axis=1)
y = df_model.Q5

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=7)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(max_iter=1000),
          "KNN": KNeighborsClassifier(),
          "SVC": SVC(kernel='linear', probability=True,random_state=7),
          "Random Forest": RandomForestClassifier()}

# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of differetn Scikit-Learn machine learning models
    X_train : training data (no labels)
    X_test : testing data (no labels)
    y_train : training labels
    y_test : test labels
    """
    # Set random seed
    np.random.seed(7)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = np.mean(cross_val_score(model,X_test, y_test,scoring="accuracy",cv=5))
    return model_scores

In [None]:
model_scores = fit_and_score(models=models,
                             X_train=X_train,
                             X_test=X_test,
                             y_train=y_train,
                             y_test=y_test)

model_scores

In [None]:
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
fig = go.Figure([go.Bar(x=model_compare.columns, y=model_compare.iloc[0])])
fig.update_layout(title="Comparison of Model Accuracy")
fig.show()

So far, in terms of accuracy, RandomForestClassifier has out ran other models, but Logistici Regression does not come too far behind. Let's tune the two models to reach its full performance.

Moreover,it is known that this dataset is a little imbalanced, number of Data Scientist is more than twice of those of Research Scientist and Machine Learning Engineer. So we'd be careful in choosing the valuation metrics.

We started the evaluation on accuracy, and then can carry on adopting other metrics which are more suitable for imbalanced dataset after the hyperparametertuning

* Hyperparameter tuning
* Confusion matrix
* Precision / Recall / F1 score
* ROC / AUC

### Hyperparameter tuning
Before moving on to the two best performers, I want to know how much better a KNN model can achieve since it has the simplest algorithm.

In [None]:
# Let's tune KNN

train_scores = []
test_scores = []

# Create a list of differnt values for n_neighbors
neighbors = range(1, 40)

# Setup KNN instance
knn = KNeighborsClassifier()

# Loop through different n_neighbors
for i in neighbors:
    knn.set_params(n_neighbors=i)
    
    # Fit the algorithm
    knn.fit(X_train, y_train)
    
    # Update the training scores list
    train_scores.append(knn.score(X_train, y_train))
    
    # Update the test scores list
    test_scores.append(knn.score(X_test, y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(neighbors, train_scores, label="Train score")
plt.plot(neighbors, test_scores, label="Test score")
plt.xticks(np.arange(1, 40, 1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()

print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%")

With a peak accuracy of around 55%, it is safe to treat it as a bottom limit of the prediction accuracy.

#### Hyperparameter tuning with RandomizedSearchCV

We're going to tune:
* LogisticRegression()
* RandomForestClassifier()

In [None]:
# Create a hyperparameter grid for LogisticRegression
log_reg_grid = {"C": np.logspace(-2, 1, 20),
                "multi_class":["ovr","multinomial"],
                "solver":["lbfgs","saga"]}

# Create a hyperparameter grid for RandomForestClassifier
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

In [None]:
# Tune LogisticRegression

# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(models["Logistic Regression"],
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=10,
                                verbose=True,
                                random_state=7)

# Fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(X_train, y_train)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test, y_test)

Now we've tuned LogisticRegression(), let's do the same for RandomForestClassifier()...

In [None]:
# Setup random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(), 
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=10,
                           verbose=True,
                           random_state=7)

# Fit random hyperparameter search model for RandomForestClassifier()
rs_rf.fit(X_train, y_train)

In [None]:
# Find the best hyperparameters
rs_rf.best_params_

In [None]:
# Evaluate the randomized search RandomForestClassifier model
rs_rf.score(X_test, y_test)

After randomised search, the Logistic Regression model outputs a slightly better accuracy comparing to RandomForestClassifier, lets continue.

### Evaluting our tuned machine learning classifier, using below metrics, in additional to accuracy

* Confusion matrix
* Precision / Recall / F1 score
* ROC / AUC

In [None]:
#create a function to plot the confusion matrix in a heatmap
sns.set(font_scale=1.0)

def plot_conf_mat(y_test, y_preds, labels):

    fig, ax = plt.subplots(figsize=(8, 8))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds, labels=labels),
                     annot=True,
                     cbar=False,
                     fmt="g")
    plt.xlabel("Predicted label")
    plt.ylabel("True label")
    ax.set_title("Confusion Matrix")
    ax.xaxis.set_ticklabels(labels)
    ax.yaxis.set_ticklabels(labels)
    plt.xticks(rotation=-45); plt.yticks(rotation=45)

In [None]:
#create a function to evaluate both Logistic Regression and RandomForestRegressor
def evaluate_model(model, X_test, y_test):
    '''
    A function to output the confusion matrix and precision/recall/F1 score for
    a given classifier
    '''
    y_preds = model.predict(X_test)
    labels = y_test.value_counts().index
    
    #print precision/recall/F1 score
    print(classification_report(y_test, y_preds))
    
    #plot confusion matrix
    plot_conf_mat(y_test, y_preds, labels)

In [None]:
#evaluate the tuned Logistic Regressor
evaluate_model(rs_log_reg, X_test, y_test)

In [None]:
#evaluate the tuned RandomForestRegressor
evaluate_model(rs_rf, X_test, y_test)

In [None]:
# Plot ROC curve and calculate and calculate AUC
def plot_multiclass_roc(model, X_test, y_test, n_classes, figsize=(17, 6)):
    y_score = model.predict_proba(X_test)

    #decision_function
    # structures
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    # calculate dummies once
    y_test_dummies = pd.get_dummies(y_test, drop_first=False).values
    labels = pd.get_dummies(y_test, drop_first=False).columns
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # roc for each class
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot([0, 1], [0, 1], "k--")
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_title("Receiver operating characteristic")
    for i in range(n_classes):
        ax.plot(fpr[i], tpr[i], label=f'ROC curve (area = {roc_auc[i]:0.2f}) for label {labels[i]}')
    ax.legend(loc="best")
    ax.grid(alpha=.4)
    sns.despine()
    plt.show()

In [None]:
#ROC/AUC for Logistic Regressor
plot_multiclass_roc(rs_log_reg, X_test, y_test, n_classes=5, figsize=(16, 10))

In [None]:
#ROC/AUC for RandomForestClassifier
plot_multiclass_roc(rs_rf, X_test, y_test, n_classes=5, figsize=(16, 10))

We can see a **slightly higher F1 score and AUC for Logistic Regression, comparing to those from the RandomForestRegressor**, although the difference is minimal, the classic Logistic Regression wins and I shall use it to predict my future!

### Best strategies to get hired via analysing feature importance
Before moving onto predicting my future job, let's have a look what are the important contributions for each of the positions. These information will guide my decision making and hunting strategies.

In [None]:
#Define and construct the logistic regressor with the best parameters
lr = LogisticRegression(C=0.01, multi_class="multinomial")
lr.fit(X_train, y_train)

In [None]:
#plot the first five contributors for each of the positions
labels = pd.get_dummies(y_test, drop_first=False).columns

plt.figure(figsize=(10,30))
for n in range(len(labels)):
    ax = plt.subplot(len(labels),1,n+1)
    feature_dict = dict(zip(X_train.columns, list(lr.coef_[n])))
    feature_dict = dict(sorted(feature_dict.items(), key=lambda item: item[1], reverse=True))

    # Visualize feature importance
    feature_df = pd.DataFrame(feature_dict, index=["importance"])
    feature_df=feature_df.T.head(5).iloc[::-1]

    height = feature_df["importance"]
    bars = feature_df.index
    y_pos = np.arange(len(bars))

    # Create horizontal bars
    plt.barh(y_pos, height)
 
    # Create names on the y-axis
    plt.yticks(y_pos, bars)
    
    plt.title(f"Model Coefficients in Logistic Regressor for {labels[n]}")

* **Analyze and understand the data to influence product or business decisions** is the biggest contributor in both positions as Data Analyst and Data Scientist, and can be safely concluded one of the most important competencies to have for those two roles. 
* However, as a Data Analyst, you will have to have **more experience in BI tool** while you are not required to be a coding master in terms the years and experience.
* As a Data Scientist, **R is a good option to get the hands dirty** and you will be required to **gain some traction with Machine learing experience and knowledges**.
* **Build and improve the operations and performance of the ML models** are the main job of Machining Learning Engineers.
* Not surprisingly, **getting a phd is kind of your best bet if you want to join as a Research Scientist**, which in turn, you will naturally be required to have the ability to publications and master MATLAB as the numerical computation tool.
* As for the Software Engineers, **Javascript and Java are the critical skills in demand** and seems like most people lands the job with a Bachelor's degree.

Let's see how much odds it increases to get hired assuming I am good at analyzing the data and driving business decisions

In [None]:
#transfer feature importance coefficient to the increase of odds if certain feature switches from 0 to 1
odd_analyst = np.exp(np.max(lr.coef_[0]))-1
odd_scientist = np.exp(np.max(lr.coef_[1]))-1

print(f"Being good at 'Analyze and understand the data to influence product or business decisions' increase odds of becoming\nData Analyst by {odd_analyst*100:.2f}%\nData Scientist by {odd_scientist*100:.2f}%")

Now let's take a look at feature importance from the RandomForestClassifier. I will only have a quick look by investigating the mean reduction in tree impurity as the main method for now, to gain a global picture to see how losing each feature impacts the final decision.

In [None]:
rf = RandomForestClassifier(n_estimators=760, min_samples_split=18, min_samples_leaf=1)
rf.fit(X_train, y_train)

In [None]:
# Feature importance dataframe
imp_df = pd.DataFrame({'feature': X_train.columns.values,
                       'importance': rf.feature_importances_})
 
# Reorder by importance
ordered_df = imp_df.sort_values(by='importance').tail(20)
imp_range=range(1,len(imp_df.index)+1)
 
## Barplot with confidence intervals
height = ordered_df['importance']
bars = ordered_df['feature']
y_pos = np.arange(len(bars))

plt.figure(figsize=(10,10))
# Create horizontal bars
plt.barh(y_pos, height)
 
# Create names on the y-axis
plt.yticks(y_pos, bars)

plt.title("Mean reduction in tree impurity in random forest")

#plt.tight_layout()
# Show graphic
plt.show()

The first a few features also appears in the feature importance analysis for Logistic Regressor. What we can read from this chart, together with the previous section, is that a **Doctoral degree, Ability to analyse data and influence decisions and Javascript** have a strong determination in terms of categorising samples into **Research Scientist, Data Analyst / Data Scientist / Software Engineer**, respectively.

#### Save the classifier
**This is originally done on my local drive, comment that whole section on kaggle**.

In [None]:
# pickle the model
#import pickle
#Position_classfier = {'model': lr}
#pickle.dump(Position_classfier, open('classifier' + ".p", "wb"))

In [None]:
# test the pickled model for prediction

#file_name = "classifier.p"
#with open(file_name, 'rb') as pickled:
#    Position_classfier = pickle.load(pickled)
#    classifier = Position_classfier['model']

#classifier.predict(X_test.iloc[0,:].values.reshape(1,-1)), y_test.iloc[0]

### Finally, let's predict the position I can get hired

In [None]:
probabilities = lr.predict_proba(array_myself.values.reshape(1,-1))
positions_proba = pd.DataFrame(probabilities,columns=labels,index=["Probability"])
positions_proba

In [None]:
fig = go.Figure([go.Bar(x=positions_proba.columns, y=positions_proba.iloc[0])])
fig.update_layout(title="Probabilities of Getting hired at Different Positions")
fig.show()

So, not so surprisingly, my future is being directed to the **Data Analyst and with a good 42.6% chance**! However, I might still going to pursuit for a **Data Scientist** role since that has **a good chance of nearly 40% as well**. But the best thing is that, I know what I need to concentrate on improving and demonstrating in my project / interview / future works, that is the skill of **analyse and understand data to influence the business decisions**, regardless of being and analyst or a scientist!