Bayesian Analysis of Data

In [5]:
import pandas as pd
import numpy as np

In [6]:
dat = pd.read_csv("jobs_in_data.csv")

Introduction

In today's job market in the Data Science industry, the chances of being hired are very low. Hence, having a basic intuition of conditional probability is important to derive the chances of different job positions/openings given certain beliefs/conditions in the data science market. This allows us to make more informed decisions about what jobs in today's market for the Data Science industry to look for. This is why we will be performing some Bayesian Analysis on the dataset using different situations.

Situation 1

Find the odds that you make over $150,000 in USD given that you are a Data 
Engineer. To rewrite this in terms of conditional probability, we use Bayes' Theorem below:

P(A|B) = P(B|A) * P(A) / P(B)

A = Belief, B = Observation

To identify the belief and observation in the above conditional probability
problem, we first need to identify what these terms mean. The belief is what we are trying to find out while the observation is what we are given (evidence).

In this case, A(Belief) = Salary is greater than $150,000 in USD and B(Observation) = You are a Data Engineer.

Going back to Bayes' Theorem to calculate conditional probability, there are 4 parts to Bayes Theorem.
1. P(B|A) = Likelihood probability
2. P(A) = Prior Probability
3. P(B) = Marginalization (Prob of observation)
4. P(A|B) = Posterior/Conditional Probability

We will calculate each of them one by one to derive the conditional probability.

To compute the posterior conditional probability using Bayes' theorem above, we first need to find the number of jobs that have a salary greater than $150,000.

In [19]:
salary_greater = dat.loc[dat['salary_in_usd'] > 150000]
salary_greater_length = len(salary_greater.index)
salary_greater_length

4150

In [24]:
# Get the total number of salaries (aka, the number of records in the dataset)
num_salaries = len(dat.index)
num_salaries

9355

Now we can compute the prior probability of making a salary of greater than $150,000 in USD.

In [27]:
p_a = salary_greater_length/num_salaries
p_a

0.4436130411544629

This is a huge percentage of number of jobs around the world that offer yearly salaries greater than $150,000 in USD.
Now that we found the prior probability, we can calculate the probability of being a Data Engineer, P(B).

In [28]:
data_engineer = dat[dat['job_title'] == 'Data Engineer']
data_engineer_length = len(data_engineer.index)
p_b = data_engineer_length / num_salaries
p_b 

0.2346338856226617

From the dataset of Data Science jobs around the world, the probability of being a Data Engineer is strictly high compared to the vast amount of job titles there are. 

Now, let's derive the likelihood probability. We will find the likelihood probability using the likelihood ratio, which is using odds. In order to find that, we need the true positive and the false positive ratios:
1. True Positive = P(B|A) = Probability of being a data engineer given that you are making over 80k a year.
2. False Positive = P(B|!A) =  Probability of being a data engineer given that you are making at most 80k a year.

The true and false positives are computed in the following code cells.

In [32]:
df1 = dat.loc[(dat['job_title'] == 'Data Engineer') & (dat['salary_in_usd'] > 150000)]
true_positive = len(df1.index) / salary_greater_length
df2 = dat.loc[(dat['job_title'] == 'Data Engineer') & (dat['salary_in_usd'] <= 150000)]
false_positive = len(df2.index) / (num_salaries - salary_greater_length)

In [33]:
true_positive

0.22216867469879517

In [34]:
false_positive

0.24457252641690683

Now we can compute the likelihood ratio, which is the ratio of the true positive and false positive.

likelihood_ratio = true_positive / false_positive
likelihood_ratio

Next, we calculate the prior odds to find the posterior odds, which are the odds for 
posterior probability. We will later convert it to conditional probability to divide by the marginalization.

Odds of A = P(A) / 1 - P(A)

In [38]:
p_a_odds = p_a / (1 - p_a)
p_a_odds

0.7973102785782902

Once we have found the prior odds, we can now find the posterior odds, which is computed
as likelihood ratio * prior odds.

In [40]:
post_odds = likelihood_ratio * p_a_odds
post_odds

0.7242733699921446

Now, we can convert the posterior odds to conditional probability.

In [45]:
cond_prob = post_odds / (post_odds + 1)
cond_prob

0.4200455580865604

The probability of having a yearly salary over $150,000 in USD given that you 
are a Data Engineer is 42%.

The following codecell verifies this, using the equation to find P(A|B).
P(A|B) = P(A & B) / P(B)

In [50]:
#df1 = Dataframe of data engineering jobs that have a yearly salary of over $150,000. 
p_a_given_b = (len(df1.index) / num_salaries) / p_b
p_a_given_b

0.42004555808656036