<h1 style="font-size: 3.5rem"> Most data scientists possess postgraduate degrees. How true is this statement?</h1>

This study uses a sample that consists of the educational qualifications of randomly selected data professionals on the linkedin social network to determine if statistically, most data scientists possess at least one postgraduate degree. Please take note of the assumptions made during the study. 

<h1>Assumptions</h1>

1. A postgraduate degree is a master's degree or a postgraduate degree.
2. Professionals who say they have a postgraduate degree actually have a postgraduate degree.
3. Most data science professionals actually have linkedin profiles.
4. The data scientists in the sample are actually data scientists.
5. The sample is a simple random sample.
6. Data scientists call themselves "data scientists" on their linkedin profiles.

<br/>

<h1>The Data</h1>

In [206]:
import pandas as pd

In [207]:
education_df = pd.read_csv("education.csv")

# change the case of strings in the degree column
education_df['Degree'] = education_df['Degree'].str.lower()

In [208]:
education_df.head()

Unnamed: 0,Scientist,University,Degree
0,62093db0-7628-4198-9048-82da9efb6863,"Birla Institute of Technology and Science, Pilani",master of science - ms data science
1,62093db0-7628-4198-9048-82da9efb6863,JNTUH College of Engineering Hyderabad,b.tech electrical and electronics engineering
2,9442cb86-9853-4e7e-b9b1-ab145d3d71a4,National Institute of Technology Kurukshetra,bachelor of technology - btech computer science 8
3,17aadfcd-9fa9-4767-b24b-f23110a78cec,National Institutes of Health,postdoctoral fellow virtual colonoscopy comput...
4,17aadfcd-9fa9-4767-b24b-f23110a78cec,Polytechnic University of Bucharest,doctor of philosophy (phd) engineering sciences


<br/>

<h1>Utility functions</h1>

In [220]:
def isPostGrad(qualification):
    """Checks if a degree is a postgraduate degree or not
    
    Parameters
    ----------
    qualification: str
    The degree
    
    Returns
    -------
    True if the qualification is a postgraduate degree
    False otherwise
    """
    
    qualification = str(qualification)
    degrees = ["master", "msc", "ms", "m.sci", "msci", "m.sc", "philosophy", "meng", "m.eng", "mhs", "mtech", \
               "m.tech", "m.a", "ma " "mba", "phd", "ph.d", "mmath", "m.s", "msee", "mse", "mstat", "dphil", \
              "mphys", "mres", "mds", "m.mgt", "m.e."]
    
    # filtering...
    if "micromasters" in qualification:
        return False
    
    if "bachelor" in qualification:
        return False
    
    if "graduate certificate" in qualification:
        return False
    
    for degree in degrees:
        if degree in qualification:
            return True
        
    return False
    

In [210]:
def includes(df, word: str):
    """Find degrees that contain a certain word
    
    Parameters
    ----------
    df: DataFrame
    dataframe of educational qualifications
    """
    filtered = df.loc[df['Degree'].str.contains(word)]
    return filtered

<br/>

<h1>Finding Postgraduate degrees</h1>

In [211]:
education_df['isPostGrad'] = education_df['Degree'].apply(isPostGrad)

In [217]:
postgrad_df = education_df[education_df['isPostGrad'] == True]

In [224]:
postgrads = postgrad_df['Scientist'].unique().shape[0]
print("There are approximately {} professionals with at least one postgraduate degree.".format(postgrads))

There are approximately 895 professionals with at least one postgraduate degree.


In [None]:
# checking for the presence of some unwanted strings
includes(postgrad_df, 'b.s')

<br/>

<h1>Formula Sheet</h1>

<img src="Formula Sheet.png" />

<h1>Hypothesis Test</h1>

This is a fixed alpha test with <b style="font-size: 2.5rem; font-weight:700">$\alpha = 0.1$</b>

<p style="font-weight: 700; font-size:2rem;">$H_{o}: p = 0.70$</p>

70% of data science professionals possess at least one postgraduate degree. The proportion is the same and any sample difference is due to <b>sampling error</b>.

<br/>

<p style="font-weight: bold; font-size:2rem;">$H_{A}: p \neq 0.70$ </p>

70% of data science professionals <b>do not</b> possess at least one postgraduate degree. The proportion has changed and <b>any sample difference is real</b>.


<b><i>This hypothesis test is therefore a two-tailed test.</i></b>

<br/>

<h3>Check for assumptions</h3>

1. The sample is random
2. The sample size is less than 10% of the population size.
3. Each sample is independent.
4. $np_{o} \geq 10$ and $n(1-p_{o}) \geq 10$ where $p_{o}$ is the value of p in $H_{o}$.

In the sample, $895$ of $1200$ professionals were found to have postgraduate degrees.


<h4>Sample proportion</h4>

<p style="font-weight: bold; font-size: 2rem; text-align: center">$\hat{p} = \frac {895} {1200} = 0.745$</p>

How unusual is the observed $\hat{p}$?

<br/>

<h4>Test Statistic for a test of proportions</h4>

<p style="font-size: 2rem; text-align: center; font-weight:600">$z = \frac {statistic - parameter} {SD(statistic)} $</p>

<br/>

<p style="font-size:2rem; text-align:center;">$SD(\hat{p}) = \sqrt \frac {0.7 \times 0.3} {1200} = 0.013229$</p>

<br/>

<p style="font-size: 2rem; text-align: center;">$\therefore z = \frac {0.745 - 0.7} {0.013229} = 3.40$</p>

<br/>

Using the Z-score table,
<p style="font-size:2rem; text-align:center">$P- value = 0.9994$</p>

Where

- $\hat{p}$ is the statistic
- <b>$p_{o}$</b> is the parameter.
- $SD(statistic)$ is the standard deviation of the statistic which can be calculated using the formula sheet above.



<br/>

<h1>Predicting the actual proportion</h1>

In this section, statistical methods are used to estimate the true population proportion of data scientists that possess at least one postgraduate degree with a 95% confidence interval.

<h2>Confidence Intervals</h2>

<br/>

<b style="font-size: 1.5rem">CI</b> = <b>statistic</b> $\pm$ <b>critical value</b> $\times$ <b>S.E(statistic)</b>

<b>statistic</b> = $\hat{p}$

<b>critical value</b> at 95% Confidence = $1.960$

<b style="font-size: 2rem; font-weight: bold">$S.E(\hat{p}) = \sqrt \frac {\hat{p}\hat{q}} {n}$</b>

So,

<p style="font-size:2rem; text-align:center;">$S.E(\hat{p}) = \sqrt \frac {0.745 \times 0.255} {1200} = 0.01258222$</p>

$\therefore$ The proportion of data scientists that possess a postgraduate degree is <b style="font-size: 2rem; font-weight: bold">$0.745 \pm 0.025$</b>

<br/>

<h1>Conclusions</h1>

- The P-value is almost 1.


- If the population proportion is $0.70$, it is highly likely (0.9994) to observe 895 out of 1200 professionals that possess a postgraduate degree.


- There is insufficient evidence that the proportion of professionals that possess at least one postgraduate degrees is not 70% with a 10% level of significance.


- Looks like most data scientists indeed possess a postgraduate degree of some sort.


- Around 72% to 77% of data scientists possess at least a postgraduate degree.

<br/>

<h1>Improvements</h1>

A better degree-filtering algorithm can be developed for a more accurate determination of postgraduate degrees. 

<br/>