## EDA & Statistics Basics

Questions 1 - 10 will go over the basics of exploratory data analysis. However, before we can start, we must load the companies data back into our jupyter notebook. This will again entail using the company data-tables saved in the `aws` postgres database, which include `jobs`, `salaries`, and `skills`. To get a better understanding of these tables, consult the company's [planning documents](https://drive.google.com/drive/folders/1z4EwdbyfUzf-FuRTJfVMaRSm5-R25viA).

You will additionally use `pandas` to do some exploratory data analysis.

## Q1

Import all 3 of your tables using sqlalchemy, and convert them into Python objects using the `auto_mapper`. Then, create 3 dataframes from the data pulled a session object. Be sure to dispose your engine after creating these dataframes.

**Relevant Notes/Labs**
* [2/1 Intro to SQL Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_9/Lab/intermediate_sql_lab_notes.ipynb)

## Q2

Join all 3 pandas dataframes that you created above on the primary key columns. Consult the planning docs to figure out how these tables were planned. Save this dataframe into a new variable.

**Relevant Notes/Labs**
* [Pandas Docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)

## Q3

Create a new dataframe that drops all null values in `salary_standardized` from this newly joined dataframe. Use this new dataframe to calculate the minimum offered `salary_standardized`, the maximum, the mean, and the standard deviation.

**Relevant Notes/Labs**
* [1/3 Pandas Cleaning Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_5/1_3/pandas_cleaning_notes.ipynb)

## Q4

You'll notice that the `work_from_home` column in this cleaned dataframe has mostly `null` values. This indicates that the employer did not indicate if this job was remote friendly. We are going to assume that unless a job **explicitly** states that it allows remote work (`work_from_home == True`), it is not remote-friendly.

Mark all these `null` values as False.

**Relevant Notes/Labs**
* [1/3 Pandas Cleaning Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_5/1_3/pandas_cleaning_notes.ipynb)

## Q5

Create 2 new dataframes that filter for `work_from_home == True` and `work_from_home == False`.

**Relevant Notes/Labs**
* [12/13 Intro to Pandas Lab](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_2/12_13/pandas_II_notes.ipynb)

## Q6

For this assessment's EDA, we are interested in revealing the difference in salary between remote workers & non-remote workers. To better support this portion of the data science pipeline, let's create an initial write-up. 

Fill in each corresponding sections with 1-3 sentences to fulfill the respective prompt.

**Relevant Notes/Labs**
* [2/6 Intro to Tableau Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_10/2_6/intro_to_tableau_notes.ipynb)

**Background**
Google salary outcomes for remote workers. What findings does the media and relevant literature state about salary outcomes for remote workers? Are they higher? lower? Include relevant links

Answer here

**Goal for Analysis**
What is the goal for this EDA analysis? 

Answer here

**Business Applications**
What would be the potential business utility of insight from this dataset?

Answer here

## Q7

Create histograms using seaborn to plot the distribution of `salary_standardized` for both remote work and non-remote work. For each category, use the corresponding number of bins
* 10 bins
* 20 bins
* 100 bins

In total, you should have 6 histograms.

**Relevant Notes/Labs**
* [2/21 Data Analytics Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_12/2_21/data_analytics_notes.ipynb)

## Q8

Run a kolmogorov-smirnov test for normality for both categories. Take note of the p-value for both.

**Relevant Notes/Labs**
* [2/21 Data Analytics Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_12/2_21/data_analytics_notes.ipynb)

## Q9

Create a seaborn boxplot on the `cleaned` dataframe (non-split dataframe) to view the distribution of remote jobs and non-remote jobs. Take note of the relative positions of the median. 

**Relevant Notes/Labs**
* [12/13 Intro to Pandas Lab](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_2/12_13/pandas_II_notes.ipynb)

## Q10

What did your analysis reveal?

Fill in the corresponding sections with your observations.

**Relevant Notes/Labs**
* [2/6 Intro to Tableau Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_10/2_6/intro_to_tableau_notes.ipynb)

**Outcomes**
What distributions did both your categories take? Did they share a similiar distribution?


Which category had the most outliers (right or left)?


Is there an obvious difference in medians across categories?



**Next Steps**
What potential next steps would you like to do to improve your analysis?

## Basics of Statistics

Questions 11 - 15 will go through some essential knowledge of statistics that we need in order best apply analytical methods & competently answer interview questions.

Either uncomment the multiple choice option that you believe best answers a respective question, or fill in the blank `print()` statement with your open-ended answer.

## Q11

Which statement best describes the central limit theorem? Comment out the print statement that contains the correct answer.

In [None]:
# print("the sampling distribution of the sample mean is approximately normal under certain conditions.")
# print("as a sample size grows, its mean gets closer to the average of the whole population")
# print("the average of a data set")
# print("the most frequent number of a data set")

## Q12

Which statement best describes the law of large numbers? Comment out the print statement that contains the correct answer.

In [None]:
# print("the sampling distribution of the sample mean is approximately normal under certain conditions.")
# print("as a sample size grows, its mean gets closer to the average of the whole population")
# print("the average of a data set")
# print("the most frequent number of a data set")

## Q13

What is the minimum number of samples for the central limit theorem to be true? Comment out the print statement that contains the correct answer.

In [None]:
# print("100")
# print("20")
# print("30")
# print("5")

## Q14

If the p-value of a hypothesis test is less than 0.05, what do we state about the null hypothesis? Answer in your own words.

In [None]:
print("")

## Q15

After viewing our distribution for standardized salary in the dataset, we notice a couple of salaries >$200k for CEO positions. This is forcing an otherwise normal distribution to be right-skewed. How would you handle these values? Answer in your own words.

In [None]:
print("")