# Pandas Basics

Questions 1 through 10 rely on the following pseudo-dataframes. 

*warehouse*
| prod_id | type      | price_kg | speciality | origin      |
| ------- | --------- | -------- | ---------- | ----------- |
| 0       | Robusta   | 2.586    | True       | Vietnam     |
| 1       | Arabica   | 4.558    | False      | Brazil      |
| 2       | Liberica  | 4.840    | False      | Phillipines |
| 3       | Excelsa   | 4.840    | True       | India       |


*orders*
| ord_id | destination | kgs    | type     | prod_id |  comp_id |
| ------ | ----------- | ------ | -------- | ------- | -------- |
|  0     | USA         | 1000   | Robusta  | 0       | 0        |
|  1     | USA         | 1000   | Arabica  | 1       | 0        |
|  2     | USA         | 200    | Liberica | 2       | 1        |
|  3     | Germany     | 1200   | Arabica  | 1       | 2        |
|  4     | Germany     | 200    | Robusta  | 0       | 2        |
|  5     | Japan       | 800    | Arabica  | 1       | 3        |
|  6     | France      | 700    | Excelsa  | 3       | 4        |

For each question, write out panda code that best accomplishes the listed requirements regarding these two tables. Assume that these two tables have been saved in two pandas dataframes called `ware_df` and `order_df` respectively.

These tables do not exist anywhere in our repository. Just like before, you will write out code without the ability to test iteratively.


## Q1

Utilize the pandas package to write a line of code that calculates the mean price of coffee.

**Relevant Notes/Labs**
* [12/13 Intro to Pandas II Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_2/12_13/pandas_II_notes.ipynb)

## Q2

Create a new column in the `warehouse` dataframe called `new_price` that increases all prices in the `price_kg` column by 5%.

**Relevant Notes/Labs**
* [1/9 Intro to SQL Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_6/1_9/sql_intro_notes.ipynb)

## Q3

Create a new column in the `warehouse` dataframe called `adjusted_price` that increases all `speciality` coffee prices (from `new_price`) by 20%. 

**Relevant Notes/Labs**
* [1/9 Intro to SQL Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_6/1_9/sql_intro_notes.ipynb)

## Q4

Create a new dataframe called `orders_us` that displays only the orders whose destinations are `USA`.

**Relevant Notes/Labs**
* [12/13 Intro to Pandas Lab](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_2/12_13/pandas_II_notes.ipynb)

## Q5

In the current `orders` dataframe, the `kgs` column is expressed as a string. Write a line of code that will convert this column into an `int`.

**Relevant Notes/Labs**
* [Pandas Docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)

## Q6

Let's say our `orders` dataframe is 1000 rows long. Some of these rows contain `NaN` values in the `destination` column. How would you print how many null values we have in this column?

How would you drop these null rows from the `orders` dataframe?

**Relevant Notes/Labs**
* [1/3 Pandas Cleaning Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_5/1_3/pandas_cleaning_notes.ipynb)

## Q7

Group your `orders` dataframe according to the `destination` column. Calculate the average `kgs` of an order going to this `destination`.

**Relevant Notes/Labs**
* [12/13 Intro to Pandas Lab](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_2/12_13/pandas_II_notes.ipynb)

## Q8

Group your `orders` dataframe according to the `type` column. Calculate the average `kgs` of an order from this `type`.

**Relevant Notes/Labs**
* [12/13 Intro to Pandas Lab](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_2/12_13/pandas_II_notes.ipynb)

## Q9

Write python code that joins these two dataframes together on coffee-type, creates a new column labeled `total_price` that calculates the total price of an order according to price_kg and adds $1000 for air-freight, and then orders this table from most expensive order to least.

**Relevant Notes/Labs**
* [Pandas Docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)

## Q10

Write python code that joins these two dataframes together by coffee-type, and reveal the origin & destination of this order.

**Relevant Notes/Labs**
* [Pandas Docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)

# SQLAlchemy & Pandas

Questions 11 - 20 will go over the basics of combining sqlalchemy with pandas. This will entail using the company data-tables saved in the `aws` postgres database, which include `jobs`, `salaries`, and `skills`. To get a better understanding of these tables, consult the company's [planning documents](https://drive.google.com/drive/folders/1z4EwdbyfUzf-FuRTJfVMaRSm5-R25viA).

You will additionally use `pandas` to do some light exploration of these dataframes.

## Q11

Import all 3 of your tables using sqlalchemy, and convert them into Python objects using the `auto_mapper`. Then, create 3 dataframes from the data pulled a session object. Be sure to dispose your engine after creating these dataframes.

**Relevant Notes/Labs**
* [2/1 Intro to SQL Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_9/Lab/intermediate_sql_lab_notes.ipynb)

## Q12

Join all 3 pandas dataframes that you created above on the primary key columns. Consult the planning docs to figure out how these tables were planned. Save this dataframe into a new variable.

**Relevant Notes/Labs**
* [Pandas Docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)

## Q13

Create a new dataframe that drops all null values in `salary_standardized` from this newly joined dataframe. Use this new dataframe to calculate the minimum offered `salary_standardized`, the maximum, the mean, and the standard deviation.

**Relevant Notes/Labs**
* [1/3 Pandas Cleaning Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_5/1_3/pandas_cleaning_notes.ipynb)

## Q14

Create a histogram to display the frequency of `salary_standardized` off of this new dataframe. 

**Relevant Notes/Labs**
* [2/6 Intro to Tableau](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_10/2_6/intro_to_tableau_notes.ipynb)

## Q15

Create a box-plot to display the distributions of `salary_standardized` off of this new dataframe. 

**Relevant Notes/Labs**
* [2/21 Data Analytics Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_12/2_21/data_analytics_notes.ipynb)

## Q16

Create 3 new dataframes that filter for `New York, NY`, `San Fransisco, CA`, and `Atlanta, GA` from the `location` off of this new dataframe.

**Relevant Notes/Labs**
* [12/13 Intro to Pandas Lab](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_2/12_13/pandas_II_notes.ipynb)

## Q17

Create 3 histograms using seaborn to plot the distribution of `salary_standardized` for each named city above using the dataframes you just created.

**Relevant Notes/Labs**
* [2/21 Data Analytics Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_12/2_21/data_analytics_notes.ipynb)

## Q19

Going back to Q15, what kind of distribution does this most likely represent?

In [None]:
print("Uniform")
print("Right Skewed")
print("Left Skewed")
print("Normal")

## Q20

Calculate the `ks` test for normality on your `salary_standardized` column from the dataframe you used in `Q15`. What does this p-value tell you about this distribution? 

**Relevant Notes/Labs**
* [2/21 Data Analytics Notes](https://github.com/The-Knowledge-House/DS_22/blob/main/phase_2/week_12/2_21/data_analytics_notes.ipynb)