# Advanced Dataframes

## Part 1

Use your function to obtain a connection to the employees database.

In [72]:
import pandas as pd
import numpy as np
from pydataset import data
import env # contains database access

Read the employees and titles tables into two separate DataFrames.

In [53]:
query = """
SELECT *
FROM employees;
"""

emp_data = pd.read_sql(query, env.get_db_access("employees"))
emp_data.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12


Tittles dataframe

In [54]:
query = """
SELECT *
FROM titles;
"""

titles_data = pd.read_sql(query, env.get_db_access('employees'))
titles_data.head()

Unnamed: 0,emp_no,title,from_date,to_date
0,10001,Senior Engineer,1986-06-26,9999-01-01
1,10002,Staff,1996-08-03,9999-01-01
2,10003,Senior Engineer,1995-12-03,9999-01-01
3,10004,Engineer,1986-12-01,1995-12-01
4,10004,Senior Engineer,1995-12-01,9999-01-01


How many rows and columns do you have in each DataFrame? Is that what you expected?

In [55]:
emp_data.shape

(300024, 6)

In [56]:
titles_data.shape

(443308, 4)

Display the summary statistics for each DataFrame.

In [57]:
emp_data.describe()

Unnamed: 0,emp_no
count,300024.0
mean,253321.763392
std,161828.23554
min,10001.0
25%,85006.75
50%,249987.5
75%,424993.25
max,499999.0


In [58]:
titles_data.describe()

Unnamed: 0,emp_no
count,443308.0
mean,253075.03443
std,161853.292613
min,10001.0
25%,84855.75
50%,249847.5
75%,424891.25
max,499999.0


How many unique titles are in the titles DataFrame?

In [59]:
titles_data.nunique()

emp_no       300024
title             7
from_date      6393
to_date        5888
dtype: int64

In [60]:
titles_data.title.nunique()

7

What is the oldest date in the to_date column?

In [61]:
titles_data.to_date.min()

datetime.date(1985, 3, 1)

What is the most recent date in the to_date column?

In [62]:
titles_data.to_date.max()

datetime.date(9999, 1, 1)

## Part 2

1. Copy the users and roles DataFrames from the examples above.

In [63]:
users = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6],
    'name': ['bob', 'joe', 'sally', 'adam', 'jane', 'mike'],
    'role_id': [1, 2, 3, 3, np.nan, np.nan]
})
users

Unnamed: 0,id,name,role_id
0,1,bob,1.0
1,2,joe,2.0
2,3,sally,3.0
3,4,adam,3.0
4,5,jane,
5,6,mike,


In [64]:
roles = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['admin', 'author', 'reviewer', 'commenter']
})
roles

Unnamed: 0,id,name
0,1,admin
1,2,author
2,3,reviewer
3,4,commenter


2. What is the result of using a right join on the DataFrames?

In [65]:
def merge_right(users, roles):
    return users.merge(roles, how= "right", on= "id")

merge_right(users, roles)

Unnamed: 0,id,name_x,role_id,name_y
0,1,bob,1.0,admin
1,2,joe,2.0,author
2,3,sally,3.0,reviewer
3,4,adam,3.0,commenter


3. What is the result of using an outer join on the DataFrames?

In [66]:
def merge_outer(users, roles):
    return users.merge(roles, how= "outer", on= "id")

merge_outer(users, roles)

Unnamed: 0,id,name_x,role_id,name_y
0,1,bob,1.0,admin
1,2,joe,2.0,author
2,3,sally,3.0,reviewer
3,4,adam,3.0,commenter
4,5,jane,,
5,6,mike,,


4. What happens if you drop the foreign keys from the DataFrames and try to merge them?

In [67]:
def drop_fKey(df):
    return df.drop(["id"], axis= 1)

new_users = drop_fKey(users)
new_roles = drop_fKey(roles)

In [69]:
new_users.merge(new_roles, how= "right", on= "name")

Unnamed: 0,name,role_id
0,admin,
1,author,
2,reviewer,
3,commenter,


In [70]:
new_users.merge(new_roles, how= "outer", on= "name")

Unnamed: 0,name,role_id
0,bob,1.0
1,joe,2.0
2,sally,3.0
3,adam,3.0
4,jane,
5,mike,
6,admin,
7,author,
8,reviewer,
9,commenter,


5. Load the mpg dataset from PyDataset.

In [76]:
mpg = data("mpg")

6. Output and read the documentation for the mpg dataset.

In [78]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


7. How many rows and columns are in the dataset?

In [88]:
mpg.shape

(234, 11)

8. Check out your column names and perform any cleanup you may want on them.

In [91]:
mpg.columns
mpg = mpg.rename(columns = {"displ":"display",
            "cyl":"cylenders",
            "trans":"transmission",
            "drv":"drivers",
            "cty":"city",
            "hwy":"highway",
            "fl":"fuel",
           })
mpg.head()

Unnamed: 0,manufacturer,model,display,year,cylenders,transmission,drivers,city,highway,fuel,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


9. Display the summary statistics for the dataset.

In [92]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   display       234 non-null    float64
 3   year          234 non-null    int64  
 4   cylenders     234 non-null    int64  
 5   transmission  234 non-null    object 
 6   drivers       234 non-null    object 
 7   city          234 non-null    int64  
 8   highway       234 non-null    int64  
 9   fuel          234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 30.0+ KB


In [93]:
mpg.describe

<bound method NDFrame.describe of     manufacturer   model  display  year  cylenders transmission drivers  city  \
1           audi      a4      1.8  1999          4     auto(l5)       f    18   
2           audi      a4      1.8  1999          4   manual(m5)       f    21   
3           audi      a4      2.0  2008          4   manual(m6)       f    20   
4           audi      a4      2.0  2008          4     auto(av)       f    21   
5           audi      a4      2.8  1999          6     auto(l5)       f    16   
..           ...     ...      ...   ...        ...          ...     ...   ...   
230   volkswagen  passat      2.0  2008          4     auto(s6)       f    19   
231   volkswagen  passat      2.0  2008          4   manual(m6)       f    21   
232   volkswagen  passat      2.8  1999          6     auto(l5)       f    16   
233   volkswagen  passat      2.8  1999          6   manual(m5)       f    18   
234   volkswagen  passat      3.6  2008          6     auto(s6)       f    

10. How many different manufacturers are there?

In [100]:
mpg.manufacturer.nunique()

15

11. How many different models are there?

In [101]:
mpg.model.nunique()

38

12. Create a column named mileage_difference like you did in the DataFrames exercises; this column should contain the difference between highway and city mileage for each car.

13. Create a column named average_mileage like you did in the DataFrames exercises; this is the mean of the city and highway mileage.

14. Create a new column on the mpg dataset named is_automatic that holds boolean values denoting whether the car has an automatic transmission.

15. Using the mpg dataset, find out which which manufacturer has the best miles per gallon on average?

16. Do automatic or manual cars have better miles per gallon?