# Advanced Dataframes Exercises I

### 1. Run python -m pip install mysqlclient pymysql from your terminal to install the mysql client (any folder is fine)

### 2. cd into your exercises folder for this module and run echo env.py >> .gitignore

### 3. Create a function named `get_db_url`. It should accept a username, hostname, password, and database name and return a url connection string formatted like in the example at the start of this lesson.

In [180]:
from env import user, password, host
import pandas as pd
import numpy as np

def get_db_url(username: str, hostname: str , password: str, database_name: str):
    '''
    Takes username, hostname, password and database_name and 
    returns a connection string
    '''
    connection = f'mysql+pymysql://{username}:{password}@{hostname}/{database_name}'
    
    return connection

emp_conn = get_db_url(user, host, password, 'employees')


### 4. Use your function to obtain a connection to the employees database.

In [181]:
sql = '''
select *
from employees
limit 5
offset 50
'''
pd.read_sql(sql, emp_conn)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10051,1953-07-28,Hidefumi,Caine,M,1992-10-15
1,10052,1961-02-26,Heping,Nitsch,M,1988-05-21
2,10053,1954-09-13,Sanjiv,Zschoche,F,1986-02-04
3,10054,1957-04-04,Mayumi,Schueller,M,1995-03-13
4,10055,1956-06-06,Georgy,Dredge,M,1992-04-27


### 5a. Intentionally make a typo in the database url. What kind of error message do you see? -- Returns a NoSuchModuleError

In [9]:
bad_conn = emp_conn[:5] + 'test' + emp_conn[5:]
pd.read_sql(sql, bad_conn)

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:mysqltest.pymysql

### 5b. Intentionally make an error in your SQL query. What does the error message look like? -- Returns a ProgrammingError which then returns the specific pymysql.err information

In [13]:
bad_sql = '''
select *
from employeeeeeeeeees
limit 5
offset 50
'''
pd.read_sql(bad_sql, emp_conn)

ProgrammingError: (pymysql.err.ProgrammingError) (1146, "Table 'employees.employeeeeeeeeees' doesn't exist")
[SQL: 
select *
from employeeeeeeeeees
limit 5
offset 50
]
(Background on this error at: http://sqlalche.me/e/14/f405)

### 6. Read the employees and titles tables into two separate DataFrames.

In [182]:
sql_employees = '''
select *
from employees
'''
employees = pd.read_sql(sql_employees, emp_conn)
sql_titles = '''
select *
from titles
'''
titles = pd.read_sql(sql_titles, emp_conn)


### 7. How many rows and columns do you have in each DataFrame? Is that what you expected? -- Yes to both

In [174]:
employees.shape
titles.shape

NameError: name 'titles' is not defined

### 8. Display the summary statistics for each DataFrame.

In [29]:
employees.describe()
titles.describe()

Unnamed: 0,emp_no
count,443308.0
mean,253075.03443
std,161853.292613
min,10001.0
25%,84855.75
50%,249847.5
75%,424891.25
max,499999.0


### 9. How many unique titles are in the `titles` DataFrame?

In [34]:
len(titles.title.unique())

7

### 10. What is the oldest date in the to_date column?

In [37]:
titles.to_date.min()

datetime.date(1985, 3, 1)

### 11. What is the most recent date in the to_date column?

In [59]:
titles[titles.to_date != datetime.date(9999, 1, 1)].to_date.max()

datetime.date(2002, 8, 1)

# Exercises II

### 1. Copy the `users` and `roles` DataFrames from the examples above.

In [61]:
users = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6],
    'name': ['bob', 'joe', 'sally', 'adam', 'jane', 'mike'],
    'role_id': [1, 2, 3, 3, np.nan, np.nan]
})

roles = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['admin', 'author', 'reviewer', 'commenter']
})


### 2. What is the result of using a `right` join on the DataFrames?



In [80]:
users.join(roles, how='right', lsuffix='_users', rsuffix='_roles')


Unnamed: 0,id_users,name_users,role_id,id_roles,name_roles
0,1,bob,1.0,1,admin
1,2,joe,2.0,2,author
2,3,sally,3.0,3,reviewer
3,4,adam,3.0,4,commenter


### 3. What is the result of using an outer join on the DataFrames?

In [82]:
users.join(roles, how='outer', lsuffix='_users', rsuffix='_roles')

Unnamed: 0,id_users,name_users,role_id,id_roles,name_roles
0,1,bob,1.0,1.0,admin
1,2,joe,2.0,2.0,author
2,3,sally,3.0,3.0,reviewer
3,4,adam,3.0,4.0,commenter
4,5,jane,,,
5,6,mike,,,


### 4. What happens if you drop the foreign keys from the DataFrames and try to merge them?

In [90]:
users.drop(columns='role_id')
users.merge(roles, how='outer')

Unnamed: 0,id,name,role_id
0,1,bob,1.0
1,2,joe,2.0
2,3,sally,3.0
3,4,adam,3.0
4,5,jane,
5,6,mike,
6,1,admin,
7,2,author,
8,3,reviewer,
9,4,commenter,


### 5. Load the `mpg` dataset from PyDataset.

In [94]:
from pydataset import data
mpg = data('mpg')

### 6. Output and read the documentation for the `mpg` dataset.



In [95]:
data('mpg', show_doc=True)

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




### 7. How many rows and columns are in the dataset?

In [96]:
mpg.shape

(234, 11)

### 8. Check out your column names and perform any cleanup you may want on them.

In [100]:
mpg.drop(columns=['displ', 'fl', 'cyl', 'drv'])

Unnamed: 0,manufacturer,model,year,trans,cty,hwy,class
1,audi,a4,1999,auto(l5),18,29,compact
2,audi,a4,1999,manual(m5),21,29,compact
3,audi,a4,2008,manual(m6),20,31,compact
4,audi,a4,2008,auto(av),21,30,compact
5,audi,a4,1999,auto(l5),16,26,compact
...,...,...,...,...,...,...,...
230,volkswagen,passat,2008,auto(s6),19,28,midsize
231,volkswagen,passat,2008,manual(m6),21,29,midsize
232,volkswagen,passat,1999,auto(l5),16,26,midsize
233,volkswagen,passat,1999,manual(m5),18,26,midsize


### 9. Display the summary statistics for the dataset.

In [101]:
mpg.describe()

Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


### 10. How many different manufacturers are there?

In [105]:
len(mpg.manufacturer.unique())

15

### 11. How many different models are there?

In [107]:
len(mpg.model.unique())

38

### 12. Create a column named `mileage_difference` like you did in the DataFrames exercises; this column should contain the difference between highway and city mileage for each car.

In [111]:
mpg['mileage_difference'] = mpg.hwy - mpg.cty

### 13. Create a column named `average_mileage` like you did in the DataFrames exercises; this is the mean of the city and highway mileage.

In [114]:
mpg['average_mileage'] = round(mpg[['cty', 'hwy']].mean(axis=1), 2)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9,23.5
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8,25.0
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10,21.0
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8,22.0


### 14. Create a new column on the `mpg` dataset named `is_automatic` that holds boolean values denoting whether the car has an automatic transmission.

In [127]:
mpg['is_automatic'] = mpg.trans.str.contains('auto')

### 15. Using the `mpg` dataset, find out which which manufacturer has the best miles per gallon on average?

In [130]:
mpg[mpg.average_mileage == mpg.average_mileage.max()]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference,average_mileage,is_automatic
222,volkswagen,new beetle,1.9,1999,4,manual(m5),f,35,44,d,subcompact,9,39.5,False


### 16. Do automatic or manual cars have better miles per gallon? -- 
Manual's have better mpg in this data set on average

In [136]:
auto_mpg_avg = mpg[mpg.trans.str.contains('auto')].average_mileage.mean()
manual_mpg_avg = mpg[mpg.trans.str.contains('manual')].average_mileage.mean()
auto_mpg_avg, manual_mpg_avg

(19.130573248407643, 22.227272727272727)

# Exercises III

### 1. Use your `get_db_url` function to help you explore the data from the chipotle database.

In [155]:
conn = get_db_url(username=user, password=password, hostname=host, database_name='chipotle')
sql = '''
select *
from orders'''
chipotle = pd.read_sql(sql, conn)

### 2. What is the total price for each order?

In [146]:
chipotle.item_price.replace('[\$,]', '', regex=True).astype(float).sum()

34500.16

### 3. What are the most popular 3 items?

In [160]:
bools = chipotle.quantity.isin(chipotle.quantity.sort_values(ascending=False).head(n=3))
chipotle[bools]

Unnamed: 0,id,order_id,quantity,item_name,choice_description,item_price
3598,3599,1443,15,Chips and Fresh Tomato Salsa,,$44.25
3887,3888,1559,8,Side of Chips,,$13.52
4152,4153,1660,10,Bottled Water,,$15.00


### 4. Which item has produced the most revenue?

In [168]:
chipotle['revenue'] = chipotle.quantity * chipotle.item_price.replace('[\$,]', '', regex=True).astype(float)
chipotle[chipotle.revenue == chipotle.revenue.max()]

Unnamed: 0,id,order_id,quantity,item_name,choice_description,item_price,revenue
3598,3599,1443,15,Chips and Fresh Tomato Salsa,,$44.25,663.75


### 5. Join the `employees` and `titles` DataFrames together.

In [186]:
employees.info()
emp_w_title = employees.join(titles, how='left', lsuffix='_emp')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300024 entries, 0 to 300023
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   emp_no      300024 non-null  int64 
 1   birth_date  300024 non-null  object
 2   first_name  300024 non-null  object
 3   last_name   300024 non-null  object
 4   gender      300024 non-null  object
 5   hire_date   300024 non-null  object
dtypes: int64(1), object(5)
memory usage: 13.7+ MB


### 6. For each title, find the hire date of the employee that was hired most recently with that title.

In [210]:
emp_w_title.groupby(by='title').hire_date.max()

title
Assistant Engineer    1999-12-10
Engineer              2000-01-23
Manager               1996-10-24
Senior Engineer       2000-01-28
Senior Staff          2000-01-13
Staff                 1999-12-24
Technique Leader      1999-10-26
Name: hire_date, dtype: object

### 7. Write the code necessary to create a cross tabulation of the number of titles by department. (Hint: this will involve a combination of SQL code to pull the necessary data and python/pandas code to perform the manipulations.)

In [219]:
dept_sql = '''
select title, dept_name
from employees
join dept_emp
using(emp_no)
join departments
using(dept_no)
join titles
on titles.emp_no = employees.emp_no
and titles.to_date >= now()
'''
title_dept = pd.read_sql(dept_sql, emp_conn)
title_dept.groupby('title').count()

Unnamed: 0_level_0,dept_name
title,Unnamed: 1_level_1
Assistant Engineer,3953
Engineer,34203
Manager,9
Senior Engineer,94950
Senior Staff,90672
Staff,28234
Technique Leader,13311
