# Advanced Dataframes Exercises I

### 1. Run python -m pip install mysqlclient pymysql from your terminal to install the mysql client (any folder is fine)

### 2. cd into your exercises folder for this module and run echo env.py >> .gitignore

### 3. Create a function named `get_db_url`. It should accept a username, hostname, password, and database name and return a url connection string formatted like in the example at the start of this lesson.

In [1]:
from env import user, password, host
import pandas as pd
import numpy as np

def get_db_url(username: str, hostname: str , password: str, database_name: str):
    '''
    Takes username, hostname, password and database_name and 
    returns a connection string
    '''
    connection = f'mysql+pymysql://{username}:{password}@{hostname}/{database_name}'
    
    return connection

emp_conn = get_db_url(user, host, password, 'employees')


### 4. Use your function to obtain a connection to the employees database.

In [2]:
sql = '''
select *
from employees
limit 5
offset 50
'''
pd.read_sql(sql, emp_conn)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10051,1953-07-28,Hidefumi,Caine,M,1992-10-15
1,10052,1961-02-26,Heping,Nitsch,M,1988-05-21
2,10053,1954-09-13,Sanjiv,Zschoche,F,1986-02-04
3,10054,1957-04-04,Mayumi,Schueller,M,1995-03-13
4,10055,1956-06-06,Georgy,Dredge,M,1992-04-27


### 5a. Intentionally make a typo in the database url. What kind of error message do you see? 

* Returns: ```NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:mysqltest.pymysql
```

In [3]:
bad_conn = emp_conn[:5] + 'test' + emp_conn[5:]
# pd.read_sql(sql, bad_conn)

### 5b. Intentionally make an error in your SQL query. What does the error message look like? 

* Returns: ```ProgrammingError: (pymysql.err.ProgrammingError) (1146, "Table 'employees.employeeeeeeeeees' doesn't exist")```

In [4]:
bad_sql = '''
select *
from employeeeeeeeeees
limit 5
offset 50
'''
# pd.read_sql(bad_sql, emp_conn)

### 6. Read the employees and titles tables into two separate DataFrames.

In [5]:
sql_employees = '''
select *
from employees
'''
employees = pd.read_sql(sql_employees, emp_conn)
sql_titles = '''
select *
from titles
'''
titles = pd.read_sql(sql_titles, emp_conn)

### 7. How many rows and columns do you have in each DataFrame? Is that what you expected? 

* 300024 rows, 6 columns - employees / 443308 rows, 4 columns and Yes because there would be more titles than employees as each employee will have at least one title and can have many. Yes this is what I expected.

In [6]:
[employees.shape, titles.shape]

[(300024, 6), (443308, 4)]

### 8. Display the summary statistics for each DataFrame.

In [7]:
employees.describe()
titles.describe()

Unnamed: 0,emp_no
count,443308.0
mean,253075.03443
std,161853.292613
min,10001.0
25%,84855.75
50%,249847.5
75%,424891.25
max,499999.0


### 9. How many unique titles are in the `titles` DataFrame?

In [8]:
len(titles.title.unique())

7

### 10. What is the oldest date in the to_date column?

In [9]:
titles.to_date.min()

datetime.date(1985, 3, 1)

### 11. What is the most recent date in the to_date column?

In [10]:
titles[titles.to_date != datetime.date(9999, 1, 1)].to_date.max()

NameError: name 'datetime' is not defined

# Exercises II

### 1. Copy the `users` and `roles` DataFrames from the examples above.

In [None]:
users = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6],
    'name': ['bob', 'joe', 'sally', 'adam', 'jane', 'mike'],
    'role_id': [1, 2, 3, 3, np.nan, np.nan]
})

roles = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['admin', 'author', 'reviewer', 'commenter']
})


### 2. What is the result of using a `right` join on the DataFrames?



In [None]:
users.join(roles, how='right', lsuffix='_users', rsuffix='_roles')


### 3. What is the result of using an outer join on the DataFrames?

In [None]:
users.join(roles, how='outer', lsuffix='_users', rsuffix='_roles')

### 4. What happens if you drop the foreign keys from the DataFrames and try to merge them?

In [None]:
users.drop(columns='role_id')
users.merge(roles, how='outer')

### 5. Load the `mpg` dataset from PyDataset.

In [None]:
from pydataset import data
mpg = data('mpg')

### 6. Output and read the documentation for the `mpg` dataset.



In [None]:
data('mpg', show_doc=True)

### 7. How many rows and columns are in the dataset?

* 234 Rows and 11 Columns

In [None]:
mpg.shape

### 8. Check out your column names and perform any cleanup you may want on them.

In [None]:
mpg.drop(columns=['displ', 'fl', 'cyl', 'drv'])

### 9. Display the summary statistics for the dataset.

In [None]:
mpg.describe()

### 10. How many different manufacturers are there?

* There are 15 different manufacturers.

In [None]:
len(mpg.manufacturer.unique())

### 11. How many different models are there?

* There are 38 different models

In [None]:
len(mpg.model.unique())

### 12. Create a column named `mileage_difference` like you did in the DataFrames exercises; this column should contain the difference between highway and city mileage for each car.

In [None]:
mpg['mileage_difference'] = mpg.hwy - mpg.cty

### 13. Create a column named `average_mileage` like you did in the DataFrames exercises; this is the mean of the city and highway mileage.

In [None]:
mpg['average_mileage'] = round(mpg[['cty', 'hwy']].mean(axis=1), 2)

### 14. Create a new column on the `mpg` dataset named `is_automatic` that holds boolean values denoting whether the car has an automatic transmission.

In [None]:
mpg['is_automatic'] = mpg.trans.str.contains('auto')

### 15. Using the `mpg` dataset, find out which which manufacturer has the best miles per gallon on average?

|manufacturer|model|displ|year|cyl|trans|drv|cty|hwy|fl|class|mileage_difference|average_mileage|is_automatic|
|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
|222|volkswagen|new beetle|1.9|1999|4|manual(m5)|f|35|44|d|subcompact|9|39.5|False|


In [None]:
mpg[mpg.average_mileage == mpg.average_mileage.max()]

### 16. Do automatic or manual cars have better miles per gallon? -- 

* Manual's have better mpg in this data set on average:
|auto trans|manual trans|
|---|---|
|19.130573248407643|22.227272727272727|

In [None]:
auto_mpg_avg = mpg[mpg.trans.str.contains('auto')].average_mileage.mean()
manual_mpg_avg = mpg[mpg.trans.str.contains('manual')].average_mileage.mean()
auto_mpg_avg, manual_mpg_avg

# Exercises III

### 1. Use your `get_db_url` function to help you explore the data from the chipotle database.

In [None]:
conn = get_db_url(username=user, password=password, hostname=host, database_name='chipotle')
sql = '''
select *
from orders'''
chipotle = pd.read_sql(sql, conn)
chipotle

### 2. What is the total price for each order?

* $34,500.16 is the total price for all orders.

In [None]:
chipotle.item_price.replace('[\$,]', '', regex=True).astype(float).sum()

### 3. What are the most popular 3 items?

* The three most popular items are:
** Chicken Bowl:          726
** Chicken Burrito:       553\n
** Chips and Guacamole:   479\n



In [None]:
chipotle.item_name.value_counts().head(n=3)

### 4. Which item has produced the most revenue?

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>id</th>      <th>order_id</th>      <th>quantity</th>      <th>item_name</th>      <th>choice_description</th>      <th>item_price</th>      <th>revenue</th>    </tr>  </thead>  <tbody>    <tr>      <th>3598</th>      <td>3599</td>      <td>1443</td>      <td>15</td>      <td>Chips and Fresh Tomato Salsa</td>      <td>nan</td>      <td>44.25</td>      <td>663.75</td>    </tr>  </tbody></table>

In [None]:
mxidx = chipotle.item_price.replace('[\$,]', '', regex=True).astype(float).idxmax()
chipotle.loc[mxidx]

### 5. Join the `employees` and `titles` DataFrames together.

In [None]:
emp_w_title = employees.merge(titles, how='left')
emp_w_title

### 6. For each title, find the hire date of the employee that was hired most recently with that title.

In [None]:
hire_dte = emp_w_title.groupby(by='title').hire_date.agg(['max'])
hire_dte.rename(columns={'max': 'most recent hire date'})

### 7. Write the code necessary to create a cross tabulation of the number of titles by department. (Hint: this will involve a combination of SQL code to pull the necessary data and python/pandas code to perform the manipulations.)

In [None]:
dept_sql = '''
select title, dept_name
from employees
join dept_emp
using(emp_no)
join departments
using(dept_no)
join titles
on titles.emp_no = employees.emp_no
and titles.to_date >= now() #You could remove this and get all titles
'''
title_dept = pd.read_sql(dept_sql, emp_conn)
# title_dept.groupby(['dept_name','title']).title.agg(['count'])
pd.crosstab(title_dept.dept_name, title_dept.title, margins=True).T
