In [1]:
import numpy as np
import pandas as pd

In [3]:
# Create list of values for names column.
np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# Randomly generate arrays of scores for each student for each subject.
# Note that all the values need to have the same length here.

math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

In [4]:
df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades,
                   'classroom': np.random.choice(['A', 'B'], len(students))})

In [5]:
df

Unnamed: 0,name,math,english,reading,classroom
0,Sally,62,85,80,A
1,Jane,88,79,67,B
2,Suzie,94,74,95,A
3,Billy,98,96,88,B
4,Ada,77,92,98,A
5,John,79,76,93,B
6,Thomas,82,64,81,A
7,Marie,93,63,90,A
8,Albert,92,62,87,A
9,Richard,69,80,94,A


In [6]:
# df.to_csv('students.csv' index=False)
# to save a file to a csvv to re-read later
# df = read_csv('students.csv')

## Indexing and Subsetting

Like the pandas Series object, the pandas DataFrame object supports both position- and label-based indexing using the indexing operator `[]`.

I will demonstrate concrete examples of indexing using the indexing operator `[]` alone and with the `.loc` and `.iloc` attributes below.

### `[]`

I can pass a list of columns from a DataFrame to the indexing operator (aka bracket notation) to return a subset of my original DataFrame.

In [8]:
df[['name']]

Unnamed: 0,name
0,Sally
1,Jane
2,Suzie
3,Billy
4,Ada
5,John
6,Thomas
7,Marie
8,Albert
9,Richard


In [None]:
# I can choose a single column using bracket notation.



In [7]:
df.name

0       Sally
1        Jane
2       Suzie
3       Billy
4         Ada
5        John
6      Thomas
7       Marie
8      Albert
9     Richard
10      Isaac
11       Alan
Name: name, dtype: object

In [None]:
# I can pass a boolean Series to the indexing operator as a selector.
# names that start with a?


In [9]:
df.name.str.startswith('A')

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11     True
Name: name, dtype: bool

In [10]:
df[df.name.str.startswith('A')]

Unnamed: 0,name,math,english,reading,classroom
4,Ada,77,92,98,A
8,Albert,92,62,87,A
11,Alan,92,62,72,A


In [11]:
df.name.str.lower().startswith('a')

AttributeError: 'Series' object has no attribute 'startswith'

In [13]:
df.name.str.lower().str.startswith('a')

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11     True
Name: name, dtype: bool

In [15]:
df['name'].str.lower().str.startswith('a')

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11     True
Name: name, dtype: bool

### `.loc`

We can use the `.loc` attribute to select specific rows AND columns by index label. The index label can be a number, but it can also be a string label. This method offers a lot of flexibility! **The .loc attribute's indexing is inclusive and uses an index label, not position.**

```python
df.loc[row_indexer, column_indexer]
```

In [16]:
my_list = [3, 6, 8, 10, 20]
my_string = 'hello leavitt'

In [17]:
my_list[:3]

[3, 6, 8]

In [18]:
my_string[3:6]

'lo '

In [19]:
# Select all the rows and a subset of columns;
# notice the inclusive behavior of the indexing.
df.loc[:4]


Unnamed: 0,name,math,english,reading,classroom
0,Sally,62,85,80,A
1,Jane,88,79,67,B
2,Suzie,94,74,95,A
3,Billy,98,96,88,B
4,Ada,77,92,98,A


In [20]:
df.loc[:4, 'math': 'reading']

Unnamed: 0,math,english,reading
0,62,85,80
1,88,79,67
2,94,74,95
3,98,96,88
4,77,92,98


In [None]:
# I can use a boolean Series as a selector with .loc, too, but I can choose rows and columns.
# dimensionality: 0, 1: rows, columns


### `.iloc`

We can use the `.iloc` attribute to select specific rows and colums by index position. `.iloc` does not accept a boolean Series as a selector like `.loc` does. **It takes in integers representing index position and is NOT inclusive.**

```python
df.iloc[row_indexer, column_indexer]
```

We can select rows by integer position:

In [21]:
# Notice the exclusive behavior of the indexing.

df.iloc[:4]

Unnamed: 0,name,math,english,reading,classroom
0,Sally,62,85,80,A
1,Jane,88,79,67,B
2,Suzie,94,74,95,A
3,Billy,98,96,88,B


We can also specify which columns we want to select:

In [22]:
df.iloc[1:4, 2:5]

Unnamed: 0,english,reading,classroom
1,79,67,B
2,74,95,A
3,96,88,B


In [24]:
df.iloc[1:4, 2:4]

Unnamed: 0,english,reading
1,79,67
2,74,95
3,96,88


Here we select the first 3 rows (everything up to but not including the index of 3), and the second and third columns (starting from the index of 1 up to but not including the index of 3).

## Aggregating

### `.agg`

The `.agg` method lets us specify a way to aggregate a series of numerical values. We pass an aggregate function or list of functions to the method that we want applied to a Series.

In [25]:
df.reading.agg('min')

67

In [26]:
df['reading'].agg('min')

67

In [27]:
df.agg('min')

name         Ada
math          62
english       62
reading       67
classroom      A
dtype: object

In [31]:
df[['math', 'english']].agg(['max', 'min'])

Unnamed: 0,math,english
max,98,99
min,62,62


In [32]:
df[['math', 'english']].agg(['max', 'min']).T

Unnamed: 0,max,min
math,98,62
english,99,62


In [None]:
# I can use it on the entire df.


In [None]:
# I can use it on a single column in a df.



In [None]:
# I can pass a list of functions to the .agg method.



While on the surface this seems pretty simple, `.agg` is capable of providing more detailed aggregations:

### `.groupby`

The `.groupby` method is used to create a grouped object, which we can then apply an aggregation on. For example, if we wanted to know the highest math grade from each classroom:

In [33]:
df.columns

Index(['name', 'math', 'english', 'reading', 'classroom'], dtype='object')

In [36]:
df.groupby('classroom')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x13eb6c490>

In [38]:
df.groupby('classroom').agg(['min', 'mean', 'max'])

  df.groupby('classroom').agg(['min', 'mean', 'max'])


Unnamed: 0_level_0,math,math,math,english,english,english,reading,reading,reading
Unnamed: 0_level_1,min,mean,max,min,mean,max,min,mean,max
classroom,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
A,62,82.625,94,62,72.75,92,72,87.125,98
B,79,89.25,98,76,87.5,99,67,85.25,93


In [None]:
# get our maximum grade for our 'math' column for classroom A and B respectively

We can use `.agg` here to, to see multiple aggregations:

In [None]:
# structure:
# take your dataframe df
# group by classroom column
# use the math column --> referenced with .math (could use ['math'] instead)
# use a math method --> aggregation via .agg
# use these functions in agg: min, mean, max


We can group by multiple columns as well. To demonstrate, we'll create a boolean column named `passing_math`, then group by the combination of our new feature, `passing_math`, and the classroom and calculate the average reading grade and the number of individuals in each subgroup. 

Let's break this problem down and code it step-by-step.

#### `np.where`

First, we can create the new `passing_math` column using a handy NumPy function called `np.where`. It will allow us to base the new column values on whether the values in an existing column, `math`, meet a condition.

```python
np.where(condition, this_where_True, this_where_False)
```

In [39]:
df.math

0     62
1     88
2     94
3     98
4     77
5     79
6     82
7     93
8     92
9     69
10    92
11    92
Name: math, dtype: int64

In [40]:
np.where(df.math > 69, 'pass', 'fail')

array(['fail', 'pass', 'pass', 'pass', 'pass', 'pass', 'pass', 'pass',
       'pass', 'fail', 'pass', 'pass'], dtype='<U4')

In [41]:
df['pf_math'] = np.where(df.math > 69, 'pass', 'fail')

In [42]:
df.head()

Unnamed: 0,name,math,english,reading,classroom,pf_math
0,Sally,62,85,80,A,fail
1,Jane,88,79,67,B,pass
2,Suzie,94,74,95,A,pass
3,Billy,98,96,88,B,pass
4,Ada,77,92,98,A,pass


In [43]:
df['passing_math'] = df.math.apply(
lambda x: 'passing' if x > 69 else 'failing')
df

Unnamed: 0,name,math,english,reading,classroom,pf_math,passing_math
0,Sally,62,85,80,A,fail,failing
1,Jane,88,79,67,B,pass,passing
2,Suzie,94,74,95,A,pass,passing
3,Billy,98,96,88,B,pass,passing
4,Ada,77,92,98,A,pass,passing
5,John,79,76,93,B,pass,passing
6,Thomas,82,64,81,A,pass,passing
7,Marie,93,63,90,A,pass,passing
8,Albert,92,62,87,A,pass,passing
9,Richard,69,80,94,A,fail,failing


In [45]:
df.drop(columns=['pf_math'] , inplace=True)
df

Unnamed: 0,name,math,english,reading,classroom,passing_math
0,Sally,62,85,80,A,failing
1,Jane,88,79,67,B,passing
2,Suzie,94,74,95,A,passing
3,Billy,98,96,88,B,passing
4,Ada,77,92,98,A,passing
5,John,79,76,93,B,passing
6,Thomas,82,64,81,A,passing
7,Marie,93,63,90,A,passing
8,Albert,92,62,87,A,passing
9,Richard,69,80,94,A,failing


Argument descriptors:
np.where(condition, thing if true, thing if false)
condition: math grade is less than 70 --> df.math < 70
output if True: 'failing'
output if False: 'passing'

In [None]:
# Create the new column based on an existing column.



In [46]:
grade_groups = df.groupby(['passing_math', 'classroom']).reading.agg(['mean', 'count'])

In [47]:
grade_groups

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
passing_math,classroom,Unnamed: 2_level_1,Unnamed: 3_level_1
failing,A,87.0,2
passing,A,87.166667,6
passing,B,85.25,4


Next, we will group by the `passing_math` and `classroom` columns and use the `.agg` method to calculate the average reading grade and the number of students.

In [None]:
# I can even clean up my columns to make my calculations clearer.



In [48]:
grade_groups.columns = ['avg_reading_grade', 'number_of_students']
grade_groups

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_reading_grade,number_of_students
passing_math,classroom,Unnamed: 2_level_1,Unnamed: 3_level_1
failing,A,87.0,2
passing,A,87.166667,6
passing,B,85.25,4


**Takeaways:**

We can interpret this output as there being 2 students failing math in classroom A with an average reading grade of 87, 6 students passing math in classroom A with an average reading grade of 87.16, and 4 students passing math in classroom B with an average reading grade of 85.25.

#### `.transform`

The `.transform` method can be used to produce a series with the same length of the original dataframe where each value represents the aggregation from the subgroup resulting from the `.groupby`. 

For example, if we wanted to know the average math grade for each classroom and add this data back to our original dataframe:

In [49]:
# process: get the average math frade for each classroom
# df.groupby('classroom').math.mean()
df.assign(avg_math_by_class=df.groupby('classroom').math.transform('mean'))

Unnamed: 0,name,math,english,reading,classroom,passing_math,avg_math_by_class
0,Sally,62,85,80,A,failing,82.625
1,Jane,88,79,67,B,passing,89.25
2,Suzie,94,74,95,A,passing,82.625
3,Billy,98,96,88,B,passing,89.25
4,Ada,77,92,98,A,passing,82.625
5,John,79,76,93,B,passing,89.25
6,Thomas,82,64,81,A,passing,82.625
7,Marie,93,63,90,A,passing,82.625
8,Albert,92,62,87,A,passing,82.625
9,Richard,69,80,94,A,failing,82.625


#### `.describe`

Check out what I can do when I combine a `.groupby` with a `.describe`!

In [52]:
df.groupby(['classroom', 'passing_math']).describe().T

Unnamed: 0_level_0,classroom,A,A,B
Unnamed: 0_level_1,passing_math,failing,passing,passing
math,count,2.0,6.0,4.0
math,mean,65.5,88.333333,89.25
math,std,4.949747,7.061633,7.973916
math,min,62.0,77.0,79.0
math,25%,63.75,84.5,85.75
math,50%,65.5,92.0,90.0
math,75%,67.25,92.75,93.5
math,max,69.0,94.0,98.0
english,count,2.0,6.0,4.0
english,mean,82.5,69.5,87.5


## Merging and Joining

Pandas provides several ways to combine dataframes together. We will look at two of them below:

### `pd.concat`

This function takes in a list or dictionary of Series or DataFrame objects and joins them along a particular axis, row-wise axis=0 or column-wise axis=1.

```python
# For example, concat with a list of two DataFrames
pd.concat([df1, df2], axis=0)
```

- When your list contains at least one DataFrame, a DataFrame is returned.


- When concatenating only Series objects row-wise, axis=0, a Series is returned.


- When concatenating Series or DataFrames column-wise, axis=1, a DataFrame is returned.

```python
# Default is set to row-wise concatenation using an outer join.
pd.concat(objs, axis=0, join='outer')
```

When concatenating dataframes vertically, we basically are just adding more rows to an existing dataframe. In this case, the dataframes we are putting together should have the same column names[^1].

In [54]:
concat_df = pd.DataFrame({'col1': [1, 6, 3], 'col2': [6, 9, 10]})
concat_df1 = pd.DataFrame({'col1': [2, 5, 9], 'col2': [500, 35, 62]})

In [56]:
pd.concat([concat_df, concat_df1], axis=0)

Unnamed: 0,col1,col2
0,1,6
1,6,9
2,3,10
0,2,500
1,5,35
2,9,62


In [58]:
pd.concat([concat_df, concat_df1], axis=1)

Unnamed: 0,col1,col2,col1.1,col2.1
0,1,6,2,500
1,6,9,5,35
2,3,10,9,62


In [62]:
pd.concat([concat_df, concat_df1], axis=0, ignore_index=False)

Unnamed: 0,col1,col2
0,1,6
1,6,9
2,3,10
0,2,500
1,5,35
2,9,62


**Note** that the indices are preserved on the resulting dataframe; we could set the `ignore_index` parameter to `True` if we wanted these to be sequential.

[^1]:
    We can concatenate dataframes with different column names, but generally this is not the behavior we want, as pandas will fill in a lot of null values into the resulting dataframe. The exception to this is if the dataframes are aligned on their index (i.e. the labels for each row), then we can provide the `axis=1` keyword argument to `pd.concat` to merge the dataframes horizontally.

### `.merge`

This method is similar to a SQL join. Here's a [cool read](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join) making a comparison between the two, if you're interested.

```python
# df.merge default settings for commonly used parameters.

left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, indicator=False)
```

How does changing the default argument of the `how` parameter change my resulting DataFrame?

`how` == Type of merge to be performed.

`how=left`: use only keys from left frame, similar to a SQL left outer join; preserve key order.

`how=right`: use only keys from right frame, similar to a SQL right outer join; preserve key order.

`how=outer`: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

`how=inner`: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

In [63]:
# Create the users DataFrame.

users = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6],
    'name': ['bob', 'joe', 'sally', 'adam', 'jane', 'mike'],
    'role_id': [1, 2, 3, 3, np.nan, np.nan]
})
users

Unnamed: 0,id,name,role_id
0,1,bob,1.0
1,2,joe,2.0
2,3,sally,3.0
3,4,adam,3.0
4,5,jane,
5,6,mike,


In [64]:
# Create the roles DataFrame

roles = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['admin', 'author', 'reviewer', 'commenter']
})
roles

Unnamed: 0,id,name
0,1,admin
1,2,author
2,3,reviewer
3,4,commenter


The `.merge` method will allow us to specify `left_on` and `right_on` to indicate the columns that are the keys used to merge the dataframes together. 

- In addition, the `how` keyword argument is used to define what type of JOIN we want to do; as we saw above, `inner` is the default setting. 

- For demonstration purposes, I'm also going to set the `indicator` parameter to `True`, which will create a column indicating whether the merge key appears in the `left_only`, `right_only` or `both` DataFrames.

In [65]:
# Perform an outer join specifying the left and right DataFrame keys.

users.merge(roles, left_on='role_id', right_on='id', how='outer', indicator=True)

Unnamed: 0,id_x,name_x,role_id,id_y,name_y,_merge
0,1.0,bob,1.0,1.0,admin,both
1,2.0,joe,2.0,2.0,author,both
2,3.0,sally,3.0,3.0,reviewer,both
3,4.0,adam,3.0,3.0,reviewer,both
4,5.0,jane,,,,left_only
5,6.0,mike,,,,left_only
6,,,,4.0,commenter,right_only


Notice that we have duplicate column names in the resulting dataframe. By default, pandas will add a suffix of `_x` to any columns in the left dataframe that are duplicated, and `_y` to any columns in the right dataframe that are duplicated. I can clean up my columns if I want to; one way would be to use method chaining, which it demonstrated below:

In [None]:
# left_on = left table key
# right_on = right table key

# ON left.key = right.key


In [66]:
temp = (users.merge(roles, 
            left_on='role_id', 
            right_on='id')
    .drop(columns='role_id')
    .rename(columns={'id_x': 'id', 
                     'name_x': 'employee',
                     'id_y': 'role_id',
                     'name_y': 'role'}
            )
)
temp

Unnamed: 0,id,employee,role_id,role
0,1,bob,1,admin
1,2,joe,2,author
2,3,sally,3,reviewer
3,4,adam,3,reviewer


In [67]:
temp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        4 non-null      int64 
 1   employee  4 non-null      object
 2   role_id   4 non-null      int64 
 3   role      4 non-null      object
dtypes: int64(2), object(2)
memory usage: 160.0+ bytes


In [None]:
df.drop(columns='a_name_student', inplace=True)

In [87]:
# cache a dataframe: turn it into a csv
df.to_csv('students.csv', index=False)

In [None]:
df.to_csv('/Users/madeleinecapper/Documents/students.csv')

## Exercises II

1. Copy the `users` and `roles` DataFrames from the examples above. 

In [68]:
users

Unnamed: 0,id,name,role_id
0,1,bob,1.0
1,2,joe,2.0
2,3,sally,3.0
3,4,adam,3.0
4,5,jane,
5,6,mike,


In [69]:
roles

Unnamed: 0,id,name
0,1,admin
1,2,author
2,3,reviewer
3,4,commenter


2. What is the result of using a `right` join on the DataFrames? 

In [70]:
users.merge(roles, how='right')

Unnamed: 0,id,name,role_id
0,1,admin,
1,2,author,
2,3,reviewer,
3,4,commenter,


3. What is the result of using an `outer` join on the DataFrames?
     

In [71]:
users.merge(roles, how='outer')

Unnamed: 0,id,name,role_id
0,1,bob,1.0
1,2,joe,2.0
2,3,sally,3.0
3,4,adam,3.0
4,5,jane,
5,6,mike,
6,1,admin,
7,2,author,
8,3,reviewer,
9,4,commenter,


4. What happens if you drop the foreign keys from the DataFrames and try to merge them?

In [72]:
users.drop(columns='role_id').merge(roles)

Unnamed: 0,id,name


5. Load the `mpg` dataset from PyDataset. 

In [73]:
from pydataset import data

In [74]:
mpg = data('mpg')

6. Output and read the documentation for the `mpg` dataset.

In [78]:
data('mpg', show_doc=True)

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




7. How many rows and columns are in the dataset?

In [79]:
mpg.shape

(234, 11)

8. Check out your column names and perform any cleanup you may want on them.

In [80]:
mpg.columns

Index(['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty',
       'hwy', 'fl', 'class'],
      dtype='object')

9. Display the summary statistics for the dataset.

In [81]:
mpg.describe()

Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


10. How many different manufacturers are there?

In [86]:
mpg.groupby('manufacturer').agg('min','mean','max')

  mpg.groupby('manufacturer').agg('min','mean','max')


audi
chevrolet
dodge
ford
honda
hyundai
jeep
land rover
lincoln
mercury
nissan


In [88]:
len(mpg.groupby('manufacturer').agg('min','mean','max'))

  len(mpg.groupby('manufacturer').agg('min','mean','max'))


15

11. How many different models are there?

In [90]:
len(mpg.groupby('model'))

38

12. Create a column named `mileage_difference` like you did in the DataFrames exercises; this column should contain the difference between highway and city mileage for each car.

13. Create a column named `average_mileage` like you did in the DataFrames exercises; this is the mean of the city and highway mileage.

14. Create a new column on the `mpg` dataset named `is_automatic` that holds boolean values denoting whether the car has an automatic transmission.

15. Using the `mpg` dataset, find out which which manufacturer has the best miles per gallon on average?

16. Do automatic or manual cars have better miles per gallon?