# Dataframes

In this lesson, we will introduce pandas *dataframes*. Dataframes represent tabular, 2-dimensional data, and provide a number of facilities for manipulating and transforming the data.

In [1]:
import pandas as pd

## An Example Dataframe 

The code below will create a data frame that represents grades for multiple students. We pass a dictionary where the keys will correspond to the names of the columns, and the values associated with those keys will make up the data. We will talk in more detail about different ways to create a dataframe in a coming lesson.

In [2]:
import pandas as pd
import numpy as np

np.random.seed(123) #will provide the same set of random numbers. if you 
#don't have this line, everytiem you run the cell you'll get different random
#numbers. If you have it, you will get the same set of random numbers
#even as other notbooks using the same seed

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

type(df)

pandas.core.frame.DataFrame

As we might expect, the dataframe stored in the `df` variable has a type of `DataFrame`.

Dataframes also have a nice, printed representation:

In [3]:
print(df)

       name  math  english  reading
0     Sally    62       85       80
1      Jane    88       79       67
2     Suzie    94       74       95
3     Billy    98       96       88
4       Ada    77       92       98
5      John    79       76       93
6    Thomas    82       64       81
7     Marie    93       63       90
8    Albert    92       62       87
9   Richard    69       80       94
10    Isaac    92       99       93
11     Alan    92       62       72


And, if we are within a jupyter notebook (or the Codeup curriculum), we can get a nice html representation of a dataframe:

In [4]:
df

Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


## Summarizing Dataframes

The `.info` prints out some useful information about the dataframe:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 512.0+ bytes


The `.describe` method gives a quick summary of the numerical values in a dataframe.

In [6]:
df.describe()

Unnamed: 0,math,english,reading
count,12.0,12.0,12.0
mean,84.833333,77.666667,86.5
std,11.134168,13.371158,9.643651
min,62.0,62.0,67.0
25%,78.5,63.75,80.75
50%,90.0,77.5,89.0
75%,92.25,86.75,93.25
max,98.0,99.0,98.0


## Dataframe Attributes

Dataframes have several attributes that are important to be familiar with:

- `dtypes`: the data type of each column
- `shape`: the number of rows and columns in the dataframe
- `columns`: the list of column names
- `index`: the labels for each row (usually an autogenerated number)

In [7]:
df.dtypes

name       object
math        int64
english     int64
reading     int64
dtype: object

In [8]:
df.shape

(12, 4)

In [9]:
len(df)

12

In [10]:
df.columns

Index(['name', 'math', 'english', 'reading'], dtype='object')

In [11]:
df.index

RangeIndex(start=0, stop=12, step=1)

The `.columns` attribute can be assigned to in order to change the name of the columns in the data frame. For example, if we wanted to uppercase every column name, we could do so like this:

In [12]:
df.columns = [col.upper() for col in df.columns]

In [13]:
df

Unnamed: 0,NAME,MATH,ENGLISH,READING
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


For now, we'll reset the column names back to what they used to be.

In [14]:
df.columns = [col.lower() for col in df.columns]

## Subsetting Dataframes

There are a number of ways we can access certain subsets, i.e. either a restricted number of rows, columns, or both, of our dataframes.

### Accessing Individual Columns

Each column in a dataframe is a `Series` that we discussed in the previous lesson. These values can be accessed in one of two ways:

In [15]:
# using . notation
df.math #dont use this method, it fails regularly

0     62
1     88
2     94
3     98
4     77
5     79
6     82
7     93
8     92
9     69
10    92
11    92
Name: math, dtype: int64

In [16]:
# using square brackets
df['math']

0     62
1     88
2     94
3     98
4     77
5     79
6     82
7     93
8     92
9     69
10    92
11    92
Name: math, dtype: int64

In [17]:
df.math.describe()

count    12.000000
mean     84.833333
std      11.134168
min      62.000000
25%      78.500000
50%      90.000000
75%      92.250000
max      98.000000
Name: math, dtype: float64

In [18]:
df['math'].describe()

count    12.000000
mean     84.833333
std      11.134168
min      62.000000
25%      78.500000
50%      90.000000
75%      92.250000
max      98.000000
Name: math, dtype: float64

In general, the first way is preferred, but the second way is required if the name of the column is not a valid python identifier.

### Accessing Multiple Columns

We can see multiple columns in the dataframe by subsetting the dataframe with a list of strings. The following two code samples are functionally equivalent.

In [19]:
df[['name', 'math']] #passig a list of names, which is why double bracket. can't do function # or # have to do function # or function # 




Unnamed: 0,name,math
0,Sally,62
1,Jane,88
2,Suzie,94
3,Billy,98
4,Ada,77
5,John,79
6,Thomas,82
7,Marie,93
8,Albert,92
9,Richard,69


In [20]:
columns = ['name', 'math']
df[columns]
#columns list comes with brackets lol 

Unnamed: 0,name,math
0,Sally,62
1,Jane,88
2,Suzie,94
3,Billy,98
4,Ada,77
5,John,79
6,Thomas,82
7,Marie,93
8,Albert,92
9,Richard,69


### Creating new Columns

In [21]:
df.head()

Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98


In [22]:
df['location'] = 'San Antonio' #could do list with same number of students


In [23]:
df. head()

Unnamed: 0,name,math,english,reading,location
0,Sally,62,85,80,San Antonio
1,Jane,88,79,67,San Antonio
2,Suzie,94,74,95,San Antonio
3,Billy,98,96,88,San Antonio
4,Ada,77,92,98,San Antonio


In [24]:
df.math

0     62
1     88
2     94
3     98
4     77
5     79
6     82
7     93
8     92
9     69
10    92
11    92
Name: math, dtype: int64

In [25]:
df['math_honors'] = df.math > 90

In [26]:
df.head()

Unnamed: 0,name,math,english,reading,location,math_honors
0,Sally,62,85,80,San Antonio,False
1,Jane,88,79,67,San Antonio,False
2,Suzie,94,74,95,San Antonio,True
3,Billy,98,96,88,San Antonio,True
4,Ada,77,92,98,San Antonio,False


In [27]:
df['overall_average'] = (df.math+df.english+df.reading)/3

In [28]:
df.head()

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
0,Sally,62,85,80,San Antonio,False,75.666667
1,Jane,88,79,67,San Antonio,False,78.0
2,Suzie,94,74,95,San Antonio,True,87.666667
3,Billy,98,96,88,San Antonio,True,94.0
4,Ada,77,92,98,San Antonio,False,89.0


### Accessing Row Subsets

Pandas provides several convenience methods for quickly looking at several rows in a dataframe:

- `.head`: for the first n (default 5) rows
- `.tail`: for the last n (default 5) rows
- `.sample`: for a random sample of rows

In [29]:
df.head()

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
0,Sally,62,85,80,San Antonio,False,75.666667
1,Jane,88,79,67,San Antonio,False,78.0
2,Suzie,94,74,95,San Antonio,True,87.666667
3,Billy,98,96,88,San Antonio,True,94.0
4,Ada,77,92,98,San Antonio,False,89.0


In [30]:
df.tail()

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
7,Marie,93,63,90,San Antonio,True,82.0
8,Albert,92,62,87,San Antonio,True,80.333333
9,Richard,69,80,94,San Antonio,False,81.0
10,Isaac,92,99,93,San Antonio,True,94.666667
11,Alan,92,62,72,San Antonio,True,75.333333


In [31]:
df.sample(5)

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
7,Marie,93,63,90,San Antonio,True,82.0
11,Alan,92,62,72,San Antonio,True,75.333333
1,Jane,88,79,67,San Antonio,False,78.0
4,Ada,77,92,98,San Antonio,False,89.0
9,Richard,69,80,94,San Antonio,False,81.0


Like numpy arrays and pandas `Series`, pandas dataframes can also be indexed into with a boolean series.

For example, suppose we wanted to find the observations in our dataframe where the math grade is below an 80. We know that we can produce a boolean series of values using a vectorized comparison operation:

In [32]:
df.math < 80

0      True
1     False
2     False
3     False
4      True
5      True
6     False
7     False
8     False
9      True
10    False
11    False
Name: math, dtype: bool

We can then use that series to index into our dataframe to find the entire row where our condition is true:

In [33]:
df[df.math < 80]

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
0,Sally,62,85,80,San Antonio,False,75.666667
4,Ada,77,92,98,San Antonio,False,89.0
5,John,79,76,93,San Antonio,False,82.666667
9,Richard,69,80,94,San Antonio,False,81.0


In [34]:
(df.math_honors) == True & (df.english > 90)

0      True
1      True
2     False
3      True
4     False
5      True
6      True
7     False
8     False
9      True
10     True
11    False
dtype: bool

In [35]:
df[(df.math_honors == True) & (df.english > 90)]

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
3,Billy,98,96,88,San Antonio,True,94.0
10,Isaac,92,99,93,San Antonio,True,94.666667


In [36]:
df[(df.math_honors) & (df.english > 90)] # math_honors already boolean !


Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
3,Billy,98,96,88,San Antonio,True,94.0
10,Isaac,92,99,93,San Antonio,True,94.666667


In [37]:
df.loc[0] #more on that tomorrow

name                     Sally
math                        62
english                     85
reading                     80
location           San Antonio
math_honors              False
overall_average      75.666667
Name: 0, dtype: object

## Dropping and Renaming Columns

We can drop columns with the `.drop` method, and, similarly, rename them with `.rename`.

For both methods (and many other methods within pandas), the original dataframe **will not be changed**. Instead, the methods will produce a new dataframe. This is similar to the behavior we have seen with, for example, string methods. The exception to this is that most pandas methods will accept an optional keyword argument of `inplace` (defaults to `False`) to determine whether to mutate the original value.

Let's take a look at a couple of examples of `.drop` and `.rename`:

In [38]:
df.head()

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
0,Sally,62,85,80,San Antonio,False,75.666667
1,Jane,88,79,67,San Antonio,False,78.0
2,Suzie,94,74,95,San Antonio,True,87.666667
3,Billy,98,96,88,San Antonio,True,94.0
4,Ada,77,92,98,San Antonio,False,89.0


In [39]:
# drop english and reading
df.drop(columns = ['math','english'])

Unnamed: 0,name,reading,location,math_honors,overall_average
0,Sally,80,San Antonio,False,75.666667
1,Jane,67,San Antonio,False,78.0
2,Suzie,95,San Antonio,True,87.666667
3,Billy,88,San Antonio,True,94.0
4,Ada,98,San Antonio,False,89.0
5,John,93,San Antonio,False,82.666667
6,Thomas,81,San Antonio,False,75.666667
7,Marie,90,San Antonio,True,82.0
8,Albert,87,San Antonio,True,80.333333
9,Richard,94,San Antonio,False,81.0


In [40]:
df.head()

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
0,Sally,62,85,80,San Antonio,False,75.666667
1,Jane,88,79,67,San Antonio,False,78.0
2,Suzie,94,74,95,San Antonio,True,87.666667
3,Billy,98,96,88,San Antonio,True,94.0
4,Ada,77,92,98,San Antonio,False,89.0


In [41]:
# rename 'name' column to 'student'
df.rename(columns = {'name': 'student'}) #use comma not ampersand to do multiple columns


Unnamed: 0,student,math,english,reading,location,math_honors,overall_average
0,Sally,62,85,80,San Antonio,False,75.666667
1,Jane,88,79,67,San Antonio,False,78.0
2,Suzie,94,74,95,San Antonio,True,87.666667
3,Billy,98,96,88,San Antonio,True,94.0
4,Ada,77,92,98,San Antonio,False,89.0
5,John,79,76,93,San Antonio,False,82.666667
6,Thomas,82,64,81,San Antonio,False,75.666667
7,Marie,93,63,90,San Antonio,True,82.0
8,Albert,92,62,87,San Antonio,True,80.333333
9,Richard,69,80,94,San Antonio,False,81.0


We will use the `columns` keyword argument with both `.drop` and `.rename`. We'll pass a list of column names we want to remove to `.drop`, and a dictionary of columns to rename to `.rename`. Within the passed dictionary, the keys will be the old column names, and the values are the new column names.

Notice that, after both of these operations, the original variable is unchanged.

In [42]:
df

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
0,Sally,62,85,80,San Antonio,False,75.666667
1,Jane,88,79,67,San Antonio,False,78.0
2,Suzie,94,74,95,San Antonio,True,87.666667
3,Billy,98,96,88,San Antonio,True,94.0
4,Ada,77,92,98,San Antonio,False,89.0
5,John,79,76,93,San Antonio,False,82.666667
6,Thomas,82,64,81,San Antonio,False,75.666667
7,Marie,93,63,90,San Antonio,True,82.0
8,Albert,92,62,87,San Antonio,True,80.333333
9,Richard,69,80,94,San Antonio,False,81.0


Because these methods each return a dataframe, we can *chain* them together:

In [43]:
df.drop(columns=['english', 'reading']).rename(columns={'name': 'student'})

Unnamed: 0,student,math,location,math_honors,overall_average
0,Sally,62,San Antonio,False,75.666667
1,Jane,88,San Antonio,False,78.0
2,Suzie,94,San Antonio,True,87.666667
3,Billy,98,San Antonio,True,94.0
4,Ada,77,San Antonio,False,89.0
5,John,79,San Antonio,False,82.666667
6,Thomas,82,San Antonio,False,75.666667
7,Marie,93,San Antonio,True,82.0
8,Albert,92,San Antonio,True,80.333333
9,Richard,69,San Antonio,False,81.0


## Sorting Dataframes

We can use the `.sort_values` method to sort a dataframe by any given criteria. For example, we can sort by the english grade:

In [44]:
df.sort_values(by='english')

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
8,Albert,92,62,87,San Antonio,True,80.333333
11,Alan,92,62,72,San Antonio,True,75.333333
7,Marie,93,63,90,San Antonio,True,82.0
6,Thomas,82,64,81,San Antonio,False,75.666667
2,Suzie,94,74,95,San Antonio,True,87.666667
5,John,79,76,93,San Antonio,False,82.666667
1,Jane,88,79,67,San Antonio,False,78.0
9,Richard,69,80,94,San Antonio,False,81.0
0,Sally,62,85,80,San Antonio,False,75.666667
4,Ada,77,92,98,San Antonio,False,89.0


We can sort in descending order by providing the a keyword arugment

In [45]:
df.sort_values(by='english', ascending=False)

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
10,Isaac,92,99,93,San Antonio,True,94.666667
3,Billy,98,96,88,San Antonio,True,94.0
4,Ada,77,92,98,San Antonio,False,89.0
0,Sally,62,85,80,San Antonio,False,75.666667
9,Richard,69,80,94,San Antonio,False,81.0
1,Jane,88,79,67,San Antonio,False,78.0
5,John,79,76,93,San Antonio,False,82.666667
2,Suzie,94,74,95,San Antonio,True,87.666667
6,Thomas,82,64,81,San Antonio,False,75.666667
7,Marie,93,63,90,San Antonio,True,82.0


## Chaining Dataframe Methods

Because most dataframe methods return another dataframe, it is common to see them *chained* together.

For example, we could use method chaining to find the name of the student with the *lowest* english grade above a 90.

In [46]:
df[df.english > 90].sort_values(by='english').head(1)

Unnamed: 0,name,math,english,reading,location,math_honors,overall_average
4,Ada,77,92,98,San Antonio,False,89.0


In [47]:
df[df.english > 90].sort_values(by='english').head(1).name

4    Ada
Name: name, dtype: object

Let's break down the above expression piece by piece:

1. `df`: our initial variable that holds our dataframe
1. `[df.english > 90]`: here we subset the datframe to find just the rows where the english grade is greater than 90
1. `.sort_values(by='english')`: now we take the remaining rows and sort them by the english grade
1. `.head(1)`: take just the first record. Because we sorted previously, this will give us the student with lowest english grade
1. `.name`: extract just the `name` part of the record

In [48]:
df[df.english > 90].math.mean()

89.0

## Further Reading

- [pandas documentation: `DataFrame`s](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe)

## Exercises

Do your work for this exercise in a python script or a jupyter notebook with the name `dataframes.py` or `dataframes.ipynb`.

For several of the following exercises, you'll need to load several datasets using the `pydataset` library. (If you get an error when trying to run the import below, use `pip` to install the `pydataset` package.)

In [49]:
from pydataset import data

When the instructions say to load a dataset, you can pass the name of the dataset as a string to the `data` function to load the dataset. You can also view the documentation for the data set by passing the `show_doc` keyword argument.

In [50]:
# data('mpg', show_doc=True) # view the documentation for the dataset
mpg = data('mpg') # load the dataset and store it in a variable

All the datasets loaded from the `pydataset` library will be pandas dataframes.

1. Copy the code from the lesson to create a dataframe full of student grades.

    1. Create a column named `passing_english` that indicates whether each student has a passing grade in english.
    1. Sort the english grades by the `passing_english` column. How are duplicates handled?
    1. Sort the english grades first by `passing_english` and then by student name. All the students that are failing english should be first, and within the students that are failing english they should be ordered alphabetically. The same should be true for the students passing english. (Hint: you can pass a list to the `.sort_values` method)
    1. Sort the english grades first by `passing_english`, and then by the actual english grade, similar to how we did in the last step.
    1. Calculate each students overall grade and add it as a column on the dataframe. The overall grade is the average of the math, english, and reading grades.

In [51]:
np.random.seed(123) #will provide the same set of random numbers. if you 
#don't have this line, everytiem you run the cell you'll get different random
#numbers. If you have it, you will get the same set of random numbers
#even as other notbooks using the same seed

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

type(df)

pandas.core.frame.DataFrame

In [52]:
df['passing_english'] = df.english > 70 #A

In [53]:
df.head()

Unnamed: 0,name,math,english,reading,passing_english
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True


In [54]:
df.sort_values('passing_english') #1B

Unnamed: 0,name,math,english,reading,passing_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True


In [55]:
df.sort_values('name').sort_values('passing_english') #1C

Unnamed: 0,name,math,english,reading,passing_english
11,Alan,92,62,72,False
8,Albert,92,62,87,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True


In [56]:
df.sort_values('english', ascending = False).sort_values('passing_english') #1D

Unnamed: 0,name,math,english,reading,passing_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
10,Isaac,92,99,93,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
0,Sally,62,85,80,True
9,Richard,69,80,94,True
1,Jane,88,79,67,True


In [57]:
df['overall'] = (df.math+df.english+df.reading)/3

In [58]:
df.head()

Unnamed: 0,name,math,english,reading,passing_english,overall
0,Sally,62,85,80,True,75.666667
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,87.666667
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0


In [59]:
# data('mpg', show_doc=True) # view the documentation for the dataset
mpg = data('mpg') # load the dataset and store it in a variable

2.  Load the `mpg` dataset. Read the documentation for the dataset and use it for the following questions:

    - a How many rows and columns are there?
    - b What are the data types of each column?
    - c Summarize the dataframe with `.info` and `.describe`
    - d Rename the `cty` column to `city`.
    - e Rename the `hwy` column to `highway`.
    - f Do any cars have better city mileage than highway mileage?
    - g Create a column named `mileage_difference` this column should contain the difference between highway and city mileage for each car.
    - h Which car (or cars) has the highest mileage difference?
    - i Which compact class car has the lowest highway mileage? The best?
    - j Create a column named `average_mileage` that is the mean of the city and highway mileage.
    - k Which dodge car has the best average mileage? The worst?

In [60]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [61]:
mpg.shape #2a

(234, 11)

In [62]:
mpg.dtypes #2b

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

In [63]:
mpg.info #2c

<bound method DataFrame.info of     manufacturer   model  displ  year  cyl       trans drv  cty  hwy fl  \
1           audi      a4    1.8  1999    4    auto(l5)   f   18   29  p   
2           audi      a4    1.8  1999    4  manual(m5)   f   21   29  p   
3           audi      a4    2.0  2008    4  manual(m6)   f   20   31  p   
4           audi      a4    2.0  2008    4    auto(av)   f   21   30  p   
5           audi      a4    2.8  1999    6    auto(l5)   f   16   26  p   
..           ...     ...    ...   ...  ...         ...  ..  ...  ... ..   
230   volkswagen  passat    2.0  2008    4    auto(s6)   f   19   28  p   
231   volkswagen  passat    2.0  2008    4  manual(m6)   f   21   29  p   
232   volkswagen  passat    2.8  1999    6    auto(l5)   f   16   26  p   
233   volkswagen  passat    2.8  1999    6  manual(m5)   f   18   26  p   
234   volkswagen  passat    3.6  2008    6    auto(s6)   f   17   26  p   

       class  
1    compact  
2    compact  
3    compact  
4    co

In [64]:
mpg.describe #2c

<bound method NDFrame.describe of     manufacturer   model  displ  year  cyl       trans drv  cty  hwy fl  \
1           audi      a4    1.8  1999    4    auto(l5)   f   18   29  p   
2           audi      a4    1.8  1999    4  manual(m5)   f   21   29  p   
3           audi      a4    2.0  2008    4  manual(m6)   f   20   31  p   
4           audi      a4    2.0  2008    4    auto(av)   f   21   30  p   
5           audi      a4    2.8  1999    6    auto(l5)   f   16   26  p   
..           ...     ...    ...   ...  ...         ...  ..  ...  ... ..   
230   volkswagen  passat    2.0  2008    4    auto(s6)   f   19   28  p   
231   volkswagen  passat    2.0  2008    4  manual(m6)   f   21   29  p   
232   volkswagen  passat    2.8  1999    6    auto(l5)   f   16   26  p   
233   volkswagen  passat    2.8  1999    6  manual(m5)   f   18   26  p   
234   volkswagen  passat    3.6  2008    6    auto(s6)   f   17   26  p   

       class  
1    compact  
2    compact  
3    compact  
4    

In [65]:
mpg = mpg.rename(columns = {'cty' : 'city'}) #2d

In [66]:
mpg = mpg.rename(columns = {'hwy' : 'highway'}) #2e

In [67]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [68]:
mpg[mpg.city > mpg.highway] #No f

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class


In [69]:
mpg.city > mpg.highway

1      False
2      False
3      False
4      False
5      False
       ...  
230    False
231    False
232    False
233    False
234    False
Length: 234, dtype: bool

In [70]:
mpg['milage_difference'] = (mpg.highway - mpg.city) #2g

In [71]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,milage_difference
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10


In [82]:
mpg.sort_values('milage_difference', ascending = False) #2h


Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,milage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
229,volkswagen,passat,1.8,1999,4,auto(l5),f,18,29,p,midsize,11
36,chevrolet,malibu,3.5,2008,6,auto(l4),f,18,29,r,midsize,11
...,...,...,...,...,...,...,...,...,...,...,...,...
80,ford,explorer 4wd,4.0,1999,6,auto(l5),4,14,17,r,suv,3
138,mercury,mountaineer 4wd,4.0,1999,6,auto(l5),4,14,17,r,suv,3
177,toyota,4runner 4wd,3.4,1999,6,manual(m5),4,15,17,r,suv,2
152,nissan,pathfinder 4wd,3.3,1999,6,manual(m5),4,15,17,r,suv,2


In [84]:
mpg.sort_values('milage_difference', ascending = False).nlargest(5, 'milage_difference', keep = 'all')
#2h better

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,milage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
229,volkswagen,passat,1.8,1999,4,auto(l5),f,18,29,p,midsize,11
36,chevrolet,malibu,3.5,2008,6,auto(l4),f,18,29,r,midsize,11
213,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,11
106,honda,civic,1.8,2008,4,auto(l5),f,25,36,r,subcompact,11
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11


In [106]:
mpg['class'] in ('compact')

TypeError: 'in <string>' requires string as left operand, not Series

In [102]:
mpg['class'] = 'compact'

1. Load the `Mammals` dataset. Read the documentation for it, and use the data to answer these questions:

    - How many rows and columns are there?
    - What are the data types?
    - Summarize the dataframe with `.info` and `.describe`
    - What is the the weight of the fastest animal?
    - What is the overal percentage of specials?
    - How many animals are hoppers that are above the median speed? What percentage is this?
    
    
### ** Awesome Bonus **
For much more practice with pandas, Go to `https://github.com/guipsamora/pandas_exercises` and clone the repo down to your laptop. To clone a repository:
- Copy the SSH address of the repository
- `cd ~/codeup-data-science`
- Then type `git clone git@github.com:guipsamora/pandas_exercises.git`
- Now do `cd pandas_exercises` on your terminal.
- Type `git remote remove origin`, so you won't accidentally try to push your work to guipsamora's repo.

Congratulations! You have cloned guipsamora's pandas exercises to your computer. Now you need to make a new, blank, repository on GitHub.

- Go to `https://github.com/new` to make a new repo. Name it `pandas_exercises`.
- DO NOT check any check boxes. We need a blank, empty repo.
- Finally, follow the directions to "push an existing repository from the command line" so that you can push up your changes to your own account. 
- Now do your own work, add it, commit it, and push it!