In [1]:
import pandas as pd
import numpy as np
import itertools as it
import json
import re
from collections import Counter
from pydataset import data

In [2]:
pd.set_option('display.max_columns', None)

## Exercises

Do your work for this exercise in a python script or a jupyter notebook with the name `dataframes.py` or `dataframes.ipynb`.

For several of the following exercises, you'll need to load several datasets using the `pydataset` library. (If you get an error when trying to run the import below, use `pip` to install the `pydataset` package.)

`from pydataset import data`

When the instructions say to load a dataset, you can pass the name of the dataset as a string to the `data` function to load the dataset. You can also view the documentation for the data set by passing the `show_doc` keyword argument.

In [None]:
# data('mpg', show_doc=True) # view the documentation for the dataset
mpg = data('mpg') # load the dataset and store it in a variable

All the datasets loaded from the `pydataset` library will be pandas dataframes.

1. Copy the code from the lesson to create a dataframe full of student grades.
    1. Create a column named `passing_english` that indicates whether each student has a passing grade in english.
    2. Sort the english grades by the `passing_english` column. How are duplicates handled?
    3. Sort the english grades first by `passing_english` and then by student name. All the students that are failing english should be first, and within the students that are failing english they should be ordered alphabetically. The same should be true for the students passing english. (Hint: you can pass a list to the `.sort_values` method)
    4. Sort the english grades first by `passing_english`, and then by the actual english grade, similar to how we did in the last step.
    5. Calculate each students overall grade and add it as a column on the dataframe. The overall grade is the average of the math, english, and reading grades.
2. Load the `mpg` dataset. Read the documentation for the dataset and use it for the following questions:
    - How many rows and columns are there?
    - What are the data types of each column?
    - Summarize the dataframe with `.info` and `.describe`
    - Rename the `cty` column to `city`.
    - Rename the `hwy` column to `highway`.
    - Do any cars have better city mileage than highway mileage?
    - Create a column named `mileage_difference` this column should contain the difference between highway and city mileage for each car.
    - Which car (or cars) has the highest mileage difference?
    - Which compact class car has the lowest highway mileage? The best?
    - Create a column named `average_mileage` that is the mean of the city and highway mileage.
    - Which dodge car has the best average mileage? The worst?
3. Load the `Mammals` dataset. Read the documentation for it, and use the data to answer these questions:
    - How many rows and columns are there?
    - What are the data types?
    - Summarize the dataframe with `.info` and `.describe`
    - What is the the weight of the fastest animal?
    - What is the overal percentage of specials?
    - How many animals are hoppers that are above the median speed? What percentage is this?

### **Awesome Bonus**

For much more practice with pandas, go to `https://github.com/guipsamora/pandas_exercises` and clone the repo down to your laptop. To clone a repository:

- Copy the SSH address of the repository
- Run `cd ~/codeup-data-science` in the terminal
- Run `git clone git@github.com:guipsamora/pandas_exercises.git`
- Run `cd pandas_exercises`
- Run `git remote remove origin` (so you won't accidentally try to push your work to guipsamora's repo_

Congratulations! You have cloned guipsamora's pandas exercises to your computer. Now you need to make a new, blank, repository on GitHub.

- Go to `https://github.com/new` to make a new repo. Name it `pandas_exercises`.
- DO NOT check any check boxes. We need a blank, empty repo.
- Finally, follow the directions to "push an existing repository from the command line" so that you can push up your changes to your own account.
- Now do your own work, add it, commit it, and push it!

<div class="alert alert-info">
  <h1><strong>STUDENTS GRADES</strong></h1>
</div>

In [3]:
np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

In [4]:
df

Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


## Create a column named pass_english

In [5]:
df['pass_english'] = df['english'] >= 70

In [6]:
df

Unnamed: 0,name,math,english,reading,pass_english
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
9,Richard,69,80,94,True


## Sort df by pass_english...note that duplicates are further sorted by INDEX!

In [7]:
df.sort_values(by='pass_english')

Unnamed: 0,name,math,english,reading,pass_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True


## Sort df by pass_english and then by student_name

In [8]:
df.sort_values(by=['pass_english','name'])

Unnamed: 0,name,math,english,reading,pass_english
11,Alan,92,62,72,False
8,Albert,92,62,87,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True


## Sort df by pass_english and then by english_grade

In [9]:
df.sort_values(by=['pass_english','english'])

Unnamed: 0,name,math,english,reading,pass_english
8,Albert,92,62,87,False
11,Alan,92,62,72,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
2,Suzie,94,74,95,True
5,John,79,76,93,True
1,Jane,88,79,67,True
9,Richard,69,80,94,True
0,Sally,62,85,80,True
4,Ada,77,92,98,True


## Calculate overall grade (avg of all grades) for each student and append it as a column.

In [10]:
columns = ['math','english','reading']

In [11]:
df[columns].loc[0].mean().round(1)

75.7

In [12]:
[df[columns].loc[i].mean().round(1) for i in range(0,len(df))]

[75.7, 78.0, 87.7, 94.0, 89.0, 82.7, 75.7, 82.0, 80.3, 81.0, 94.7, 75.3]

In [13]:
df['overall_grade'] = [df[columns].loc[i].mean().round(1) for i in range(0,len(df))]

In [14]:
df

Unnamed: 0,name,math,english,reading,pass_english,overall_grade
0,Sally,62,85,80,True,75.7
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,87.7
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0
5,John,79,76,93,True,82.7
6,Thomas,82,64,81,False,75.7
7,Marie,93,63,90,False,82.0
8,Albert,92,62,87,False,80.3
9,Richard,69,80,94,True,81.0


<div class="alert alert-info">
  <h1><strong>mpg Dataset</strong></h1>
</div>

In [15]:
df = data('mpg')

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [17]:
df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


## How many Rows and Columns?

In [18]:
df.shape
print(f'{df.shape[0]} ROWS \n {df.shape[1]} COLS')

234 ROWS 
 11 COLS


## What are the dtypes for each col?

In [19]:
pd.DataFrame(df.dtypes)

Unnamed: 0,0
manufacturer,object
model,object
displ,float64
year,int64
cyl,int64
trans,object
drv,object
cty,int64
hwy,int64
fl,object


## Summarize with .info and .describe

In [20]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


## Rename 'cty' to 'city'  NB - had to RESET_INDEX!

In [21]:
df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [22]:
df.reset_index(inplace=True)

In [23]:
df

Unnamed: 0,index,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...,...
229,230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [24]:
df.drop(columns='index',inplace=True)

In [25]:
df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [26]:
df.rename(columns={'cty': 'city'},inplace=True)

## Rename 'hwy' to 'highway'

In [27]:
df.rename(columns={'hwy': 'highway'},inplace=True)
df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


## Do any cars have greater 'city' than 'highway' mpg?

In [28]:
df[df['city'] > df['highway']]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class


## Create a column 'mileage_diff'

In [29]:
abs(df['highway'] - df['city'])

0      11
1       8
2      11
3       9
4      10
       ..
229     9
230     8
231    10
232     8
233     9
Length: 234, dtype: int64

In [30]:
abs(df['highway'].loc[0] - df['city'].loc[0])

11

In [31]:
df['mileage_diff'] = [abs(df['highway'].loc[i] - df['city'].loc[i]) for i in range(0,len(df))]

In [32]:
df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10
...,...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8


## Which Car(s) has the greatest mileage_diff?

In [33]:
df.nlargest(1,columns='mileage_diff',keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff
106,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
222,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12


## Which COMPACT car has the LOWEST highway mpg?  and the BEST?

In [34]:
df[df['class'] == 'compact']

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10
5,audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact,8
6,audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact,9
7,audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact,8
8,audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact,9
9,audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact,8


In [35]:
df[df['class'] == 'compact'].nsmallest(1,columns='highway',keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff
219,volkswagen,jetta,2.8,1999,6,auto(l4),f,16,23,r,compact,7


In [36]:
df[df['class'] == 'compact'].nlargest(1,columns='highway',keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff
212,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,11


## Create a column 'avg_mpg'

In [37]:
columns = ['city','highway']
df['avg_mpg'] = [df[columns].loc[i].mean().round(1) for i in range(0,len(df))]

In [38]:
df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff,avg_mpg
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9,23.5
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8,25.0
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10,21.0
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8,22.0


## Which DODGE car has the BEST avg_mpg?  and the LOWEST?

In [39]:
df[df['manufacturer'] == 'dodge']

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff,avg_mpg
37,dodge,caravan 2wd,2.4,1999,4,auto(l3),f,18,24,r,minivan,6,21.0
38,dodge,caravan 2wd,3.0,1999,6,auto(l4),f,17,24,r,minivan,7,20.5
39,dodge,caravan 2wd,3.3,1999,6,auto(l4),f,16,22,r,minivan,6,19.0
40,dodge,caravan 2wd,3.3,1999,6,auto(l4),f,16,22,r,minivan,6,19.0
41,dodge,caravan 2wd,3.3,2008,6,auto(l4),f,17,24,r,minivan,7,20.5
42,dodge,caravan 2wd,3.3,2008,6,auto(l4),f,17,24,r,minivan,7,20.5
43,dodge,caravan 2wd,3.3,2008,6,auto(l4),f,11,17,e,minivan,6,14.0
44,dodge,caravan 2wd,3.8,1999,6,auto(l4),f,15,22,r,minivan,7,18.5
45,dodge,caravan 2wd,3.8,1999,6,auto(l4),f,15,21,r,minivan,6,18.0
46,dodge,caravan 2wd,3.8,2008,6,auto(l6),f,16,23,r,minivan,7,19.5


In [40]:
df[df['manufacturer'] == 'dodge'].nlargest(1,columns='avg_mpg',keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff,avg_mpg
37,dodge,caravan 2wd,2.4,1999,4,auto(l3),f,18,24,r,minivan,6,21.0


In [41]:
df[df['manufacturer'] == 'dodge'].nsmallest(1,columns='avg_mpg',keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_diff,avg_mpg
54,dodge,dakota pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
59,dodge,durango 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv,3,10.5
65,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
69,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup,3,10.5


<div class="alert alert-info">
  <h1><strong>MAMMALS!</strong></h1>
</div>

In [42]:
df = data('Mammals',show_doc=True)

Mammals

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Garland(1983) Data on Running Speed of Mammals

### Description

Observations on the maximal running speed of mammal species and their body
mass.

### Usage

    data(Mammals)

### Format

A data frame with 107 observations on the following 4 variables.

weight

Body mass in Kg for "typical adult sizes"

speed

Maximal running speed (fastest sprint velocity on record)

hoppers

logical variable indicating animals that ambulate by hopping, e.g. kangaroos

specials

logical variable indicating special animals with "lifestyles in which speed
does not figure as an important factor": Hippopotamus, raccoon (Procyon),
badger (Meles), coati (Nasua), skunk (Mephitis), man (Homo), porcupine
(Erithizon), oppossum (didelphis), and sloth (Bradypus)

### Details

Used by Chappell (1989) and Koenker, Ng and Portnoy (1994) to illustrate the
fitting of piecewise linear curves.

### Source

Garland, T. (

In [43]:
df = data('Mammals')

In [44]:
df

Unnamed: 0,weight,speed,hoppers,specials
1,6000.0,35.0,False,False
2,4000.0,26.0,False,False
3,3000.0,25.0,False,False
4,1400.0,45.0,False,False
5,400.0,70.0,False,False
6,350.0,70.0,False,False
7,300.0,64.0,False,False
8,260.0,70.0,False,False
9,250.0,40.0,False,False
10,3800.0,25.0,False,True


## How many Rows and Columns?

In [45]:
df.shape

(107, 4)

## What are the data types of each column?

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 1 to 107
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   weight    107 non-null    float64
 1   speed     107 non-null    float64
 2   hoppers   107 non-null    bool   
 3   specials  107 non-null    bool   
dtypes: bool(2), float64(2)
memory usage: 2.7 KB


## Summarize the df w/ INFO and DESCRIBE

In [47]:
df.describe()

Unnamed: 0,weight,speed
count,107.0,107.0
mean,278.688178,46.208411
std,839.608269,26.716778
min,0.016,1.6
25%,1.7,22.5
50%,34.0,48.0
75%,142.5,65.0
max,6000.0,110.0


## What is weight of fastest mammal?

In [48]:
df[df['speed']==df['speed'].max()][['weight']]

Unnamed: 0,weight
53,55.0


## What is the overall pct of mammals labeled 'special'?

In [49]:
df['specials'].mean().round(3)

0.093

## How many mammals are hoppers with a speed greater than the average?  What pct of the rows does this represent?

In [50]:
df['speed'].mean()

46.20841121495327

In [51]:
df[df['speed'] > df['speed'].mean()].sort_values('speed')

Unnamed: 0,weight,speed,hoppers,specials
38,50.0,47.0,False,False
47,300.0,48.0,False,False
50,135.0,48.0,False,False
23,150.0,48.0,False,False
102,1.5,50.0,True,False
54,45.0,50.0,False,False
29,85.0,55.0,False,False
61,10.0,56.0,False,False
21,250.0,56.0,False,False
48,230.0,56.0,False,False


In [52]:
len(df[df['speed'] > df['speed'].mean()])

57

In [53]:
df['faster'] = df['speed'] > df['speed'].mean()

In [54]:
df

Unnamed: 0,weight,speed,hoppers,specials,faster
1,6000.0,35.0,False,False,False
2,4000.0,26.0,False,False,False
3,3000.0,25.0,False,False,False
4,1400.0,45.0,False,False,False
5,400.0,70.0,False,False,True
6,350.0,70.0,False,False,True
7,300.0,64.0,False,False,True
8,260.0,70.0,False,False,True
9,250.0,40.0,False,False,False
10,3800.0,25.0,False,True,False


In [55]:
df['faster'] = df.faster.astype('int')

In [56]:
df

Unnamed: 0,weight,speed,hoppers,specials,faster
1,6000.0,35.0,False,False,0
2,4000.0,26.0,False,False,0
3,3000.0,25.0,False,False,0
4,1400.0,45.0,False,False,0
5,400.0,70.0,False,False,1
6,350.0,70.0,False,False,1
7,300.0,64.0,False,False,1
8,260.0,70.0,False,False,1
9,250.0,40.0,False,False,0
10,3800.0,25.0,False,True,0


In [57]:
round(df.faster.mean(),3)

0.533