# DataFrames Exercises

For several of the following exercises, you'll need to load several datasets using the pydataset library. (If you get an error when trying to run the import below, use pip to install the pydataset package.)
```
from pydataset import data
```
When the instructions say to load a dataset, you can pass the name of the dataset as a string to the data function to load the dataset. You can also view the documentation for the data set by passing the show_doc keyword argument.
```
# data('mpg', show_doc=True) # view the documentation for the dataset
mpg = data('mpg') # load the dataset and store it in a variable
```
All the datasets loaded from the pydataset library will be pandas dataframes.

1. Copy the code from the lesson to create a dataframe full of student grades.

In [290]:
import pandas as pd
import numpy as np
from pydataset import data

np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})


df


Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


a. Create a column named passing_english that indicates whether each student has a passing grade in english.

In [291]:
df['passing_english'] = df["english"] >= 70
df

Unnamed: 0,name,math,english,reading,passing_english
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
9,Richard,69,80,94,True


b. Sort the english grades by the passing_english column. How are duplicates handled?

In [292]:
df.sort_values("passing_english")

Unnamed: 0,name,math,english,reading,passing_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True


c. Sort the english grades first by passing_english and then by student name. All the students that are failing english should be first, and within the students that are failing english they should be ordered alphabetically. The same should be true for the students passing english. (Hint: you can pass a list to the .sort_values method)

In [293]:
df.sort_values(["passing_english", "name"])

Unnamed: 0,name,math,english,reading,passing_english
11,Alan,92,62,72,False
8,Albert,92,62,87,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True


d. Sort the english grades first by passing_english, and then by the actual english grade, similar to how we did in the last step.

In [294]:
df.sort_values(["passing_english", "english"])

Unnamed: 0,name,math,english,reading,passing_english
8,Albert,92,62,87,False
11,Alan,92,62,72,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
2,Suzie,94,74,95,True
5,John,79,76,93,True
1,Jane,88,79,67,True
9,Richard,69,80,94,True
0,Sally,62,85,80,True
4,Ada,77,92,98,True


e. Calculate each students overall grade and add it as a column on the dataframe. The overall grade is the average of the math, english, and reading grades.

In [295]:
# df["over_all_grade"]
# df['math'] + df["english"] + df["reading"] or

df["average_grade"] = df[["math","english","reading"]].mean(axis=1).round(2)
df

Unnamed: 0,name,math,english,reading,passing_english,average_grade
0,Sally,62,85,80,True,75.67
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,87.67
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0
5,John,79,76,93,True,82.67
6,Thomas,82,64,81,False,75.67
7,Marie,93,63,90,False,82.0
8,Albert,92,62,87,False,80.33
9,Richard,69,80,94,True,81.0


2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions:

In [296]:
mpg = data("mpg")
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


* How many rows and columns are there?

In [363]:
mpg.shape

(234, 13)

* What are the data types of each column?

In [298]:
mpg.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

* Summarize the dataframe with .info and .describe

In [299]:
mpg.info

<bound method DataFrame.info of     manufacturer   model  displ  year  cyl       trans drv  cty  hwy fl  \
1           audi      a4    1.8  1999    4    auto(l5)   f   18   29  p   
2           audi      a4    1.8  1999    4  manual(m5)   f   21   29  p   
3           audi      a4    2.0  2008    4  manual(m6)   f   20   31  p   
4           audi      a4    2.0  2008    4    auto(av)   f   21   30  p   
5           audi      a4    2.8  1999    6    auto(l5)   f   16   26  p   
..           ...     ...    ...   ...  ...         ...  ..  ...  ... ..   
230   volkswagen  passat    2.0  2008    4    auto(s6)   f   19   28  p   
231   volkswagen  passat    2.0  2008    4  manual(m6)   f   21   29  p   
232   volkswagen  passat    2.8  1999    6    auto(l5)   f   16   26  p   
233   volkswagen  passat    2.8  1999    6  manual(m5)   f   18   26  p   
234   volkswagen  passat    3.6  2008    6    auto(s6)   f   17   26  p   

       class  
1    compact  
2    compact  
3    compact  
4    co

* Rename the cty column to city.

In [300]:
mpg = mpg.rename(columns = {"cty":"city"})
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


* Rename the hwy column to highway.

In [301]:
mpg = mpg.rename(columns={"hwy":"highway"})
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


* Do any cars have better city mileage than highway mileage?

In [366]:
mpg["city"] < mpg["highway"]

1      True
2      True
3      True
4      True
5      True
       ... 
230    True
231    True
232    True
233    True
234    True
Length: 234, dtype: bool

* Create a column named mileage_difference this column should contain the difference between highway and city mileage for each car.

In [303]:
mpg['mileage_difference'] = abs(mpg.city - mpg.highway)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10


* Which car (or cars) has the highest mileage difference?

In [304]:
# max_mi_diff = mpg.mileage_difference.max()

# mpg.loc[mpg.mileage_difference == max_mi_diff]

mpg.loc[mpg["mileage_difference"].idxmax()]

manufacturer               honda
model                      civic
displ                        1.8
year                        2008
cyl                            4
trans                   auto(l5)
drv                            f
city                          24
highway                       36
fl                             c
class                 subcompact
mileage_difference            12
Name: 107, dtype: object

* Which compact class car has the lowest highway mileage? The best?

In [305]:
mpg.loc[mpg[mpg["class"] == 'compact']["highway"].idxmin()]

manufacturer          volkswagen
model                      jetta
displ                        2.8
year                        1999
cyl                            6
trans                   auto(l4)
drv                            f
city                          16
highway                       23
fl                             r
class                    compact
mileage_difference             7
Name: 220, dtype: object

* Create a column named average_mileage that is the mean of the city and highway mileage.

In [306]:
mpg['average_mileage'] = mpg[["city", "highway"]].mean(axis= 1)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0


* Which dodge car has the best average mileage? The worst?

In [307]:
mpg.loc[mpg[mpg.manufacturer == "dodge"].average_mileage.max()]

manufacturer                   chevrolet
model                 c1500 suburban 2wd
displ                                5.3
year                                2008
cyl                                    8
trans                           auto(l4)
drv                                    r
city                                  14
highway                               20
fl                                     r
class                                suv
mileage_difference                     6
average_mileage                     17.0
Name: 21, dtype: object

3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:

* How many rows and columns are there?

In [311]:
mammals = data("Mammals")
mammals.head()

Unnamed: 0,weight,speed,hoppers,specials
1,6000.0,35.0,False,False
2,4000.0,26.0,False,False
3,3000.0,25.0,False,False
4,1400.0,45.0,False,False
5,400.0,70.0,False,False


* What are the data types?

In [313]:
mammals.dtypes

weight      float64
speed       float64
hoppers        bool
specials       bool
dtype: object

* Summarize the dataframe with .info and .describe

In [316]:
mammals.info

<bound method DataFrame.info of        weight  speed  hoppers  specials
1    6000.000   35.0    False     False
2    4000.000   26.0    False     False
3    3000.000   25.0    False     False
4    1400.000   45.0    False     False
5     400.000   70.0    False     False
6     350.000   70.0    False     False
7     300.000   64.0    False     False
8     260.000   70.0    False     False
9     250.000   40.0    False     False
10   3800.000   25.0    False      True
11   1000.000   60.0    False     False
12    900.000   70.0    False     False
13    900.000   56.0    False     False
14    800.000   29.0    False     False
15    750.000   57.0    False     False
16    500.000   32.0    False     False
17    450.000   56.0    False     False
18    300.000   72.0    False     False
19    300.000   90.0    False     False
20    250.000   80.0    False     False
21    250.000   56.0    False     False
22    170.000   80.0    False     False
23    150.000   48.0    False     False
24    13

In [318]:
mammals.describe(include="all")

Unnamed: 0,weight,speed,hoppers,specials
count,107.0,107.0,107,107
unique,,,2,2
top,,,False,False
freq,,,96,97
mean,278.688178,46.208411,,
std,839.608269,26.716778,,
min,0.016,1.6,,
25%,1.7,22.5,,
50%,34.0,48.0,,
75%,142.5,65.0,,


* What is the the weight of the fastest animal?

In [362]:
mammals[mammals.speed == mammals.speed.max()].weight

53    55.0
Name: weight, dtype: float64

* What is the overal percentage of specials?

In [355]:
specials = mammals[mammals.specials == True]
specials.shape[0] / mammals.shape[0]

0.09345794392523364

* How many animals are hoppers that are above the median speed? What percentage is this?

In [344]:
above_mid_hoppers = mammals[mammals["hoppers"]]["speed"] > mammals.speed.median()

above_mid_hoppers.count()

11