### Working with pandas dataframes


[Official Documentation](https://pandas.pydata.org/docs/)

[Course page](https://ds.codeup.com/python/dataframes/)


<div class="alert alert-block alert-info">

#### Learning Goals
    
- Understand the structure of a dataframe
- View information about the contents of a dataframe
- Determine attributes of the data within the dataframe
- Manipulate columns
- Sort and filter contents of dataframe
- Create, modify, and drop columns
- View descriptive statistics for data

In [1]:
# first things  first:
# import pandas!
import numpy as np
import pandas as pd

### Sources of dataframes

- Created from dictionaries, lists, or arrays
- Imported from Python libraries (e.g. `pydataset`)
- Read from `csv`, `tsv`, `xlsx` files
- SQL databases

In [2]:
# most common acquisition/wrangling steps:
# pd.read_sql()

In [3]:
# pd.read_csv()

---

### Import statements



In [4]:
# Imported libraries
import pandas as pd

# Numpy to build arrays
import numpy as np

# Source of datasets
from pydataset import data

# Set a value to initialize the random integer generation
# put in an integer into the seed call
# this integer is a set of instructions
# that allows random processes to replicate
# if we were to re-run this code
np.random.seed(1349)

---

## Creating a dataframe

- Combine multiple series of equal length.


In [5]:
# Create a list of students
students = ['Sally', 
            'Jane', 
            'Suzie', 
            'Billy', 
            'Ada', 
            'John', 
            'Thomas', 
            'Marie', 
            'Albert', 
            'Richard', 
            'Isaac', 
            'Alan']


In [6]:
type(students)

list

In [7]:
pd.Series(students, name='my_students')

0       Sally
1        Jane
2       Suzie
3       Billy
4         Ada
5        John
6      Thomas
7       Marie
8      Albert
9     Richard
10      Isaac
11       Alan
Name: my_students, dtype: object

In [8]:
# np.random.randint call:
# min and max values called with kwargs low and high
# size kwarg associates with the len(students),
# which means we will get the same number of elements

np.random.randint(
    low=60, 
    high=100, 
    size=len(students))

array([78, 77, 96, 62, 98, 95, 87, 99, 91, 84, 77, 83])

In [10]:

# Randomly generate 12 scores for each subject (1 per student)
# Store values as arrays
# Note that all the values need to have the same length here

math_grades = np.random.randint(
    low=60, high=100, size=len(students))

english_grades = np.random.randint(
    low=60, high=100, size=len(students))

reading_grades = np.random.randint(
    low=60, high=100, size=len(students))

In [11]:
# Create a dictionary with structure:
# 'column_name': <array or list>

# define a dictionary for the structure of my dataframe
# when we think about dataframes, or tabular data in general,
# I know that I am going to need a specific structure
# to define rows and columns:
# columns will be defined my names, just as they were with fields
# in sql, which I am defining here as string literal keys in
# a dictionary structure
# The content for each of these columns
# is defined by either a list, or array of content all of the same
# length
df_dict = {'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades}

In [12]:
# df_dict is still a dictionary
type(df_dict)

dict

In [13]:
# defining our first DataFrame:

In [16]:
students_df = pd.DataFrame(df_dict)

In [17]:
type(df_dict)

dict

In [18]:
type(students_df)

pandas.core.frame.DataFrame

### Check data types

In [20]:
# first things we do when we look at a pandas dataframe for
# the first time
# (likely steps that you will be doing from this point forward
# every time you get your hands on a new data set)
# info is a method of the df object, hence the parens
students_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 516.0+ bytes


In [23]:
type(students_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 516.0+ bytes


NoneType

In [25]:
# dtypes is a property of the dataframe object, not a method call
# so when we invoke it, we are getting data about the df
# not acting as a function
# *INTERESTINGLY* the output of this property
# is still of a pandas datatype
students_df.dtypes

name       object
math        int64
english     int64
reading     int64
dtype: object

In [22]:
# how do I look at the data type of a single Series?

In [None]:
# lets examine the structure a little more before we answer:

### Display a table for the dataframe

### Descriptive Statistics

In [27]:
students_df.describe()

Unnamed: 0,math,english,reading
count,12.0,12.0,12.0
mean,82.0,83.333333,80.666667
std,8.696917,12.928709,12.572216
min,61.0,60.0,62.0
25%,78.5,72.75,73.0
50%,84.5,89.0,78.5
75%,88.0,93.25,93.25
max,91.0,98.0,99.0


In [28]:
type(students_df.describe())

pandas.core.frame.DataFrame

In [30]:
my_desc = students_df.describe()

### Dimension of dataframe

Format: A tuple with (rows, columns)

In [33]:
# reassign students_df into df because fingers dont cooperate
df = students_df

In [34]:
df.shape

(12, 4)

In [35]:
df.size

48

### View portions of the dataframe

```python
.head(n) # first n rows (default = 5 rows)
.tail(n) # last n rows (default = 5 rows)
.sample(n) # randomly select n rows (default = 1 row)
```

In [37]:
# SELECT * FROM df LIMIT 5;
df.head()

Unnamed: 0,name,math,english,reading
0,Sally,84,73,95
1,Jane,79,98,74
2,Suzie,91,91,62
3,Billy,88,72,64
4,Ada,88,92,78


In [54]:
df.tail(2)

Unnamed: 0,name,math,english,reading
10,Isaac,91,93,93
11,Alan,87,60,94


In [41]:
students[-5:]

['Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

In [42]:
df[-5:]

Unnamed: 0,name,math,english,reading
7,Marie,85,94,99
8,Albert,77,87,79
9,Richard,61,94,70
10,Isaac,91,93,93
11,Alan,87,60,94


In [55]:
# pd.DataFrame.sample?

In [56]:
df.sample(4)# however many rows you want, (optional))

Unnamed: 0,name,math,english,reading
6,Thomas,80,82,74
5,John,73,64,86
10,Isaac,91,93,93
2,Suzie,91,91,62


### Information about the dataframe's contents

| Name | Description |
| ---:|:------ |
|`#`  | The row index|
|`Column`| Index/name of column (a string) |
|`Non-Null Count` | Number of non-empty values 
|`Dtype` |    The data type |


---

## Working with Columns

- View column information
- Rename columns
- Drop columns


In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 516.0+ bytes


In [58]:
# how do I look at info for a specific column?
# how do I know its a series?

In [60]:
type(df['math'])

pandas.core.series.Series

In [61]:
df['math'].head()

0    84
1    79
2    91
3    88
4    88
Name: math, dtype: int64

In [63]:
# be careful with dot notation!!!
# dot notation and bracket notation are basically
# interchangable for ~90 of your pandas interactions
# for individual Series.
#*HOWEVER* if there is a field that has any duplicate names
# or reserved words, there is a high liklihood that you'll confuse
# or break python's interpretation of that thing
# lets assign a field called "class" later on...(hint)
df.math.head()

0    84
1    79
2    91
3    88
4    88
Name: math, dtype: int64

### Data types

### View Column Information

In [64]:
# property two: columns!

In [72]:
df.field_name.head()

AttributeError: 'DataFrame' object has no attribute 'field_name'

In [73]:
df.columns
# how we might begin to use the columns property
# as an iterable variable in a loop
for field_name in df.columns:
    print(field_name)
    #df['name'].head() etc
    print(df[field_name].head())
    # this wont work: df.field_name.head()
    # the field_name after df will only ever interpret as 
    # field_name and nothing else --> it is not flexible to
    # the iterations of the loop

name
0    Sally
1     Jane
2    Suzie
3    Billy
4      Ada
Name: name, dtype: object
math
0    84
1    79
2    91
3    88
4    88
Name: math, dtype: int64
english
0    73
1    98
2    91
3    72
4    92
Name: english, dtype: int64
reading
0    95
1    74
2    62
3    64
4    78
Name: reading, dtype: int64


In [76]:
# we saw df.columns as an iterable, so lets treat it like one

In [77]:
# length mismatch: df.columns = ['a', 'b', 'c']

In [79]:
# reassign every column name to the uppercase version of itself
# using a list comprehension
df.columns = [col.upper() for col in df.columns]

In [80]:
df.head(1)

Unnamed: 0,NAME,MATH,ENGLISH,READING
0,Sally,84,73,95


### Check data types

<div class="alert alert-block alert-info">
    
### Important Notes
    
- Keep in mind what data structure you are working with
- Recall that methods do not alter the original data structure
- Check documentation for what the parameters and results of a pandas function are




In [82]:
cols = []
datatypes = []
for col in df.columns:
    cols.append(col)
    datatypes.append(df[col].dtype)

In [83]:
cols

['NAME', 'MATH', 'ENGLISH', 'READING']

In [84]:
datatypes

[dtype('O'), dtype('int64'), dtype('int64'), dtype('int64')]

In [85]:
pd.Series(datatypes, index=cols)

NAME       object
MATH        int64
ENGLISH     int64
READING     int64
dtype: object

In [81]:
for col in df:
    print(col)
    print(df[col].dtype)
    print('------')

NAME
object
------
MATH
int64
------
ENGLISH
int64
------
READING
int64
------


### Dropping columns

- Drop all columns except for the student's name and math grade
- Assign the results to a new dataframe

In [86]:
# use pandas dataframe methods to not only observe, but actively
# do things!

In [88]:
# reassign column names back into the lowercase version
df.columns = [col.lower() for col in df.columns]

In [92]:
columns_to_drop = ['english', 'math']
df.drop(columns=columns_to_drop).head()
# updating these changes and saving them:
# df = df.drop(columns=columns_to_drop)

Unnamed: 0,name,reading
0,Sally,95
1,Jane,74
2,Suzie,62
3,Billy,64
4,Ada,78



### Tip

- Create a list with the columns you want to drop

### Renaming columns

- Rename the columne `name` to `student`
- Use a dictionary structure
- Assign to a new dataframe

```python
{'old_name': 'new_name'}
```

In [96]:
# pd.DataFrame.rename?

In [99]:
df = df.rename(columns={
    'name':'student'
})

In [100]:
df.head(2)

Unnamed: 0,student,math,english,reading
0,Sally,84,73,95
1,Jane,79,98,74


### Selecting a column

### Creating new columns


- Use a comparison operator on a column to return Boolean values 
- Create a new column with these contents
- Use assign to set the column values equal to the Boolean values

In [103]:
# looking at a column on its own: 
# df['english'].value_counts()

In [105]:
# vectorized applications from numpy:
df['english'] + 2

0      75
1     100
2      93
3      74
4      94
5      66
6      84
7      96
8      89
9      96
10     95
11     62
Name: english, dtype: int64

In [106]:
df['curved_english'] = (df.english + 2)

In [107]:
df.head(2)

Unnamed: 0,student,math,english,reading,curved_english
0,Sally,84,73,95,75
1,Jane,79,98,74,100


In [108]:
# what if I just want to look at english and curved english?

In [111]:
df[['english', 'curved_english']].head()

Unnamed: 0,english,curved_english
0,73,75
1,98,100
2,91,93
3,72,74
4,92,94


In [113]:
# also:
my_fields_to_select =  ['english', 'curved_english']
df[my_fields_to_select]


Unnamed: 0,english,curved_english
0,73,75
1,98,100
2,91,93
3,72,74
4,92,94
5,64,66
6,82,84
7,94,96
8,87,89
9,94,96


In [114]:
# speaking of masks:
# df, where df math grades are below 70:
df[df['math'] < 70]

Unnamed: 0,student,math,english,reading,curved_english
9,Richard,61,94,70,96


In [116]:
boolean_mask = df['math'] < 70

In [117]:
df[boolean_mask]

Unnamed: 0,student,math,english,reading,curved_english
9,Richard,61,94,70,96


In [118]:
df.math < 70

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10    False
11    False
Name: math, dtype: bool

---

### Sorting dataframes

- Use `.sort_values`
- set which column to sort by
- Set ascending or descending

In [120]:
# pd.DataFrame.sort_values?

In [122]:
df.sort_values('english', ascending=False)

Unnamed: 0,student,math,english,reading,curved_english
1,Jane,79,98,74,100
7,Marie,85,94,99,96
9,Richard,61,94,70,96
10,Isaac,91,93,93,95
4,Ada,88,92,78,94
2,Suzie,91,91,62,93
8,Albert,77,87,79,89
6,Thomas,80,82,74,84
0,Sally,84,73,95,75
3,Billy,88,72,64,74


In [126]:
df.sort_values(['english', 'math'], ascending=[True,False]).head()

Unnamed: 0,student,math,english,reading,curved_english
11,Alan,87,60,94,62
5,John,73,64,86,66
3,Billy,88,72,64,74
0,Sally,84,73,95,75
6,Thomas,80,82,74,84


### Chaining Operations

In [127]:
df.sort_values(
    'curved_english', ascending=False
)[df['math'] < 80].head()

  df.sort_values('curved_english', ascending=False)[df['math'] < 80].head()


Unnamed: 0,student,math,english,reading,curved_english
1,Jane,79,98,74,100
9,Richard,61,94,70,96
8,Albert,77,87,79,89
5,John,73,64,86,66


In [130]:
# pd.DataFrame.assign?

In [135]:
# previous way of making a dataframe column:
# df['newcol'] = [some values]
# rename:
# df.rename({'old_name': 'new_name'})
# df.assign(class = 'robinson')
df['class'] = 'robinson'

In [136]:
df['class']

0     robinson
1     robinson
2     robinson
3     robinson
4     robinson
5     robinson
6     robinson
7     robinson
8     robinson
9     robinson
10    robinson
11    robinson
Name: class, dtype: object

In [139]:
df.rename(columns={'class': 'cohort'}, inplace=True)

In [140]:
my_updated_df = df.drop(columns='curved_english', inplace=True)

In [142]:
type(my_updated_df)

NoneType

In [143]:
df.head()

Unnamed: 0,student,math,english,reading,cohort
0,Sally,84,73,95,robinson
1,Jane,79,98,74,robinson
2,Suzie,91,91,62,robinson
3,Billy,88,72,64,robinson
4,Ada,88,92,78,robinson


---

## Importing datasets

In [144]:
from pydataset import data

In [145]:
type(data)

function

In [147]:
# data?

In [149]:
type(data('iris'))

pandas.core.frame.DataFrame

In [150]:
data('some_other_dataset')

Did you mean:
datasets


In [152]:
# looking at all the datasets:
data('datasets')

Unnamed: 0_level_0,Item,Title,csv,doc
Package,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
datasets,AirPassengers,Monthly Airline Passenger Numbers 1949-1960,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/datasets/AirPassengers....
datasets,BJsales,Sales Data with Leading Indicator,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/BJsales.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/datasets/BJsales.html
datasets,BOD,Biochemical Oxygen Demand,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/BOD.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/datasets/BOD.html
datasets,Formaldehyde,Determination of Formaldehyde,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/Formaldehyde.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/datasets/Formaldehyde.html
datasets,HairEyeColor,Hair and Eye Color of Statistics Students,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/HairEyeColor.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/datasets/HairEyeColor.html
...,...,...,...,...
lme4,VerbAgg,Verbal Aggression item responses,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/lme4/VerbAgg.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/lme4/VerbAgg.html
lme4,cake,Breakage Angle of Chocolate Cakes,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/lme4/cake.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/lme4/cake.html
lme4,cbpp,Contagious bovine pleuropneumonia,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/lme4/cbpp.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/lme4/cbpp.html
lme4,grouseticks,Data on red grouse ticks from Elston et al. 2001,https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/lme4/grouseticks.csv,https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/lme4/grouseticks.html


In [172]:
data('datasets')['Item'].to_list()

['AirPassengers',
 'BJsales',
 'BOD',
 'Formaldehyde',
 'HairEyeColor',
 'InsectSprays',
 'JohnsonJohnson',
 'LakeHuron',
 'LifeCycleSavings',
 'Nile',
 'OrchardSprays',
 'PlantGrowth',
 'Puromycin',
 'Titanic',
 'ToothGrowth',
 'UCBAdmissions',
 'UKDriverDeaths',
 'UKgas',
 'USAccDeaths',
 'USArrests',
 'USJudgeRatings',
 'USPersonalExpenditure',
 'VADeaths',
 'WWWusage',
 'WorldPhones',
 'airmiles',
 'airquality',
 'anscombe',
 'attenu',
 'attitude',
 'austres',
 'cars',
 'chickwts',
 'co2',
 'crimtab',
 'discoveries',
 'esoph',
 'euro',
 'faithful',
 'freeny',
 'infert',
 'iris',
 'islands',
 'lh',
 'longley',
 'lynx',
 'morley',
 'mtcars',
 'nhtemp',
 'nottem',
 'npk',
 'occupationalStatus',
 'precip',
 'presidents',
 'pressure',
 'quakes',
 'randu',
 'rivers',
 'rock',
 'sleep',
 'stackloss',
 'sunspot.month',
 'sunspot.year',
 'sunspots',
 'swiss',
 'treering',
 'trees',
 'uspop',
 'volcano',
 'warpbreaks',
 'women',
 'acme',
 'aids',
 'aircondit',
 'aircondit7',
 'amis',
 'aml',

In [153]:
data('grouseticks')

Unnamed: 0,INDEX,TICKS,BROOD,HEIGHT,YEAR,LOCATION,cHEIGHT
1,1,0,501,465,95,32,2.759305
2,2,0,501,465,95,32,2.759305
3,3,0,502,472,95,36,9.759305
4,4,0,503,475,95,37,12.759305
5,5,0,503,475,95,37,12.759305
...,...,...,...,...,...,...,...
399,399,0,741,433,97,15,-29.240695
400,400,0,742,430,97,14,-32.240695
401,401,0,742,430,97,14,-32.240695
402,402,2,743,450,97,25,-12.240695


In [155]:
type(data('grouseticks', show_doc=True))

grouseticks

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

##  Data on red grouse ticks from Elston et al. 2001

### Description

Number of ticks on the heads of red grouse chicks sampled in the field
(`grouseticks`) and an aggregated version (`grouseticks_agg`); see original
source for more details

### Usage

    data(grouseticks)

### Format

`INDEX`

(factor) chick number (observation level)

`TICKS`

number of ticks sampled

`BROOD`

(factor) brood number

`HEIGHT`

height above sea level (meters)

`YEAR`

year (-1900)

`LOCATION`

(factor) geographic location code

`cHEIGHT`

centered height, derived from `HEIGHT`

`meanTICKS`

mean number of ticks by brood

`varTICKS`

variance of number of ticks by brood

### Details

`grouseticks_agg` is just a brood-level aggregation of the data

### Source

Robert Moss, via David Elston

### References

Elston, D. A., R. Moss, T. Boulinier, C. Arrowsmith, and X. Lambin. 2001.
"Analysis of Aggregatio

NoneType

In [156]:
df = data('grouseticks')

In [157]:
df.head()

Unnamed: 0,INDEX,TICKS,BROOD,HEIGHT,YEAR,LOCATION,cHEIGHT
1,1,0,501,465,95,32,2.759305
2,2,0,501,465,95,32,2.759305
3,3,0,502,472,95,36,9.759305
4,4,0,503,475,95,37,12.759305
5,5,0,503,475,95,37,12.759305



---

### Logical Operators


 `&`    and 
 
 `|`    or
 
 `~`  not

In [168]:
df.describe()['HEIGHT']['50%']

457.0

In [165]:
df.describe()['HEIGHT']['50%']

457.0

In [170]:
# df,
# where brood is less than 505 ==> boolean Series
# where height is greater than 457 ==> boolean Series
# use & as an and operator to compare row by row
# the rendering of each boolean series
# ==> TRUE and TRUE ==> TRUE
df[
    (df['BROOD'] < 505) 
    & 
    (df.HEIGHT > 457)
]

Unnamed: 0,INDEX,TICKS,BROOD,HEIGHT,YEAR,LOCATION,cHEIGHT
1,1,0,501,465,95,32,2.759305
2,2,0,501,465,95,32,2.759305
3,3,0,502,472,95,36,9.759305
4,4,0,503,475,95,37,12.759305
5,5,0,503,475,95,37,12.759305
6,6,3,503,475,95,37,12.759305
7,7,2,503,475,95,37,12.759305
8,8,0,504,488,95,44,25.759305
9,9,0,504,488,95,44,25.759305
10,10,2,504,488,95,44,25.759305


### Find students who are eligible for:

- High-performer award (At least 80 in Math and English and Reading)
- Subject award (at least 90 in a subject)
- Exemplary award (at least 90 in all subjects)