# Agenda

1. Data frames (2D data)
2. Reading (and writing) files -- real-world data!

To download: https://files.lerner.co.il/data-science-exercise-files.zip

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

When we import a module, we're basically asking Python to do the following:

1. Find the module (ending with "py") on disk
2. Load it into memory
3. Cache it, so that we don't need to load it a second time
4. Define the module as a variable in our global namespace

The second time we use import, we just jump directly to step 4.



In [4]:
import sys
sys.modules['pandas']  # sys.modules is the cache that Python uses for modules

<module 'pandas' from '/usr/local/lib/python3.11/site-packages/pandas/__init__.py'>

What about "from import"?

In that case, it takes a slightly different route:

1. Find the module (ending with "py") on disk
2. Load it into memory
3. Cache it, so that we don't need to load it a second time
4. Defines only the names we've specified in our global namespace


In [5]:
from random import randint

In [6]:
sys.modules['random']

<module 'random' from '/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/random.py'>

In [7]:
# If I want to create a data frame...

# list of lists
df = DataFrame([[10, 20, 30, 40],
               [50, 60, 70, 80],
               [90, 100, 110, 120]])
df

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


As with a series, we have an index -- along the left column, describing our rows

We also have columns, which are along the top row, describing the columns.

By default, both are numbered starting at 0.

We can set one or both by passing "index=" or "columns=" when we create teh data frame. And yes, we can modify those down the road.

In [8]:
df = DataFrame([[10, 20, 30, 40],
               [50, 60, 70, 80],
               [90, 100, 110, 120]],
              index=list('abc'),
              columns=list('wxyz'))
df

Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


In [9]:
# retrieving a row -- we use .loc and .iloc

df.loc['a']

w    10
x    20
y    30
z    40
Name: a, dtype: int64

In [10]:
df.loc[['a', 'c']]   # fancy indexing -- request more than one row

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [11]:
# I can also use .iloc

df.iloc[1]

w    50
x    60
y    70
z    80
Name: b, dtype: int64

In [12]:
# how can I retrieve one or more columns? Use []
df['w']

a    10
b    50
c    90
Name: w, dtype: int64

In [13]:
# can I get more than one columns? Yes!
df[['w', 'y']]

Unnamed: 0,w,y
a,10,30
b,50,70
c,90,110


In [14]:
# what about numbering our columns?  that doesn't really happen.

# what about slices of rows?
df.loc['a':'c']

Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


In [15]:
df.loc['a':'c':2]

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [17]:
# you don't need .loc when asking for a slice!
# yes, this is hugely inconsistent

df['a':'c':2]

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [18]:
df

Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


Let's say that I want a particular row and a particular column. I'll use .loc for this, as well -- I'll just use its two-argument version:

    df.loc[ROW_SELECTOR, COLUMN_SELECTOR]
    
The row selector (and column selector, for that matter), can be:

1. An individual row name
2. A list of row names (indexes)
3. A boolean series/list

This is the standard way to work with data frames.

In [19]:
# if I want row b, column y

df.loc['b', 'y']

70

In [20]:
# if I want row b, columns y and z

df.loc['b', ['y', 'z']]

y    70
z    80
Name: b, dtype: int64

In [21]:
# if I want rows a and c, columns y and z

df.loc[['a', 'c'], ['y', 'z']]

Unnamed: 0,y,z
a,30,40
c,110,120


In [22]:
# you can write it on two lines

df.loc[
    ['a', 'c'],   # row selector
    ['y', 'z']    # column selector
]

Unnamed: 0,y,z
a,30,40
c,110,120


In [26]:
# I can use a boolean index here

df.loc[
    df['x'] > df['x'].mean(),   # row selector
    ['y', 'z']                  # column selector
]

Unnamed: 0,y,z
c,110,120


# Exercises with data frames

1. Create a 5x5 data frame with rows abcde and columns vwxyz. The values should be random integers from 0-1,000. (You can use a 2D NumPy array for this, if you want.)

2. Retrieve row b
3. Retrieve rows b and d
4. Retrieve rows b, c, and d
5. Retrieve column w
6. Retrieve columns w and y
7. Retrieve columns w, x, and y
8. Retrieve the item at row e, column v


In [27]:
np.random.seed(0)  # this sets the random number generator

In [29]:
np.random.randint(0, 100, 25).reshape(5, 5)

array([[ 9, 20, 80, 69, 79],
       [47, 64, 82, 99, 88],
       [49, 29, 19, 19, 14],
       [39, 32, 65,  9, 57],
       [32, 31, 74, 23, 35]])

In [30]:
np.random.randint(0, 100, [5,5])

array([[75, 55, 28, 34,  0],
       [ 0, 36, 53,  5, 38],
       [17, 79,  4, 42, 58],
       [31,  1, 65, 41, 57],
       [35, 11, 46, 82, 91]])

In [34]:
# 1. Create a 5x5 data frame with rows abcde and columns vwxyz. The values should be random integers from 0-1,000. (You can use a 2D NumPy array for this, if you want.)

np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [5,5]),
         index=list('abcde'),
         columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [35]:
# 2. Retrieve row b

df.loc['b']

v    763
w    707
x    359
y      9
z    723
Name: b, dtype: int64

In [36]:
# 3. Retrieve rows b and d

df.loc[['b', 'd']]

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
d,472,600,396,314,705


In [37]:
df.loc['b':'d':2]

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
d,472,600,396,314,705


In [38]:
# 4. Retrieve rows b, c, and d

df.loc['b':'d']

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705


In [39]:
df['b':'d']  # slice on a data frame gives us the rows

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705


In [40]:
%timeit df.loc['b':'d']

625 µs ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [41]:
%timeit df['b':'d'] 

545 µs ± 78.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [42]:
# 5. Retrieve column w

df['w']

a    559
b    707
c    754
d    600
e    551
Name: w, dtype: int64

In [43]:
# 6. Retrieve columns w and y

df[['w', 'y']]

Unnamed: 0,w,y
a,559,192
b,707,9
c,754,599
d,600,314
e,551,174


In [44]:
# 7. Retrieve columns w, x, and y

df[['w', 'x', 'y']]

Unnamed: 0,w,x,y
a,559,629,192
b,707,359,9
c,754,804,599
d,600,396,314
e,551,87,174


In [45]:
# 8. Retrieve the item at row e, column v

df.loc[
      'e', # row selector
       'v' # column selector
] 

486

In [48]:
df.loc[:, 'w':'y']

Unnamed: 0,w,x,y
a,559,629,192
b,707,359,9
c,754,804,599
d,600,396,314
e,551,87,174


In [49]:
# dtypes -- columns are series

# every column in a data frame is actually a series behind the scenes

df['v'].dtype

dtype('int64')

In [50]:
df.loc['a'].dtype

dtype('int64')

In [51]:
# I can get the dtypes of all columns with the "dtypes" attribute

df.dtypes

v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [52]:
# we can assign to values in a data frame via .loc!

df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [53]:
df.loc['c', 'y']

599

In [54]:
df.loc['c', 'y'] = 222

df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,222,70
d,472,600,396,314,705
e,486,551,87,174,600


In [55]:
df.loc[['a', 'c'], ['x', 'y']]

Unnamed: 0,x,y
a,629,192
c,804,222


In [56]:
df.loc[['a', 'c'], ['x', 'y']] = 333
df

Unnamed: 0,v,w,x,y,z
a,684,559,333,333,835
b,763,707,359,9,723
c,277,754,333,333,70
d,472,600,396,314,705
e,486,551,87,174,600


In [60]:
# change the order of columns, if you want
df = df[['y', 'z', 'v', 'x', 'w']]

In [61]:
df

Unnamed: 0,y,z,v,x,w
a,333,835,684,333,559
b,9,723,763,359,707
c,333,70,277,333,754
d,314,705,472,396,600
e,174,600,486,87,551


# Some other ways to create data frames

1. List of lists
2. 2D NumPy array
3. List of dicts -- the dict keys specify the column names, and each list is a row
4. Dict of lists -- the dict keys specify the column names, and each list is a column. Each list must be of the same length.

In [62]:
df = DataFrame([{'a':10, 'b':20, 'c':30},
                {'a':100, 'b':200, 'd':400},
               {'a':1000, 'c':3000, 'd':4000}])
df

Unnamed: 0,a,b,c,d
0,10,20.0,30.0,
1,100,200.0,,400.0
2,1000,,3000.0,4000.0


In [63]:
df.dtypes

a      int64
b    float64
c    float64
d    float64
dtype: object

In [67]:
# create a data frame with a list of dicts
df = DataFrame([{'a':10, 'b':20, 'c':30},
                {'a':100, 'b':200, 'd':400},
               {'a':1000, 'c':3000, 'd':4000}])
df

Unnamed: 0,a,b,c,d
0,10,20.0,30.0,
1,100,200.0,,400.0
2,1000,,3000.0,4000.0


In [69]:
# what if I retrieve row 0 from df?

df.loc[0]

a    10.0
b    20.0
c    30.0
d     NaN
Name: 0, dtype: float64

In [70]:
# create a data frame with a dict of lists

df = DataFrame({'a':[10, 20, 30], 
               'b':[100, 200, 300],
               'c':[1000, 2000, 3000]})

df

Unnamed: 0,a,b,c
0,10,100,1000
1,20,200,2000
2,30,300,3000


In [72]:
# basic methods

# the general rule: Any method that you can run on a series, you can also run
# on a data frame. You'll get a series back, whose index is the data frame's columns,
# and the values represent running the method on each of those columns


np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [5,5]),
         index=list('abcde'),
         columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [73]:
df['v'].mean()

536.4

In [74]:
df.mean()  # calculate the mean for each column

v    536.4
w    634.2
x    455.0
y    257.6
z    586.6
dtype: float64

In [75]:
df.sum()

v    2682
w    3171
x    2275
y    1288
z    2933
dtype: int64

In [76]:
df.describe()  # descriptive statistics

Unnamed: 0,v,w,x,y,z
count,5.0,5.0,5.0,5.0,5.0
mean,536.4,634.2,455.0,257.6,586.6
std,191.774086,91.376693,273.951638,219.561609,300.574949
min,277.0,551.0,87.0,9.0,70.0
25%,472.0,559.0,359.0,174.0,600.0
50%,486.0,600.0,396.0,192.0,705.0
75%,684.0,707.0,629.0,314.0,723.0
max,763.0,754.0,804.0,599.0,835.0


In [77]:
s = Series('this is a bunch of words and this is a bad example'.split())

In [78]:
s.describe()

count       12
unique       9
top       this
freq         2
dtype: object

In [79]:
df = DataFrame({'a':Series([10, 20, 30], dtype=np.float128), 
               'b':[100, 200, 300],
               'c':[1000, 2000, 3000]})

df

Unnamed: 0,a,b,c
0,10.0,100,1000
1,20.0,200,2000
2,30.0,300,3000


In [81]:
# adding rows/columns (including columns of arrays)
np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [5,5]),
         index=list('abcde'),
         columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [82]:
# how can I add a row?
# just assign to .loc with the new row index
# note: You cannot (easily) add a new row this way if the index repeats

df.loc['f'] = [2,4,6,8,10]  # this adds a new row (or replaces an existing one!)
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600
f,2,4,6,8,10


In [84]:
# how about adding a column?
# once again, just assign -- if the column exists, then it is replaced

df['u'] = [5,10,15,20,25,30]
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,5
b,763,707,359,9,723,10
c,277,754,804,599,70,15
d,472,600,396,314,705,20
e,486,551,87,174,600,25
f,2,4,6,8,10,30


In [85]:
# what about removing a row?
# use df.drop(rowname) or df.drop([row1, row2])

df.drop('f')  # returns a new data frame, based on df, without row f

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,5
b,763,707,359,9,723,10
c,277,754,804,599,70,15
d,472,600,396,314,705,20
e,486,551,87,174,600,25


In [86]:
# better to assign back to df the result of running df.drop
# here, I'll remove two rows

df = df.drop(['c', 'f'])
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,5
b,763,707,359,9,723,10
d,472,600,396,314,705,20
e,486,551,87,174,600,25


In [87]:
# how about dropping a column?
# we'll use "drop" again -- but the assumption is always that
# we'll want to work with rows, at least by default

# if we want to work with columns, we'll need to specify that by passing
# axis='columns'  (you could instead say axis=1, but I never remember which is 0 and 1)

df.drop('u', axis='columns')

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
d,472,600,396,314,705
e,486,551,87,174,600


In [88]:
df = df.drop('u', axis='columns')
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
d,472,600,396,314,705
e,486,551,87,174,600


In [89]:
# what is the product of columns x and y?

df['x'] * df['y']

a    120768
b      3231
d    124344
e     15138
dtype: int64

In [90]:
# calculate and then assign

df['product'] = df['x'] * df['y']
df

Unnamed: 0,v,w,x,y,z,product
a,684,559,629,192,835,120768
b,763,707,359,9,723,3231
d,472,600,396,314,705,124344
e,486,551,87,174,600,15138


In [91]:
df

Unnamed: 0,v,w,x,y,z,product
a,684,559,629,192,835,120768
b,763,707,359,9,723,3231
d,472,600,396,314,705,124344
e,486,551,87,174,600,15138


In [93]:
df.loc['f'] = [np.nan, np.nan, 50, np.nan, np.nan, np.nan]
df

Unnamed: 0,v,w,x,y,z,product
a,684.0,559.0,629.0,192.0,835.0,120768.0
b,763.0,707.0,359.0,9.0,723.0,3231.0
d,472.0,600.0,396.0,314.0,705.0,124344.0
e,486.0,551.0,87.0,174.0,600.0,15138.0
f,,,50.0,,,


In [94]:
# assigning to .loc works, even when the row will be new
df.loc['g', 'x'] = 30

In [96]:
# assigning to .loc works, even when the column will be new
df.loc['g', 'q'] = 400

In [98]:
df.loc['h'] = {'x':3, 'y':5}

In [100]:
# to add two different data frames together, it's pd.concat([list_of_data_frames])

pd.concat([df, df])  # by default, adds them vertically -- you can set axis='columns'
                     # to add them horizontally

Unnamed: 0,v,w,x,y,z,product,q
a,684.0,559.0,629.0,192.0,835.0,120768.0,
b,763.0,707.0,359.0,9.0,723.0,3231.0,
d,472.0,600.0,396.0,314.0,705.0,124344.0,
e,486.0,551.0,87.0,174.0,600.0,15138.0,
f,,,50.0,,,,
g,,,30.0,,,,400.0
h,,,3.0,5.0,,,
a,684.0,559.0,629.0,192.0,835.0,120768.0,
b,763.0,707.0,359.0,9.0,723.0,3231.0,
d,472.0,600.0,396.0,314.0,705.0,124344.0,


In [101]:
pd.concat([df.loc[['a', 'b'], ['x', 'y']],   # rows a-b, columns x-y
           df.loc[['d', 'e'], ['v', 'w']]])  # rows d-e, columns v-w

Unnamed: 0,x,y,v,w
a,629.0,192.0,,
b,359.0,9.0,,
d,,,472.0,600.0
e,,,486.0,551.0


In [105]:
s1 = Series('a b c d e'.split(), index=list('abcde'))
s2 = Series('1 2 1 2 1'.split(), index=list('abcde'))

df['total'] = s1 + s2   # adds a new column to df, based on s1+s2

In [106]:
df

Unnamed: 0,v,w,x,y,z,product,q,total
a,684.0,559.0,629.0,192.0,835.0,120768.0,,a1
b,763.0,707.0,359.0,9.0,723.0,3231.0,,b2
d,472.0,600.0,396.0,314.0,705.0,124344.0,,d2
e,486.0,551.0,87.0,174.0,600.0,15138.0,,e1
f,,,50.0,,,,,
g,,,30.0,,,,400.0,
h,,,3.0,5.0,,,,


In [99]:
df

Unnamed: 0,v,w,x,y,z,product,q
a,684.0,559.0,629.0,192.0,835.0,120768.0,
b,763.0,707.0,359.0,9.0,723.0,3231.0,
d,472.0,600.0,396.0,314.0,705.0,124344.0,
e,486.0,551.0,87.0,174.0,600.0,15138.0,
f,,,50.0,,,,
g,,,30.0,,,,400.0
h,,,3.0,5.0,,,


In [97]:
df

Unnamed: 0,v,w,x,y,z,product,q
a,684.0,559.0,629.0,192.0,835.0,120768.0,
b,763.0,707.0,359.0,9.0,723.0,3231.0,
d,472.0,600.0,396.0,314.0,705.0,124344.0,
e,486.0,551.0,87.0,174.0,600.0,15138.0,
f,,,50.0,,,,
g,,,30.0,,,,400.0


In [57]:
# fillna / dropna
# removing rows and columns
# setting/unsetting the index
# queries based on rows/columns

In [107]:
df

Unnamed: 0,v,w,x,y,z,product,q,total
a,684.0,559.0,629.0,192.0,835.0,120768.0,,a1
b,763.0,707.0,359.0,9.0,723.0,3231.0,,b2
d,472.0,600.0,396.0,314.0,705.0,124344.0,,d2
e,486.0,551.0,87.0,174.0,600.0,15138.0,,e1
f,,,50.0,,,,,
g,,,30.0,,,,400.0,
h,,,3.0,5.0,,,,


In [118]:
pd.concat([df, DataFrame([[10, 20, 30, 40, 50]],
                        columns=list('vwxyz'))])

Unnamed: 0,v,w,x,y,z,product,q,total
a,684.0,559.0,629.0,192.0,835.0,120768.0,,a1
b,763.0,707.0,359.0,9.0,723.0,3231.0,,b2
d,472.0,600.0,396.0,314.0,705.0,124344.0,,d2
e,486.0,551.0,87.0,174.0,600.0,15138.0,,e1
f,,,50.0,,,,,
g,,,30.0,,,,400.0,
h,,,3.0,5.0,,,,
0,10.0,20.0,30.0,40.0,50.0,,,


# Exercise:

1. Create a data frame in which we have three columns: High, Low, and Precip.  There should be 10 rows, one for each of the next 10 days. In each cell, show the high temp, low temp, or precipitation forecast for these days.
2. Create a new column, Diff, which shows the difference in temperature between these days.
3. Find the three days with the greatest temp difference.

In [120]:
# dict of series

highs = [30, 28, 34, 36, 25, 23, 23, 26, 29, 30]
lows = [12, 17, 21, 16, 14, 14, 13, 14, 17, 17]
precip = [0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0]

forecast = DataFrame({'highs':highs,
                      'lows':lows,
                     'precip':precip})
forecast

Unnamed: 0,highs,lows,precip
0,30,12,0.0
1,28,17,0.0
2,34,21,0.0
3,36,16,0.0
4,25,14,0.0
5,23,14,0.0
6,23,13,0.0
7,26,14,0.4
8,29,17,0.0
9,30,17,0.0


In [123]:
forecast['diff'] = forecast['highs'] - forecast['lows']
forecast

Unnamed: 0,highs,lows,precip,diff
0,30,12,0.0,18
1,28,17,0.0,11
2,34,21,0.0,13
3,36,16,0.0,20
4,25,14,0.0,11
5,23,14,0.0,9
6,23,13,0.0,10
7,26,14,0.4,12
8,29,17,0.0,12
9,30,17,0.0,13


In [124]:
forecast.dtypes

highs       int64
lows        int64
precip    float64
diff        int64
dtype: object

In [127]:
# what are the three highest differences?
forecast['diff'].sort_values().tail(3)

9    13
0    18
3    20
Name: diff, dtype: int64

In [128]:
forecast.describe()

Unnamed: 0,highs,lows,precip,diff
count,10.0,10.0,10.0,10.0
mean,28.4,15.5,0.04,12.9
std,4.351245,2.635231,0.126491,3.478505
min,23.0,12.0,0.0,9.0
25%,25.25,14.0,0.0,11.0
50%,28.5,15.0,0.0,12.0
75%,30.0,17.0,0.0,13.0
max,36.0,21.0,0.4,20.0


In [130]:
forecast['diff'].nlargest(3)

3    20
0    18
2    13
Name: diff, dtype: int64

In [131]:
forecast['diff'].nlargest(keep='all')

3    20
0    18
2    13
9    13
7    12
8    12
Name: diff, dtype: int64

# Next up

1. Indexes (setting and unsetting)
2. fillna / dropna
3. removing rows and columns
4. setting/unsetting the index
5. queries based on rows/columns
6. Working with files

If you haven't yet downloaded the file with some data in it, please do that: https://files.lerner.co.il/data-science-exercise-files.zip

Resume at :10

In [132]:
forecast

Unnamed: 0,highs,lows,precip,diff
0,30,12,0.0,18
1,28,17,0.0,11
2,34,21,0.0,13
3,36,16,0.0,20
4,25,14,0.0,11
5,23,14,0.0,9
6,23,13,0.0,10
7,26,14,0.4,12
8,29,17,0.0,12
9,30,17,0.0,13


In [133]:
# add a new column, dates, to our forecast data frame

forecast['dates'] = '0502 0503 0504 0505 0506 0507 0508 0509 0510 0511'.split()
forecast

Unnamed: 0,highs,lows,precip,diff,dates
0,30,12,0.0,18,502
1,28,17,0.0,11,503
2,34,21,0.0,13,504
3,36,16,0.0,20,505
4,25,14,0.0,11,506
5,23,14,0.0,9,507
6,23,13,0.0,10,508
7,26,14,0.4,12,509
8,29,17,0.0,12,510
9,30,17,0.0,13,511


In [134]:
# how can I make that column into the index for my data frame?
# answer: set_index

forecast.set_index('dates')  # this returns a new data frame with that index

Unnamed: 0_level_0,highs,lows,precip,diff
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
502,30,12,0.0,18
503,28,17,0.0,11
504,34,21,0.0,13
505,36,16,0.0,20
506,25,14,0.0,11
507,23,14,0.0,9
508,23,13,0.0,10
509,26,14,0.4,12
510,29,17,0.0,12
511,30,17,0.0,13


In [135]:
forecast = forecast.set_index('dates') 
forecast

Unnamed: 0_level_0,highs,lows,precip,diff
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
502,30,12,0.0,18
503,28,17,0.0,11
504,34,21,0.0,13
505,36,16,0.0,20
506,25,14,0.0,11
507,23,14,0.0,9
508,23,13,0.0,10
509,26,14,0.4,12
510,29,17,0.0,12
511,30,17,0.0,13


In [136]:
forecast.loc['0507']

highs     23.0
lows      14.0
precip     0.0
diff       9.0
Name: 0507, dtype: float64

In [137]:
forecast.loc['0504':'0508']

Unnamed: 0_level_0,highs,lows,precip,diff
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
504,34,21,0.0,13
505,36,16,0.0,20
506,25,14,0.0,11
507,23,14,0.0,9
508,23,13,0.0,10


In [138]:
forecast.loc['0504':'0508', 
             ['precip', 'diff']]

Unnamed: 0_level_0,precip,diff
dates,Unnamed: 1_level_1,Unnamed: 2_level_1
504,0.0,13
505,0.0,20
506,0.0,11
507,0.0,9
508,0.0,10


In [139]:
# what if I don't want this index any more? I want it to be a normal column

forecast.reset_index()  # this returns a new data frame, not modifying the original

Unnamed: 0,dates,highs,lows,precip,diff
0,502,30,12,0.0,18
1,503,28,17,0.0,11
2,504,34,21,0.0,13
3,505,36,16,0.0,20
4,506,25,14,0.0,11
5,507,23,14,0.0,9
6,508,23,13,0.0,10
7,509,26,14,0.4,12
8,510,29,17,0.0,12
9,511,30,17,0.0,13


In [140]:
forecast = forecast.reset_index()

In [141]:
forecast

Unnamed: 0,dates,highs,lows,precip,diff
0,502,30,12,0.0,18
1,503,28,17,0.0,11
2,504,34,21,0.0,13
3,505,36,16,0.0,20
4,506,25,14,0.0,11
5,507,23,14,0.0,9
6,508,23,13,0.0,10
7,509,26,14,0.4,12
8,510,29,17,0.0,12
9,511,30,17,0.0,13


# Why set the index?

1. Sometimes we get data from a file / source that doesn't specify the index, and it'll be useful.
2. Very often, we want to do a complex query that's made easier to write/understand if it's on the index, rather than on a regular column
3. You might sometimes query based on one column, and sometimes based on another column... this way, you can change your queries to make them more readable and obvious.

In [142]:
# remember that wherever we can specify one column (as a string), #
# we can specify multiple columns as a list of strings
# this means, yes, we can specify multiple columns for the index -- known as 
# a multi-index.

forecast.set_index(['dates', 'precip'])

Unnamed: 0_level_0,Unnamed: 1_level_0,highs,lows,diff
dates,precip,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
502,0.0,30,12,18
503,0.0,28,17,11
504,0.0,34,21,13
505,0.0,36,16,20
506,0.0,25,14,11
507,0.0,23,14,9
508,0.0,23,13,10
509,0.4,26,14,12
510,0.0,29,17,12
511,0.0,30,17,13


# fillna / dropna

How did we deal with nan in our series?

- Replace it with "fillna"
- Remove it with "dropna"

These methods also appear on data frames, but they're a bit more sophisticated.

In [144]:
np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [5,5]),
         index=list('abcde'),
         columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [145]:
# set some NaN values

df.loc['a', 'y'] = np.nan
df.loc['b', 'y'] = np.nan
df.loc['c', 'v'] = np.nan
df.loc['d', 'x'] = np.nan
df.loc['d', 'z'] = np.nan
df


Unnamed: 0,v,w,x,y,z
a,684.0,559,629.0,,835.0
b,763.0,707,359.0,,723.0
c,,754,804.0,599.0,70.0
d,472.0,600,,314.0,
e,486.0,551,87.0,174.0,600.0


In [146]:
# can I use fillna on my data frame? 
# if I pass a scalar value, that'll be used everywhere

df.fillna(9999)  # returns a new data frame; doesn't change df

Unnamed: 0,v,w,x,y,z
a,684.0,559,629.0,9999.0,835.0
b,763.0,707,359.0,9999.0,723.0
c,9999.0,754,804.0,599.0,70.0
d,472.0,600,9999.0,314.0,9999.0
e,486.0,551,87.0,174.0,600.0


In [149]:
df.mean()

v    601.250000
w    634.200000
x    469.750000
y    362.333333
z    557.000000
dtype: float64

In [148]:
# if I pass a series whose index matches our columns,
# then those values will be used in fillna

df.fillna(df.mean())   # replace nan with the mean OF THAT COLUMN

Unnamed: 0,v,w,x,y,z
a,684.0,559,629.0,362.333333,835.0
b,763.0,707,359.0,362.333333,723.0
c,601.25,754,804.0,599.0,70.0
d,472.0,600,469.75,314.0,557.0
e,486.0,551,87.0,174.0,600.0


In [150]:
df.fillna(df.min())

Unnamed: 0,v,w,x,y,z
a,684.0,559,629.0,174.0,835.0
b,763.0,707,359.0,174.0,723.0
c,472.0,754,804.0,599.0,70.0
d,472.0,600,87.0,314.0,70.0
e,486.0,551,87.0,174.0,600.0


# What about dropna?

dropna, when run on a series, removes any nan value from that series.

In a data frame, it means that every ROW containing NaN will be removed

In [151]:
df.dropna()  # returns a new data frame, containing no rows that had NaN

Unnamed: 0,v,w,x,y,z
e,486.0,551,87.0,174.0,600.0


In [153]:
# option 1: set a threshold
df.dropna(thresh=4)   # meaning: if we have 4 values, then keep the row

Unnamed: 0,v,w,x,y,z
a,684.0,559,629.0,,835.0
b,763.0,707,359.0,,723.0
c,,754,804.0,599.0,70.0
e,486.0,551,87.0,174.0,600.0


In [154]:
# option 2: indicate which columns shouldn't have nan

df.dropna(subset=['x', 'y', 'z'])  # only care about nan in x, y, and z -- no nans at all

Unnamed: 0,v,w,x,y,z
c,,754,804.0,599.0,70.0
e,486.0,551,87.0,174.0,600.0


In [156]:
help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(*, axis: 'Axis' = 0, how: 'AnyAll | NoDefault' = <no_default>, thresh: 'int | NoDefault' = <no_default>, subset: 'IndexLabel' = None, inplace: 'bool' = False, ignore_index: 'bool' = False) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        Pass tuple or list to drop on multiple axes.
        Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame

In [155]:
# combine them

df.dropna(thresh=2,                  # we need 2 good values
          subset=['x', 'y', 'z'])    # only look at the columns x, y, z

Unnamed: 0,v,w,x,y,z
a,684.0,559,629.0,,835.0
b,763.0,707,359.0,,723.0
c,,754,804.0,599.0,70.0
e,486.0,551,87.0,174.0,600.0


5. queries based on rows/columns -- especially using boolean series




In [158]:
np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [5,5]),
         index=list('abcde'),
         columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [160]:
# I want the values of z that are greater than z's mean

df['z'][df['z'] > df['z'].mean()]

a    835
b    723
d    705
e    600
Name: z, dtype: int64

In [161]:
# show me all rows of df 
# that correspond to rows where z is > z's mean

df.loc[
    df['z'] > df['z'].mean()  # row selector is a boolean series == mask index
      ]

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
d,472,600,396,314,705
e,486,551,87,174,600


In [162]:
# show me all rows
# where z is greater than the mean
# in columns v and y

df.loc[
    df['z'] > df['z'].mean()  # row selector is a boolean series == mask index
    ,
    ['v', 'y']
      ]

Unnamed: 0,v,y
a,684,192
b,763,9
d,472,314
e,486,174


# Exercise: Querying forecasts

1. Create/edit your data frame containing high/low/precip, such that its index will now contain the dates (in the form MMDD).
2. On which days do you find an above-average amount of precipitation?
3. What is the avearage low temp on days where there's an above-average diff in temperatures?

In [163]:
forecast

Unnamed: 0,dates,highs,lows,precip,diff
0,502,30,12,0.0,18
1,503,28,17,0.0,11
2,504,34,21,0.0,13
3,505,36,16,0.0,20
4,506,25,14,0.0,11
5,507,23,14,0.0,9
6,508,23,13,0.0,10
7,509,26,14,0.4,12
8,510,29,17,0.0,12
9,511,30,17,0.0,13


In [164]:
forecast = forecast.set_index('dates')
forecast

Unnamed: 0_level_0,highs,lows,precip,diff
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
502,30,12,0.0,18
503,28,17,0.0,11
504,34,21,0.0,13
505,36,16,0.0,20
506,25,14,0.0,11
507,23,14,0.0,9
508,23,13,0.0,10
509,26,14,0.4,12
510,29,17,0.0,12
511,30,17,0.0,13


In [166]:
# On which days do you find an above-average amount of precipitation?

forecast.loc[
    forecast['precip'] > forecast['precip'].mean()
]

Unnamed: 0_level_0,highs,lows,precip,diff
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
509,26,14,0.4,12


In [167]:
# get the index
forecast.loc[
    forecast['precip'] > forecast['precip'].mean()
].index

Index(['0509'], dtype='object', name='dates')

In [173]:
# What is the average low temp on days 
# where there's an above-average diff in temperatures?

forecast.loc[
    forecast['diff'] > forecast['diff'].mean(),   # row selector
    'lows'   # column selector
].mean()  # running mean() on the series we got back

16.5

In [176]:
forecast.loc[
    forecast['diff'] > forecast['diff'].mean(),   # row selector
    ['lows']   # column selector
].mean()  # running mean() on each of the columns in the data frame we got back

lows    16.5
dtype: float64

6. Working with files

If you haven't yet downloaded the file with some data in it, please do that: https://files.lerner.co.il/data-science-exercise-files.zip

Resume at :10