![Pandas logo](img/pandas.svg)

In [1]:
%matplotlib inline
import pandas as pd
from src.training import *

# Vectorization

The notion of applying operations in a vectorized way is common between Pandas and NumPy.  As with NumPy, vectorization goes hand-in-hand with filtering.  We usually want to apply some transformation on *many* elements, but usually not *all* the elements.

For a quick example, let us look at a slightly enhanced version of our patient data.  This version contains an `age` field as well as the others we saw earlier.

In [2]:
patients = pd.read_csv('data/patients-with-age.csv', 
                       parse_dates=['date'])
patients

Unnamed: 0,name,date,weight(kg),height(cm),age
0,Alice,2011-01-01,85.1,170,30
1,Barb,2012-02-02,66.7,160,40
2,Carla,2013-03-03,29.5,120,12
3,Dagmar,2014-04-04,64.2,180,40


A measure called Body Mass Index (BMI) is sometimes used to judge overall health in a quick way, with a "healthy range" between 18.5 and 24.9, with underweight or overweight past those limits.  There are many caveats in using this measure medically, but the numeric formula is:

$$BMI = \frac{kg}{m^2}$$

We can add this measure to our patients, but we feel that it is only applicable to adults over age 18.

In [3]:
bmi = patients['weight(kg)'] / (patients['height(cm)'] * 0.01)**2
# Notice the access both filters rows and adds a new column
patients.loc[patients.age >= 18, "BMI"] = bmi
patients

Unnamed: 0,name,date,weight(kg),height(cm),age,BMI
0,Alice,2011-01-01,85.1,170,30,29.446367
1,Barb,2012-02-02,66.7,160,40,26.054687
2,Carla,2013-03-03,29.5,120,12,
3,Dagmar,2014-04-04,64.2,180,40,19.814815


We can write this formula in one line, but sometimes you wish to perform a more complex calculation that you would like to encapsulate in a function.

In [4]:
def bmi_calc(row):
    kg = row['weight(kg)']
    m = row['height(cm)'] / 100
    return kg / m**2

In [5]:
# Axis 1 is row-by-row (axis 0 is column-by-column)
patients.apply(bmi_calc, axis=1)

0    29.446367
1    26.054687
2    20.486111
3    19.814815
dtype: float64

# Operations on Dates

Our simple `patients` DataFrame contains several numeric columns, but it also contains a datetime column and string column.  Those have some special "accessors" of their own.  A quirk of Pandas and Python is that even though Pandas knows a column is a special type, you are required to call type-specific methods via these accessors (there are obscure reasons for this relating to Python's runtime model; just take it as given).

Often we would like to maniuplate or filter dates in some way that is aware of their decomposition.

In [6]:
patients.date.dt.day_name()

0    Saturday
1    Thursday
2      Sunday
3      Friday
Name: date, dtype: object

In [7]:
# Patients who visited on a weekend
patients[patients.date.dt.day_name().isin(['Saturday', 'Sunday'])]

Unnamed: 0,name,date,weight(kg),height(cm),age,BMI
0,Alice,2011-01-01,85.1,170,30,29.446367
2,Carla,2013-03-03,29.5,120,12,


There is a great deal you can do with datetime accessors, such as accessing the hour, minute, second, or microsecond of the timestamp (where those are used, unlike in this example).  A later module will look at working with time series in much more detail.

# Operations on Strings

Another special kind of data you often deal with is strings.  Usually this is encoded generically as Python objects internally.  But reading from most kinds of data sources will make those objects consistently strings when that is the kind of field.

As with datetimes, there are many non-numeric things you often want to do with strings.  Basically all the capabilities of Python string methods—and some additional ones—are provided in vectorized versions for Pandas Series (i.e. string columns of DataFrames).

In [8]:
df = patients.copy()
# Set the index to the name column
df.index = df.name
print("How many 'a's are in each patient's name?")
df.name.str.count('a')

How many 'a's are in each patient's name?


name
Alice     0
Barb      1
Carla     2
Dagmar    2
Name: name, dtype: int64

We could solve an earlier problem of using uppercase name for the index in pure-Pandas style.

In [9]:
df.index = df.name.str.upper()
df

Unnamed: 0_level_0,name,date,weight(kg),height(cm),age,BMI
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALICE,Alice,2011-01-01,85.1,170,30,29.446367
BARB,Barb,2012-02-02,66.7,160,40,26.054687
CARLA,Carla,2013-03-03,29.5,120,12,
DAGMAR,Dagmar,2014-04-04,64.2,180,40,19.814815


Completely silly, but replace the a's with a-ring

In [10]:
df.name.str.replace('a', 'å')

name
ALICE      Alice
BARB        Bårb
CARLA      Cårlå
DAGMAR    Dågmår
Name: name, dtype: object

While single single first names are a bit too short and simple for this to be useful, we can split strings in a vectorized way.  For things we may often encounter like structured or delimited strings, this can be extremely powerful.  Moreover, this split operation is a full regular-expression pattern split, not simply a split on a delimeter only; in concept we can divide strings into parts in extermely powerful ways.

To keep in mind though is that the default operation mode produces a Series of Python lists with the split components.  This is often not the most useful result if the splits represent derived "features" of each row.

In [15]:
split_name = df.name.str.split('a')
print(split_name.dtype, type(split_name.iloc[0]))
split_name

object <class 'list'>


name
ALICE        [Alice]
BARB         [B, rb]
CARLA      [C, rl, ]
DAGMAR    [D, gm, r]
Name: name, dtype: object

Often what you would like instead is to create multiple columns of derived "data" based on the original string.

In [16]:
substrings = df.name.str.split(r'[Aa]', expand=True)
substrings.columns = ['part1', 'part2', 'part3']
substrings

Unnamed: 0_level_0,part1,part2,part3
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALICE,,lice,
BARB,B,rb,
CARLA,C,rl,
DAGMAR,D,gm,r


Although we've been discussing strings and `.str` methods, `.join()` should here be read strictly in the sense of SQL and relational algebra (not as `str.join()` or a cousin).  That is, it's basically `JOIN df, substrings ON df.index = substrings.index` for folks who know SQL.

In [17]:
df.join(substrings)

Unnamed: 0_level_0,name,date,weight(kg),height(cm),age,BMI,part1,part2,part3
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ALICE,Alice,2011-01-01,85.1,170,30,29.446367,,lice,
BARB,Barb,2012-02-02,66.7,160,40,26.054687,B,rb,
CARLA,Carla,2013-03-03,29.5,120,12,,C,rl,
DAGMAR,Dagmar,2014-04-04,64.2,180,40,19.814815,D,gm,r


# Exercises



The exercises here are about **cleaning data**.  I happen to have written a very good book about exactly this topic, largely using Pandas to explore the topic.  Readers might want to check out [_Cleaning Data for Effective Data Science_](https://gnosis.cx/cleaning/), David Mertz, Ph.D. ISBN-13 978-1801071291, 30 March 2021.

I'd love it if you want to buy a copy, but you can read it freely online as well.

Let us read in the moderately large NOAA temperature dataset we worked with before.  Our naïve manner of reading it simply allows Pandas to guess types.  For the most part that works well, but there is a date field we want to go back and adjust.

In [18]:
url = ("https://bitbucket.org/davidmertz/sample-data/raw/"
       "61872271984f66e3094c367cf90dfc4875a22e8d/NOAA-2019-partial.csv.gz")
temperatures = pd.read_csv(url)

In [19]:
with show_all_rows():
    print(temperatures.dtypes)

STATION               int64
DATE                 object
LATITUDE            float64
LONGITUDE           float64
ELEVATION           float64
NAME                 object
TEMP                float64
TEMP_ATTRIBUTES       int64
DEWP                float64
DEWP_ATTRIBUTES       int64
SLP                 float64
SLP_ATTRIBUTES        int64
STP                 float64
STP_ATTRIBUTES        int64
VISIB               float64
VISIB_ATTRIBUTES      int64
WDSP                float64
WDSP_ATTRIBUTES       int64
MXSPD               float64
GUST                float64
MAX                 float64
MAX_ATTRIBUTES       object
MIN                 float64
MIN_ATTRIBUTES       object
PRCP                float64
PRCP_ATTRIBUTES      object
SNDP                float64
FRSHTT                int64
dtype: object


In [20]:
# Use the same column name, but do a vectorized conversion to datetime
# Note also that we can *access* the column in attribute style,
#  ... but we can only *set* it using the dictionary style
temperatures['DATE'] = pd.to_datetime(temperatures.DATE)
temperatures.dtypes.iloc[:3]

STATION              int64
DATE        datetime64[ns]
LATITUDE           float64
dtype: object

### Data fields

A description of this data set can be found at: [Global Surface Summary of the Day](https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00516).  This gives details on *some* of the fields in the data, but is a bit incomplete.  NOAA is a wonderful agency, but data is always dirty.

In particular, being a United States agency, most of the data are in Imperial Units, and you probably want SI units to make sense of the data.  We would like to create a new DataFrame called `temperatures_SI` that contains the data in `temperatures`, but using better units:

* Mean temperature: Fahrenheit → Celcius (or Kelvin)
* Mean dew point: Fahrenheit → Celcius (or Kelvin)
* Mean sea level pressure: millibars → hectopascals
* Mean station pressure millibars → hectopascals
* Mean visibility: miles → kilometers
* Mean wind speed: 1 knots → m/s
* Maximum sustained wind speed: knots → m/s
* Maximum wind gust: knots → m/s
* Maximum temperature: Fahrenheit → Celcius (or Kelvin)
* Minimum temperature: Fahrenheit → Celcius (or Kelvin)
* Precipitation amount: inches → centimeters
* Snow depth inches → centimeters

In [None]:
# Conversions here
temperatures_SI = ...

### Exporing the data

While the column names provided are *suggestive* of their full names given in the metadata description, often very abbreviated names are used.  Try to verify using the data itself that you have made the right assumptions about the field meanings.

For example, you can be pretty confident that the *maximum* of something should almost always be more than the *minimum* of that same thing.  However, datasets are always dirty, so *almost always* may have to suffice for this verification if *always* does not apply.  

I.e. if `FOO` values are typically in the range of 10-20, if you see a value of 1,000,000 it is almost surely a data error rather than an unusual fluctuation in the underlying phenomenon.  Whether it is an error in collection, transcription, conversion, or of some other kind, is much harder to determine.

In [None]:
print("Size of dataset:", len(temperatures))
print("MAX more than MIN", (temperatures.MAX > temperatures.MIN).sum())
print("MIN more than MAX", (temperatures.MIN > temperatures.MAX).sum())

However, although your answer may not be definitive, try to characterize the nature of the problem data.  Do so both for the above provided example, and then for other abnormalities you detect in your own valuidation.

In [None]:
# Peform more data validation
...
# As you perform this, describe to your neighbor or in this notebook
# what expectations you have for analyses and what the results are.

### Undocumented data field

**Extra Credit**: In writing this material, the units used for `ELEVATION` do not seem to be documented at the metadata description page.  Can you determine the units based on other data you have available? The most likely units seem to be feet or meters (where 1 meter is approxmiately 3.28 feet; and in the USA, elevations are most often given in feet).

In [None]:
# Deduce units for a field
...

### Prepare data for scientific hypothesis

Let us state a silly and simplified scientific goal.  The point here to practice Pandas, not to do actual meteorology or climate modeling. We have a theory that the underlying temperature inside the Arctic Circle would be 5% higher (in Celcius degrees) if not for wintertime albedo.  Pretend winter means January/February/March for this purpose.

Create a DataFrame `temperatures_theoretical` based on `temperatures_SI` in which all temperatures have been adjusted to match our model.  

You can check your work, in part by making sure that 3.14% of the measurements will be adjusted in this exercise.  The fact that the target ratio is very close to pi is either a curious coincidence or of deep numerological significance.

In [None]:
# Adjust arctic tempeartures in winter by 5% ℃
...