# Pandas intro class 02

Objective:
- Data exploration with methods

There are many ways to do things using `pandas`, generally we want our work to be:
- simple
- explicit
- easy to read
- efficient

In [3]:
import pandas as pd
pd.set_option('display.max_columns', 100)

college = pd.read_csv('data/college.csv')

In [4]:
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Missing values
The `isna` and `isnull` methods both determine whether each value in the DataFrame is missing or not. The result will always be a DataFrame (or Series) of all boolean values indicating if value is missing or not.

`notna` and `notnull` methods are the opposite logic to see if values are not missing. 

Since other methods like `dropna` and `fillna` uses `na`, you might want to choose the entire `na` collection.

In [5]:
college.shape

(7535, 27)

In [6]:
# variable missing rate
1 - college['satvrmid'].notna().mean()

0.8427339084273391

In [8]:
college.isna().mean()

instnm                0.000000
city                  0.000000
stabbr                0.000000
hbcu                  0.049237
menonly               0.049237
womenonly             0.049237
relaffil              0.000000
satvrmid              0.842734
satmtmid              0.841274
distanceonly          0.049237
ugds                  0.087724
ugds_white            0.087724
ugds_black            0.087724
ugds_hisp             0.087724
ugds_asian            0.087724
ugds_aian             0.087724
ugds_nhpi             0.087724
ugds_2mor             0.087724
ugds_nra              0.087724
ugds_unkn             0.087724
pptug_ef              0.090511
curroper              0.000000
pctpell               0.091042
pctfloan              0.091042
ug25abv               0.108427
md_earn_wne_p10       0.148905
grad_debt_mdn_supp    0.004247
dtype: float64

In [9]:
college.shape

(7535, 27)

In [13]:
college.dropna().shape

(1171, 27)

In [12]:
college.dropna(subset=['satvrmid']).shape

(1185, 27)

In [15]:
college.dropna(axis=1, how='all').shape

(7535, 27)

In [19]:
# fill missing values using backfill method
# college.loc[college['satvrmid'].isna(), 'satvrmid'] = 600

college.fillna(method='backfill').head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,595.0,590.0,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [20]:
# impute missing values using median
satvrmid_med = college['satvrmid'].median()

In [22]:
satvrmid_med

510.0

In [21]:
college['satvrmid'].fillna(value=satvrmid_med)

0       424.0
1       570.0
2       510.0
3       595.0
4       425.0
        ...  
7530    510.0
7531    510.0
7532    510.0
7533    510.0
7534    510.0
Name: satvrmid, Length: 7535, dtype: float64

In [27]:
college['satvrmid'].fillna(value=satvrmid_med).value_counts()

510.0    6382
495.0      52
475.0      43
500.0      39
470.0      37
         ... 
483.0       1
623.0       1
638.0       1
466.0       1
591.0       1
Name: satvrmid, Length: 163, dtype: int64

In real life, missing valeus are important as it could indicate systematic missing or data biases. Most of the time the missings are not missing at random. It's good practice to check your missings once you read in the data, before dong anything on it.

### Describe your DaraFrame
A lazy way to get a taste of distribution. Notice missing values are not here so you might want to look at them separately or handle that before describing your DF.

In [30]:
college.describe(include='all')

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
count,7535,7535,7535,7164.0,7164.0,7164.0,7535.0,1185.0,1196.0,7164.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6874.0,6853.0,7535.0,6849.0,6849.0,6718.0,6413,7503
unique,7535,2514,59,,,,,,,,,,,,,,,,,,,,,,,598,2038
top,Alabama A & M University,New York,CA,,,,,,,,,,,,,,,,,,,,,,,PrivacySuppressed,PrivacySuppressed
freq,1,87,773,,,,,,,,,,,,,,,,,,,,,,,822,1510
mean,,,,0.014238,0.009213,0.005304,0.190975,522.819409,530.76505,0.005583,2356.83794,0.510207,0.189997,0.161635,0.033544,0.013813,0.004569,0.02395,0.016086,0.045181,0.226639,0.923291,0.530643,0.522211,0.410021,,
std,,,,0.118478,0.095546,0.072642,0.393096,68.578862,73.469767,0.074519,5474.275871,0.286958,0.224587,0.221854,0.073777,0.070196,0.033125,0.031288,0.050172,0.09344,0.24647,0.266146,0.225544,0.283616,0.228939,,
min,,,,0.0,0.0,0.0,0.0,290.0,310.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
25%,,,,0.0,0.0,0.0,0.0,475.0,482.0,0.0,117.0,0.2675,0.036125,0.0276,0.0025,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.3578,0.3329,0.2415,,
50%,,,,0.0,0.0,0.0,0.0,510.0,520.0,0.0,412.5,0.5557,0.10005,0.0714,0.0129,0.0026,0.0,0.0175,0.0,0.0143,0.1504,1.0,0.5215,0.5833,0.40075,,
75%,,,,0.0,0.0,0.0,0.0,555.0,565.0,0.0,1929.5,0.747875,0.2577,0.198875,0.0327,0.0073,0.0025,0.0339,0.0117,0.0454,0.3769,1.0,0.7129,0.745,0.572275,,


In [31]:
# explore quantiles, p99 and p1
college['satmtmid'].quantile(.99), college['satmtmid'].quantile(.01)

(745.2499999999998, 395.0)

In [35]:
# frequency table as %
college['stabbr'].value_counts()/college['stabbr'].shape[0]

CA    0.102588
TX    0.062641
NY    0.060916
FL    0.057863
PA    0.052289
OH    0.046715
IL    0.039814
MI    0.027472
NC    0.027074
MA    0.025747
MO    0.025614
VA    0.024552
GA    0.024419
TN    0.024154
NJ    0.021898
IN    0.021367
MN    0.020438
PR    0.019642
OK    0.019376
AZ    0.017651
CO    0.016589
WA    0.016324
LA    0.016058
WI    0.014864
SC    0.014731
KY    0.014068
CT    0.013537
KS    0.013139
MD    0.013139
AL    0.012741
OR    0.012342
IA    0.012210
AR    0.011413
UT    0.010219
WV    0.009688
MS    0.008361
NM    0.006768
NE    0.006503
NV    0.006105
ME    0.005707
NH    0.005441
ID    0.005309
MT    0.004247
SD    0.004114
ND    0.003849
VT    0.003583
DC    0.003451
HI    0.003451
RI    0.003185
DE    0.002522
WY    0.001460
AK    0.001327
GU    0.000398
VI    0.000265
AS    0.000133
MP    0.000133
FM    0.000133
PW    0.000133
MH    0.000133
Name: stabbr, dtype: float64

### Arithmetic and Comparison Operators

All arithmetic operators have corresponding methods that function similarly.

* `+` - `add`
* `-` - `sub` and `subtract`
* `*` - `mul` and `multiply`
* `/` - `div`, `divide` and `truediv`
* `/` - `pow`
* `//` - `floordiv`
* `%` - `mod`

All the comparison operators also have corresponding methods.

* `>` - `gt`
* `<` - `lt`
* `>=` - `ge`
* `<=` - `le`
* `==` - `eq`
* `!=` - `ne`

Logic comparison also work here.

* `&` - `and`
* `|` - `or`
* `~` - `not`

Let's select the undergraduate population (ugds) column as Series, add 100 to it and verify that both the plus operator its corresponding method, `add` give the same result.

In [36]:
ugds = college['ugds']
ugds_operator = ugds + 100
ugds_method = ugds.add(100)
ugds_operator.equals(ugds_method)

True

### Operators and methods may perform differently on your data
Usally operators and methods are interchangable, but let's see an example where we absolutely need the method to complete the task. The college dataset contains 9 consecutive columns holding the frequency of the undergraduate population by race. The first column is `ugds_white` and the last `ugds_unkn`. Let's select these columns now into their own DataFrame.

In [37]:
college_idx = college.set_index('instnm')

In [38]:
college_race = college_idx.loc[:, 'ugds_white':'ugds_unkn'] # challenge: all columns that start with 'ugds_'
college_race.head()

Unnamed: 0_level_0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [39]:
ugds = college_idx['ugds']
ugds.head()

instnm
Alabama A & M University                4206.0
University of Alabama at Birmingham    11383.0
Amridge University                       291.0
University of Alabama in Huntsville     5451.0
Alabama State University                4811.0
Name: ugds, dtype: float64

We want to answer the question: what's the number of undergrad students in each of the races? Since we have the proportion and total number, we can get the answers by multiplying them.

We then multiple the `college_race` DataFrame by this Series. Intuitively, this seems like it should work, but it doesn't.

In [40]:
college_race.shape, ugds.shape

((7535, 9), (7535,))

In [41]:
df_attempt = college_race * ugds
df_attempt.shape

(7535, 7544)

In [44]:
df_attempt.head()

Unnamed: 0_level_0,A & W Healthcare Educators,A T Still University of Health Sciences,ABC Beauty Academy,ABC Beauty College Inc,AI Miami International University of Art and Design,AIB College of Business,AOMA Graduate School of Integrative Medicine,ASA College,ASI Career Institute,ASM Beauty World Academy,ATA Career Education,ATA College,ATEP at IVC,ATI College-Norwalk,ATS Institute of Technology,AVTEC-Alaska's Institute of Technology,Aaniiih Nakoda College,Aaron's Academy of Beauty,Abcott Institute,Abdill Career College Inc,Abilene Christian University,Abington Memorial Hospital Dixon School of Nursing,Abraham Baldwin Agricultural College,Academia Serrant Inc,Academy College,Academy Di Capelli-School of Cosmetology,Academy di Firenze,Academy for Careers and Technology,Academy for Five Element Acupuncture,Academy for Jewish Religion-California,Academy for Nursing and Health Occupations,Academy for Salon Professionals,Academy of Art University,Academy of Career Training,Academy of Careers and Technology,Academy of Chinese Culture and Health Sciences,Academy of Cosmetology,Academy of Cosmetology and Esthetics NYC,Academy of Couture Art,Academy of Esthetics and Cosmetology,Academy of Hair Design-Beaumont,Academy of Hair Design-Grenada,Academy of Hair Design-Jackson,Academy of Hair Design-Jasper,Academy of Hair Design-Las Vegas,Academy of Hair Design-Lufkin,Academy of Hair Design-Oklahoma City,Academy of Hair Design-Pearl,Academy of Hair Design-Salem,Academy of Hair Design-Springfield,...,Yeshiva College of the Nations Capital,Yeshiva D'monsey Rabbinical College,Yeshiva Derech Chaim,Yeshiva Gedolah Imrei Yosef D'spinka,Yeshiva Gedolah Kesser Torah,Yeshiva Gedolah Zichron Leyma,Yeshiva Gedolah of Greater Detroit,Yeshiva Karlin Stolin,Yeshiva Ohr Elchonon Chabad West Coast Talmudical Seminary,Yeshiva Shaar Hatorah,Yeshiva Shaarei Torah of Rockland,Yeshiva Toras Chaim,Yeshiva University,Yeshiva Yesodei Hatorah,Yeshiva and Kollel Harbotzas Torah,Yeshiva of Far Rockaway Derech Ayson Rabbinical Seminary,Yeshiva of Machzikai Hadas,Yeshiva of Nitra Rabbinical College,Yeshiva of the Telshe Alumni,Yeshivah Gedolah Rabbinical College,Yeshivas Be'er Yitzchok,Yeshivas Novominsk,Yeshivat Mikdash Melech,Yeshivath Beth Moshe,Yeshivath Viznitz,Yeshivath Zichron Moshe,Yo San University of Traditional Chinese Medicine,York College,York College Pennsylvania,York County Community College,York County School of Technology-Adult & Continuing Education,York Technical College,Yorktowne Business Institute,Young Harris College,Youngstown State University,Yuba College,Yukon Beauty College Inc,Z Hair Academy,Zane State College,duCret School of Arts,eClips School of Cosmetology and Barbering,ugds_2mor,ugds_aian,ugds_asian,ugds_black,ugds_hisp,ugds_nhpi,ugds_nra,ugds_unkn,ugds_white
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
Alabama A & M University,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
University of Alabama at Birmingham,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Amridge University,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
University of Alabama in Huntsville,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Alabama State University,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


##### Why this is happening 
Whenever an operation takes place between two Pandas objects, an **automatic alignment** always takes place between the index and/or columns of the two objects. In the above operation, we multiplied The `college_race` **DataFrame** and the `ugds` **Series** together. Pandas automatically (implicitly) aligned the **columns** of `college_race` to the **index** values of `ugds`.

None of the `college_race` columns match the index values of `ugds`. Pandas does the alignment by performing an **outer join** keeping all values that match as well as those that do not. This returns a large DataFrame with all missing values. You can scroll all the way to the right to view the original column names of the `college_race` DataFrame.

##### Change the direction of the alignment with a method
All operators only work in a single way. We cannot change how the multiplication operator, `*`, works. Methods, on the other hand, can have parameters that we can use to control how the operation takes place. 

###### Use the `axis` parameter of the `mul` method
All the methods that correspond to the operators listed above have an `axis` parameter that allows us to change the direction of the alignment. So, instead of aligning the columns of a DataFrame to the index of a Series, we can align the index of a DataFrame to the index of a Series. Let's do that now so that we can find the answer to our problem from above.

In [45]:
df_correct = college_race.mul(ugds, axis='index').round(0).fillna(0).astype(int)
df_correct.head()

Unnamed: 0_level_0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,140,3934,23,8,10,8,0,25,58
University of Alabama at Birmingham,6741,2960,322,590,25,8,419,204,114
Amridge University,87,122,2,1,0,0,0,0,79
University of Alabama in Huntsville,3809,684,208,205,78,1,94,181,191
Alabama State University,76,4430,58,9,5,3,47,117,66


In [46]:
df_correct.shape

(7535, 9)

In [47]:
# another example
s1 = pd.Series(index=['a', 'b', 'c', 'e'], data=[4, 8, 3, 5])
s2 = pd.Series(index=['a', 'b', 'd'], data=[2, 1, 9])

In [48]:
s1

a    4
b    8
c    3
e    5
dtype: int64

In [49]:
s2

a    2
b    1
d    9
dtype: int64

In [50]:
s1 + s2

a    6.0
b    9.0
c    NaN
d    NaN
e    NaN
dtype: float64

In [51]:
s1.add(s2, fill_value=0)

a    6.0
b    9.0
c    3.0
d    9.0
e    5.0
dtype: float64

## Built in functions vs Pandas methods with the same name
There are a few DataFrame/Series methods that will return the same result if a built-in Python function with the same name is used. They are:
* `sum`
* `min`
* `max`
* `abs`

Let's verify that the give the same result by testing them out on a single column of data.

In [65]:
%timeit -n 5 sum(ugds)

273 µs ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [66]:
%timeit -n 5 ugds.sum()

162 µs ± 95.8 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [67]:
%timeit -n 5 max(ugds)

342 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [68]:
%timeit -n 5 ugds.max()

142 µs ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


There are clear performance discrepancies for `sum`, `max`, and `min`. Completely different code is executed when the built-in Python functions are used as opposed to when the Pandas method is called. Calling `sum(ugds)` essentially creates a Python for loop to iterate through each value one at a time. On the other hand, calling `ugds.sum()` executes the internal Pandas `sum` method which is written in C and much faster than the iterating with a Python for loop.

**Takeaway**: use Pandas methods over python built-in function with the same name.

### Let's work through an example: Calculating the z-scores of each school
Let's do a slightly more complex example. Below, we set the index to be the institution name and then select both of the SAT columns.

In [52]:
college_idx = college.set_index('instnm')
sats = college_idx[['satmtmid', 'satvrmid']].dropna()
sats.head()

Unnamed: 0_level_0,satmtmid,satvrmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,420.0,424.0
University of Alabama at Birmingham,565.0,570.0
University of Alabama in Huntsville,590.0,595.0
Alabama State University,430.0,425.0
The University of Alabama,565.0,555.0


In [53]:
college.shape, sats.shape

((7535, 27), (1184, 2))

Let's say we are interested in finding the z-score for each college's SAT score. To calculate this, we would need to subtract the mean and divide by the standard deviation. Let's do that first with operators.

In [54]:
mean = sats.mean()
mean

satmtmid    530.958615
satvrmid    522.775338
dtype: float64

In [55]:
std = sats.std()
std

satmtmid    73.645153
satvrmid    68.591051
dtype: float64

In [58]:
type(std)

pandas.core.series.Series

In [80]:
sats.head()

Unnamed: 0_level_0,satmtmid,satvrmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,420.0,424.0
University of Alabama at Birmingham,565.0,570.0
University of Alabama in Huntsville,590.0,595.0
Alabama State University,430.0,425.0
The University of Alabama,565.0,555.0


In [59]:
zscore_operator = (sats - mean) / std
zscore_operator.head()

Unnamed: 0_level_0,satmtmid,satvrmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,-1.506666,-1.440062
University of Alabama at Birmingham,0.462235,0.688496
University of Alabama in Huntsville,0.801701,1.052975
Alabama State University,-1.370879,-1.425482
The University of Alabama,0.462235,0.469809


Let's repeat this with the methods and verify equality.

In [60]:
# equivalent
zscore_methods = sats.sub(mean).div(std)
zscore_operator.equals(zscore_methods)

True

### Exercise 3: Which columns in `college` dataset contain missing value? What are their missing rate(%)?

### Exercise 4: Is there statistical significant difference between average `md_earn_wne_p10` (Median Earnings 10 years after enrollment) for schools in Texas and California? 

Hint: we can see this problem as a [two-sample t-test](https://online.stat.psu.edu/stat555/node/36/)


In [None]:
# you might find the following function useful to converts data to numeric
pd.to_numeric()

# example code for p-value calculation if you'd like to calculate z-scores
import scipy.stats as st

st.norm.cdf(-1.506666)

# you may also find the following function useful
st.ttest_ind()