<a id='index'></a>

# More Data Proccesing with Pandas

## Índice

[Merging DataFrames](#merging)

1. [merge](#c-merge)
    - [how=](#c-how)
    - On
        - [Single Column](#c-on-single-col)
        - [Multiple Column](#c-on-mul-col)
1. [concat](#c-concat)

---

[Pandas Idioms](#idioms)

1. [Pandorable (Pythonic)](#c-pythonic)
1. [apply](#c-apply)

---

[Group By](#group)

- Splitting
    - [groupby](#c-groupby)

- Aggregation
    - [agg](#c-agg)

- Transformation
    - [transform](#c-transform)

- Filtering
    - [filter](#c-filter)

---

[Scales](#scaling)

1. [astype](#c-astype)
1. [cut](#c-cut)

---

[Pivot Table](#pivot-table)

1. [pivot_table](#c-pivot)
1. [stack](#c-stack)
1. [unstack](#c-unstack)

---

[Date/Time Functionality](#date-time)

1. [Timestamp](#c-timestamp)
1. [Period](#c-period)
1. [DatetimeIndex & PeriodIndex](#c-types)
1. [to_datetime](#c-to-datetime)
1. [Timedelta](#c-timedelta)
1. [weekday](#c-weekday)
1. [offsets](#c-offsets)

- Working with Dates in a DataFrame
    1. [date_range](#c-date-range)
    1. [day_name](#c-day-name)
    1. [diff](#c-diff)
    1. [resample](#c-resample)
    1. [loc](#c-loc)

<a id='merging'></a>

[Índice](#index)

## Merging DataFrames

---

In [1]:
import pandas as pd

In [2]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df.set_index('Name', inplace=True)

student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df.set_index('Name', inplace=True)
print(staff_df)
print(student_df)

                 Role
Name                 
Kelly  Director of HR
Sally  Course liasion
James          Grader
            School
Name              
James     Business
Mike           Law
Sally  Engineering


<a id='c-merge'></a>

[Índice](#index)

### `merge`

<a id='c-how'></a>

[Índice](#index)

#### `how=`

- outer
- inner
- left
- right

In [3]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Kelly,Director of HR,
Mike,,Law
Sally,Course liasion,Engineering


In [4]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liasion,Engineering
James,Grader,Business


In [5]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kelly,Director of HR,
Sally,Course liasion,Engineering
James,Grader,Business


In [6]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Mike,,Law
Sally,Course liasion,Engineering


In [7]:
staff_df.reset_index(), student_df.reset_index()

(    Name            Role
 0  Kelly  Director of HR
 1  Sally  Course liasion
 2  James          Grader,
     Name       School
 0  James     Business
 1   Mike          Law
 2  Sally  Engineering)

<a id='c-on-single-col'></a>

[Índice](#index)

#### `on=` Single Column

https://stackoverflow.com/questions/20375561/joining-pandas-dataframes-by-column-names

In [8]:
pd.merge(staff_df, student_df, how='right', on='Name')

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Mike,,Law
Sally,Course liasion,Engineering


In [9]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR','Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])

student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])

In [10]:
pd.merge(staff_df, student_df, how='left', on='Name')

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,State Street,,
1,Sally,Course liasion,Washington Avenue,Engineering,512 Wilson Crescent
2,James,Grader,Washington Avenue,Business,1024 Billiard Avenue


In [11]:
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name' : 'Desjardins',
                          'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name' : 'Brooks',
                          'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name' : 'Wilde',
                          'Role': 'Grader'}])

student_df = pd.DataFrame([{'First Name': 'James', 'Last Name' : 'Hammond',
                            'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name' : 'Smith',
                            'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name' : 'Brooks',
                            'School': 'Engineering'}])

<a id='c-on-mul-col'></a>

[Índice](#index)

#### `on=` Multiple Columns

In [12]:
pd.merge(staff_df, student_df, how='inner', on=['First Name', 'Last Name'])

Unnamed: 0,First Name,Last Name,Role,School
0,Sally,Brooks,Course liasion,Engineering


In [13]:
%%capture

# to let ignore the bad lines on the csv, and supress the warning messages
df_2011 = pd.read_csv('datasets/MERGED2011_12_PP.csv', error_bad_lines=False)
df_2012 = pd.read_csv('datasets/MERGED2012_13_PP.csv', error_bad_lines=False)
df_2013 = pd.read_csv('datasets/MERGED2012_13_PP.csv', error_bad_lines=False)

In [14]:
df_2011.head()

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654.0,100200.0,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663.0,105200.0,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690.0,2503400.0,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706.0,105500.0,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724.0,100500.0,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,


In [15]:
df_2011.info(), df_2012.info(), df_2013.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15235 entries, 0 to 15234
Columns: 1977 entries, UNITID to OMENRUP_PARTTIME_POOLED_SUPP
dtypes: float64(197), object(1780)
memory usage: 229.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7793 entries, 0 to 7792
Columns: 1977 entries, UNITID to OMENRUP_PARTTIME_POOLED_SUPP
dtypes: float64(743), int64(11), object(1223)
memory usage: 117.5+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7793 entries, 0 to 7792
Columns: 1977 entries, UNITID to OMENRUP_PARTTIME_POOLED_SUPP
dtypes: float64(743), int64(11), object(1223)
memory usage: 117.5+ MB


(None, None, None)

In [16]:
len(df_2011), len(df_2012), len(df_2013)

(15235, 7793, 7793)

<a id='c-concat'></a>

[Índice](#index)

### `concat`

In [17]:
frames = [df_2011, df_2012, df_2013]
pd.concat(frames).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30821 entries, 0 to 7792
Columns: 1977 entries, UNITID to OMENRUP_PARTTIME_POOLED_SUPP
dtypes: float64(197), object(1780)
memory usage: 465.1+ MB


In [18]:
rows = 0
for i in frames:
    rows += len(i)
rows

30821

In [19]:
pd.concat(frames, keys=['2011', '2012', '2013'])

Unnamed: 0,Unnamed: 1,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
2011,0,100654.0,100200.0,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
2011,1,100663.0,105200.0,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2011,2,100690.0,2503400.0,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
2011,3,100706.0,105500.0,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
2011,4,100724.0,100500.0,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2013,7788,47691101.0,4205801.0,42058,SAE Institute of Technology San Francisco,Emeryville,CA,94608,,,,...,,,,,,,,,,
2013,7789,47701101.0,10145905.0,1459,Strayer University-Bloomington Campus,Bloomington,MN,554311411,,,,...,,,,,,,,,,
2013,7790,47702001.0,10145903.0,1459,Strayer University-Schaumburg Campus,Schaumburg,IL,601735081,,,,...,,,,,,,,,,
2013,7791,47702002.0,10145902.0,1459,Strayer University-Downers Grove Campus,Downers Grove,IL,605151169,,,,...,,,,,,,,,,


<a id='idioms'></a>

[Índice](#index)

## Pandas Idioms

Pandorable

---

In [20]:
import pandas as pd
import numpy as np

import timeit

In [21]:
df = pd.read_csv('datasets/census.csv')
df.head(1)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594


<a id='c-pythonic'></a>

[Índice](#index)

### `pythonic or pandorable`

In [22]:
# (df.where(df['SUMLEV'] == 50))

(df[df['SUMLEV'] == 50]
 .set_index(['STNAME', 'CTYNAME'])
 .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})).head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,Estimates Base 2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333


In [23]:
df = df[df['SUMLEV'] == 50]
df.set_index(['STNAME', 'CTYNAME'], inplace=True)
df.rename(columns={'ESTIMATESBASE2010': 'Estimate Base 2010'}).head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,Estimate Base 2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333


In [24]:
def first_approach():
    global df
    return (df[df['SUMLEV'] == 50]
            .set_index(['STNAME', 'CTYNAME'])
            .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
    
df = pd.read_csv('datasets/census.csv')

timeit.timeit(first_approach, number=10)

0.09264500000000453

In [25]:
def second_approach():
    global df
    new_df = df[df['SUMLEV'] == 50]
    new_df.set_index(['STNAME', 'CTYNAME'], inplace=True)
    return new_df.rename(columns={'ESTIMATESBASE2010': 'Estimate Base 2010'}).head(1)
    
df = pd.read_csv('datasets/census.csv')

timeit.timeit(second_approach, number=10)

0.0901878000000238

<a id='c-apply'></a>

[Índice](#index)

### `apply`

In [26]:
df = pd.read_csv('datasets/census.csv')
df.head(1)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594


In [27]:
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    
    return pd.Series({'min': np.min(data), 'max': np.max(data)})

df.apply(min_max, axis='columns').head(1)

Unnamed: 0,min,max
0,4785161,4858979


In [28]:
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    row['min'] = np.min(data)
    row['max'] = np.max(data)
    
    return row

df.apply(min_max, axis='columns').head(1)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,min,max
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594,4785161,4858979


In [29]:
rows = ['POPESTIMATE2010',
        'POPESTIMATE2011',
        'POPESTIMATE2012',
        'POPESTIMATE2013',
        'POPESTIMATE2014',
        'POPESTIMATE2015']

#df.apply(lambda x: np.max(x[rows]), axis=1).head()
df.apply(lambda x: np.max(x[rows]), axis='columns').head()

0    4858979
1      55347
2     203709
3      27341
4      22861
dtype: int64

In [30]:
def get_state_region(x):
    northeast = ['Connecticut', 'Maine', 'Massachusetts']
    midwest = ['Illinois', 'Indiana', 'New York']
    south = ['Delaware', 'Florida', 'Georgia']
    west = ['Arizona', 'colorado', 'Idaho']
    
    if x in northeast:
        return 'Northeast'
    elif x in midwest:
        return 'Midwest'
    elif x in south:
        return 'South'
    else:
        return 'West'

df['state_region'] = df['STNAME'].apply(lambda x: get_state_region(x))

df[['STNAME', 'state_region']].head(1)

Unnamed: 0,STNAME,state_region
0,Alabama,West


<a id='group'></a>

[Índice](#index)

## Group by

---

In [31]:
import pandas as pd
import numpy as np

### Splitting

In [32]:
df = pd.read_csv('datasets/census.csv')

In [33]:
df = df[df['SUMLEV'] == 50]
df.head(1)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333


#### Different forms of grouping

In [34]:
%%timeit -n 3

for state in df['STNAME'].unique():
    avg = np.average(df[df['STNAME'] == state]['CENSUS2010POP'])
    
    #print(f'{state} with an average population of {avg} ')

35.8 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [35]:
%%timeit -n 3
for group, frame in df.groupby('STNAME'):
    avg = np.average(frame['CENSUS2010POP'])
    
    #print(group)
    #print(f'{group} with an average population of {avg} ')

8.78 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


<a id='c-groupby'></a>

[Índice](#index)

#### `groupby`

In [36]:
%%timeit -n 3
df.groupby('STNAME')['CENSUS2010POP'].mean()

2.01 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [37]:
df = df.set_index('STNAME')

def set_batch_number(item):
    if item[0] < 'M':
        return 0
    elif item[0] < 'Q':
        return 1
    else:
        return 2
    
for group, frame in df.groupby(set_batch_number):
    print(f'There are {len(frame)} in group {group}')

There are 1177 in group 0
There are 1134 in group 1
There are 831 in group 2


In [38]:
df = pd.read_csv('datasets/listings.csv')
df.set_index(['cancellation_policy', 'review_scores_value'], inplace=True)
df.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
cancellation_policy,review_scores_value,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,,,f,f,f,1,


#### Group by indexes

In [39]:
for group, frame in df.groupby(level=(0, 1)):
    print(group)

('flexible', 2.0)
('flexible', 4.0)
('flexible', 5.0)
('flexible', 6.0)
('flexible', 7.0)
('flexible', 8.0)
('flexible', 9.0)
('flexible', 10.0)
('moderate', 2.0)
('moderate', 4.0)
('moderate', 6.0)
('moderate', 7.0)
('moderate', 8.0)
('moderate', 9.0)
('moderate', 10.0)
('strict', 2.0)
('strict', 3.0)
('strict', 4.0)
('strict', 5.0)
('strict', 6.0)
('strict', 7.0)
('strict', 8.0)
('strict', 9.0)
('strict', 10.0)
('super_strict_30', 6.0)
('super_strict_30', 7.0)
('super_strict_30', 8.0)
('super_strict_30', 9.0)
('super_strict_30', 10.0)


#### Group by Function

In [40]:
def grouping_fun(item):
    if item[1] == 10.0:
        return (item[0], '10.0')
    else:
        return (item[0], 'not 10.0')
    
for group, frame in df.groupby(grouping_fun):
    print(group)

('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


### Aggregation

In [41]:
df = pd.read_csv('datasets/listings.csv')
df.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,


In [42]:
df.groupby('cancellation_policy')['review_scores_value'].mean()

cancellation_policy
flexible           9.237421
moderate           9.307398
strict             9.081441
super_strict_30    8.537313
Name: review_scores_value, dtype: float64

<a id='c-agg'></a>

[Índice](#index)

#### `agg`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html

In [43]:
df.groupby('cancellation_policy').agg({'review_scores_value': ['mean', 'std'],
                                       'reviews_per_month': ['mean']})

Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,mean,std,mean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


In [44]:
df.groupby('cancellation_policy').agg(review_mean = ('review_scores_value', 'mean'), 
                                      review_std = ('review_scores_value', 'std'),
                                      review_pero_monthy = ('reviews_per_month', 'mean'))

Unnamed: 0_level_0,review_mean,review_std,review_pero_monthy
cancellation_policy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


### Transformation

In [45]:
df = pd.read_csv('datasets/listings.csv')
df.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,


<a id='c-transform'></a>

[Índice](#index)

#### `transform`

In [46]:
cols = ['cancellation_policy', 'review_scores_value']

transform_df = df[cols].groupby('cancellation_policy').transform('mean')
transform_df.rename({'review_scores_value': 'mean_review_scores'}, axis='columns', inplace=True)

transform_df.head()

Unnamed: 0,mean_review_scores
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421


In [47]:
#df = df.merge(transform_df, how='inner', left_index=True, right_index=True)
df = pd.merge(df, transform_df, how='inner', left_index=True, right_index=True)
df.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,f,,,f,moderate,f,f,1,,9.307398


In [48]:
df['mean_diff'] = np.abs(df['review_scores_value'] - df['mean_review_scores'])
df[['mean_diff']].head()

Unnamed: 0,mean_diff
0,
1,0.307398
2,0.692602
3,0.692602
4,0.762579


### Filtering

In [49]:
df = pd.read_csv('datasets/listings.csv')
df.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,


<a id='c-filter'></a>

[Índice](#index)

#### `filter`

In [50]:
(df.groupby('cancellation_policy')
 .filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2)).tail(2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
3582,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,none,"Cambridge is a short walk into Boston, and set...",...,,f,,,f,flexible,f,f,1,
3584,14504422,https://www.airbnb.com/rooms/14504422,20160906204935,2016-09-07,(K1) Private Room near Harvard/MIT,My place is close to My home is a warm and fri...,To ensure a smooth check in: 1. You MUST have ...,My place is close to My home is a warm and fri...,none,,...,,f,,,t,flexible,f,f,3,


### Applying

In [51]:
df = pd.read_csv('datasets/listings.csv')
df.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,


In [52]:
df = df[['cancellation_policy', 'review_scores_value']]
df.head(1)

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,


In [53]:
def calc_mean_review_scores(group):
    avg = np.nanmean(group['review_scores_value'])
    group['review_scores_mean'] = np.abs(avg- group['review_scores_value'])
    return group
    
df.groupby('cancellation_policy').apply(calc_mean_review_scores).head(1)

Unnamed: 0,cancellation_policy,review_scores_value,review_scores_mean
0,moderate,,


<a id='scaling'></a>

[Índice](#index)

## Scales

---

### Ratio Scale

- Units equally spaced
- Math operations are valid

e.g. height and weight

### Interval Scale

- Units equally spaced, but there is no _true_ zero
- Math operations could not be valid

e.g. temperature

### Orginal Scale

- Order of units is important, but not evenly spaced

e.g Letter grades. _A+_ represents only a 3%, but _A_ represents 4%

### Nominal Scale

Categorical data

- Categories of data, but the categories have no order with respet to one another

In [54]:
import pandas as pd

In [58]:
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                  index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'],
                  columns=['Grades'])
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [59]:
df.dtypes

Grades    object
dtype: object

<a id='c-astype'></a>

[Índice](#index)

### `astype`

In [60]:
df['Grades'].astype('category').head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): ['A', 'A+', 'A-', 'B', ..., 'C+', 'C-', 'D', 'D+']

In [61]:
my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
                                    ordered=True)
grades = df['Grades'].astype(my_categories)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): ['D' < 'D+' < 'C-' < 'C' ... 'B+' < 'A-' < 'A' < 'A+']

In [62]:
df[df['Grades'] > 'C']

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [66]:
grades[grades > 'C']

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): ['D' < 'D+' < 'C-' < 'C' ... 'B+' < 'A-' < 'A' < 'A+']

<a id='c-cut'></a>

[Índice](#index)

### `cut`

In [67]:
import numpy as np

In [70]:
df = pd.read_csv('datasets/census.csv')
df = df[df['SUMLEV'] == 50]
df = (df.set_index('STNAME')
      .groupby(level=0)
      ['CENSUS2010POP'].mean())
df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [71]:
pd.cut(df, 10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

<a id='pivot-table'></a>

[Índice](#index)

## Pivot Table

---

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('datasets/cwurData.csv')
df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


In [3]:
""" Another Approach
def create_category(row):
    rank = row['world_rank']
    
    if rank in range(1, 100+1):
        tier = 'First Tier Top University'
    elif rank in range(101, 200+1):
        tier = 'Second Tier Top University'
    elif rank in range(201, 300+1):
        tier = 'Third Tier Top University'
    else:
        tier = 'Other Top University'
    row['Rank_Level'] = tier
    return row
        
df.apply(create_category, axis='columns')
"""

def create_category(ranking):
    
    if ranking in range(1, 100+1):
        tier = 'First Tier Top University'
    elif ranking in range(101, 200+1):
        tier = 'Second Tier Top University'
    elif ranking in range(201, 300+1):
        tier = 'Third Tier Top University'
    else:
        tier = 'Other Top University'
    #row['Rank_Level'] = tier
    return tier
        
df['Rank_Level'] = df['world_rank'].apply(create_category)
df[['world_rank', 'Rank_Level']]

Unnamed: 0,world_rank,Rank_Level
0,1,First Tier Top University
1,2,First Tier Top University
2,3,First Tier Top University
3,4,First Tier Top University
4,5,First Tier Top University
...,...,...
2195,996,Other Top University
2196,997,Other Top University
2197,998,Other Top University
2198,999,Other Top University


<a id='c-pivot'></a>

[Índice](#index)

### `pivot_table`

In [122]:
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=['mean', 'max']).head()

Unnamed: 0_level_0,mean,mean,mean,mean,max,max,max,max
Rank_Level,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Argentina,,44.672857,,,,45.66,,
Australia,47.9425,44.64575,49.2425,47.285,51.61,45.97,50.4,47.47
Austria,,44.864286,,47.066667,,46.29,,47.78
Belgium,51.875,45.081,49.084,46.746667,52.03,46.21,49.73,47.14
Brazil,,44.499706,49.565,,,46.08,49.82,


In [124]:
df.pivot_table(values='score',
               index='country',
               columns='Rank_Level',
               aggfunc=['mean', 'max'],
               margins=True).tail()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,max,max,max,max,max
Rank_Level,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University,All,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Uganda,,44.28,,,44.28,,44.4,,,44.4
United Arab Emirates,,44.22,,,44.22,,44.36,,,44.36
United Kingdom,63.937931,44.881299,48.9575,46.862273,49.474653,97.64,46.3,49.97,47.74,97.64
Uruguay,,44.255,,,44.255,,44.35,,,44.35
All,58.350675,44.738871,49.06545,46.84345,47.798395,100.0,46.34,51.29,47.93,100.0


In [16]:
new_df = df.pivot_table(values='score', index='country',columns='Rank_Level',aggfunc=['mean', 'max'],
                        margins=True)

In [129]:
new_df['mean']['First Tier Top University'].idxmax()

'United Kingdom'

In [131]:
new_df.head(1)

Unnamed: 0_level_0,mean,mean,mean,mean,mean,max,max,max,max,max
Rank_Level,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University,All,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,44.672857,,,44.672857,,45.66,,,45.66


<a id='c-stack'></a>

[Índice](#index)

### `stack`

In [17]:
new_df = new_df.stack()
new_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,max
country,Rank_Level,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,Other Top University,44.672857,45.66
Argentina,All,44.672857,45.66
Australia,First Tier Top University,47.9425,51.61
Australia,Other Top University,44.64575,45.97
Australia,Second Tier Top University,49.2425,50.4


<a id='c-unstack'></a>

[Índice](#index)

### `unstack`

In [14]:
new_df.unstack().head(1)

Unnamed: 0_level_0,mean,mean,mean,mean,mean,max,max,max,max,max
Rank_Level,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University,All,First Tier Top University,Other Top University,Second Tier Top University,Third Tier Top University,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
All,58.350675,44.738871,49.06545,46.84345,47.798395,100.0,46.34,51.29,47.93,100.0


In [18]:
pd.DataFrame(new_df.unstack().unstack())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
Unnamed: 0_level_1,Rank_Level,country,Unnamed: 3_level_1
mean,First Tier Top University,All,58.350675
mean,First Tier Top University,Argentina,
mean,First Tier Top University,Australia,47.942500
mean,First Tier Top University,Austria,
mean,First Tier Top University,Belgium,51.875000
...,...,...,...
max,All,USA,100.000000
max,All,Uganda,44.400000
max,All,United Arab Emirates,44.360000
max,All,United Kingdom,97.640000


<a id='date-time'></a>

[Índice](#index)

## Data/Time Functionality

---

In [21]:
import pandas as pd
import numpy as np

<a id='c-timestamp'></a>

[Índice](#index)

### `Timestamp`

In [7]:
pd.Timestamp('9/1/2019 10:05AM'), pd.Timestamp(2019, 12, 20, 0, 0)

(Timestamp('2019-09-01 10:05:00'), Timestamp('2019-12-20 00:00:00'))

In [10]:
# shows the week day of the timestamp monday = 1 & sunday = 7
print(pd.Timestamp(2022, 8, 4, 0, 0).isoweekday())
# extract some specific element of the timestamp
print(pd.Timestamp(2022, 8, 4, 5, 2, 23).second)

4
23


<a id='c-period'></a>

[Índice](#index)

### `Period`

In [16]:
pd.Period('1/2016'), pd.Period('3/5/2016') + 5

(Period('2016-01', 'M'), Period('2016-03-10', 'D'))

<a id='c-types'></a>

[Índice](#index)

### DatetimeIndex & PeriodIndex

In [15]:
t1 = pd.Series(list('abs'), [pd.Timestamp('2016-09-01'), 
                             pd.Timestamp('2016-09-02'),
                             pd.Timestamp('2016-09-03')])
t1

2016-09-01    a
2016-09-02    b
2016-09-03    s
dtype: object

In [17]:
type(t1.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [18]:
t2 = pd.Series(list('abs'), [pd.Period('2016-09'), 
                             pd.Period('2016-10'),
                             pd.Period('2016-11')])
t2

2016-09    a
2016-10    b
2016-11    s
Freq: M, dtype: object

In [19]:
type(t2.index)

pandas.core.indexes.period.PeriodIndex

In [26]:
d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']

ts3 = pd.DataFrame(np.random.randint(10, 100, (4, 2)), index=d1, columns=list('ab'))
ts3

Unnamed: 0,a,b
2 June 2013,90,54
"Aug 29, 2014",69,52
2015-06-26,94,50
7/12/16,94,86


<a id='c-to-datetime'></a>

[Índice](#index)

### `to_datetime`

In [28]:
ts3['c'] = pd.to_datetime(ts3.index)
ts3

Unnamed: 0,a,b,c
2 June 2013,90,54,2013-06-02
"Aug 29, 2014",69,52,2014-08-29
2015-06-26,94,50,2015-06-26
7/12/16,94,86,2016-07-12


In [29]:
pd.to_datetime('4.7.12', dayfirst=True)

Timestamp('2012-07-04 00:00:00')

<a id='c-timedelta'></a>

[Índice](#index)

### `Timedelta`

In [30]:
pd.Timestamp('9/3/2016') - pd.Timestamp('9/1/2016'), pd.Timestamp('9/2/2016 8:10AM') + pd.Timedelta('12D 3H')

(Timedelta('2 days 00:00:00'), Timestamp('2016-09-14 11:10:00'))

<a id='c-weekday'></a>

[Índice](#index)

### `weekday`

In [40]:
# monday = 0
pd.Timestamp('8/4/2022').weekday()

3

<a id='c-offsets'></a>

[Índice](#index)

### `offsets`

In [41]:
pd.Timestamp('9/4/2016') + pd.offsets.Week()

Timestamp('2016-09-11 00:00:00')

In [42]:
pd.Timestamp('9/4/2016') + pd.offsets.MonthEnd()

Timestamp('2016-09-30 00:00:00')

<a id='c-date-range'></a>

[Índice](#index)

### `date_range`

In [50]:
# Sunday biweekly
dates = pd.date_range('08-01-2022', periods=9, freq='2W-SUN')
print(dates)

# Just business day
print(pd.date_range('08-01-2022', periods=9, freq='B'))

# Quarterly, starting in June
print(pd.date_range('08-01-2022', periods=12, freq='QS-JUN'))

DatetimeIndex(['2022-08-07', '2022-08-21', '2022-09-04', '2022-09-18',
               '2022-10-02', '2022-10-16', '2022-10-30', '2022-11-13',
               '2022-11-27'],
              dtype='datetime64[ns]', freq='2W-SUN')
DatetimeIndex(['2022-08-01', '2022-08-02', '2022-08-03', '2022-08-04',
               '2022-08-05', '2022-08-08', '2022-08-09', '2022-08-10',
               '2022-08-11'],
              dtype='datetime64[ns]', freq='B')
DatetimeIndex(['2022-09-01', '2022-12-01', '2023-03-01', '2023-06-01',
               '2023-09-01', '2023-12-01', '2024-03-01', '2024-06-01',
               '2024-09-01', '2024-12-01', '2025-03-01', '2025-06-01'],
              dtype='datetime64[ns]', freq='QS-JUN')


In [54]:
dates = pd.date_range('08-01-2022', periods=9, freq='2W-SUN')
df = pd.DataFrame({'Count 1': 100 + np.random.randint(-5, 10, 9).cumsum(),
                   'Count 2': 120 + np.random.randint(-5, 10, 9)},
                  index=dates)
df

Unnamed: 0,Count 1,Count 2
2022-08-07,98,127
2022-08-21,98,120
2022-09-04,100,128
2022-09-18,97,117
2022-10-02,105,116
2022-10-16,101,117
2022-10-30,109,116
2022-11-13,117,118
2022-11-27,124,128


<a id='c-day-name'></a>

[Índice](#index)

### `day_name`

In [61]:
df.index.day_name()

Index(['Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday',
       'Sunday', 'Sunday'],
      dtype='object')

<a id='c-diff'></a>

[Índice](#index)

### `diff`

In [63]:
# difference between each date. e.g. [count 1] 13 = 117 27 = 124 -> 124 - 117 = 7
df.diff()

Unnamed: 0,Count 1,Count 2
2022-08-07,,
2022-08-21,0.0,-7.0
2022-09-04,2.0,8.0
2022-09-18,-3.0,-11.0
2022-10-02,8.0,-1.0
2022-10-16,-4.0,1.0
2022-10-30,8.0,-1.0
2022-11-13,8.0,2.0
2022-11-27,7.0,10.0


<a id='c-resample'></a>

[Índice](#index)

### `resample`

In [64]:
df.resample('M').mean()

Unnamed: 0,Count 1,Count 2
2022-08-31,98.0,123.5
2022-09-30,98.5,122.5
2022-10-31,105.0,116.333333
2022-11-30,120.5,123.0


<a id='c-loc'></a>

[Índice](#index)

### `loc`

In [68]:
df.loc['2022-10':]

Unnamed: 0,Count 1,Count 2
2022-10-02,105,116
2022-10-16,101,117
2022-10-30,109,116
2022-11-13,117,118
2022-11-27,124,128
