<a href="https://colab.research.google.com/github/thimotyb/real-world-machine-learning/blob/python3/pandas_advanced_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas - Advanced Techniques

This lesson is taken and optimized from the following web sources:

1.   https://towardsdatascience.com/learn-advanced-features-for-pythons-main-data-analysis-library-in-20-minutes-d0eedd90d086
2.   https://colab.research.google.com/github/thimotyb/real-world-machine-learning/blob/python3/Importing_data_with_pandas.ipynb






## Data Types

let’s quickly summarize all the available Pandas data types. In total, there are seven types:
* object : This data type is used for strings (i.e., sequences of characters)
* int64 : Used for integers (whole numbers, no decimals)
* float64 : Used for floating-point numbers (i.e., figures with decimals/fractions)
* bool : Used for values that can only be True/False
* datetime64 : Used for date and time values
* timedelta : Used to represent the difference between datetimes
* category : Used for values that take one out of a limited number of available options (categories don’t have to, but can have explicit ordering)

In [1]:
import pandas as pd
import numpy as np
import datetime
import pytz
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 10)

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/thimotyb/you-datascientist/master/happiness_with_continent.csv')

In [17]:
invoices = pd.read_csv('https://raw.githubusercontent.com/thimotyb/you-datascientist/master/invoices.csv')

In [12]:
invoices.tail(5)

Unnamed: 0,Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal,Heroes Adjustment
50012,4OMS8ZSA0UX8LHWI,2017-09-20,1TD5MROATV1NHZ4Y,E4K99D4JR9E40VE1,2017-09-21 08:00:00+02:00,['Regina Shirley'],9,Breakfast,False
50013,RR0VKJN8V0KHNKGG,2018-03-19,22EX9VZSJKHP4AIP,E4K99D4JR9E40VE1,2018-03-18 09:00:00+01:00,['Robin Ramos' 'Chester Mortimer'],25,Breakfast,False
50014,STJ6QJC30WPRM93H,2017-09-21,LMX18PNGWCIMG1QW,E4K99D4JR9E40VE1,2017-09-22 21:00:00+02:00,['Robin Ramos'],160,Dinner,False
50015,QHEUIYNC0XQX7GDR,2018-01-28,4U0VH2TGQL30X23X,E4K99D4JR9E40VE1,2018-02-01 21:00:00+01:00,['Chester Mortimer' 'Robin Ramos'],497,Dinner,False
50016,NKHFWT5I2J9LPAPG,2017-09-06,ORWFRT5TUSYGNYG7,E4K99D4JR9E40VE1,2017-09-09 14:00:00+02:00,['Chester Mortimer' 'Robin Ramos'],365,Lunch,False


In [7]:
invoices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50017 entries, 0 to 50016
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Order Id           50017 non-null  object 
 1   Date               50017 non-null  object 
 2   Meal Id            50017 non-null  object 
 3   Company Id         50017 non-null  object 
 4   Date of Meal       50017 non-null  object 
 5   Participants       50017 non-null  object 
 6   Meal Price         50017 non-null  float64
 7   Type of Meal       50017 non-null  object 
 8   Heroes Adjustment  50017 non-null  bool   
dtypes: bool(1), float64(1), object(7)
memory usage: 3.1+ MB


In [18]:
invoices['Type of Meal'] = invoices['Type of Meal'].astype('category')
invoices['Date'] = invoices['Date'].astype('datetime64[ns]')
invoices['Meal Price'] = invoices['Meal Price'].astype('int')

In [9]:
invoices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50017 entries, 0 to 50016
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Order Id           50017 non-null  object        
 1   Date               50017 non-null  datetime64[ns]
 2   Meal Id            50017 non-null  object        
 3   Company Id         50017 non-null  object        
 4   Date of Meal       50017 non-null  object        
 5   Participants       50017 non-null  object        
 6   Meal Price         50017 non-null  int64         
 7   Type of Meal       50017 non-null  category      
 8   Heroes Adjustment  50017 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](1), int64(1), object(5)
memory usage: 2.8+ MB


With conversion helpers,  it is possible to specify the behavior in case a value is encountered, that can not be converted.
Both functions accept an additional parameter errors that defines how errors should be treated. We could choose to ignore errors by passingerrors='ignore' , or turn the offending values into np.nan values by passing errors='coerce'. The default behavior is to raise errors.

We create an error and we demonstrate how to handle conversion errors.

In [21]:
invoices.loc[45612,'Meal Price'] = 'I am causing trouble'
invoices.loc[35612,'Meal Price'] = 'Me too'

In [11]:
invoices['Meal Price'].astype(int)

ValueError: ignored

In [13]:
invoices['Meal Price'].apply(lambda x: type(x)).value_counts()

<class 'int'>    50015
<class 'str'>        2
Name: Meal Price, dtype: int64

In [19]:
# Now this will not work as part is int and part str
invoices['Meal Price'][invoices['Meal Price']<10]

536      8
984      7
1041     7
1528     9
2294     8
        ..
46517    9
47225    9
48936    9
48975    8
50012    9
Name: Meal Price, Length: 100, dtype: int64

In [22]:
# Conditionally filter by lambda condition
invoices['Meal Price'][invoices['Meal Price'].apply(
  lambda x: isinstance(x,str)
)]

35612                  Me too
45612    I am causing trouble
Name: Meal Price, dtype: object

it would be very reasonable to just convert the values into np.nan by passing errors='coerce' to pd.to_numeric() like this:

In [23]:
pd.to_numeric(invoices['Meal Price'],errors='coerce')

0        469.0
1         22.0
2        314.0
3        438.0
4        690.0
         ...  
50012      9.0
50013     25.0
50014    160.0
50015    497.0
50016    365.0
Name: Meal Price, Length: 50017, dtype: float64

In [25]:
invoices.iloc[45612]

Order Id                      SJA1F92KXWZDH398
Date                       2017-02-26 00:00:00
Meal Id                       OOW0UEXQY5RMPPZ8
Company Id                    ICNGUMLKEB27T1P3
Date of Meal         2017-03-02 20:00:00+01:00
Participants                  ['Betty Stroud']
Meal Price                I am causing trouble
Type of Meal                            Dinner
Heroes Adjustment                        False
Name: 45612, dtype: object

In [26]:
pd.to_numeric(invoices['Meal Price'],errors='coerce')[45612]

nan

In [27]:
# Fill in the gaps with fillna and median
invoices['Meal Price'] = pd.to_numeric(invoices['Meal Price'],errors='coerce')
invoices['Meal Price'] = invoices['Meal Price'].fillna(invoices['Meal Price'].median())
invoices['Meal Price'].astype(int)

0        469
1         22
2        314
3        438
4        690
        ... 
50012      9
50013     25
50014    160
50015    497
50016    365
Name: Meal Price, Length: 50017, dtype: int64

In [30]:
print(invoices['Meal Price'].median())
invoices.iloc[45610:45614]

398.0


Unnamed: 0,Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal,Heroes Adjustment
45610,DMB8H3M3WT8GJSPN,2016-08-20,AOJPWQGKNVDF9UD5,ICNGUMLKEB27T1P3,2016-08-17 21:00:00+02:00,['Betty Stroud'],49.0,Dinner,False
45611,XEEXYOB84AHVCC1J,2018-07-17,63LLVLE72VG2J157,ICNGUMLKEB27T1P3,2018-07-19 20:00:00+02:00,['Alesha Wooten'],891.0,Dinner,False
45612,SJA1F92KXWZDH398,2017-02-26,OOW0UEXQY5RMPPZ8,ICNGUMLKEB27T1P3,2017-03-02 20:00:00+01:00,['Betty Stroud'],398.0,Dinner,False
45613,HC6MTWMXF99YEB92,2018-03-01,II205DMW5FBPTGIX,ICNGUMLKEB27T1P3,2018-03-02 13:00:00+01:00,['Betty Stroud'],245.0,Lunch,False


## Manipulating Date and Time in Pandas

A tutorial on basic python date time standard lib is here:
https://colab.research.google.com/github/thimotyb/materials/blob/master/datetime/datetime_tutorial.ipynb


pd.to_datetime()
Does what the name implies, the method converts a string into a datetime format. To call to_datetime on a column you would do: 
pd.to_datetime(invoices['Date of Meal']). 

Pandas will then guess the format and try to parse the date from the Input. And it does so impressively well:




In [31]:
print(pd.to_datetime('2019-8-1'))
print(pd.to_datetime('2019/8/1'))
print(pd.to_datetime('8/1/2019'))
print(pd.to_datetime('Aug, 1 2019'))
print(pd.to_datetime('Aug - 1 2019'))
print(pd.to_datetime('August - 1 2019'))
print(pd.to_datetime('2019, August - 1'))
print(pd.to_datetime('20190108'))

2019-08-01 00:00:00
2019-08-01 00:00:00
2019-08-01 00:00:00
2019-08-01 00:00:00
2019-08-01 00:00:00
2019-08-01 00:00:00
2019-08-01 00:00:00
2019-01-08 00:00:00


In [32]:
# With an arbitrary format
print(pd.to_datetime('yolo 20190108',format='%Y%d%m', exact=False))

2019-08-01 00:00:00


In [33]:
invoices['Date of Meal'] = pd.to_datetime(invoices['Date of Meal'],utc=True)

## Accessors

 a property that acts as an interface to methods specific to the type you are trying to access. Those methods are highly specialized. They serve one job and one job only. However, they are excellent and extremely concise for that particular job.
There are three different accessors:
* dt
* str
* cat

In [35]:
pd.set_option('display.max_rows', 4)

### dt

In [36]:
invoices['Date of Meal'].dt.date

0        2016-05-31
1        2018-10-01
            ...    
50015    2018-02-01
50016    2017-09-09
Name: Date of Meal, Length: 50017, dtype: object

In [38]:
invoices['Date of Meal'].dt.weekday

0        1
1        0
        ..
50015    3
50016    5
Name: Date of Meal, Length: 50017, dtype: int64

In [44]:
invoices['Date of Meal'].dt.month

0         5
1        10
         ..
50015     2
50016     9
Name: Date of Meal, Length: 50017, dtype: int64

In [42]:
invoices['Date of Meal'].dt.isocalendar().week

0        22
1        40
         ..
50015     5
50016    36
Name: week, Length: 50017, dtype: UInt32

In [45]:
invoices['Date of Meal'].dt.is_month_end
# also available: is_leap_year, is_month_start, is_month_end, is_quarter_start, is_quarter_end, is_year_start, is_year_end

0         True
1        False
         ...  
50015    False
50016    False
Name: Date of Meal, Length: 50017, dtype: bool

In [46]:
invoices[invoices['Date of Meal'].dt.is_month_end]

Unnamed: 0,Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal,Heroes Adjustment
0,839FKFW2LLX4LMBB,2016-05-27,INBUX904GIHI8YBD,LJKS5NK6788CYMUU,2016-05-31 05:00:00+00:00,['David Bishop'],469.0,Breakfast,False
6,2DDN2LHS7G85GKPQ,2014-04-29,1MKLAKBOE3SP7YUL,LJKS5NK6788CYMUU,2014-04-30 19:00:00+00:00,['Susan Guerrero' 'David Bishop'],14.0,Dinner,False
...,...,...,...,...,...,...,...,...,...
49896,G3FX5EAE2VCUFELA,2016-03-05,SZ1UUTPDNW3FCIFF,CZTLKWWDEHQ0GW0I,2016-02-29 07:00:00+00:00,['Olga Fortenberry'],288.0,Breakfast,False
49929,HEGJLUD58BTP2CC3,2016-02-04,C18IITWW7K615G21,DNAC0XNVYCD3J62R,2016-01-31 21:00:00+00:00,['Linda Torros'],331.0,Dinner,False


In [47]:
# converts the Pandas datetime into a regular Python datetime format 
invoices['Date of Meal'].dt.to_pydatetime()

array([datetime.datetime(2016, 5, 31, 5, 0, tzinfo=<UTC>),
       datetime.datetime(2018, 10, 1, 18, 0, tzinfo=<UTC>),
       datetime.datetime(2014, 8, 23, 12, 0, tzinfo=<UTC>), ...,
       datetime.datetime(2017, 9, 22, 19, 0, tzinfo=<UTC>),
       datetime.datetime(2018, 2, 1, 20, 0, tzinfo=<UTC>),
       datetime.datetime(2017, 9, 9, 12, 0, tzinfo=<UTC>)], dtype=object)

In [48]:
# to period [available periods are W, M, Q, and Y], which converts the dates into periods.
invoices['Date of Meal'].dt.to_period('W')



0        2016-05-30/2016-06-05
1        2018-10-01/2018-10-07
                 ...          
50015    2018-01-29/2018-02-04
50016    2017-09-04/2017-09-10
Name: Date of Meal, Length: 50017, dtype: period[W-SUN]

### str

In [49]:
pd.set_option('display.max_rows', 10)
invoices['Type of Meal'].str.lower()

0        breakfast
1           dinner
2            lunch
3           dinner
4            lunch
           ...    
50012    breakfast
50013    breakfast
50014       dinner
50015       dinner
50016        lunch
Name: Type of Meal, Length: 50017, dtype: object

In [50]:
invoices['Type of Meal'].str.ljust(width=15)

0        Breakfast      
1        Dinner         
2        Lunch          
3        Dinner         
4        Lunch          
              ...       
50012    Breakfast      
50013    Breakfast      
50014    Dinner         
50015    Dinner         
50016    Lunch          
Name: Type of Meal, Length: 50017, dtype: object

In [51]:
invoices['Type of Meal'].str.zfill(width=15)

0        000000Breakfast
1        000000000Dinner
2        0000000000Lunch
3        000000000Dinner
4        0000000000Lunch
              ...       
50012    000000Breakfast
50013    000000Breakfast
50014    000000000Dinner
50015    000000000Dinner
50016    0000000000Lunch
Name: Type of Meal, Length: 50017, dtype: object

In [52]:

invoices['Type of Meal'].str.repeat(2)

0        BreakfastBreakfast
1              DinnerDinner
2                LunchLunch
3              DinnerDinner
4                LunchLunch
                ...        
50012    BreakfastBreakfast
50013    BreakfastBreakfast
50014          DinnerDinner
50015          DinnerDinner
50016            LunchLunch
Name: Type of Meal, Length: 50017, dtype: object

In [53]:
invoices['Type of Meal'].str.endswith('ast')

0         True
1        False
2        False
3        False
4        False
         ...  
50012     True
50013     True
50014    False
50015    False
50016    False
Name: Type of Meal, Length: 50017, dtype: bool

In [54]:
invoices[invoices['Participants'].str.contains('Bruce')]

Unnamed: 0,Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal,Heroes Adjustment
214,PSQXKK7KIDIRKDSG,2018-12-07,GG9MXLE8MGY8VGPD,MR6NETSKD2PSN54L,2018-12-09 06:00:00+00:00,['Jane Bruce'],400.0,Breakfast,False
215,Z8PYUN4L85MEW5W4,2016-07-26,MG7LI3RM8K3UKR34,MR6NETSKD2PSN54L,2016-07-30 19:00:00+00:00,['Jane Bruce' 'Jennifer Lee' 'Rosa Parramore'],503.0,Dinner,False
216,8XWFE5AX3D9LX5AY,2017-09-18,99NEQYAVFP9W1IHK,MR6NETSKD2PSN54L,2017-09-22 06:00:00+00:00,['Jane Bruce'],137.0,Breakfast,False
217,X546I8JFNVJFE7FH,2017-10-01,G2UDNGRTBYGIS90Z,MR6NETSKD2PSN54L,2017-09-26 05:00:00+00:00,['Martin Riley' 'Jane Bruce' 'Rosa Parramore'],409.0,Breakfast,False
219,GCB5ULGM1W8A8HPR,2017-12-14,XI16TWCYLA0F7NLZ,MR6NETSKD2PSN54L,2017-12-15 08:00:00+00:00,['Earl Sorrentino' 'Jane Bruce'],339.0,Breakfast,False
...,...,...,...,...,...,...,...,...,...
48761,W7X0DMISLLY4X9I6,2017-08-29,EDVP4TD6YCHF8N00,TXTJD46IGQWLD75D,2017-09-02 05:00:00+00:00,['John Leo' 'Judy Dammann' 'Todd Bradshaw' 'Co...,465.0,Breakfast,True
48764,K69T43KA9WCW9GOI,2016-06-03,16I4B1HW4T60CO8H,TXTJD46IGQWLD75D,2016-05-29 11:00:00+00:00,['Judy Dammann' 'John Leo' 'Courtney Shaw' 'Br...,112.0,Lunch,True
48767,66DNKN5WXDS73VUA,2018-05-20,5HLCRT0G3S9D20SQ,TXTJD46IGQWLD75D,2018-05-23 06:00:00+00:00,['Lydia Muske' 'Todd Bradshaw' 'Judy Dammann' ...,898.0,Breakfast,True
48770,RYN3CZCDK3TQLX5O,2015-08-11,TE9PC2623UV8XL4L,TXTJD46IGQWLD75D,2015-08-11 07:00:00+00:00,['Courtney Shaw' 'Lydia Muske' 'Bruce Duenas' ...,280.0,Breakfast,True


### cat