<a href="https://colab.research.google.com/github/thimotyb/real-world-machine-learning/blob/python3/pandas_advanced_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas - Advanced Techniques

This lesson is taken and optimized from the following web sources:

1.   https://towardsdatascience.com/learn-advanced-features-for-pythons-main-data-analysis-library-in-20-minutes-d0eedd90d086
2.   List item






## Data Types

let’s quickly summarize all the available Pandas data types. In total, there are seven types:
* object : This data type is used for strings (i.e., sequences of characters)
* int64 : Used for integers (whole numbers, no decimals)
* float64 : Used for floating-point numbers (i.e., figures with decimals/fractions)
* bool : Used for values that can only be True/False
* datetime64 : Used for date and time values
* timedelta : Used to represent the difference between datetimes
* category : Used for values that take one out of a limited number of available options (categories don’t have to, but can have explicit ordering)

In [1]:
import pandas as pd
import numpy as np
import datetime
import pytz
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 10)

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/thimotyb/you-datascientist/master/happiness_with_continent.csv')

In [17]:
invoices = pd.read_csv('https://raw.githubusercontent.com/thimotyb/you-datascientist/master/invoices.csv')

In [12]:
invoices.tail(5)

Unnamed: 0,Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal,Heroes Adjustment
50012,4OMS8ZSA0UX8LHWI,2017-09-20,1TD5MROATV1NHZ4Y,E4K99D4JR9E40VE1,2017-09-21 08:00:00+02:00,['Regina Shirley'],9,Breakfast,False
50013,RR0VKJN8V0KHNKGG,2018-03-19,22EX9VZSJKHP4AIP,E4K99D4JR9E40VE1,2018-03-18 09:00:00+01:00,['Robin Ramos' 'Chester Mortimer'],25,Breakfast,False
50014,STJ6QJC30WPRM93H,2017-09-21,LMX18PNGWCIMG1QW,E4K99D4JR9E40VE1,2017-09-22 21:00:00+02:00,['Robin Ramos'],160,Dinner,False
50015,QHEUIYNC0XQX7GDR,2018-01-28,4U0VH2TGQL30X23X,E4K99D4JR9E40VE1,2018-02-01 21:00:00+01:00,['Chester Mortimer' 'Robin Ramos'],497,Dinner,False
50016,NKHFWT5I2J9LPAPG,2017-09-06,ORWFRT5TUSYGNYG7,E4K99D4JR9E40VE1,2017-09-09 14:00:00+02:00,['Chester Mortimer' 'Robin Ramos'],365,Lunch,False


In [7]:
invoices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50017 entries, 0 to 50016
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Order Id           50017 non-null  object 
 1   Date               50017 non-null  object 
 2   Meal Id            50017 non-null  object 
 3   Company Id         50017 non-null  object 
 4   Date of Meal       50017 non-null  object 
 5   Participants       50017 non-null  object 
 6   Meal Price         50017 non-null  float64
 7   Type of Meal       50017 non-null  object 
 8   Heroes Adjustment  50017 non-null  bool   
dtypes: bool(1), float64(1), object(7)
memory usage: 3.1+ MB


In [18]:
invoices['Type of Meal'] = invoices['Type of Meal'].astype('category')
invoices['Date'] = invoices['Date'].astype('datetime64[ns]')
invoices['Meal Price'] = invoices['Meal Price'].astype('int')

In [9]:
invoices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50017 entries, 0 to 50016
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Order Id           50017 non-null  object        
 1   Date               50017 non-null  datetime64[ns]
 2   Meal Id            50017 non-null  object        
 3   Company Id         50017 non-null  object        
 4   Date of Meal       50017 non-null  object        
 5   Participants       50017 non-null  object        
 6   Meal Price         50017 non-null  int64         
 7   Type of Meal       50017 non-null  category      
 8   Heroes Adjustment  50017 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](1), int64(1), object(5)
memory usage: 2.8+ MB


With conversion helpers,  it is possible to specify the behavior in case a value is encountered, that can not be converted.
Both functions accept an additional parameter errors that defines how errors should be treated. We could choose to ignore errors by passingerrors='ignore' , or turn the offending values into np.nan values by passing errors='coerce'. The default behavior is to raise errors.

We create an error and we demonstrate how to handle conversion errors.

In [21]:
invoices.loc[45612,'Meal Price'] = 'I am causing trouble'
invoices.loc[35612,'Meal Price'] = 'Me too'

In [11]:
invoices['Meal Price'].astype(int)

ValueError: ignored

In [13]:
invoices['Meal Price'].apply(lambda x: type(x)).value_counts()

<class 'int'>    50015
<class 'str'>        2
Name: Meal Price, dtype: int64

In [19]:
# Now this will not work as part is int and part str
invoices['Meal Price'][invoices['Meal Price']<10]

536      8
984      7
1041     7
1528     9
2294     8
        ..
46517    9
47225    9
48936    9
48975    8
50012    9
Name: Meal Price, Length: 100, dtype: int64

In [22]:
# Conditionally filter by lambda condition
invoices['Meal Price'][invoices['Meal Price'].apply(
  lambda x: isinstance(x,str)
)]

35612                  Me too
45612    I am causing trouble
Name: Meal Price, dtype: object

it would be very reasonable to just convert the values into np.nan by passing errors='coerce' to pd.to_numeric() like this:

In [23]:
pd.to_numeric(invoices['Meal Price'],errors='coerce')

0        469.0
1         22.0
2        314.0
3        438.0
4        690.0
         ...  
50012      9.0
50013     25.0
50014    160.0
50015    497.0
50016    365.0
Name: Meal Price, Length: 50017, dtype: float64

In [25]:
invoices.iloc[45612]

Order Id                      SJA1F92KXWZDH398
Date                       2017-02-26 00:00:00
Meal Id                       OOW0UEXQY5RMPPZ8
Company Id                    ICNGUMLKEB27T1P3
Date of Meal         2017-03-02 20:00:00+01:00
Participants                  ['Betty Stroud']
Meal Price                I am causing trouble
Type of Meal                            Dinner
Heroes Adjustment                        False
Name: 45612, dtype: object

In [26]:
pd.to_numeric(invoices['Meal Price'],errors='coerce')[45612]

nan