![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The Pandas Library - Pandas Data Types

*Basic initialization of the workspace.*

In [1]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

Pandas installed at version: 1.1.5


## 1. Pandas basic data types

The Pandas library supports all the basic data types supported by the NumPy library. Additionally it supports several data types extensions, especially in the area of datetime intervals.

The data type extensions supported by Pandas are the following:

| Data type	              | Description                                        | Scalar representation   | Data type code  |
|-------------------------|----------------------------------------------------|--------------------------|-----------------|
| ``DatetimeTZDtype``     | Timezone aware data time                       | Timestamp                 | datetime64\[ns, \<tz\>\]       |
| ``Categorical``         | A data type dedicated to represent granular information      | Categorical | category    |
| ``PeriodDtype``         | Date time periods (time spans)     | Period |period\[\<freq\>\]  Period\[\<freq\>\]    |
| ``SparseDtype``         | Condensed representation of data      | N/A |Sparse Sparse\[float\] Sparse\[int\]    |
| ``IntervalDtype``         | Representation of interval data (numeric, datetime, etc ...)      | Interval |interval Interval Interval\[\<numpy_dtype\>\] Interval\[datetime64\[ns, \<tz\>\]\] Interval[timedelta64[\<freq\>]]    |
| ``Int64Dtype``         | Nullable integer data type (integer with null values) | N/A | Int8 Int16 Int32 Int64 UInt8 UInt16 UInt32 UInt64   |
| ``String``         | Representation of text data | str | string   |
| ``BooleanDtype``   | Nullable boolean data type (boolean with null values) | boolean | boolean   |

We will focus on exploring the most important use cases when working with basic data types.  

### 1.1 Working with date time

Handling of date and time is an important feature of any development platform. Pandas offers powerful mechanisms for specifying the datetime data and extracting its relevant properties.
 
The fundamental class for handling date time is the [Timestamp](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html) class:

In [2]:
# creating timestamp data and extracting its proprieties
new_year_2022_date = "2022-01-01"

timestamp = pd.Timestamp(new_year_2022_date);

print(
    "The date {} is represented as timestamp value {}".format(
        new_year_2022_date,
        timestamp
    )
)

print(
    "The day associated with it is {} ({})".format(
        timestamp.day,
        timestamp.day_name()
    )
)

print(
    "The month associated with it is {} ({})".format(
        timestamp.month,
        timestamp.month_name()
    )
)

print(
    "The year associated with it is {}".format(
        timestamp.year
    )
)

print(
    "The day in week and year vales are {} and {}".format(
        timestamp.dayofweek,
        timestamp.dayofyear,
    )
)


The date 2022-01-01 is represented as timestamp value 2022-01-01 00:00:00
The day associated with it is 1 (Saturday)
The month associated with it is 1 (January)
The year associated with it is 2022
The day in week and year vales are 5 and 1


Pandas allows not only for creation of particular data values but also for the creation of time period ranges via the [period_range](https://pandas.pydata.org/docs/reference/api/pandas.period_range.html) function:

In [3]:
# creating a period range for 2022 with
# a quarterly frequence
end_year_2022_date = "2022-12-31"
frequency = "Q"
time_period_range = pd.period_range(
    new_year_2022_date, 
    end_year_2022_date, 
    freq = frequency
  )

print(
    "The period range from {} to {} with a frequency of {} contains the elements \n{}".format(
      new_year_2022_date, 
      end_year_2022_date, 
      frequency,
      # convert the values to numpy array
      time_period_range.values        
    )
)

The period range from 2022-01-01 to 2022-12-31 with a frequency of Q contains the elements 
[Period('2022Q1', 'Q-DEC') Period('2022Q2', 'Q-DEC')
 Period('2022Q3', 'Q-DEC') Period('2022Q4', 'Q-DEC')]


Pandas has excellent support as well for modifying datetime data. Datetime modifications can be done via the [DateOffset](https://pandas.pydata.org/docs/reference/api/pandas.tseries.offsets.DateOffset.html) class, adding time units such as months, days and hours:

In [4]:
# create a sample date offset
offset = pd.DateOffset(
    years = 1,
    months = 1,
    days = 1,
    hours = 1,
    minutes = 1
)

# change the timestamp with the specified date offset
new_timestamp = timestamp + offset

print("The timestamp {} changed with {} is {}".format(
    timestamp,
    offset,
    new_timestamp
))

The timestamp 2022-01-01 00:00:00 changed with <DateOffset: days=1, hours=1, minutes=1, months=1, years=1> is 2023-02-02 01:01:00


### 1.2 Working with categorical data

Categorical data ensures encoding of data with a relatively fewer discrete values into a representation that is efficient both for storage and also for data analysis. The fundamental class for handling categorical data is [Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html):

In [5]:
# creating categorical data and extracting the associated categories and codes
categorical_data_raw = ["One", "Two", "Three", "Two", "Four", "One"]
categorical_data = pd.Categorical(categorical_data_raw)

print(
    "The categorical data \n{}\n has been encoded in the categories \
     \n{}\n with the following codes \n{}\n".format(
      categorical_data_raw,
      categorical_data.categories.values,
      categorical_data.codes
    )    
)

The categorical data 
['One', 'Two', 'Three', 'Two', 'Four', 'One']
 has been encoded in the categories      
['Four' 'One' 'Three' 'Two']
 with the following codes 
[1 3 2 3 0 1]



### 1.3 Working with interval data

Pandas allows the creation of open and closed intervals considering interval limits and incrementing units. The class allowing the interval data is the [Interval](https://pandas.pydata.org/docs/reference/api/pandas.Interval.html). Interval data can be used to verify if a scalar value belongs to an interval or if some intervals are overlaping eachother:

In [6]:
# define numeric intervals 
numeric_interval_1 = pd.Interval(
    left = 0,
    right = 10,
    closed = "right"
)

numeric_interval_2 = pd.Interval(
    left = 10,
    right = 20,
    closed = "right"
)

# check scalar value appartenence
scalar_value = 10

print(
    "The scalar {} {} to interval {}".format(
          scalar_value,
         "belongs" if (scalar_value in numeric_interval_1) 
          else "does not belong",
        numeric_interval_1
    )
)

print(
    "The scalar {} {} to interval {}".format(
          scalar_value,
         "belongs" if (scalar_value in numeric_interval_2) 
          else "does not belong",
        numeric_interval_2
    )
)

# check interval range
print(
    "The intervals {} and {} {}.".format(
        numeric_interval_1,
        numeric_interval_2,
        "overlap" if (numeric_interval_1.overlaps(numeric_interval_2)) 
          else "do not overlap"

    )
)


The scalar 10 belongs to interval (0, 10]
The scalar 10 does not belong to interval (10, 20]
The intervals (0, 10] and (10, 20] do not overlap.


The interval data specification can be used to generate data ranges as well:

In [7]:
# create the range data based on interval specification
left_limit = numeric_interval_1.left if numeric_interval_1.closed_left \
                                     else numeric_interval_1.left + 1

right_limit = numeric_interval_1.right + 1 if numeric_interval_1.closed_right \
                                     else numeric_interval_1.right + 1

# generate a data range based on these intervals
import numpy as np
print(
    "The integers in the interval \n{}\n are \n{}\n".format(
        numeric_interval_1,
        np.arange(left_limit, right_limit)
    )
)

The integers in the interval 
(0, 10]
 are 
[ 1  2  3  4  5  6  7  8  9 10]

