# Pandas Data Types

In [None]:
#@title ### Run the following cell to download the necessary files for this lesson { display-mode: "form" } 
#@markdown Don't worry about what's in this collapsed cell

!pip install -q PyYaml
print('Downloading Salaries.csv...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-dev-524288083424/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/Salaries.csv -q -O Salaries.csv
print('Downloading employees.json...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-dev-524288083424/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/employees.json -q -O employees.json
print('Downloading animals.yaml...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-dev-524288083424/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/animals.yaml -q -O animals.yaml
print('Downloading employees_2.json...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-dev-524288083424/lesson_files/3d1ad53c-8b55-41f0-8772-08c39b437cfb/employees_2.json -q -O employees_2.json


> Once you have imported your data into a `DataFrame`, a common first step is to check the data type (`dtype`) of each column, and adjust any that aren't optimal for your analysis. 



Below is a table that shows various `Pandas` data types, their corresponding Python data types, and a brief description of each.

| Pandas Dtype | Python Dtype        | Description                                           |
|--------------|---------------------|-------------------------------------------------------|
| object       | str                 | Used for text or mixed types of data                  |
| int64        | int                 | Integer numbers                                       |
| float64      | float               | Floating-point numbers                                |
| bool         | bool                | Boolean values (True/False)                           |
| datetime64   | datetime.datetime   | Date and time values                                  |
| timedelta64  | datetime.timedelta  | Differences between two datetimes                     |
| category     | (special type)      | Finite list of text values                            |
| period       | pd.Period           | Periods of time, useful for time-series data          |
| sparse       | (special type)      | Sparse array to contain mostly NaN values             |
| string       | str                 | Text                                                  |

Note that:

- The `int64` and `float64` data types indicate 64-bit storage for integer and floating-point numbers, respectively. `Pandas` also supports other sizes (like `int32` and `float32`) to save memory when the larger sizes are not necessary. An `int` type column cannot contain `NaN` values.
- The `category` dtype is not a native Python data type but is provided by `Pandas` to optimize memory usage and performance for data with a small number of distinct values
- The `sparse` dtype is used for data that is mostly missing to save memory by only storing the non-missing values
- The `period` dtype is specific to `Pandas` and is used for handling period data, which is not directly analogous to a native Python type



## Checking and Assigning Data Types

When you import your raw data, `Pandas` will attempt to automatically assign a data type to each column, but it doesn't always make the best choice.

Let's import some example data to a `DataFrame` and take a look at how to do this. 

In [1]:
import pandas as pd
# Create a simple dataframe of names and ages
data = {'Name': ['Alice', 'Bashar', 'Carlos', 'Diana', "Ephraim", "Frank", "Gina"],
        'Age': [21, 22, 'n/a', 24, 25, 'missing', 27]}
age_df = pd.DataFrame(data)



### Checking the Data Type

The data types of your columns can be accessed via the `.dtypes` attribute, or by calling the `.info()` method. 

- The `.dtypes` attribute only returns the data type of each column
- The `.info()` method returns both the data type and some additional information: the number of rows and the memory usage of the dataframe, as well as the number of non-null values in each column. We will deal with handling `NULL` values in another lesson.


In [2]:
age_df.dtypes

Name    object
Age     object
dtype: object

In [3]:
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    7 non-null      object
 1   Age     7 non-null      object
dtypes: object(2)
memory usage: 244.0+ bytes


Looking at the output of `info()`, both columns have defaulted to the `object` datatype. In this case, this is not quite what we want. The `Name` column should be of `string` type, as this uses less memory than `object`, and the `Age` column should be of a numeric type so that we can do numeric calculations on it.


### Assigning Datatypes



The `.astype()` method can be used to manually assign a datatype to a column. We can easily change the `Name` column to the `string` type:

In [2]:
age_df.Name = age_df.Name.astype('string')
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    7 non-null      string
 1   Age     7 non-null      object
dtypes: object(1), string(1)
memory usage: 244.0+ bytes


When we try to cast the `Age` column as `int64` though, the method throws a `ValueError`, because it encountered some values (`missing` and `n/a`) that it could not cast as an integer:

In [3]:
age_df.Age = age_df.Age.astype('int64')

ValueError: invalid literal for int() with base 10: 'n/a'

## The `pd.to_numeric()` Function

By default, the `astype()` method throws an error when it encounters a non-convertible value, as this prevents accidental data loss. We can override this behaviour by setting the `errors` flag to `ignore` rather than `raise`, but this would still not convert the datatype to a numeric value. The non-numeric values would still keep the column as an `object` datatype.

In this scenario, we are happy to lose the information in the non-numeric values, for the sake of being able to treat the column as integers. To do this, we can use a separate function, `pd.to_numeric()`, which we can use with the `errors` parameter set to `coerce` to force the conversion:


In [7]:
age_df.Age = pd.to_numeric(age_df.Age, errors='coerce')
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     5 non-null      float64
dtypes: float64(1), object(1)
memory usage: 244.0+ bytes


The values which could not be converted to numeric have now been converted to `NaN` (Not a Number) values:

In [8]:
age_df.head(5)

Unnamed: 0,Name,Age
0,Alice,21.0
1,Bashar,22.0
2,Carlos,
3,Diana,24.0
4,Ephraim,25.0


## Time Series Data Types

### The `datetime64` Data Type



In `Pandas`, the `datetime64` data type provides a memory-efficient structure for working with date and time data, allowing for operations like time-based indexing, slicing, and resampling to be performed. This data type is necessary for effective time-series data analysis, as it allows complex temporal computations and aggregations to be performed with relative ease.

Date-time columns can be challenging to assign correctly, because there is a very large range of ways that date and time columns can be formatted, and there is no guarantee that each column will only use one of these formats, so it is important to determine which formats are used in your data before attempting to convert a column to `datetime64`.

### Casting a Column to Datetime

The `pd.to_datetime` function can be used to cast a column to `datetime64`. In order to ensure that the conversion is accurate, it is necessary to consider the format that the date/time values are in.

Run the three code blocks below to perform a simple example conversion, and confirm that the `datetime64`-encoded dates are correct. In this case, the data are initially formatted as strings, and the date format is unambiguous, and so the conversion works without any additional work:

In [13]:
data = {
    'date_strings': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01']
}
date_df = pd.DataFrame(data)

date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date_strings  4 non-null      object
dtypes: object(1)
memory usage: 164.0+ bytes


In [14]:
date_df.date_strings = pd.to_datetime(date_df.date_strings)
date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date_strings  4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 164.0 bytes


In [15]:
date_df.head()

Unnamed: 0,date_strings
0,2023-01-01
1,2023-02-01
2,2023-03-01
3,2023-04-01



#### Common Issues: *Epoch* Time

Date and time are typically represented in computers using a system known as *epoch time* or *Unix time*, which counts the number of seconds that have elapsed since a predefined point in time, known as the *epoch*. The **epoch** is set at `00:00:00 UTC on Thursday, January 1, 1970` and the count of seconds (or milliseconds in some applications) from this point is used to represent subsequent points in time, allowing for a standardized, system-independent representation of time that can be easily computed and converted into various human-readable formats.

This convention is the source of common issue encountered when converting columns to datetime. For an example, let's load in this dataframe of flight times, `flights_df`, and look at the `FLIGHTDATE` column:

In [16]:
pd.set_option('display.max_columns', None) # You can use this setting display all the columns in your dataframe
flights_df = pd.read_csv("flights_sample.csv") 

flights_df['FLIGHTDATE'].head(5)

0    20110511
1    20160520
2    20040922
3    20160527
4    19970511
Name: FLIGHTDATE, dtype: int64


`Pandas` has detected the `dtype` of the  `FLIGHTDATE` column as `int64`, whereas it would be more helpful to have it as `datetime64`, so that it can be used in time-series analysis.

Run the two code block below, to see what happens when we try using the function on the `FLIGHTDATE` column:

In [17]:
pd.to_datetime(flights_df['FLIGHTDATE']).head(5)

0   1970-01-01 00:00:00.020110511
1   1970-01-01 00:00:00.020160520
2   1970-01-01 00:00:00.020040922
3   1970-01-01 00:00:00.020160527
4   1970-01-01 00:00:00.019970511
Name: FLIGHTDATE, dtype: datetime64[ns]

It looks like the first few values of the column are in January, 1970. This should arouse suspicion, given what we know about **Epoch time**. We can see from the head of the original column that the first 5 values should be between 1997 and 2016, not in 1970. The function has interpreted the integer values in `FLIGHTDATE` as a number of seconds after the **Epoch**, when actually it should be a date in the format`YYYmmdd`.

To work around this, we can specify a format as an argument to the `to_datetime` function, as follows:

In [18]:
## Assign date_format
date_format = "%Y%m%d"
pd.to_datetime(flights_df["FLIGHTDATE"], format=date_format)

0        2011-05-11
1        2016-05-20
2        2004-09-22
3        2016-05-27
4        1997-05-11
            ...    
119175   2016-05-18
119176   2014-09-25
119177   2016-01-13
119178   1997-01-04
119179   2012-09-07
Name: FLIGHTDATE, Length: 119180, dtype: datetime64[ns]

#### Common Issues: Mixed Date Formats

Handling multiple date formats in a single column can be a bit tricky, but `pd.to_datetime` is quite flexible and can infer different formats automatically in most cases. Below is a simple example of a column called `mixed_dates`, which has dates in multiple formats:

In [20]:
# Create a sample dataframe with multiple date formats
data = {
    'mixed_dates': ['01/02/2023', '2023-03-01', '04-Apr-2023', '20230505']
}
mixed_date_df = pd.DataFrame(data)

# Displaying the original dataframe
print("Original Dataframe:")
print(mixed_date_df)
print("\nData types:")
print(mixed_date_df.dtypes)

Original Dataframe:
   mixed_dates
0   01/02/2023
1   2023-03-01
2  04-Apr-2023
3     20230505

Data types:
mixed_dates    object
dtype: object


In [21]:
# Converting the 'mixed_dates' column to datetime
# Note: infer_datetime_format=True can help to infer different formats, but might not handle all cases
mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['mixed_dates'], infer_datetime_format=True, errors='coerce')

# Displaying the modified dataframe
print("\nModified dataframe:")
print(mixed_date_df)
print("\nData types:")
print(mixed_date_df.dtypes)


Modified dataframe:
   mixed_dates      dates
0   01/02/2023 2023-01-02
1   2023-03-01        NaT
2  04-Apr-2023        NaT
3     20230505        NaT

Data types:
mixed_dates            object
dates          datetime64[ns]
dtype: object


Unfortunately in this case, automatic conversion has not been very effective, and we can see multiple values have been returned as `NaT` (Not a Time).

A more effective approach is to use the `parse` function from the `dateutil` library, in conjunction with the `.apply` method:

In [22]:
from dateutil.parser import parse
mixed_date_df['dates'] = mixed_date_df['mixed_dates'].apply(parse)
mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['dates'], infer_datetime_format=True, errors='coerce')
print("\nModified dataframe:\n")
print(mixed_date_df)
print("\nData types:\n")
print(mixed_date_df.dtypes)


Modified dataframe:

   mixed_dates      dates
0   01/02/2023 2023-01-02
1   2023-03-01 2023-03-01
2  04-Apr-2023 2023-04-04
3     20230505 2023-05-05

Data types:

mixed_dates            object
dates          datetime64[ns]
dtype: object


### The `timedelta64` Data Type

The `timedelta64` data type is used to represent differences in `datetime64` objects. While a `datetime64` object represents a specific point in time, with a defined year, month, day, hour, minute, and so on, a `timedelta64` object represents a duration that is not anchored to a specific start or end point. It tells you how much time is between two points, without specifying what those points are.

The distinction between `timedelta64` and `datetime64` data types in `Pandas` (and similarly, `timedelta` and `datetime` in Python's `datetime` module) is crucial due to the inherent differences in representing and utilizing points in time versus durations of time, which are fundamentally different concepts.

**Arithmetic Operations:**
   - When you perform arithmetic with two `datetime64` objects, the result is a `timedelta64` object because subtracting one point in time from another gives you a duration
   - Conversely, when you add or subtract a `timedelta64` from a `datetime64` object, you get another `datetime64` object because you're shifting a point in time by a certain duration

By having separate data types, `Pandas` (and Python more broadly) allows for clear, intuitive operations on time data, ensuring that the operations are semantically meaningful and that the results are what users expect when performing arithmetic or comparisons with time-related data. This distinction also helps prevent misinterpretation of the data and ensures that operations are performed with the appropriate level of precision and efficiency for each type of data.

For example, the code block below creates a new `timedelta64` column by subtracting a specific timestamp from the `dates` column of the `mixed_dates_df` dataframe:

In [23]:
# Subtracting a single date from the 'dates' column
single_date = pd.Timestamp('2023-01-01')  # Creating a Timestamp object
mixed_date_df['date_difference'] = mixed_date_df['dates'] - single_date  # Subtracting the single date

# Displaying the modified dataframe
print("\nModified dataframe:")
print(mixed_date_df)


Modified dataframe:
   mixed_dates      dates date_difference
0   01/02/2023 2023-01-02          1 days
1   2023-03-01 2023-03-01         59 days
2  04-Apr-2023 2023-04-04         93 days
3     20230505 2023-05-05        124 days
