### __Pandas Data Wrangling Data Types__

In [7]:
import pandas as pd 

df = pd.read_csv('DataSets/OnlineRetail.csv')

print("DataFrame [R,C]: \n\n", df.axes)
print()
print("DataFrame information: \n\n")
df.info()
print()
df.sample(10, random_state=333)

DataFrame [R,C]: 

 [RangeIndex(start=0, stop=50000, step=1), Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')]

DataFrame information: 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    50000 non-null  object 
 1   StockCode    50000 non-null  object 
 2   Description  49857 non-null  object 
 3   Quantity     50000 non-null  float64
 4   InvoiceDate  50000 non-null  object 
 5   UnitPrice    50000 non-null  object 
 6   CustomerID   31599 non-null  float64
 7   Country      50000 non-null  object 
dtypes: float64(2), object(6)
memory usage: 3.1+ MB



Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
21526,538172,84558A,3D DOG PICTURE PLAYING CARDS,12.0,2010-12-10T09:33:00Z,2.95,15805.0,United Kingdom
45186,540352,22178,VICTORIAN GLASS HANGING T-LIGHT,2.0,2011-01-06T14:27:00Z,2.51,,United Kingdom
14032,537634,22891,TEA FOR ONE POLKADOT,1.0,2010-12-07T15:15:00Z,4.25,16775.0,United Kingdom
2506,536634,22900,SET 2 TEA TOWELS I LOVE LONDON,1.0,2010-12-02T11:21:00Z,2.95,18041.0,United Kingdom
6145,536991,22337,ANGEL DECORATION PAINTED ZINC,4.0,2010-12-03T15:16:00Z,0.65,,United Kingdom
39002,539718,22692,DOORMAT WELCOME TO OUR HOME,1.0,2010-12-21T13:06:00Z,14.43,,United Kingdom
47,536522,22151,PLACE SETTING WHITE HEART,1.0,2010-12-01T12:49:00Z,0.42,15012.0,United Kingdom
36772,539478,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,2.0,2010-12-19T15:07:00Z,3.75,17343.0,United Kingdom
2058,536594,21733,RED HANGING HEART T-LIGHT HOLDER,6.0,2010-12-01T17:22:00Z,2.95,15235.0,United Kingdom
46091,540373,22908,PACK OF 20 NAPKINS RED APPLES,3.0,2011-01-06T17:10:00Z,0.85,13280.0,United Kingdom


In [9]:
print(df["StockCode"].min(), df["StockCode"].max())

10002 m


##### _.astype()_

The astype() method returns a new DataFrame where the data types has been changed to the specified type.

You can cast the entire DataFrame to one specific data type, or you can use a Python Dictionary to specify a data type for each column, like this:

{
 
  'Duration': 'int64',
 
  'Pulse'   : 'float',
 
  'Calories': 'int64'

}

_dataframe.astype(dtype, copy, errors)_

_dtype_	data type, or a dictionary with data types for each column: {'Duration': 'int64', 'Pulse'   : 'float', 'Calories': 'int64'} Required. Specifies the data type

_copy_ 	True|False	Optional. Default True. Specifies whether to return a copy (True), or to do the changes in the original DataFrame (False).

_errors_ 	'raise'|'ignore'	Optional. Default 'raise'. Specifies whether to ignore errors or raise an exception on error.

In [11]:
import pandas as pd

d = {'col1': [1.0, 2.0], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

print(df)
print()
print('Output of df.info():')
df.info()

   col1  col2
0   1.0     3
1   2.0     4

Output of df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    2 non-null      float64
 1   col2    2 non-null      int64  
dtypes: float64(1), int64(1)
memory usage: 164.0 bytes


In [12]:
df_str_dtype = df.astype('str')
print(df_str_dtype)
print()
df_str_dtype.info()


  col1 col2
0  1.0    3
1  2.0    4

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col1    2 non-null      object
 1   col2    2 non-null      object
dtypes: object(2)
memory usage: 164.0+ bytes


In [16]:
df['col1'] = df['col1'].astype('int')
print(df)
print()
print(df.dtypes)

   col1  col2
0     1     3
1     2     4

col1    int64
col2    int64
dtype: object


##### __Unexpected problems while converting dtypes__

Let's say you load a dataset and it looks like all the float values ​​in a column should actually be integers, but there are too many to parse manually. How do you know it's safe to convert from float to int or if you'll lose some information after doing so?

In [18]:
import pandas as pd

d = {'col1': [1.0, 2.0, 3.0, 4.0], 'col2': [5.0, 6.01, 7.0, 8.0]}
df = pd.DataFrame(d)

print(df)

   col1  col2
0   1.0  5.00
1   2.0  6.01
2   3.0  7.00
3   4.0  8.00



If you're not careful, you'll run df.astype('int') (which changes 6.01 to 6) without realizing that you've just changed the values ​​in your dataset.

##### __Numpy library__

NumPy is short for "Numerical Python".

NumPy is a powerful Python library used for scientific computing. It introduces a new data structure called an array, which is similar to a list but has many advantages, including the ability to perform vectorized operations on entire arrays very quickly. NumPy also provides mathematical functions and tools useful for working with arrays, such as sorting, indexing, and streaming.

##### _.array_equal()_


This function accepts two arrays and returns True if both have the same elements and the same shape, and False otherwise. Let's test it with 'col1':

In [22]:
import numpy as np
import pandas as pd

d = {'col1': [1.0, 2.0, 3.0, 4.0], 'col2': [5.0, 6.01, 7.0, 8.0]}
df = pd.DataFrame(data=d)

# comprueba si es seguro convertir 'col1'
print(np.array_equal(df['col1'], df['col1'].astype('int')))
print(np.array_equal(df['col2'], df['col2'].astype('int')))

True
False


Now we know that we can't convert 'col2' from float to int without losing some of the data.

In [23]:
import pandas as pd

d = {'col1': ['1.0', '2.0'], 'col2': ['3', '4']}
df = pd.DataFrame(data=d)

# convertir col2 a int
df['col2'] = df['col2'].astype('int')
print(df.dtypes)

col1    object
col2     int64
dtype: object


In [24]:
df['col1'] = df['col1'].astype('int')

ValueError: invalid literal for int() with base 10: '1.0'

##### _to_numeric()_

In [26]:
import pandas as pd

d = {'col1': ['1.0', '2.0'], 'col2': ['3', '4']}
df = pd.DataFrame(data=d)

df['col2'] = df['col2'].astype('int')
df['col1'] = pd.to_numeric(df['col1'])
print(df.dtypes)
print()
print(df)

col1    float64
col2      int64
dtype: object

   col1  col2
0   1.0     3
1   2.0     4


It works great if you have number-like strings like '72' or '1394'. However, by default, to_numeric() cannot convert strings with non-numeric characters or decimals to numbers. Instead, it returns an error.

There's good news, though: to_numeric() has an errors= parameter! The value of this parameter determines what to_numeric() will do if it encounters an invalid value:

- errors='raise': Default argument where invalid values ​​raise errors, blocking conversion to numbers for the entire column.

- errors='coerce': Invalid values ​​are replaced with NaN.

- errors='ignore': Invalid values ​​are simply ignored and left unchanged.

In [27]:
import pandas as pd

d = {'col1': ['1.0', 'B.0'], 'col2': ['3', '4']}
df = pd.DataFrame(data=d)

df['col2'] = df['col2'].astype('int')
df['col1'] = pd.to_numeric(df['col1'], errors='coerce')

print(df.dtypes)
print(df)

col1    float64
col2      int64
dtype: object
   col1  col2
0   1.0     3
1   NaN     4


##### _Excersise 1_

In [32]:
import pandas as pd
import numpy as np

df = pd.read_csv('DataSets/OnlineRetail.csv')

df.info()
can_convert = np.array_equal(df["Quantity"], df["Quantity"].astype("int"))
print()
print(can_convert)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    50000 non-null  object 
 1   StockCode    50000 non-null  object 
 2   Description  49857 non-null  object 
 3   Quantity     50000 non-null  float64
 4   InvoiceDate  50000 non-null  object 
 5   UnitPrice    50000 non-null  object 
 6   CustomerID   31599 non-null  float64
 7   Country      50000 non-null  object 
dtypes: float64(2), object(6)
memory usage: 3.1+ MB

True


In [30]:
df['Quantity'] = df["Quantity"].astype("int")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    50000 non-null  object 
 1   StockCode    50000 non-null  object 
 2   Description  49857 non-null  object 
 3   Quantity     50000 non-null  int64  
 4   InvoiceDate  50000 non-null  object 
 5   UnitPrice    50000 non-null  object 
 6   CustomerID   31599 non-null  float64
 7   Country      50000 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 3.1+ MB


In [33]:
import pandas as pd

df = pd.read_csv('DataSets/OnlineRetail.csv')

df['UnitPrice'] = pd.to_numeric(df["UnitPrice"], errors="coerce")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    50000 non-null  object 
 1   StockCode    50000 non-null  object 
 2   Description  49857 non-null  object 
 3   Quantity     50000 non-null  float64
 4   InvoiceDate  50000 non-null  object 
 5   UnitPrice    49985 non-null  float64
 6   CustomerID   31599 non-null  float64
 7   Country      50000 non-null  object 
dtypes: float64(3), object(5)
memory usage: 3.1+ MB


##### __Dates & Hours__

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

_%d_ day of the month (01 - 31)

_%m_ month (01 - 12))

_%y_ two digit year(94)

_%Y_ four digit year(1994)

_Z or T_ standard separator for date and time

_%H_ hour in a 24hr format

_%I_ hour in a 12hr format

_%M_ for minutes (00 - 59)

_%S_ for seconds (00 - 59)

In [39]:
import pandas as pd 

df = pd.read_csv('DataSets/OnlineRetail.csv')
print(df.head())
print()
print(df.dtypes)

  InvoiceNo StockCode                          Description  Quantity  \
0    536520     21123  SET/10 IVORY POLKADOT PARTY CANDLES       1.0   
1    536520     21124   SET/10 BLUE POLKADOT PARTY CANDLES       1.0   
2    536520     21122   SET/10 PINK POLKADOT PARTY CANDLES       1.0   
3    536520     84378        SET OF 3 HEART COOKIE CUTTERS       1.0   
4    536520     21985    PACK OF 12 HEARTS DESIGN TISSUES       12.0   

            InvoiceDate UnitPrice  CustomerID         Country  
0  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
1  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
2  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
3  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
4  2010-12-01T12:43:00Z      0.29     14729.0  United Kingdom  

InvoiceNo       object
StockCode       object
Description     object
Quantity       float64
InvoiceDate     object
UnitPrice       object
CustomerID     float64
Country         objec

##### _.to_datetime()_

The to_datetime() method is used to convert dates from a string data type to a datetime data type. When calling the method, we must use the format= parameter, which takes a string specifying how dates are formatted. Format codes indicated by the % symbol are used to specify the format.

For example, we can convert the string 2010-12-17T12:38:00Z (an ISO 8601 format) to a datetime object by passing the correct format string to the format= parameter: %Y-%m-%dT%H:%M:%SZ.

In [44]:
import pandas as pd

string_date = '2010-12-17T12:38:00'
datetime_date = pd.to_datetime(string_date, format='%Y-%m-%dT%H:%M:%S')

print(type(string_date))
print(type(datetime_date))
print(datetime_date)

<class 'str'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
2010-12-17 12:38:00


After calling to_datetime() with our string-based date and format string, we now have an object with the Timestamps data type. The pandas Timestamp type is equivalent to the Python datetime type, so we'll use the term "datetime" to refer to both.

Notice that the format of the datetime object is different from the original string: 2010-12-17T12:38:00Z. In the datetime object, we have YYYY-MM-DD HH:MM:SS: 2010-12-17 12:38:00. Regardless of how the original string was formatted, the datetime type will have this uniform format.

In [45]:
import pandas as pd

# Create a DataFrame with mixed date formats
data = {
    'Event': ['Event A', 'Event B', 'Event C'],
    'Date': ['2025-04-16', '16/04/2025', 'April 16, 2025']
}

df = pd.DataFrame(data)

# Convert the 'Date' column to datetime, handling multiple formats
df['Date'] = pd.to_datetime(df['Date'], format=None, errors='coerce') # allows pandas to infer the format automatically.

print(df)
print()
print(df.dtypes)

     Event       Date
0  Event A 2025-04-16
1  Event B        NaT
2  Event C        NaT

Event            object
Date     datetime64[ns]
dtype: object


In [47]:
import pandas as pd

# Create a DataFrame with Unix timestamps
data = {'Event': ['Event A', 'Event B', 'Event C'], 'Timestamp': [1618317047, 1618318047, 1618319047]}
df = pd.DataFrame(data)

# Convert the 'Timestamp' column to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s') # parameter specifies that the timestamps are in seconds since the Unix epoch.
                                                            # converted to a datetime64[ns] type.
print(df)
print()
print(df.dtypes)

     Event           Timestamp
0  Event A 2021-04-13 12:30:47
1  Event B 2021-04-13 12:47:27
2  Event C 2021-04-13 13:04:07

Event                object
Timestamp    datetime64[ns]
dtype: object


In [48]:
import pandas as pd

# Create a DataFrame with some invalid date strings
data = {'Event': ['Event A', 'Event B', 'Event C'], 'Date': ['2025-04-16', 'invalid_date', '16/04/2025']}
df = pd.DataFrame(data)

# Convert the 'Date' column to datetime, replacing invalid dates with NaT
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

print(df)
print()
print(df.dtypes)

     Event       Date
0  Event A 2025-04-16
1  Event B        NaT
2  Event C        NaT

Event            object
Date     datetime64[ns]
dtype: object


In [49]:
import pandas as pd

# Create a DataFrame with separate date and time columns
data = {
    'Date': ['2025-04-16', '2025-04-17', '2025-04-18'],
    'Time': ['12:30:00', '14:45:00', '16:00:00']
}
df = pd.DataFrame(data)

# Combine 'Date' and 'Time' columns into a single datetime column
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

print(df)
print()
print(df.dtypes)

         Date      Time            Datetime
0  2025-04-16  12:30:00 2025-04-16 12:30:00
1  2025-04-17  14:45:00 2025-04-17 14:45:00
2  2025-04-18  16:00:00 2025-04-18 16:00:00

Date                object
Time                object
Datetime    datetime64[ns]
dtype: object


In [53]:
import pandas as pd

# Create a DataFrame with custom date formats
data = {'Event': ['Event A', 'Event B', 'Event C'], 'Date': ['16-Apr-2025', '17-Apr-2025', '18-Apr-2025']}
df = pd.DataFrame(data)

# Convert the 'Date' column to datetime using a custom format
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%Y') # %b for abbreviated mont

print(df)
print()
print(df.dtypes)

     Event       Date
0  Event A 2025-04-16
1  Event B 2025-04-17
2  Event C 2025-04-18

Event            object
Date     datetime64[ns]
dtype: object


In [56]:
# Import Required Libraries
import pandas as pd

# Create a DataFrame with custom date-time strings
data = {
    'Event': ['Event A', 'Event B', 'Event C'],
    'CustomDateTime': ['16-April-2025 12:30 PM', '17-April-2025 02:45 PM', '18-April-2025 04:00 PM']
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Convert the 'CustomDateTime' column to a determined date-time format
df['CustomDateTime'] = pd.to_datetime(df['CustomDateTime'], format='%d-%B-%Y %I:%M %p') # %p is used to represent AM or PM in 12-Hr format

print("\nDataFrame after converting 'CustomDateTime' to datetime:")
print(df)
print()
print("Data types of the DataFrame:")
print(df.dtypes)

Original DataFrame:
     Event          CustomDateTime
0  Event A  16-April-2025 12:30 PM
1  Event B  17-April-2025 02:45 PM
2  Event C  18-April-2025 04:00 PM

DataFrame after converting 'CustomDateTime' to datetime:
     Event      CustomDateTime
0  Event A 2025-04-16 12:30:00
1  Event B 2025-04-17 14:45:00
2  Event C 2025-04-18 16:00:00

Data types of the DataFrame:
Event                     object
CustomDateTime    datetime64[ns]
dtype: object


Since pandas relies on data types native to Python (as well as some from various libraries), data types can get a bit complicated at times.

Remember that in pandas, datetime objects are represented by the TimeStamp data type. To get, for example, the year attribute of the first Timestamp value in the 'InvoiceDate' column, use the following code:

In [61]:
import pandas as pd

df = pd.read_csv('DataSets/OnlineRetail.csv')

# convierte 'InvoiceDate' a datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%Y-%m-%dT%H:%M:%SZ')

print(df['InvoiceDate'][0].year) # devuelve el año del primer InvoiceDate

2010


While we can access all attributes of individual Timestamp values ​​this way, we can't do so for Series Timestamp values. See what happens when we try to get the day attribute for the entire 'InvoiceDate' column:

In [68]:
import pandas as pd

df = pd.read_csv('DataSets/OnlineRetail.csv')
print(df.head())
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%Y-%m-%dT%H:%M:%SZ')
print()
print(df['InvoiceDate'][0].day)
df['day'] = df['InvoiceDate'].day

  InvoiceNo StockCode                          Description  Quantity  \
0    536520     21123  SET/10 IVORY POLKADOT PARTY CANDLES       1.0   
1    536520     21124   SET/10 BLUE POLKADOT PARTY CANDLES       1.0   
2    536520     21122   SET/10 PINK POLKADOT PARTY CANDLES       1.0   
3    536520     84378        SET OF 3 HEART COOKIE CUTTERS       1.0   
4    536520     21985    PACK OF 12 HEARTS DESIGN TISSUES       12.0   

            InvoiceDate UnitPrice  CustomerID         Country  
0  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
1  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
2  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
3  2010-12-01T12:43:00Z      1.25     14729.0  United Kingdom  
4  2010-12-01T12:43:00Z      0.29     14729.0  United Kingdom  

1


AttributeError: 'Series' object has no attribute 'day'

We get an error because df['InvoiceDate'] is a Series object, which doesn't have a day attribute, even though individual Timestamp values ​​within a Series do.

To get attributes for all datetime data columns, use the .dt accessor object instead.

For example, we can create a 'df_days' DataFrame that contains the day attribute for each value in the 'InvoiceDate' column:

In [70]:
import pandas as pd 

df = pd.read_csv('DataSets/OnlineRetail.csv')
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%Y-%m-%dT%H:%M:%SZ')

df_days = df['InvoiceDate'].dt.day
print(df_days.sample(5, random_state=42))

33553    17
9427      6
199       1
12447     6
39489    21
Name: InvoiceDate, dtype: int32


No mistake this time! Remember, _i_f you want to access the attributes of an entire datetime column_, you must use the __.dt__ accessor for the datetime column, not the column itself.

##### __TimeZones__

This Wikipedia article (https://en.wikipedia.org/wiki/List_of_tz_database_time_zones#List) contains a list of the names of each standard time zone that you can use with .dt.tz_convert().

There are some common scenarios related to time zones that you'll encounter when working with datetime data.

Your data may come from different geographic areas, with each location recording this data using its local time. Or you may be working with datetime values ​​recorded in one time zone, but need to present the results of your analysis to an audience in another.

In either case, you need to know how to convert between different time zones without getting confused. This is where .dt.tz_localize() and .dt.tz_convert() come in handy. The former allows you to assign a time zone to a datetime column so that your data is "time zone aware." The latter allows you to convert a "time zone aware" column to a different time zone.

Let's see how it works in practice. Let's assign the UTC time zone to the 'InvoiceDate' column.

In [None]:
import pandas as pd 

df = pd.read_csv('DataSets/OnlineRetail.csv')

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%Y-%m-%dT%H:%M:%SZ')

df['InvoiceDate'] = df['InvoiceDate'].dt.tz_localize('UTC')

print(df['InvoiceDate'].sample(5, random_state=42))

33553   2010-12-17 12:38:00+00:00
9427    2010-12-06 09:58:00+00:00
199     2010-12-01 13:21:00+00:00
12447   2010-12-06 16:57:00+00:00
39489   2010-12-21 15:19:00+00:00
Name: InvoiceDate, dtype: datetime64[ns, UTC]


Did you notice that the column's dtype now contains information about the UTC time zone?

What if we needed to display data to someone in New York who prefers to see datetime values ​​in their local time?

In this case, we'll pass 'America/New_York' to the .dt.tz_convert() method:

In [74]:
import pandas as pd 

df = pd.read_csv('DataSets/OnlineRetail.csv')

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%Y-%m-%dT%H:%M:%SZ')

df['InvoiceDate'] = df['InvoiceDate'].dt.tz_localize('UTC')

df['InvoiceDate_NYC'] = df['InvoiceDate'].dt.tz_convert('America/New_York')

print(df['InvoiceDate_NYC'].sample(5, random_state=42))

33553   2010-12-17 07:38:00-05:00
9427    2010-12-06 04:58:00-05:00
199     2010-12-01 08:21:00-05:00
12447   2010-12-06 11:57:00-05:00
39489   2010-12-21 10:19:00-05:00
Name: InvoiceDate_NYC, dtype: datetime64[ns, America/New_York]
