## Advanced Data Wrangling With Pandas

> https://www.tomasbeuzen.com/python-programming-for-data-science/chapters/chapter9-wrangling-advanced.html

---

In [1]:
import pandas as pd
import numpy as np

### Working with strings

String data is represented in pandas using the object dtype, which is a generic dtype for representing mixed data or data of unknown size. It would be better to have a dedicated dtype and Pandas has just introduced this: the StringDtype. object remains the default dtype for strings however, as Pandas looks to continue testing and improving the string dtype. You can read more about the StringDtype in the Pandas documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-data-types).

We’ve seen how libraries like NumPy and Pandas can vectorise operations for increased speed and useability:

This is not the case for arrays of strings however:



In [2]:
x = np.array(['Tom', 'Mike', 'Tiffany', 'Joel', 'Varada'])
# x.upper() --- does not work

In [3]:
# Instead, you would have to operate on each string object one at a time, using a loop for example:

[name.upper() for name in x]
# But even this will fail if your array contains a missing value:

x = np.array(['Tom', 'Mike', None, 'Tiffany', 'Joel', 'Varada'])
# [name.upper() for name in x]


Pandas addresses both of these issues (vectorization and missing values) with its string methods. String methods can be accessed by the .str attribute of Pandas Series and Index objects. Pretty much all built-in string operations (.upper(), .lower(), .split(), etc) and more are available.

In [4]:
s = pd.Series(x)
s

0        Tom
1       Mike
2       None
3    Tiffany
4       Joel
5     Varada
dtype: object

In [5]:
s.str.upper()

0        TOM
1       MIKE
2       None
3    TIFFANY
4       JOEL
5     VARADA
dtype: object

In [6]:
s.str.split("ff")

0        [Tom]
1       [Mike]
2         None
3    [Ti, any]
4       [Joel]
5     [Varada]
dtype: object

In [7]:
s.str.split("ff", expand=True)

Unnamed: 0,0,1
0,Tom,
1,Mike,
2,,
3,Ti,any
4,Joel,
5,Varada,


In [8]:
df = pd.DataFrame(np.random.rand(5, 3),
                  columns = ['Measured Feature', 'recorded feature', 'PredictedFeature'],
                  index = [f"ROW{_}" for _ in range(5)])

df

Unnamed: 0,Measured Feature,recorded feature,PredictedFeature
ROW0,0.909902,0.382452,0.821149
ROW1,0.238928,0.143879,0.95088
ROW2,0.428824,0.488306,0.739736
ROW3,0.26577,0.998428,0.482828
ROW4,0.347798,0.204922,0.76313


Let’s clean up those labels by:

1. Removing the word “feature” and “Feature”
2. Lowercase the “ROW” and add an underscore between the digit and letters


In [9]:
# The capitalize() method in pandas capitalizes the first character of a string and lowercases the rest. It can also be applied to a column in the pandas DataFrame.
df.columns = df.columns.str.capitalize().str.replace("feature", "").str.strip()

In [10]:
df.index = df.index.str.lower().str.replace("w", "w_")

In [11]:
df.columns.str.slice(start=0, stop=2)

Index(['Me', 'Re', 'Pr'], dtype='object')

A regular expression (regex) is a sequence of characters that defines a search pattern. For more complex string operations, you’ll definitely want to use regex. Here’s a great cheatsheet of regular expression syntax. I am self-admittedly not a regex expert, I usually jump over to RegExr.com and play around until I find the expression I want. Many Pandas string functions accept regular expressions as input, these are the ones I use most often:



In [13]:
s = pd.Series(['Tom', 'Mike', None, 'Tiffany', 'Joel', 'Varada'])

In [14]:
s.str.findall(r'^[^AEIOU].*[^aeiou]$')

0        [Tom]
1           []
2         None
3    [Tiffany]
4       [Joel]
5           []
dtype: object

Data location: [here](https://raw.githubusercontent.com/TomasBeuzen/python-programming-for-data-science/main/chapters/data/cycling_data.csv)

In [15]:
df = pd.read_csv('data/cycling_data.csv', index_col=0)

In [16]:
df

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"10 Sep 2019, 00:13:04",Afternoon Ride,Ride,2084,12.62,Rain
"10 Sep 2019, 13:52:18",Morning Ride,Ride,2531,13.03,rain
"11 Sep 2019, 00:23:50",Afternoon Ride,Ride,1863,12.52,Wet road but nice weather
"11 Sep 2019, 14:06:19",Morning Ride,Ride,2192,12.84,Stopped for photo of sunrise
"12 Sep 2019, 00:28:05",Afternoon Ride,Ride,1891,12.48,Tired by the end of the week
"16 Sep 2019, 13:57:48",Morning Ride,Ride,2272,12.45,Rested after the weekend!
"17 Sep 2019, 00:15:47",Afternoon Ride,Ride,1973,12.45,Legs feeling strong!
"17 Sep 2019, 13:43:34",Morning Ride,Ride,2285,12.6,Raining
"18 Sep 2019, 13:49:53",Morning Ride,Ride,2903,14.57,Raining today
"18 Sep 2019, 00:15:52",Afternoon Ride,Ride,2101,12.48,Pumped up tires


We could find all the comments that contains the string “Rain” or “rain”:

In [20]:
df[df['Comments'].str.contains(r"[Rr]ain")]

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"10 Sep 2019, 00:13:04",Afternoon Ride,Ride,2084,12.62,Rain
"10 Sep 2019, 13:52:18",Morning Ride,Ride,2531,13.03,rain
"17 Sep 2019, 13:43:34",Morning Ride,Ride,2285,12.6,Raining
"18 Sep 2019, 13:49:53",Morning Ride,Ride,2903,14.57,Raining today
"26 Sep 2019, 00:13:33",Afternoon Ride,Ride,1860,12.52,raining


In [22]:
df[df['Comments'].str.lower().str.contains("rain")]

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"10 Sep 2019, 00:13:04",Afternoon Ride,Ride,2084,12.62,Rain
"10 Sep 2019, 13:52:18",Morning Ride,Ride,2531,13.03,rain
"17 Sep 2019, 13:43:34",Morning Ride,Ride,2285,12.6,Raining
"18 Sep 2019, 13:49:53",Morning Ride,Ride,2903,14.57,Raining today
"26 Sep 2019, 00:13:33",Afternoon Ride,Ride,1860,12.52,raining


If we didn’t want to include “Raining” or “raining”, we could do:

In [23]:
df[df['Comments'].str.contains(r"[Rr]ain$")]

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"10 Sep 2019, 00:13:04",Afternoon Ride,Ride,2084,12.62,Rain
"10 Sep 2019, 13:52:18",Morning Ride,Ride,2531,13.03,rain


### Working With Datetimes

In [24]:
from datetime import datetime, timedelta

In [25]:
date = datetime(year=2005, month=7, day=9, hour=13, minute=54)
date

datetime.datetime(2005, 7, 9, 13, 54)

In [26]:
# We can also parse directly from a string

date = datetime.strptime("July 9 2005, 13:54", "%B %d %Y, %H:%M")
date

datetime.datetime(2005, 7, 9, 13, 54)

In [27]:
print(f"Year: {date.strftime('%Y')}")
print(f"Month: {date.strftime('%B')}")
print(f"Day: {date.strftime('%d')}")
print(f"Day name: {date.strftime('%A')}")
print(f"Day of year: {date.strftime('%j')}")
print(f"Time of day: {date.strftime('%p')}")

Year: 2005
Month: July
Day: 09
Day name: Saturday
Day of year: 190
Time of day: PM


In [28]:
date + timedelta(days=7)

datetime.datetime(2005, 7, 16, 13, 54)

But as with strings, working with arrays of datetimes in Python can be difficult and inefficient. NumPy, therefore included a new datetime object to work more effectively with dates:

But while numpy helps bring datetimes into the array world, it’s missing a lot of functionality that we would commonly want/need for wrangling tasks. This is where Pandas comes in. Pandas consolidates and extends functionality from the datetime module, numpy, and other libraries like scikits.timeseries into a single place. Pandas provides 4 key datetime objects which we’ll explore in the following sections:



#### Creating Datetimes¶


Most commonly you’ll want to:

- Create a single point in time with pd.Timestamp(), e.g., 2005-07-09 00:00:00

- Create a span of time with pd.Period(), e.g., 2020 Jan

- Create an array of datetimes with pd.date_range() or pd.period_range()

In [30]:
print(pd.Timestamp('2005-07-09'))  # parsed from string
print(pd.Timestamp(year=2005, month=7, day=9))  # pass data directly
print(pd.Timestamp(datetime(year=2005, month=7, day=9)))  # from datetime object

2005-07-09 00:00:00
2005-07-09 00:00:00
2005-07-09 00:00:00


The above is a specific point in time. Below, we can use pd.Period() to specify a span of time (like a day):

In [32]:
span = pd.Period('2005-07-09')
print(span)
print (type(span))
print(span.start_time)
print(span.end_time)

2005-07-09
<class 'pandas._libs.tslibs.period.Period'>
2005-07-09 00:00:00
2005-07-09 23:59:59.999999999


In [34]:
point = pd.Timestamp('2005-07-09 12:00')
span = pd.Period('2005-07-09')
print(f"Point: {point}")
print(f"Span: {span}")
print(f"Point in span? {span.start_time < point < span.end_time}")

Point: 2005-07-09 12:00:00
Span: 2005-07-09
Point in span? True


Often, you’ll want to create arrays of datetimes, not just single values. Arrays of datetimes are of the class DatetimeIndex/PeriodIndex/TimedeltaIndex:



In [35]:
pd.date_range(
    start='2020-09-01 12:00',
    end='2020-09-11 12:00',
    freq="D"
)

DatetimeIndex(['2020-09-01 12:00:00', '2020-09-02 12:00:00',
               '2020-09-03 12:00:00', '2020-09-04 12:00:00',
               '2020-09-05 12:00:00', '2020-09-06 12:00:00',
               '2020-09-07 12:00:00', '2020-09-08 12:00:00',
               '2020-09-09 12:00:00', '2020-09-10 12:00:00',
               '2020-09-11 12:00:00'],
              dtype='datetime64[ns]', freq='D')

We can use Timedelta objects to perform temporal operations like adding or subtracting time:


In [36]:
pd.date_range('2020-09-01 12:00', '2020-09-11 12:00', freq='D') + pd.Timedelta('1.5 hour')

DatetimeIndex(['2020-09-01 13:30:00', '2020-09-02 13:30:00',
               '2020-09-03 13:30:00', '2020-09-04 13:30:00',
               '2020-09-05 13:30:00', '2020-09-06 13:30:00',
               '2020-09-07 13:30:00', '2020-09-08 13:30:00',
               '2020-09-09 13:30:00', '2020-09-10 13:30:00',
               '2020-09-11 13:30:00'],
              dtype='datetime64[ns]', freq='D')

It’s fairly common to have an array of dates as strings. We can use pd.to_datetime() to convert these to datetime:

In [37]:
string_dates = ['July 9, 2020', 'August 1, 2020', 'August 28, 2020']

pd.to_datetime(string_dates)

DatetimeIndex(['2020-07-09', '2020-08-01', '2020-08-28'], dtype='datetime64[ns]', freq=None)

For more complex datetime format, use the format argument (see Python Format Codes for help):

In [39]:
string_dates = ['2020 9 July', '2020 1 August', '2020 28 August']
pd.to_datetime(string_dates, format="%Y %d %B")

DatetimeIndex(['2020-07-09', '2020-08-01', '2020-08-28'], dtype='datetime64[ns]', freq=None)

Or use a dictionary:

In [40]:
dict_dates = pd.to_datetime({"year": [2020, 2020, 2020],
                             "month": [7, 8, 8],
                             "day": [9, 1, 28]})  # note this is a series, not an index!
dict_dates

0   2020-07-09
1   2020-08-01
2   2020-08-28
dtype: datetime64[ns]

Let’s practice by reading in our favourite cycling dataset:


In [41]:
df = pd.read_csv('data/cycling_data.csv', index_col=0)
df

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"10 Sep 2019, 00:13:04",Afternoon Ride,Ride,2084,12.62,Rain
"10 Sep 2019, 13:52:18",Morning Ride,Ride,2531,13.03,rain
"11 Sep 2019, 00:23:50",Afternoon Ride,Ride,1863,12.52,Wet road but nice weather
"11 Sep 2019, 14:06:19",Morning Ride,Ride,2192,12.84,Stopped for photo of sunrise
"12 Sep 2019, 00:28:05",Afternoon Ride,Ride,1891,12.48,Tired by the end of the week
"16 Sep 2019, 13:57:48",Morning Ride,Ride,2272,12.45,Rested after the weekend!
"17 Sep 2019, 00:15:47",Afternoon Ride,Ride,1973,12.45,Legs feeling strong!
"17 Sep 2019, 13:43:34",Morning Ride,Ride,2285,12.6,Raining
"18 Sep 2019, 13:49:53",Morning Ride,Ride,2903,14.57,Raining today
"18 Sep 2019, 00:15:52",Afternoon Ride,Ride,2101,12.48,Pumped up tires


Our index is just a plain old index at the moment, with dtype object, full of string dates:

In [43]:
print(df.index.dtype)
type(df.index)


object


pandas.core.indexes.base.Index

In [45]:
df.index = pd.to_datetime(df.index)

In [46]:
df

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-09-10 00:13:04,Afternoon Ride,Ride,2084,12.62,Rain
2019-09-10 13:52:18,Morning Ride,Ride,2531,13.03,rain
2019-09-11 00:23:50,Afternoon Ride,Ride,1863,12.52,Wet road but nice weather
2019-09-11 14:06:19,Morning Ride,Ride,2192,12.84,Stopped for photo of sunrise
2019-09-12 00:28:05,Afternoon Ride,Ride,1891,12.48,Tired by the end of the week
2019-09-16 13:57:48,Morning Ride,Ride,2272,12.45,Rested after the weekend!
2019-09-17 00:15:47,Afternoon Ride,Ride,1973,12.45,Legs feeling strong!
2019-09-17 13:43:34,Morning Ride,Ride,2285,12.6,Raining
2019-09-18 13:49:53,Morning Ride,Ride,2903,14.57,Raining today
2019-09-18 00:15:52,Afternoon Ride,Ride,2101,12.48,Pumped up tires


Alternatively, pd.read_csv() has an argument parse_dates which can do this automatically when reading the file:

In [47]:
print(df.index.dtype)
type(df.index)

datetime64[ns]


pandas.core.indexes.datetimes.DatetimeIndex

In [54]:
df = df.sort_index()

We can do partial string indexing:

In [55]:
df.loc['2019-09']

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-09-10 00:13:04,Afternoon Ride,Ride,2084,12.62,Rain
2019-09-10 13:52:18,Morning Ride,Ride,2531,13.03,rain
2019-09-11 00:23:50,Afternoon Ride,Ride,1863,12.52,Wet road but nice weather
2019-09-11 14:06:19,Morning Ride,Ride,2192,12.84,Stopped for photo of sunrise
2019-09-12 00:28:05,Afternoon Ride,Ride,1891,12.48,Tired by the end of the week
2019-09-16 13:57:48,Morning Ride,Ride,2272,12.45,Rested after the weekend!
2019-09-17 00:15:47,Afternoon Ride,Ride,1973,12.45,Legs feeling strong!
2019-09-17 13:43:34,Morning Ride,Ride,2285,12.6,Raining
2019-09-18 00:15:52,Afternoon Ride,Ride,2101,12.48,Pumped up tires
2019-09-18 13:49:53,Morning Ride,Ride,2903,14.57,Raining today


In [56]:
# Exact matching:

df.loc['2019-09-26 00:13:33']

Name        Afternoon Ride
Type                  Ride
Time                  1860
Distance             12.52
Comments           raining
Name: 2019-09-26 00:13:33, dtype: object

In [57]:
# And slicing:

df.loc['2019-10-01':'2019-10-13'] # will not work unless the index is sorted

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-10-01 00:15:07,Afternoon Ride,Ride,1732,,Legs feeling strong!
2019-10-01 13:45:55,Morning Ride,Ride,2222,12.82,Beautiful morning! Feeling fit
2019-10-02 00:13:09,Afternoon Ride,Ride,1756,,A little tired today but good weather
2019-10-02 13:46:06,Morning Ride,Ride,2134,13.06,Bit tired today but good weather
2019-10-03 00:45:22,Afternoon Ride,Ride,1724,12.52,Feeling good
2019-10-03 13:47:36,Morning Ride,Ride,2182,12.68,Wet road
2019-10-04 01:08:08,Afternoon Ride,Ride,1870,12.63,"Very tired, riding into the wind"
2019-10-09 13:55:40,Morning Ride,Ride,2149,12.7,Really cold! But feeling good
2019-10-10 00:10:31,Afternoon Ride,Ride,1841,12.59,Feeling good after a holiday break!
2019-10-10 13:47:14,Morning Ride,Ride,2463,12.79,Stopped for photo of sunrise


In [58]:
df.query("'2019-10-10'")

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-10-10 00:10:31,Afternoon Ride,Ride,1841,12.59,Feeling good after a holiday break!
2019-10-10 13:47:14,Morning Ride,Ride,2463,12.79,Stopped for photo of sunrise


In [60]:
# And for getting all results between two times of a day, use df.between_time():

df.between_time("00:00", "01:00")

Unnamed: 0_level_0,Name,Type,Time,Distance,Comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-09-10 00:13:04,Afternoon Ride,Ride,2084,12.62,Rain
2019-09-11 00:23:50,Afternoon Ride,Ride,1863,12.52,Wet road but nice weather
2019-09-12 00:28:05,Afternoon Ride,Ride,1891,12.48,Tired by the end of the week
2019-09-17 00:15:47,Afternoon Ride,Ride,1973,12.45,Legs feeling strong!
2019-09-18 00:15:52,Afternoon Ride,Ride,2101,12.48,Pumped up tires
2019-09-19 00:30:01,Afternoon Ride,Ride,48062,12.48,Feeling good
2019-09-24 00:35:42,Afternoon Ride,Ride,2076,12.47,"Oiled chain, bike feels smooth"
2019-09-25 00:07:21,Afternoon Ride,Ride,1775,12.1,Feeling really tired
2019-09-26 00:13:33,Afternoon Ride,Ride,1860,12.52,raining
2019-10-01 00:15:07,Afternoon Ride,Ride,1732,,Legs feeling strong!


We can easily decompose our timeseries into its constituent components. There are many attributes that define these constituents.

In [61]:
df.index.year

Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
       2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
       2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
      dtype='int32', name='Date')

In [62]:
df.index.second

Index([ 4, 18, 50, 19,  5, 48, 47, 34, 52, 53,  1,  9,  5, 41, 42, 24, 21, 41,
       33, 43, 18, 52,  7, 55,  9,  6, 22, 36,  8, 40, 31, 14, 57],
      dtype='int32', name='Date')

In [63]:
df.index.weekday

Index([1, 1, 2, 2, 3, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1,
       2, 2, 3, 3, 4, 2, 3, 3, 4],
      dtype='int32', name='Date')

In [64]:
df.index.day_name()

Index(['Tuesday', 'Tuesday', 'Wednesday', 'Wednesday', 'Thursday', 'Monday',
       'Tuesday', 'Tuesday', 'Wednesday', 'Wednesday', 'Thursday', 'Thursday',
       'Friday', 'Monday', 'Tuesday', 'Tuesday', 'Wednesday', 'Wednesday',
       'Thursday', 'Thursday', 'Friday', 'Monday', 'Tuesday', 'Tuesday',
       'Wednesday', 'Wednesday', 'Thursday', 'Thursday', 'Friday', 'Wednesday',
       'Thursday', 'Thursday', 'Friday'],
      dtype='object', name='Date')

In [65]:
df.index.month_name()


Index(['September', 'September', 'September', 'September', 'September',
       'September', 'September', 'September', 'September', 'September',
       'September', 'September', 'September', 'September', 'September',
       'September', 'September', 'September', 'September', 'September',
       'September', 'September', 'October', 'October', 'October', 'October',
       'October', 'October', 'October', 'October', 'October', 'October',
       'October'],
      dtype='object', name='Date')

In [68]:
# Note that if you’re operating on a Series rather than a DatetimeIndex object, you can access this functionality through the .dt attribute:

pd.Series(df.index).dt.day_name()

0       Tuesday
1       Tuesday
2     Wednesday
3     Wednesday
4      Thursday
5        Monday
6       Tuesday
7       Tuesday
8     Wednesday
9     Wednesday
10     Thursday
11     Thursday
12       Friday
13       Monday
14      Tuesday
15      Tuesday
16    Wednesday
17    Wednesday
18     Thursday
19     Thursday
20       Friday
21       Monday
22      Tuesday
23      Tuesday
24    Wednesday
25    Wednesday
26     Thursday
27     Thursday
28       Friday
29    Wednesday
30     Thursday
31     Thursday
32       Friday
Name: Date, dtype: object