## Intermediate Pandas





Today we look at some ways to use Pandas DataFrames like databases. It is much more convenient to do this with Pandas than numpy arrays, but it means learning a lot more stuff.





### Revisiting a previous example with batches of data





We start with the example we looked at before. It is a dataset from a set of experiments. The experiments are grouped by the Day they were run on. We will use Pandas to do some analysis by the day.





In [None]:
import pandas as pd
df = pd.read_csv('p-t.dat', delimiter='\s+', skiprows=2,
                 names=['Run order', 'Day', 'Ambient Temperature', 'Temperature',
                        'Pressure', 'Fitted Value', 'Residual'])
df



Suppose we want to get information about different days. Let's do some work by hand.



In [None]:
df['Day'] == 1



We can get all kinds of data on this.



In [None]:
df[df['Day'] == 1].describe()



We can use a list comprehension to loop over the days and get an average for each one.



In [None]:
[df[df['Day'] == x].mean() for x in [1, 2, 3, 4]]



It is a little inconvenient to have to know all the days. We can get them from the dataframe itself.



In [None]:
df['Day'].unique()



Finally, we can put it all together to get the mean of a single column grouped by day.



In [None]:
[df[df['Day'] == x]['Temperature'].mean() for x in df['Day'].unique()]



That was an exploratory approach that was somewhat motivated by the approach we would use in Numpy.  Next, we look at the Pandas way.



The first aggregation we will look at is how to make groups of data that are related by values in a column.  We use the `groupby` function ([https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)), and specify a column to group on. The result is a `DataFrameGroupBy` object, which we next have to work with.





In [None]:
groups = df.groupby('Day')
type(groups)



The groups can describe themselves. Here we see we get 4 groups, one for each day, and you can see some statistics about each group. We do not need those for now.





In [None]:
groups.describe()



We can get a dictionary of the group names and labels from the groups attribute.





In [None]:
groups.groups



We can get the subset of rows from those group labels.





In [None]:
df.loc[groups.groups[2]]



We don't usually work with groups that way though, it is more common to do some analysis on each group.

Suppose we want to plot the Pressure vs Temperature for each group, so we can see visually if there are any trends that could be attributed to the group. To do this, we need to *iterate* over the groups and then make a plot on each one.

A `DataFrameGroupBy` is *iterable* and when you loop over it, you get the `key` it was grouped on, and a DataFrame that contains the items in the group. Here we loop over each group, and plot each group with a different color.





In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
for (day, group) in groups:
    group.plot('Temperature', 'Pressure', ax=ax, label=f'{day}', style='o')
plt.ylabel('Pressure');



the point of this is we cannot see a visual clustering of the groups by day. That is important, because if we did it could suggest something was different that day.





### Combining data sets





Siddhant Lambor provided from two experiments conducted to measure the properties of a worm-like micelles solution. He had carried out experiments on a rheometer to measure the viscosity of a worm-like micelles solution in a Couette cell geometry and a Cone and Plate geometry. Ideally, there should not be a difference as viscosity is intrinsic to the fluid. Analysis of this data will confirm if that is true. First, we read this data in from the two data files.





In [None]:
couette = pd.read_excel('couette.xls',
                   sheet_name='Flow sweep - 1',
                   header=1) # sheet name is case sensitive, excel file name is not

couette



We can drop the row at index 0, it just has the units in it. With this syntax, we have to save the resulting DataFrame back into the variable, or it will not be changed.





In [None]:
couette = couette.drop(0)
couette



There is a second file called cp.xls we want to combine with this. Here, we combine the drop function all into one line.





In [None]:
conePlate = pd.read_excel('cp.xls', sheet_name='Flow sweep - 1', header=1).drop(0)
conePlate.head(5)



For this analysis, we are only interested in the shear rate, stress and viscosity values. Let us drop the other columns. We do that by the names, and specify inplace=True, which modifies the DataFrame itself.





In [None]:
conePlate.drop(['Temperature', 'Step time', 'Normal stress'], axis=1, inplace=True)
# if we do not use inplace=True, the data frame will not be changed. It would by default create a new data frame
# and we would have to assign a different variable to capture this change.
conePlate.head(5)



We also do that for the couette data. Here we did not use `inplace=True`, so we have to save the result back into the variable to get the change.





In [None]:
couette = couette.drop(['Temperature', 'Step time', 'Normal stress'], axis=1)   # without using inplace = True
couette.head(5)



We can see info about each DataFrame like this.





In [None]:
couette.info()



In [None]:
conePlate.info()



We could proceed to analyze the DataFrames separately, but instead, we will combine them into one DataFrame. Before doing that, we need to add a column to each one so we know which data set is which. Simply assigning a value to a new column name will do that.





In [None]:
couette['type'] = 'couette'
couette



In [None]:
conePlate['type'] = 'cone'



Now, we can combine these into a single DataFrame. This is not critical, and you can get by without it, but I want to explore the idea, and illustrate it is possible.





In [None]:
df = pd.concat([conePlate, couette])
df



Finally, we are ready for the visualization. We will group the DataFrame and then make plots for each group. Here we illustrate several new arguments, including loglog plots, secondary axes, colored tick labels, and multiple legends.





In [None]:
g = df.groupby('type')
ax1 = g.get_group('cone').plot('Shear rate', 'Viscosity',
                               logx=True, logy=True, style='b.-',
                               label="CP viscosity")

g.get_group('couette').plot('Shear rate', 'Viscosity', logx=True, logy=True,
                            style='g.-', ax=ax1, label="Couette viscosity")

ax2 = g.get_group('cone').plot('Shear rate', 'Stress', secondary_y=True,
                               logx=True, logy=True, style='r.-',
                               ax=ax1, label="CP stress")

g.get_group('couette').plot('Shear rate', 'Stress', secondary_y=True,
                            logx=True, logy = True, style='y.', ax=ax2,
                            label="Couette Stress")

# Setting y axis labels
ax1.set_ylabel("Viscosity (Pa.s)", color='b')
[ticklabel.set_color('b') for ticklabel in ax1.get_yticklabels()]

ax2.set_ylabel("Stress (Pa)", color='r')
[ticklabel.set_color('r') for ticklabel in ax1.get_yticklabels()]

# setting legend locations
ax1.legend(loc=6)
ax2.legend(loc=7)

ax1.set_xlabel("Shear rate (1/s)")
plt.title("Comparison of Cone and Plate with Couette Cell")



So, in fact we can see these two experiments are practically equivalent.



