## Introduction to Pandas





Before we jump into Pandas, let us review what we have considered so far.

First, we learned how to read data from files into numpy arrays. We learned how to use variables to store that data, and to either slice the array into a few variables, or use slices themselves for something. We also learned how to make a *record* array that enabled us to access columns of the array by a *name*.

When we loaded a json file, we got a *dictionary* data structure, which also allowed us to access data by a *name*.

Second, we imported a visualization library, and made plots that used the arrays as arguments.

For "small" data sets, i.e. not too many columns, this is a perfectly reasonable thing to do. For larger datasets, however, it can be tedious to create a lot of variable names, and it is also hard to remember what is in each column.

Many tasks are pretty standard, e.g. read a data set, summarize and visualize it. It would be nice if we had a simple way to do this, with few lines of code, since those lines will be the same every time.

The [Pandas](https://pandas.pydata.org/) library was developed to address all these issues. From the website: "**pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."





### Review of the numpy array way





Let's review what we learned already.





In [None]:
import numpy as np
data = np.loadtxt('raman.txt')

wavenumber, intensity = data.T  # the transpose has data in rows for unpacking
ind = (wavenumber >= 1000) & (wavenumber < 1500)

import matplotlib.pyplot as plt
plt.figure()
plt.plot(wavenumber[ind], intensity[ind])
plt.xlabel('Wavenumber')
plt.ylabel('Intensity');



### Now, with Pandas





We will unpack this code shortly. For now, look how short it is to create this plot. Note that we have condensed all the code in the example above basically into three lines of code. That is pretty remarkable, but should give you some pause. We now have to learn how to use such a dense syntax!





In [None]:
import pandas as pd

df = pd.read_csv('raman.txt', delimiter = '\t', index_col=0,
                 names=['wavenumber', 'intensity'])
df[(df.index >= 1000) & (df.index<1500)].plot();



And to summarize.





In [None]:
df.describe()



What is the benefit of this dense syntax? Because it is so short, it is faster to type (at least, when you know what to type). That means it is also faster for you to read.

The downside is that it is like learning a whole new language within Python, and a new mental model for how the data is stored and accessed. You have to decide if it is worthwhile doing that. If you do this a lot, it is probably worthwhile.





## Pandas





The main object we will work with is called a `DataFrame`.





In [None]:
type(df)



Jupyter notebooks can show you a fancy rendering of your dataframe.





In [None]:
df



The dataframe combines a few ideas we used from arrays and dictionaries. First, we can access a column by name. When we do this, we get a `Series` object.





In [None]:
type(df['intensity'])



You can extract the values into a numpy array like this.





In [None]:
df['intensity'].values



A Series (and DataFrame) are like numpy arrays in some ways, and different in others. Suppose we want to see the first five entries of the intensity. If we want to use *integer-based* indexing like we have so far, you have to use the `iloc` attribute on the series like this. `iloc` is for integer location.





In [None]:
df['intensity'].iloc[0:5]



What about the wavenumbers? These are called the *index* of the dataframe.





In [None]:
df.index



You can index the index with integers as you can with an array.





In [None]:
df.index[0:5]



Finally, you can combine these so that you index a column with a slice of the index like this.





In [None]:
df['intensity'][df.index[0:5]]



In summary, we can think of a dataframe as a hybrid array/dictionary where we have an index which is like the independent variable, and a set of columns that are like dependent variables. You can access the columns like a dictionary.





### Dataframes and visualization





Dataframes also provide easy access to [visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html). The simplest method is to just call the plot method on a dataframe. Note this automatically makes the plot with labels and a legend. If there are many columns, you will have a curve for each one of them. We will see that later.





In [None]:
df.plot();



### Reading data in Pandas





Let's get back to how we got the data into Pandas. Let's retrieve the data file we used before with several columns in it.





In [None]:
fname = 'p-t.dat'
url = 'https://www.itl.nist.gov/div898/handbook/datasets/MODEL-4_4_4.DAT'

import urllib.request
urllib.request.urlretrieve(url, fname)



Let's refresh our memory of what is in this file:





In [None]:
! head p-t.dat



We use [Pandas.read\_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to read this, similar to how we used `numpy.loadtxt`. It also takes a lot of arguments to fine-tune the output. We use spaces as the delimiter here. `'\s+'` is a *regular expression* for multiple spaces. We still skip two rows, and we have to manually define the column names. We *do not* specify an index column here, we get a default one based on integers. Pandas is smart enough to recognize the first two columns are integers, so we do not have to do anything special here.





In [None]:
df = pd.read_csv('p-t.dat', delimiter='\s+', skiprows=2,
                 names=['Run order', 'Day', 'Ambient Temperature', 'Temperature',
                        'Pressure', 'Fitted Value', 'Residual'])
df



The default plot is not that nice.





In [None]:
df.plot();



The default is to plot each column vs the index, which is not that helpful for us. Say we just want to plot the pressure vs. the temperature.





In [None]:
df.plot(x='Temperature', y='Pressure', style='b.');



We can add multiple plots to a figure, but we have to tell the subsequent calls which axes to put them on. To do that, save the first one, and pass it as an argument in subsequent plots.  That also allows you to fine-tune the plot appearance, e.g. add a y-label. See the [matplotlib documentation](https://matplotlib.org/contents.html) to learn how to set all of these.





In [None]:
p1 = df.plot(x='Temperature', y='Pressure', style='b.')
df.plot(x='Temperature', y='Fitted Value', ax=p1)

p1.set_ylabel('values');



It is a reasonable question to ask if this is simpler than what we did before using arrays, variables and plotting commands. Dataframes are increasingly common in data science, and are the data structure used in many data science/machine learning projects.





## Another real-life example





LAMMPS is a molecular simulation code used to run molecular dynamics. It outputs a text file that is somewhat challenging to read. There are variable numbers of time steps that depend on how the simulation was setup.

Start by downloading and opening this file. It is a molecular dynamics trajectory at constant volume, where the pressure, temperature and energy fluctuate.

Open this file [log1.lammps](./log1.lammps) to get a sense for what is in it. The data starts around:

    timestep 0.005
    run ${runSteps}
    run 500000
    Per MPI rank memory allocation (min/avg/max) = 4.427 | 4.427 | 4.427 Mbytes
    Step v_mytime Temp Press Volume PotEng TotEng v_pxy v_pxz v_pyz v_v11 v_v22 v_v33 CPU
           0            0         1025    601.28429    8894.6478   -1566.6216   -1500.5083    2065.6285    1713.4095    203.00499 1.3408976e-05 9.2260011e-06 1.2951038e-07            0 w

And it ends around this line.

      500000         2500    978.62359   -2100.7614    8894.6478   -1570.5382   -1507.4162   -252.80665    614.87398    939.65393 0.00045263648 0.00043970796 0.00044228719    1288.0233
    Loop time of 1288.02 on 1 procs for 500000 steps with 500 atoms

Our job is to figure out where those lines are so we can read them into Pandas. There are many ways to do this, but we will stick with a pure Python way. The strategy is to search for the lines, and keep track of their positions.





In [None]:
start, stop = None, None
with open('log1.lammps') as f:
    for i, line in enumerate(f):
        if line.startswith('Step v_mytime'):
            start = i
        if line.startswith('Loop time of '):
            stop = i - 1  # stop on the previous line
            break
start, stop



This gets tricky. We want to skip the rows up to the starting line. At that point, the line numbers restart as far as Pandas is concerned, so the header is in line 0 then, and the number of rows to read is defined by the stop line minus the start line. The values are separated by multiple spaces, so we use a *pattern* to indicate multiple spaces. Finally, we prevent the first column from being the index column by setting index\_col to be False. See [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>)for all the details.





In [None]:
df = pd.read_csv('log1.lammps', skiprows=start, header=0, nrows=stop - start, delimiter='\s+', index_col=False)
df



### Visualizing the data





#### Plot a column





The effort was worth it though; look how easy it is to plot the data!





In [None]:
df.plot(x='Step', y='Press');



In [None]:
import matplotlib.pyplot as plt
fig, (ax0, ax1) = plt.subplots(1, 2)
df.plot(x='Temp', y='PotEng', style='b.', ax=ax0)
df.plot(x='Press', y='PotEng', style='b.', ax=ax1)
plt.tight_layout()



#### Plot distributions of a column





We can look at histograms of properties as easily.





In [None]:
df.hist('PotEng', xrot=45, bins=20, density=True);



#### Plot column correlations





This is just the beginning of using Pandas. Suppose we want to see which columns are correlated ([https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)). With variables this would be tedious.





In [None]:
plt.matshow(df.corr());



We can see these correlations with a pairplot. This is moderately expensive to plot (it could take a few minutes).



In [None]:
import seaborn as sns
sns.pairplot(df);



You can also make the figure manually. Note, it is not possible to plot a column against itself with Pandas (I think this is a bug [https://github.com/pandas-dev/pandas/issues/22088](https://github.com/pandas-dev/pandas/issues/22088)), so here I use matplotlib functions for the plotting. This should be symmetric, so I only plot the upper triangle.





In [None]:
keys = df.keys()

fig, axs = plt.subplots(13, 13)
fig.set_size_inches((8, 8))
for i in range(13):
    for j in range(i, 13):
        axs[i, j].plot(df[keys[i]], df[keys[j]], 'b.', ms=2)
        # remove axes so it is easier to read
        axs[i, j].axes.get_xaxis().set_visible(False)
        axs[i, j].axes.get_yaxis().set_visible(False)
        axs[j, i].axes.get_xaxis().set_visible(False)
        axs[j, i].axes.get_yaxis().set_visible(False);



### Getting parts of a Pandas DataFrame





We have seen how to get a column from a DataFrame like this:





In [None]:
df['Press']



In this context, the DataFrame is acting like a dictionary. You can get a few columns by using a list of column names.





In [None]:
df[['Press', 'PotEng']]



What about a row? This is what we would have done with a numpy array, but it just doesn't work here.





In [None]:
df[0]



The problem is that as a dictionary, the keys are for the *columns*.





In [None]:
df.keys()



One way to get the rows by their integer index is to use the *integer location* attribute for a row.





In [None]:
df.iloc[0]



We can use slices on this.





In [None]:
df.iloc[0:5]



This example may be a little confusing, because our index does include 0, so we can in this case also use the row label with the *location* attribute. You can use any value in the index for this.





In [None]:
df.index



In [None]:
df.loc[0]



We can access the first five rows like this.





In [None]:
df.loc[0:4]



And a slice of a column like this.





In [None]:
df.loc[0:4, 'Press']



We can access a value in a row and column with the `at` function on a DataFrame.





In [None]:
df.at[2, 'Press']



Or if you know the row and column numbers you can use `iat`.





In [None]:
df.iat[2, 3]



### Operating on columns in the DataFrame





Some functions just work across the columns. For example, DataFrames have statistics functions like this.





In [None]:
df.mean()



We should tread carefully with other functions that work on arrays. For example consider this example that computes the mean of an entire array.





In [None]:
a = np.array([[1, 1, 1],
              [2, 2, 2]])
np.mean(a)



It does not do the same thing on a DataFrame. The index and column labels are preserved with numpy functions.





In [None]:
import numpy as np

np.mean(df) # takes mean along axis 0



In [None]:
np.max(df)



In [None]:
np.exp(df)



In [None]:
2 * df



We can apply a function to the DataFrame. The default is the columns (axis=0). Either way, we get a new DataFrame.





In [None]:
def minmax(roworcolumn):
    return np.min(roworcolumn), np.max(roworcolumn)

df.apply(minmax)



Here we analyze across the rows.





In [None]:
df.apply(minmax, axis=1)



## Summary





Pandas is a multipurpose data science tool. In many ways it is like a numpy array, and in many ways it is different. In some ways it is like a dictionary.

The similarities include the ability to do some indexing and slicing. This is only a partial similarity though.

The differences include integrated plotting.

You should finish reading https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html.

