Pandas is a highly useful package for data analysis, and as such it is a valuable tool for work in physics and astrophysics. Let us look at some different uses of pandas.

Let us begin by creating a simple dict.

In [25]:
import pandas as pd
from numpy import nan

In [2]:
data = {'cosmo1': [3, 6, 9, 10], 'cosmo2': [1,1,3,4]}

In [3]:
CosmoValues = pd.DataFrame(data)

In [4]:
CosmoValues

Unnamed: 0,cosmo1,cosmo2
0,3,1
1,6,1
2,9,3
3,10,4


Here is the 'dataframe' or table that we have just created. We are not limited to ints, we could have used strings instead. You are probably thinking that the left-most column should contain proper labels rather than index numbers. Let's add an index!

In [5]:
CosmoValues = pd.DataFrame(data, index=['T1', 'T2', 'T3', 'T4'])

In [6]:
CosmoValues

Unnamed: 0,cosmo1,cosmo2
T1,3,1
T2,6,1
T3,9,3
T4,10,4


In [7]:
CosmoValues.loc['T3']

cosmo1    9
cosmo2    3
Name: T3, dtype: int64

Of course, having to type in values into python to create dataframes can be highly cumbersome. It is possible to just read in a csv or xlsx file.

In [None]:
df = pd.read_csv('filename.csv', index_col=0)

One can also convert to a csv file as follows:

In [None]:
df.to_csv('new_purchases.csv')

Here are some general commands for dataframes that are useful:

In [8]:
CosmoValues.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, T1 to T4
Data columns (total 2 columns):
cosmo1    4 non-null int64
cosmo2    4 non-null int64
dtypes: int64(2)
memory usage: 256.0+ bytes


In [9]:
CosmoValues.shape # output = (rows, columns)

(4, 2)

In [10]:
CosmoValues.columns

Index(['cosmo1', 'cosmo2'], dtype='object')

Now, let's build a slightly bigger and more complicated dataframe that we can work through:

In [36]:
cosmo3 = [1, 1, 3, 4]
cosmo4 = [0, 4, None, 3]

In [37]:
CosmoValues['cosmo3'] = cosmo3
CosmoValues['cosmo4'] = cosmo4

In [38]:
#CosmoValues = CosmoValues.replace(CosmoValues.iloc[2][2:],'null')

In [39]:
CosmoValues.drop_duplicates()

Unnamed: 0,cosmo1,cosmo2,cosmo3,cosmo4
T1,3,1,1,0.0
T2,6,1,1,4.0
T3,9,3,3,
T4,10,4,4,3.0


In [40]:
CosmoValues.drop_duplicates(inplace=True, keep=False)

For some datasets, there may be null entries. These can be a hassle to work with and can commonly break your code. To resolve this, here's a trick to identify them:

In [41]:
CosmoValues.isnull()

Unnamed: 0,cosmo1,cosmo2,cosmo3,cosmo4
T1,False,False,False,False
T2,False,False,False,False
T3,False,False,False,True
T4,False,False,False,False


In [42]:
CosmoValues.isnull().sum()

cosmo1    0
cosmo2    0
cosmo3    0
cosmo4    1
dtype: int64

In [43]:
CosmoValues.dropna()

Unnamed: 0,cosmo1,cosmo2,cosmo3,cosmo4
T1,3,1,1,0.0
T2,6,1,1,4.0
T4,10,4,4,3.0


This will drop any row with a null data value. Use with caution; this may be highly inappropriate for the data you are using! If you want to drop the columns with null values instead, use the following:

In [44]:
CosmoValues.dropna(axis=1)

Unnamed: 0,cosmo1,cosmo2,cosmo3
T1,3,1,1
T2,6,1,1
T3,9,3,3
T4,10,4,4


Another method is to replace the null value. Imputation is a specific example of this where one replaces the null values in a column with the mean value of the column.

In [45]:
distance = CosmoValues['cosmo2']

distance_mean = distance.mean()

distance.fillna(distance_mean, inplace=True)

In [46]:
def rating_function(x):
    if x >= 5.0:
        return "Use"
    else:
        return "Don't Use"

In [47]:
CosmoValues["usefulness"] = CosmoValues["cosmo1"].apply(rating_function)

In [48]:
CosmoValues

Unnamed: 0,cosmo1,cosmo2,cosmo3,cosmo4,usefulness
T1,3,1,1,0.0,Don't Use
T2,6,1,1,4.0,Use
T3,9,3,3,,Use
T4,10,4,4,3.0,Use
