# Pandas 

Pandas is one of the most powerful data science libraries in python. It allows you to work with datasets to quickly run computations, organize data and apply functions to it. The biggest benefit of pandas is the speed. While running things like loops in native python can take much longer than other languages such as C++, pandas bridges the gap to make functions run extremely fast. It is done by converting code to C++ which can run at a much faster speed. In this first lesson we will explore some of the most basic elements of pandas.

## Pandas Series

In regular python, we have a list which holds one dimensional data for us. With pandas, there is the series data type which is similar except that there is also an index (which can be different then the usual 0, 1, 2, etc. index) and as well there are many more pandas functions that we can use.

Start with our basic list.

In [1]:
#In regular python we have a list
l = [1,2,3,4,5,6]

In [2]:
import pandas as pd

#Create a pandas series by passing the list
s = pd.Series(l)
print(s)

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64


A few things to notice. On the left side is the index which automatically defaults to an integer index starting at 0. On the right are the values. As well, it says dtype which is just to say what kind of data is held within the series. Pandas automatically noticed we only had integers so it assigned the dtype of int64.

### Specifying an Index

We don't have to default to the normal index. If we pass in the argument index when creating our series it will be reflected.

In [3]:
#Create a series with a custom index
s = pd.Series(l,index=["a", "b", "c", "d", "e", "f"])
print(s)

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64


Above we see that the left hand side is now our index of letters. We can also a set an index after the fact by setting the index attribute. For example if we created our series without an index we can change it after like below:

In [4]:
#Create the series
s = pd.Series(l)

#Set the index
s.index=["a", "b", "c", "d", "e", "f"]

print(s)

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64


One benefit of pandas is that we can easily apply math functions to it instead of iterating over every element. For example, the code below will square each value.

In [5]:
#Square all the values
s = s ** 2
print(s)

a     1
b     4
c     9
d    16
e    25
f    36
dtype: int64


If we want just the values in the series, we can use the values attribute. The format we will get back is a numpy array which we will cover more in future lessons.

In [6]:
#Get the values of the series
print(s.values)

[ 1  4  9 16 25 36]


## Indexing

Indexing can be done two different ways. If you use iloc, you can index similar to how one might index a list. With loc you can instead index based on the index values. If you do this it will include everything up to and including the last index. First, let's use iloc to get the first three values.

In [7]:
#Get the first three values
print(s.iloc[:3])

a    1
b    4
c    9
dtype: int64


By using loc we can also do this by passing a and c.

In [8]:
print(s.loc['a':'c'])

a    1
b    4
c    9
dtype: int64


## Pandas DataFrames

A pandas dataframe is a structure which holds 2 dimensional data including an index and columns. You can think of it is similar to a spreadsheet. To create one, you need to pass in a list of either tuples, list of lists or (we won't cover it here) a dictionary. For our first dataframe, let's get the following data. Each entry of the list below is going to be 4 attributes, a name, a height, a weight and a string for offense or defense. Right now these are all strings.

In [9]:
#Now, let's create some data, each element corresponds to a row
#Each tuple corresponds to the data for the row
data = []
data.append(("Ray Lewis","6'1","250","Defense"))
data.append(("Tom Brady","6'4","225","Offense"))
data.append(("Julio Jones","6'3","220","Offense"))
data.append(("Richard Sherman","6'3","194","Defense"))
print(data)

[('Ray Lewis', "6'1", '250', 'Defense'), ('Tom Brady', "6'4", '225', 'Offense'), ('Julio Jones', "6'3", '220', 'Offense'), ('Richard Sherman', "6'3", '194', 'Defense')]


In [10]:
#We can initialize a dataframe with a list of data
df = pd.DataFrame(data)
print(df)

                 0    1    2        3
0        Ray Lewis  6'1  250  Defense
1        Tom Brady  6'4  225  Offense
2      Julio Jones  6'3  220  Offense
3  Richard Sherman  6'3  194  Defense


You can pass in a list of labels for either the columns or index if you want. For our example from before we can add columns.

In [11]:
#The columns argument lets us set the names for the columns in our dataframe
df = pd.DataFrame(data,columns=["Name","Height","Weight","Type"])
print(df)

              Name Height Weight     Type
0        Ray Lewis    6'1    250  Defense
1        Tom Brady    6'4    225  Offense
2      Julio Jones    6'3    220  Offense
3  Richard Sherman    6'3    194  Defense


Findinding column values can be done in the same way that you might grab values from a dictionary. You pass in brackets with the column you want to grab. For example, the following will get you the names.

In [12]:
#To get a specific column, we index with the column name.
print(df["Name"])

0          Ray Lewis
1          Tom Brady
2        Julio Jones
3    Richard Sherman
Name: Name, dtype: object


In [13]:
#Likewise we can get height
print(df["Height"])

0    6'1
1    6'4
2    6'3
3    6'3
Name: Height, dtype: object


If we give a nested list where the first element is a list of names, we can select multiple columns. This will give us back a dataframe which is just with the columns we are looking for.

In [14]:
#Grab the name and weight columns
print(df[["Name","Weight"]])

              Name Weight
0        Ray Lewis    250
1        Tom Brady    225
2      Julio Jones    220
3  Richard Sherman    194


By indexing into a column but then passing values we can either overwrite or create a new column. Below we are able to set the retired column.

In [15]:
#We can also assign a new column by the above method if we give a list of equal length as the dataframes rows
df["Retired"] = [True,False,False,False]
print(df)

              Name Height Weight     Type  Retired
0        Ray Lewis    6'1    250  Defense     True
1        Tom Brady    6'4    225  Offense    False
2      Julio Jones    6'3    220  Offense    False
3  Richard Sherman    6'3    194  Defense    False


## Apply

By using apply you can use apply a function to each element of a column or every row/every column. To begin with let's define a function which will deal with the height column. The issue with height is that it is a string representation right now and we might want an actual numeric representation. Let's do it piece by piece. Start with the height of the first player.

In [16]:
#Get the first player's height
height = df.loc[0, "Height"]

#Print the value and the type
print(height)
print(type(height))

6'1
<class 'str'>


The first thing to do would be to grab the two pieces of information that this string provides, the feet and inches. If we use split and split on "'", we can grab the two pieces as a list.

In [17]:
#Split in two
height = height.split("'")
print(height)

['6', '1']


We still have strings in both of these cases, so we are going to want to convert them both to integers.

In [18]:
#Convert to integers
height = [int(x) for x in height]
print(height)

[6, 1]


Now the value on the left is the number of feet, the number on the right is the number of inches. Let's convert the left number to be the number of inches.

In [19]:
#Convert the feet to inches
height[0] = height[0] * 12
print(height)

[72, 1]


Finally, sum these two numbers and we have the height in terms of total inches.

In [20]:
#Get the total height in inches
height = sum(height)
print(height)

73


Now that we know all the steps, we can convert this into a function.

In [21]:
def height_convert(height):
    #Split in two
    height = height.split("'")
    
    #Convert to integers
    height = [int(x) for x in height]
    
    #Convert the feet to inches
    height[0] = height[0] * 12
    
    #Get the total height in inches
    height = sum(height)
    return height
print(height_convert("6'1"))

73


When using apply on a column, you index into the column, call apply, then pass along the function which you want to apply. This assumes that only one argument is needed. This will return a new series where the values are the returned values from the function applied to every value.

In [22]:
#Apply our new function
print(df["Height"].apply(height_convert))

0    73
1    76
2    75
3    75
Name: Height, dtype: int64


Let's overwrite our prior height column with this.

In [23]:
#Modify the values
df["Height"] = df["Height"].apply(height_convert)
print(df)

              Name  Height Weight     Type  Retired
0        Ray Lewis      73    250  Defense     True
1        Tom Brady      76    225  Offense    False
2      Julio Jones      75    220  Offense    False
3  Richard Sherman      75    194  Defense    False


You might want to see weight divided by height but trying that will cause an error because height is in string representation.

In [24]:
#If we try to divide here we will run into an issue, weight is still a string representation
print(df["Weight"]/df["Height"])

TypeError: unsupported operand type(s) for /: 'str' and 'int'

To see what types we have for each column, we can call dtypes.

In [25]:
#The dtypes attribute lets us see the attribute
print(df.dtypes)

Name       object
Height      int64
Weight     object
Type       object
Retired      bool
dtype: object


For conversion to numeric values, there are two options. One is that we can say astype() and pass in a type. If we pass in integer than we can convert a column to integer.

In [26]:
#Convert to integer
df["Weight"] = df["Weight"].astype(int)
print(df.dtypes)

Name       object
Height      int64
Weight      int64
Type       object
Retired      bool
dtype: object


The other option is to call pd.to_numeric and pass in the series. If these were floating point numbers, it would automatically convert to that instead, but in this case they are all integers.

In [27]:
#Also converts to numeric values
df["Weight"] = pd.to_numeric(df["Weight"])
print(df.dtypes)

Name       object
Height      int64
Weight      int64
Type       object
Retired      bool
dtype: object


Now that we have numeric types we can do mathematical operations with the columns. To divide weight by height, we can do this below....

In [28]:
#Find the ratio of weight and height
print(df["Weight"]/df["Height"])

0    3.424658
1    2.960526
2    2.933333
3    2.586667
dtype: float64


## Boolean Indexing

Sometimes you want to grab only certain rows based on some sort of comparison. This is the purpose of boolean indexing. What it allows you to do is check a comparison then return only those rows that are true for the comparison. To begin with let's define a list of boolean values.

In [29]:
#Define a boolean index
bool_index = [True, False, True, False]

When you pass in a boolean index to the dataframe the same way that you pass in columns it will filter to only true values.

In [30]:
print(df[bool_index])

          Name  Height  Weight     Type  Retired
0    Ray Lewis      73     250  Defense     True
2  Julio Jones      75     220  Offense    False


Notice above that we only got back the first and third row! Of course you would get the same by inputting it the manual way like below:

In [31]:
print(df[[True,False,True,False]])

          Name  Height  Weight     Type  Retired
0    Ray Lewis      73     250  Defense     True
2  Julio Jones      75     220  Offense    False


What if wanted to find out which players are over 220 pounds and return only those players. The first step is to get our boolean index.

In [32]:
#We can also check the truth of a statement such as which rows have weight values over 220
print(df["Weight"] > 220)

0     True
1     True
2    False
3    False
Name: Weight, dtype: bool


Now with that it is easy to filter to the correct rows.

In [33]:
#And this allows us to filter based on an argument. In this case we can print only rows with weights over 220
print(df[df["Weight"]>220])

        Name  Height  Weight     Type  Retired
0  Ray Lewis      73     250  Defense     True
1  Tom Brady      76     225  Offense    False


## Combining Dataframes

Often we will have data from multiple sources and will need to combine them together to create a dataframe with all data together. For an example, we can first make a second set of data. Something you will notice is that there are a few values set to none. This happens really often when working with data. Sometimes a field can't be measured 100% of the time or does not apply so it has the value of none instead.

In [34]:
#Let's create a second set of data

data = []
data.append(("Allen Robinson",75,250,"Offense",False))
data.append(("Alvin Kamara",None,215,"Offense",False))
data.append(("Christian McCaffrey",71,None,"Offense",False))

#And turn it into a second dataframe
#You'll notice some data is missing, this happens commonly working with real data
df2 = pd.DataFrame(data,columns=["Name","Height","Weight","Type","Retired"])
print(df2)

                  Name  Height  Weight     Type  Retired
0       Allen Robinson    75.0   250.0  Offense    False
1         Alvin Kamara     NaN   215.0  Offense    False
2  Christian McCaffrey    71.0     NaN  Offense    False


The function pd.concat takes a list of dataframes to put together. By default they will be appended vertically, but you can change this behavior by using the axis keyword. The code below will take the two dataframes and put them together to make a new combined dataframe.

In [35]:
#The pd.concat() function lets us put together dataframes
df_final = pd.concat([df,df2])
print(df_final)

                  Name  Height  Weight     Type  Retired
0            Ray Lewis    73.0   250.0  Defense     True
1            Tom Brady    76.0   225.0  Offense    False
2          Julio Jones    75.0   220.0  Offense    False
3      Richard Sherman    75.0   194.0  Defense    False
0       Allen Robinson    75.0   250.0  Offense    False
1         Alvin Kamara     NaN   215.0  Offense    False
2  Christian McCaffrey    71.0     NaN  Offense    False


### Reset the Index

If you look at the index above you see that they actually overlap. Sometimes this is the behavior you might want, but often you want to set the index to be a new unique integer index. To do this we can call reset_index. This will return a new dataframe which has the old index kept as a column added to the dataframe as well as the new index from 0 to N-1.

In [36]:
#The two dataframes have indexes that overlap! We could fix this by resetting the index, as so....
print(df_final.reset_index())

   index                 Name  Height  Weight     Type  Retired
0      0            Ray Lewis    73.0   250.0  Defense     True
1      1            Tom Brady    76.0   225.0  Offense    False
2      2          Julio Jones    75.0   220.0  Offense    False
3      3      Richard Sherman    75.0   194.0  Defense    False
4      0       Allen Robinson    75.0   250.0  Offense    False
5      1         Alvin Kamara     NaN   215.0  Offense    False
6      2  Christian McCaffrey    71.0     NaN  Offense    False


If you give the argument ignore_index=True then the old index will simply be dropped from the dataframe.

In [37]:
#Or we can give the argument ignore_index=True to reset it during the concat function
df_final = pd.concat([df,df2],ignore_index=True)
print(df_final)

                  Name  Height  Weight     Type  Retired
0            Ray Lewis    73.0   250.0  Defense     True
1            Tom Brady    76.0   225.0  Offense    False
2          Julio Jones    75.0   220.0  Offense    False
3      Richard Sherman    75.0   194.0  Defense    False
4       Allen Robinson    75.0   250.0  Offense    False
5         Alvin Kamara     NaN   215.0  Offense    False
6  Christian McCaffrey    71.0     NaN  Offense    False


## Working with Null Values

When you come across null values there are a few options of what you can do. One option is to simply drop any rows which have null values. The function dropna() achieves this by going through every row and ensuring there are no null values. It will not do it in place but rather returns a new version of the dataframe.

In [38]:
#The dropna() function gets rid of any rows with missing values
print(df_final.dropna())

              Name  Height  Weight     Type  Retired
0        Ray Lewis    73.0   250.0  Defense     True
1        Tom Brady    76.0   225.0  Offense    False
2      Julio Jones    75.0   220.0  Offense    False
3  Richard Sherman    75.0   194.0  Defense    False
4   Allen Robinson    75.0   250.0  Offense    False


If there are only some columns that you want to consider when dropping null values you can give the keyword subset which will only consider dropping rows with null values in these columns.

In [39]:
#If given an argument subset, we can drop only rows with missing values from the subset 
print(df_final.dropna(subset=["Height","Type"]))

                  Name  Height  Weight     Type  Retired
0            Ray Lewis    73.0   250.0  Defense     True
1            Tom Brady    76.0   225.0  Offense    False
2          Julio Jones    75.0   220.0  Offense    False
3      Richard Sherman    75.0   194.0  Defense    False
4       Allen Robinson    75.0   250.0  Offense    False
6  Christian McCaffrey    71.0     NaN  Offense    False


If you do not want to return a dataframe but want to instead modify the one you are calling these functions on you are free to use the argument inplace=True. This will make it so that the modification happens to the object you passed.

In [40]:
#If we use inplace=True then the dropna function happens in place
df_final.dropna(inplace=True,subset=["Height","Type"])
print(df_final)

                  Name  Height  Weight     Type  Retired
0            Ray Lewis    73.0   250.0  Defense     True
1            Tom Brady    76.0   225.0  Offense    False
2          Julio Jones    75.0   220.0  Offense    False
3      Richard Sherman    75.0   194.0  Defense    False
4       Allen Robinson    75.0   250.0  Offense    False
6  Christian McCaffrey    71.0     NaN  Offense    False


We may also prefer to use a more descriptive index like the name of the player to refer to the different rows. Calling set_index will return a new dataframe with the index set to whatever column you passed it.

In [41]:
#Setting the index to name changes it from a column to index
df_final = df_final.set_index("Name")
print(df_final)

                     Height  Weight     Type  Retired
Name                                                 
Ray Lewis              73.0   250.0  Defense     True
Tom Brady              76.0   225.0  Offense    False
Julio Jones            75.0   220.0  Offense    False
Richard Sherman        75.0   194.0  Defense    False
Allen Robinson         75.0   250.0  Offense    False
Christian McCaffrey    71.0     NaN  Offense    False


If there is new data that has come to light, you have the option of overwriting data using loc. For example, we can overwrite the row for Christian McCaffrey and the weight value in that row by executing the code below.

In [42]:
#Using loc we can overwrite data
df_final.loc["Christian McCaffrey","Weight"] = 205
print(df_final)

                     Height  Weight     Type  Retired
Name                                                 
Ray Lewis              73.0   250.0  Defense     True
Tom Brady              76.0   225.0  Offense    False
Julio Jones            75.0   220.0  Offense    False
Richard Sherman        75.0   194.0  Defense    False
Allen Robinson         75.0   250.0  Offense    False
Christian McCaffrey    71.0   205.0  Offense    False
