# Pandas: Not the plural form of panda (even though it is)

Pandas is one of the "Big Three" python packages for data analysis and scripting: Numpy, Pandas, and Matplotlib. These are open-source libraries that are typically imported in any and all data analysis scripts as follows:

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Backstory, Backstory, Backstory
The reason why this is always the case, and the reason why python packages in general are so popular, is a wee bit of a long story:

As you are aware (or at least have been told), your computer only understands raw binary, using certain pre-defined binary instructions. In order to transform source code (which is in all actuality just a specific type of text document), into the appropriate binary machine code, you have to either run an interpretor to run your program (like python does), or run a compiler to create a binary program file from your code (which is what C++ and friends do). 

Python uses an interpretor that is primarily written in C and C++. Because of this, people can write C and C++ code and compile that to be imported and used in python files. This is helpful because for many applications, C and C++ are as much as 40 times faster than python. 

Because of this, certain things that people do a lot, which take a lot of time are often outsourced to imported C and C++ code. Three such things are: complicated mathimatical data (numpy), handling and manipulating tabular data and files (pandas), and creating images and graphs (matplotlib).

## Pandas (again, but for real this time)

Most times, when working with data, the easiest way to store the data is in a table. For example, if we were working with a database for animals in a zoo, we might have a table like the following:

|name|id|species|feeding schedule|weight (lbs)|height (in)|age (years)|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|Echo|001|Dog|2|40|24|2.5|
|Evie|002|?|1|14|10|8|
|Simon|003|Horse|3|930|60|5|
|Lucifer|004|Cat|1|12|12|13|

But representing this natively in python code would be confusing. We could either keep track if it as a list of lists

In [42]:
animals = [
    ["Echo",1,"Dog",2,40,24,2.5],
    ["Evie",2,"?",1,14,10,8],
    ["Simon",3,"Horse",3,930,60,5],
    ["Lucifer",4,"Cat",1,12,12,13]
]

But then accessing something would be weird since we only have number indexes. If we wanted to get Evie's feeding schedule, we would have to write the following garbage:

In [43]:
row_number = 0

# loop through until we find the right row
while animals[row_number][0] != "Evie":
    row_number += 1

# and here we would have to know that column 3 corresponds to the feeding schedule
print(animals[row_number][3])
 

1


So then we could use a list of dictionaries. But then, we would have a lot of wasted memory because dictionaries are actually a little inefficient. It is also still pretty confusing and repetitive to initialize.

In [44]:
animals = [
    {"Name": "Echo","ID": 1,"Species":"Dog","Feeding Schedule": 2,"Weight": 40,"Height": 24,"Age": 2.5},
    # etc
]

The best option would be to have a list of classes, but this only works if know exactly what datatypes we are expecting in advance. This doesn't work if we want to just crack open a random table or spreadsheet and go through the data.

In [45]:
class Animal:
    def __init__(self,name,id,species,sched,weight,height,age) -> None:
        self.name = name
        self.id = id
        # etc.

db = [
    Animal("Echo",1,"Dog",2,40,24,2.5),
    # etc.
]

## This is why we _*do have*_ nice things

Pandas handles all of these complicated nasty things behind the scenes. It makes it so we can (\**coughs*\*) easily create tables, access data within them, and manipulate them. This does mean, however, we do have to learn the ways that Pandas wants us to ask for things, and sometimes it is a little bit particular about things. But now that we have set the scene way too much, lets talk about using Pandas.

## Creating Pandas Dataframes

There are two *relatively* straightforward ways to create a pandas dataframe in code.

In [46]:
# Number One: lists and setting column names

# We first define a list of lists, where each list is one row in our table
animals = [
    ["Echo",1,"Dog",2,40,24,2.5],
    ["Evie",2,"?",1,14,10,8],
    ["Simon",3,"Horse",3,930,60,5],
    ["Lucifer",4,"Cat",1,12,12,13]
]

# We then initialize our dataframe by calling pandas's DataFrame initializer function (it's a class)
# and passing in our list of lists. We then give the initializer a list of 
animal_df = pd.DataFrame(animals, columns=["Name","ID", "Species","Feeding Schedule","Weight","Height","Age"])

In [47]:
print(animal_df)

      Name  ID Species  Feeding Schedule  Weight  Height   Age
0     Echo   1     Dog                 2      40      24   2.5
1     Evie   2       ?                 1      14      10   8.0
2    Simon   3   Horse                 3     930      60   5.0
3  Lucifer   4     Cat                 1      12      12  13.0


In [48]:
# Number Two: Dictionaries

# Or you can give the initializer a dictionary, where the keys are
# the column names, and the values are the row entries for each column
# this unfortunately means we have to be careful about keeping the 
# list entries in order
animals = {
    "Name": ["Echo","Evie","Simon","Lucifer"],
    "ID": [1,2,3,4], 
    "Species": ["Dog","?","Horse","Cat"],
    "Age" : [2.5,8,5,13]
}

df = pd.DataFrame(animals)

In [49]:
print(df)

      Name  ID Species   Age
0     Echo   1     Dog   2.5
1     Evie   2       ?   8.0
2    Simon   3   Horse   5.0
3  Lucifer   4     Cat  13.0


In reality though, these are pretty much never used. Almost always, we just want to create a pandas dataframe from a .csv file of the data. For this example, we'll use the file `Heart_Disease.csv`. Clicking on the file in VSCode, you can see the actual raw form of a naked csv file. CSV stands for "Comma Separated Values". It's usually represented as a table (like in software such as excel), but in actuality its just a bunch of lines of text, where each entry is separated by a comma.

In [77]:
df = pd.read_csv("Heart_Disease.csv")
print(df)

     Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
0     70    1                4  130          322             0            2   
1     80    0                3  115          564             0            2   
2     55    1                2  124          261             0            0   
3     65    1                4  128          263             0            0   
4     45    0                2  120          269             0            2   
..   ...  ...              ...  ...          ...           ...          ...   
265   52    1                3  172          199             1            0   
266   44    1                2  120          263             0            0   
267   56    0                2  140          294             0            2   
268   57    1                4  140          192             0            0   
269   67    1                4  160          286             0            2   

     Max HR  Exercise angina  ST depression  Slope 

## Accessing Data

Often times with pandas, we want to look at certain pieces of our data. Often times this is either on cell, row, or column.

In [51]:
# A column can be accessed by just using the column name in brackets
print(df["Age"])

0      70
1      80
2      55
3      65
4      45
       ..
265    52
266    44
267    56
268    57
269    67
Name: Age, Length: 270, dtype: int64


In [52]:
# you can access a single entry by specifying the row number from within that column
print(df["Age"][12])

44


### loc and iloc
If you want to access something other than a column or single entry however, your bestfriends are two built in pandas properties called `loc` and `iloc`. `loc` is short for "location" and `iloc` stands for "integer location". They accept a pair of arguments in brackets that specify what row(s) and column(s) you want.

In [53]:
# loc
# for loc, the first argument is the name(s) of the column(s) you want, or optionally ":" if you want all columns
# the second argument is a number or ":" corresponding to what row(s) you'd like
print("This prints the first entry in the \"Age\" column")
print(df.loc[0,"Age"])

This prints the first entry in the "Age" column
70


In [54]:
print("This prints the whole \"Age\" column")
print(df.loc[:, "Age"])

This prints the whole "Age" column
0      70
1      80
2      55
3      65
4      45
       ..
265    52
266    44
267    56
268    57
269    67
Name: Age, Length: 270, dtype: int64


In [55]:
print("This prints the first entry in all columns")
print(df.loc[0, :])

This prints the first entry in all columns
Age                              70
Sex                               1
Chest pain type                   4
BP                              130
Cholesterol                     322
FBS over 120                      0
EKG results                       2
Max HR                          109
Exercise angina                   0
ST depression                   2.4
Slope of ST                       2
Number of vessels fluro           3
Thallium                          3
Heart Disease              Presence
Name: 0, dtype: object


You can even pass in a list of columns or rows

In [56]:
print("This prints the 3rd entry in the \"Age\" and \"BP\" columns")
print(df.loc[2, ["Age", "BP"]])

This prints the 3rd entry in the "Age" and "BP" columns
Age     55
BP     124
Name: 2, dtype: object


In [57]:
print("This prints the 1st and 3rd entry in the \"Age\" and \"BP\" columns")
print(df.loc[[0,2], ["Age", "BP"]])

This prints the 1st and 3rd entry in the "Age" and "BP" columns
   Age   BP
0   70  130
2   55  124


# iloc
`iloc` is the same except it excepts integer arguments for both the row and column.

In [58]:
# this prints the entry in the first row and column
print(df.iloc[0,0])

70


In [59]:
# this prints the whole first row
print(df.iloc[0,:])

Age                              70
Sex                               1
Chest pain type                   4
BP                              130
Cholesterol                     322
FBS over 120                      0
EKG results                       2
Max HR                          109
Exercise angina                   0
ST depression                   2.4
Slope of ST                       2
Number of vessels fluro           3
Thallium                          3
Heart Disease              Presence
Name: 0, dtype: object


In [60]:
# this prints the whole first column
print(df.iloc[:,0])

0      70
1      80
2      55
3      65
4      45
       ..
265    52
266    44
267    56
268    57
269    67
Name: Age, Length: 270, dtype: int64


In [61]:
# And of course you can print multiple things
print(df.iloc[[0,10],[4,5]])

    Cholesterol  FBS over 120
0           322             0
10          234             0


## Adding Data to a Dataframe
One fun thing about pandas is that we can add data to it. If you recall our animals dataframe

In [62]:
print(animal_df)

      Name  ID Species  Feeding Schedule  Weight  Height   Age
0     Echo   1     Dog                 2      40      24   2.5
1     Evie   2       ?                 1      14      10   8.0
2    Simon   3   Horse                 3     930      60   5.0
3  Lucifer   4     Cat                 1      12      12  13.0


We could actually add a new row to it just by assigning the name of the new column to a new value

In [64]:
animal_df["Needs Kisses"] = True
print(animal_df)

      Name  ID Species  Feeding Schedule  Weight  Height   Age  Needs Kisses
0     Echo   1     Dog                 2      40      24   2.5          True
1     Evie   2       ?                 1      14      10   8.0          True
2    Simon   3   Horse                 3     930      60   5.0          True
3  Lucifer   4     Cat                 1      12      12  13.0          True


One super nifty thing about pandas is you can even use some simple math in connection to the existing columns in this assignment. So for example, if we wanted to create a new column detailing how much each meal should weigh, and we knew that each animal should eat 10% of its bodyweight per meal, we could do the following

In [65]:
animal_df["Meal Weight"] = animal_df["Weight"] / 10
print(animal_df)

      Name  ID Species  Feeding Schedule  Weight  Height   Age  Needs Kisses  \
0     Echo   1     Dog                 2      40      24   2.5          True   
1     Evie   2       ?                 1      14      10   8.0          True   
2    Simon   3   Horse                 3     930      60   5.0          True   
3  Lucifer   4     Cat                 1      12      12  13.0          True   

   Meal Weight  
0          4.0  
1          1.4  
2         93.0  
3          1.2  


We can even use some simple boolean logic here

In [66]:
animal_df["Is Big"] = animal_df["Height"] > 48
print(animal_df)

      Name  ID Species  Feeding Schedule  Weight  Height   Age  Needs Kisses  \
0     Echo   1     Dog                 2      40      24   2.5          True   
1     Evie   2       ?                 1      14      10   8.0          True   
2    Simon   3   Horse                 3     930      60   5.0          True   
3  Lucifer   4     Cat                 1      12      12  13.0          True   

   Meal Weight  Is Big  
0          4.0   False  
1          1.4   False  
2         93.0    True  
3          1.2   False  


These features are actually super powerful and helpful. But they are limited. If we want to do really complicated stuff, we can create a new column using the pandas `apply` function. We simply define a function that will take a row of our dataframe as an argument, and returns whatever we want to be in our new column. We then set our new function equal to apply, with the name of our custom function passed in.

In [70]:
def get_meal_details(row):
    food = ""
    if row["Species"] == "Dog" or row["Species"] == "?":
        food = "kibble"
    elif row["Species"] == "Cat":
        food = "idk fish"
    else:
        food = "oats"

    return f"{row['Meal Weight']} lbs of {food}"

animal_df["Meal Details"] = animal_df.apply(get_meal_details, axis=1)
print(animal_df)

      Name  ID Species  Feeding Schedule  Weight  Height   Age  Needs Kisses  \
0     Echo   1     Dog                 2      40      24   2.5          True   
1     Evie   2       ?                 1      14      10   8.0          True   
2    Simon   3   Horse                 3     930      60   5.0          True   
3  Lucifer   4     Cat                 1      12      12  13.0          True   

   Meal Weight  Is Big         Meal Details  
0          4.0   False    4.0 lbs of kibble  
1          1.4   False    1.4 lbs of kibble  
2         93.0    True     93.0 lbs of oats  
3          1.2   False  1.2 lbs of idk fish  


### Brief Note
In a lot of pandas functions, you will see `axis=0` or `axis=1`. This is because many pandas functions can work on rows or columns. `axis=0` just means this function will work on rows and `axis=1` just means the function will work on columns. That's why in our above function call we used `axis=1`, because we wanted to add a column.

## Removing Data from a Dataframe

Our heart disease dataframe has a lot of information in it. We may want to pair that down a bit.

In [71]:
print(df)

     Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
0     70    1                4  130          322             0            2   
1     80    0                3  115          564             0            2   
2     55    1                2  124          261             0            0   
3     65    1                4  128          263             0            0   
4     45    0                2  120          269             0            2   
..   ...  ...              ...  ...          ...           ...          ...   
265   52    1                3  172          199             1            0   
266   44    1                2  120          263             0            0   
267   56    0                2  140          294             0            2   
268   57    1                4  140          192             0            0   
269   67    1                4  160          286             0            2   

     Max HR  Exercise angina  ST depression  Slope 

lets start with ST depression. We don't need more depression in our life. To get rid of a column, we just call `df.drop()`. We also specify three things:
- which axis (0 for row, 1 for column)
- the name or index of the row or column we'd like to drop
- `inplace` if this is set to `False` the function returns a copy of the dataframe without the dropped column or row, but the original dataframe will remain unchanged. If we set `inplace = True`, then the original dataframe will actually be updated.

In [78]:
df.drop("ST depression", axis=1, inplace=True)
print(df.loc[0,:])

Age                              70
Sex                               1
Chest pain type                   4
BP                              130
Cholesterol                     322
FBS over 120                      0
EKG results                       2
Max HR                          109
Exercise angina                   0
Slope of ST                       2
Number of vessels fluro           3
Thallium                          3
Heart Disease              Presence
Name: 0, dtype: object


In [79]:
# We can even drop multiple
df.drop(["EKG results","Number of vessels fluro","Slope of ST"], axis=1, inplace=True)
print(df.loc[0,:])

Age                      70
Sex                       1
Chest pain type           4
BP                      130
Cholesterol             322
FBS over 120              0
Max HR                  109
Exercise angina           0
Thallium                  3
Heart Disease      Presence
Name: 0, dtype: object


In [80]:
# And with rows 
df.drop(2,axis=0,inplace=True)
print(df)

     Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  Max HR  \
0     70    1                4  130          322             0     109   
1     80    0                3  115          564             0     160   
3     65    1                4  128          263             0     105   
4     45    0                2  120          269             0     121   
5     30    1                4  120          177             0     140   
..   ...  ...              ...  ...          ...           ...     ...   
265   52    1                3  172          199             1     162   
266   44    1                2  120          263             0     173   
267   56    0                2  140          294             0     153   
268   57    1                4  140          192             0     148   
269   67    1                4  160          286             0     108   

     Exercise angina  Thallium Heart Disease  
0                  0         3      Presence  
1                

## Looping through a Dataframe
There are technically multiple different ways to loop through the rows in a pandas dataframe (something which you often want to be able to do). There is one main one that I've seen most often, and that is to use the `iterrows()` function. This is used in conjunction with a for loop like so:

In [83]:
for index, row in animal_df.iterrows():
    # here index will be the row ID (usually the same as the row number)
    print("Index: ",index)
    # and the row will be a single row from the dataframe. Here you can access any
    # column entry for that row the same way you would access any column in a dataframe
    print("Name:", row["Name"], "Age:", row["Age"])
    print()

Index:  0
Name: Echo Age: 2.5

Index:  1
Name: Evie Age: 8.0

Index:  2
Name: Simon Age: 5.0

Index:  3
Name: Lucifer Age: 13.0



## Saving a Dataframe as a File
Once you're done, often times we'lll want to save the data we've created as a .csv file so that we can use it later. Pandas has a nifty function to do that.

In [85]:
animal_df.to_csv("animals.csv")

One thing to note here, is if you look at that .csv file, it's added an extra first entry in each column. Pandas assigns each row a row ID automatically. This is actually what has been referenced when we've been looking at row number.

When saving a dataframe as a .csv, we can tell it not to save the automagical pandas row ID numbers by setting `index=False`. Trying it now, we can see that our file no longer includes that first column.

In [None]:
animal_df.to_csv("animals.csv", index=False)

When we want to load our data back, we also have the option to tell it which column to use for these secret Pandas IDs by setting the `index_col`. For example, in our `animal_df` dataframe, we have a column named "ID". So we can tell pandas to us "ID" as our IDs.

In [86]:
print(animal_df)

      Name  ID Species  Feeding Schedule  Weight  Height   Age  Needs Kisses  \
0     Echo   1     Dog                 2      40      24   2.5          True   
1     Evie   2       ?                 1      14      10   8.0          True   
2    Simon   3   Horse                 3     930      60   5.0          True   
3  Lucifer   4     Cat                 1      12      12  13.0          True   

   Meal Weight  Is Big         Meal Details  
0          4.0   False    4.0 lbs of kibble  
1          1.4   False    1.4 lbs of kibble  
2         93.0    True     93.0 lbs of oats  
3          1.2   False  1.2 lbs of idk fish  


In [88]:
animal_df = pd.read_csv("animals.csv", index_col="ID")
print(animal_df)

       Name Species  Feeding Schedule  Weight  Height   Age  Needs Kisses  \
ID                                                                          
1      Echo     Dog                 2      40      24   2.5          True   
2      Evie       ?                 1      14      10   8.0          True   
3     Simon   Horse                 3     930      60   5.0          True   
4   Lucifer     Cat                 1      12      12  13.0          True   

    Meal Weight  Is Big         Meal Details  
ID                                            
1           4.0   False    4.0 lbs of kibble  
2           1.4   False    1.4 lbs of kibble  
3          93.0    True     93.0 lbs of oats  
4           1.2   False  1.2 lbs of idk fish  


# Epilogue

This lesson really only scratches the surface of what pandas can do, but there's really too much to cover. Just remember as always that google is your friend. If there's ever anything you want to do with a pandas dataframe, jjust google it and there's probably a nifty built in way to do so.