## Pandas and Data Management

In this session we are going to learn about the package pandas. To call pandas a package may even be a little bit disparaging towards them. Pandas is a collection of data science functions that basically deal with data frames, tables etc. A lot like the python version of excel. It is one of the most in demand python skills in industry and you will most likely use it if you are using python for any of your data analysis or projects during your PhD.

The most important feature of pandas is the panda data frame. This is just a specific structure of arrays that are easily manipulated in useful ways. It also contains various analysis functions of the data you have although these are very general and not so powerful.

In [2]:
import pandas as pd

In [3]:
import numpy as np

First thing to do is to import the pandas package. Generally people use pd as a short for pandas but it isn't strictly necessary.

What we will do is make a quick pandas dataframe first to explore some of the basic features.

In [4]:
df = pd.DataFrame({'name':['Jack','Jen','John'],
                   'age':[20,30,40],
                   'Department':['Zoology', 'Geography', 'Physics']})

df

Unnamed: 0,name,age,Department
0,Jack,20,Zoology
1,Jen,30,Geography
2,John,40,Physics


Look at how Pretty that is. It basically looks like excel right? And we can do even more with it!

In [7]:
print(df.columns)


Index(['name', 'age', 'Department'], dtype='object')


In [12]:
print(df.index)

RangeIndex(start=0, stop=3, step=1)


We can examine the shape and names of the coloumns. We can reference the columns we want in two ways.

In [11]:
print(df.name)

0    Jack
1     Jen
2    John
Name: name, dtype: object


In [22]:
print(df["name"])

0    Jack
1     Jen
2    John
Name: name, dtype: object


We can use iloc and an index to get all the info in that index

In [21]:
df.iloc[0]

name             Jack
age                20
Department    Zoology
Name: 0, dtype: object

In [25]:
df.iloc[1]

name                Jen
age                  30
Department    Geography
Name: 1, dtype: object

Or we can reference a specific index inside a specific column.

In [26]:
df.name[0]

'Jack'

We can grab rows using the slices : operator. But note you can't really reference pandas dataframes with standard indexing.

In [39]:
df[:1]

Unnamed: 0,name,age,Department
0,Jack,20,Zoology


the 0 index will be the column names.

In [44]:
df[:0]

Unnamed: 0,name,age,Department


We have to use the loc / iloc to use our normal indexing.

In [50]:
df.loc[0]

name             Jack
age                20
Department    Zoology
Name: 0, dtype: object

In [54]:
df.loc[0][0]

'Jack'

conside the dataframe like a dictionary when using loc. If you want to use a specific value you need to reference the row and the Key/column name you are wanting.

In [85]:
df.iloc[:,0]

0    Jack
1     Jen
2    John
Name: name, dtype: object

Here we have referenced all values in the name column. Try out as many things as you want to understand indexing of dataframes.

Try out the head and tail functions that allow you to look at certain rows from the top or bottom of the dataframe.

Now you try making your own data frame from this data. 

In [67]:
Species = ["Canis lupus", "Panthera leo", "Loxodonta africana", "Vulpes vulpes"]
English = ["Wolf", "Lion", "Elephant", "Fox"]
Weight = [35, 190, 6000, 12]
Lifespan = [18, 12, 60, 5]

In [72]:
Animals = 

In [None]:
Animals

Now check out the columns in order, looking at all the data in them.

We can do some fancier things with pandas than just look at pretty data frames.

In [None]:
Animals.sort_values('Lifespan',ascending = True)

Here we sort the data by lifespan. Quite a useful think if you want to sort data.

In [None]:
Animals["Species Name"].sample(3)

Here we have just randomly sampled 3 of the species names.

In [None]:
Animals.info()

Here we can see all the types that the data within the dataframe are.

In [None]:
Animals[Animals["English Name"]=="Fox"]

Here we have sorted the data by the grouping of Fox. We can sort the data by anything we want.

In [None]:
Animals.describe()

Here we can get some descriptive statistics automatically generated for our numerical data.

While these are all useful the true power of pandas comes from its ability to do all this to huge datasets that were generated another way. Many of you will have experience of working with CSV files. Pandas comes with functions for reading in CSV files as dataframes.

There are lots of datasets to play around with out there, we will use on of the most common on the internet the iris dataset which contains measurement data of flowers. 

In [4]:
Data = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")

In [None]:
Data

Explore the dataset, get the descriptive statistics for various columns etc. 

Another incredibly useful thing that we can do is to write CSVs from pandas dataframes so we can save them and use them for other things later.

In [None]:
Data.to_csv(path) #you'll have to put your own path in here where you want to save it.

As a final part of this we are going to use the functions we have been writing to work out some finances for people and turn that into a dataframe and a CSV.

You will build a dataset with these people.

Jane : age 30, Income £60000, invests 20% monthly, target house price = £450,000

Amanda : age 23, income £19000, invests 8% monthly, target house price = £180,000

Chris: age 48, income £39000, invests 16% annualy, target house price = £320,000

Ethan: age 37, income £120000, invests 26% monthly, target house price = £720,000

Create a dataset that is composed of their information and how much they will be making each year in interest as well as how many years they have to save for to buy their house.