![title](http://www.pragso.com/content/images/2016/07/pandas_logo-3.png)

(Not exactly a Panda but w/e). In the Data Science world,  a lot of data comes organized neatly in tables, like those you might see on Excel files.  Python has a convenient way of working with tabular data like this, and that's with the Pandas library. This library will be like your bread and butter, you'll have to have it in your blood, you need to eat this Panda every day, and hopefully it doesn't go extinct. Now, how is data in Pandas represented? 

Let's suppose you have a table about some of your friends, which includes their name, age, month of birth and gender, kinda like this: 

![title](https://sarahleejane.github.io/assets/simple_table_df.png)

Now let's go over the parts of a dataframe. 

- The Birth Month, Origin, Age and Gender are the **Columns** of the dataframe. They are the real world meaning of the data that is in the dataframe cell. 

- Carly Rachel Nicky Wendy and Judith are the **Rows** of the dataframe. Now here's an interesting case. In this dataframe, the names are the **index** of the dataframe,  or the names of the rows. The index of a dataframe is an unique value that makes each row different. Since the names are unique, we can set them as the index of the dataframe. By default, dataframes have a row index similar to that of a list, going from 0 to the number of rows N.

- January, UK, 27, F, etc. are the **Values** of the dataframe. They represent the intersection of the Column and the Row. For example, the value of the Column Origin in the row Wendy is Italy. 

Let's create this dataframe right here right now, by using a dictionary as a base.

In [118]:
#Always make sure to import your libraries kids!
import pandas as pd

In [119]:
#Create the values of the dataframe
Names = ["Carly","Rachel","Nicky","Wendy","Judith"]
Birth_Month = ["January","September","September","November","February"]
Origin = ["UK","Spain","Jamaica","Italy","France"]
Age = [27,28,28,22,19]
Gender = ["f","f","f","f","f"]

#Create the dictionary that we will turn into the dataframe.
Friends = {"Names":Names,"Birth Month":Birth_Month,
           "Origin":Origin, "Age":Age,"Gender":Gender}

#Create the dataframe, by converting the dictionary to a dataframe using pd.Dataframe()
Friends_Dataframe = pd.DataFrame(Friends)

#In Jupyter notebooks, it's better to just name the dataframe at the end of the cell, since it creates a much better
#looking table. Let's compare using print, and just naming the dataframe at the end.
print(Friends_Dataframe)
Friends_Dataframe

   Age Birth Month Gender   Names   Origin
0   27     January      f   Carly       UK
1   28   September      f  Rachel    Spain
2   28   September      f   Nicky  Jamaica
3   22    November      f   Wendy    Italy
4   19    February      f  Judith   France


Unnamed: 0,Age,Birth Month,Gender,Names,Origin
0,27,January,f,Carly,UK
1,28,September,f,Rachel,Spain
2,28,September,f,Nicky,Jamaica
3,22,November,f,Wendy,Italy
4,19,February,f,Judith,France


Much better looking, right? But hang on, there's still something wrong. We have numbered indexes instead of the names. We can fix that really easily by using the built in function of the dataframe, **set_index()**

In [120]:
#Replace the old dataframe with this new one that has names as the index.
Friends_Dataframe = Friends_Dataframe.set_index("Names")
Friends_Dataframe

Unnamed: 0_level_0,Age,Birth Month,Gender,Origin
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Carly,27,January,f,UK
Rachel,28,September,f,Spain
Nicky,28,September,f,Jamaica
Wendy,22,November,f,Italy
Judith,19,February,f,France


But this still doesn't look right. The names are just having an empty row all for themselves, those greedy bastards. But we can also fix that by doing this:

In [121]:
#Delete the name of the index.

del Friends_Dataframe.index.name
Friends_Dataframe

Unnamed: 0,Age,Birth Month,Gender,Origin
Carly,27,January,f,UK
Rachel,28,September,f,Spain
Nicky,28,September,f,Jamaica
Wendy,22,November,f,Italy
Judith,19,February,f,France


There we go, this looks way better. 

#### Now your turn. 

Create your own dataframe with 5 or more of your friends, and put them in the Myfriends_Dataframe. You can use any columns you want. You'll work with this dataframe for the rest of the exercises.

In [122]:
#Your code here
Myfriends_Dataframe = " "

## Accessing data on a Dataframe

So far so good, you know how to create a dataframe. But what can you do after you create it? One thing we can do, is look at a single column to see their values. For example, let's look at the origins of the friends.

In [123]:
Friends_Dataframe.Origin

Carly          UK
Rachel      Spain
Nicky     Jamaica
Wendy       Italy
Judith     France
Name: Origin, dtype: object

We can see the country of origin for each person , as well as the name of the person in the index.  Why is this? That's because when you call a Column of a dataframe like this, the resulting object is something called a **Series**. A series is like a dataframe, but it only includes the index and the value of the column. We can perform some of the same operations we do on dataframes with series, but we'll see more of that later. If you need the values of the column themselves, you first have to convert the Series to a list, like so:

In [124]:
countries = list(Friends_Dataframe.Origin)
countries

['UK', 'Spain', 'Jamaica', 'Italy', 'France']

And now we have a list of strings with the countries of our dataframe. Now let's do the same, but for the months. 

In [125]:
#Another way to access a column in a dataframe, in case the name of the column is separated by spaces.
list(Friends_Dataframe["Birth Month"])

['January', 'September', 'September', 'November', 'February']

That's nice and all, but some months are repeated. To get all the unique values on a dataframe column, we can use the built in unique() function, which returns only the unique values of the column.

In [126]:
list(Friends_Dataframe["Birth Month"].unique())

['January', 'September', 'November', 'February']

We can also select various columns in one go. 

In [127]:
Columns = ["Age","Gender"]
Friends_Dataframe[Columns]

Unnamed: 0,Age,Gender
Carly,27,f
Rachel,28,f
Nicky,28,f
Wendy,22,f
Judith,19,f


And,just like with numpy arrays, we can subset by rows.

In [128]:
#Gets the rows before the third row.
Friends_Dataframe[:3]

Unnamed: 0,Age,Birth Month,Gender,Origin
Carly,27,January,f,UK
Rachel,28,September,f,Spain
Nicky,28,September,f,Jamaica


#### Now your turn.
With the Myfriends dataframe you created early,  do 2 subsets my columns, one with unique values, and one without unique values. Do a subset with rows as well. Try to do a subset of rows and columns. Is it possible?

In [129]:
#Your code here

## Operations with dataframes.

Operations in pandas dataframes are best done **column-based**. That is, you first take a column, apply an operation, and then reassign that column to its new value. This can be done all on the same line, or by first setting a variable with the result and then replacing it.

In [130]:

import numpy as np
#all on the same line.
Friends_Dataframe.Age = Friends_Dataframe.Age.apply(np.sqrt)
print(Friends_Dataframe.Age)

#Setting a variable with the new value, then replacing the old one.
Squared_Age = Friends_Dataframe.Age.apply(np.square)
Friends_Dataframe.Age = Squared_Age
print(Friends_Dataframe.Age)

Carly     5.196152
Rachel    5.291503
Nicky     5.291503
Wendy     4.690416
Judith    4.358899
Name: Age, dtype: float64
Carly     27.0
Rachel    28.0
Nicky     28.0
Wendy     22.0
Judith    19.0
Name: Age, dtype: float64


**Be very careful when replacing columns in a dataframe**.  Once you change them in the original dataframe, unless you got the previous value stored somewhere, the only way to go back is to recreate the original dataframe, and apply all the steps you have done before on it. 

You can also use apply to apply custom functions, and **lambda** functions to a column.

In [131]:
#Custom Function.
def All_Caps(string):
    return string.upper()

Friends_Dataframe.Gender = Friends_Dataframe.Gender.apply(All_Caps)
print(Friends_Dataframe.Gender)

#Lambda function
Friends_Dataframe["Birth Month"] = Friends_Dataframe["Birth Month"].apply(lambda month: month.lower())
print(Friends_Dataframe["Birth Month"])

Carly     F
Rachel    F
Nicky     F
Wendy     F
Judith    F
Name: Gender, dtype: object
Carly       january
Rachel    september
Nicky     september
Wendy      november
Judith     february
Name: Birth Month, dtype: object


Lambda functions are really useful in the way that they allow us to set a function in a variable. Like this:

In [132]:
Minus_5 = lambda number: number - 5 #number being the variable being passed to the function and number-5 is the function itself.

print(Friends_Dataframe.Age.apply(Minus_5))

Carly     22.0
Rachel    23.0
Nicky     23.0
Wendy     17.0
Judith    14.0
Name: Age, dtype: float64


Lambda functions are another way of working with Series and data like this. Feel free to mess around with them.  One more useful function is the **groupby** function, which creates a new dataframe, but with the grouped columns combined. Let's see this in action by grouping the friends by their birth month.

### Adding  Rows and Columns
Adding rows and columns in a dataframe follow a simple rule. If adding a new column, the column must contain the same number of elements as there are rows in the dataframe. Example:

In [133]:
##To add a new column to a dataframe is like adding a new Key:Value in a dictionary.
Education = ["Ms","Ms","PhD","Bs","Hs"] #Level of education list.
Friends_Dataframe["Education"] = Education
Friends_Dataframe

Unnamed: 0,Age,Birth Month,Gender,Origin,Education
Carly,27.0,january,F,UK,Ms
Rachel,28.0,september,F,Spain,Ms
Nicky,28.0,september,F,Jamaica,PhD
Wendy,22.0,november,F,Italy,Bs
Judith,19.0,february,F,France,Hs


To add a new row, the row must contain the same number of elements as there are columns in the dataframe, including the index. First, you need to create a dictionary (as if creating a new dataframe) and then convert it to a pandas Series, and finally append it to the dataframe.  If you are trying to combine two dataframes together, you can simply append them straight away, provided they have the same columns.

In [134]:
James = {"Age":25.0,
         "Birth Month":"july","Gender":"M",
         "Origin":"USA","Education":"Bs"}
#Add him to the dataframe.
Friends_Dataframe = Friends_Dataframe.append(pd.Series(James,name="James"))
Friends_Dataframe

Unnamed: 0,Age,Birth Month,Gender,Origin,Education
Carly,27.0,january,F,UK,Ms
Rachel,28.0,september,F,Spain,Ms
Nicky,28.0,september,F,Jamaica,PhD
Wendy,22.0,november,F,Italy,Bs
Judith,19.0,february,F,France,Hs
James,25.0,july,M,USA,Bs


Another useful function is the groupby function, which creates a sub-dataframe, with the data grouped by the column and function you selected. If this seems complicated here's an example, grouping the friends by Gender, and getting the mean age. 

In [137]:
#First, we need to separate the columns, into numeric columns, and the label we are trying to group them by.
#In this case, gender and age.
Friends_Copy = Friends_Dataframe.copy() #Create a copy of our original friends dataframe.
Friends_Copy = Friends_Copy.reset_index() #Reset the index back to generic numbers.

Gender_And_Age = Friends_Copy[["Age","Gender"]]

#Group by gender, and use the mean as the function.
Grouped_byGender =  Gender_And_Age.groupby(["Gender"]).mean()
Grouped_byGender.head()

Unnamed: 0_level_0,Age
Gender,Unnamed: 1_level_1
F,24.8
M,25.0


So the mean age for females is 24.8 and for males, since there's only one, 25.0.  

#### Now your turn.
Merge the friends dataframe and the dataframe of your friends into one. Create new columns if necessary! 

In [138]:
#Your code here

## Other useful things.

### Subsetting by a condition. 
In pandas, we can also select rows based on if certain values are true for a condition. For example, we can get all friends that were born in september like this:

In [139]:
#Use the loc function for dataframes to subset by a condition.
SeptemberBorn = Friends_Dataframe.loc[Friends_Dataframe["Birth Month"] == "september"]
SeptemberBorn

Unnamed: 0,Age,Birth Month,Gender,Origin,Education
Rachel,28.0,september,F,Spain,Ms
Nicky,28.0,september,F,Jamaica,PhD


The same conditions that you would use in if statements apply here.

### Saving/Reading  a file.
Pandas allows to save your dataframe to a file easily. A lot of the time, you'll be reading and writing from CSVs though. 

In [144]:
#Save our dataframe.
Friends_Dataframe.to_csv("Friends.csv")

#Reading what we saved. 
Read_Friends = pd.read_csv("Friends.csv",index_col = 0) #Set index col to 0, so it sets the index to the names again.
Read_Friends

Unnamed: 0,Age,Birth Month,Gender,Origin,Education
Carly,27.0,january,F,UK,Ms
Rachel,28.0,september,F,Spain,Ms
Nicky,28.0,september,F,Jamaica,PhD
Wendy,22.0,november,F,Italy,Bs
Judith,19.0,february,F,France,Hs
James,25.0,july,M,USA,Bs


Remember to save your results if needed!

## Exercise

Read the olives.csv dataset and visualize the first 5 columns.

In [146]:
#Your code here.

Rename the first column of the dataset.

In [147]:
# Your code here.

Which unique regions and areas are contained in the dataset?

In [148]:
#Your code here.

Clean the dataset, if it needs cleaning.

In [149]:
#Your code here.

Divide all numeric columns by 100, and group them by area string. You can use whatever function you want.

In [150]:
#Your code here.

Save the dataset you so proudly cleaned and created.

In [151]:
#Your code here.

## Further Reading

Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/index.html