# Pandas

Author: Manuel Dalcastagnè. This work is licensed under a CC attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/).

## Introduction

Pandas is a Python library which provides high-performance data structures and data analysis tools.

Among its main features:

* a fast and efficient `DataFrame` object, used to represent and manipulate data in a table-like fashion
* functions to read and write data in different formats, like CSV and text files
* aggregating or transforming data with SQL-like sintax through group-by, merge and join operations
* highly optimized with critical code written in C

To use `pandas` you need to import the module, using for example:

In [3]:
import pandas as pd

## Creating series and dataframes

A `Series` is a 1-dimensional array which can contain any type of data, ordered according to a user-defined index. It can be created:
* from Python lists and dictionaries
* from Numpy 1-d arrays
* from scalars

A `Dataframe` is a 2-dimensional data structure with columns of different types. Possible ways to create a dataframe are:
* from Python lists and dictionaries
* from Series
* from Numpy 1-d and 2-d arrays
* from other Dataframes
* reading data from files 

### Creating a series

In [56]:
import numpy as np

# from a Numpy array
pd.Series(np.array([1,2,3,4,5]), index=['a', 'b', 'c', 'd', 'e'])

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [58]:
# from a list
pd.Series([1,2,3,4,5], index=[1,2,3,4,5])

1    1
2    2
3    3
4    4
5    5
dtype: int64

In [59]:
# from a list (without providing an index)
pd.Series([1,2,3,4,5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

### Creating a dataframe from a list of tuples

To merge lists together creating a list of tuples, we can use the `zip` function:

In [194]:
heights = [160,188,175,174,155]
weights = [60,80,78,81,67]
newlist = list(zip(heights,weights))
print(newlist)

[(160, 60), (188, 80), (175, 78), (174, 81), (155, 67)]


In [28]:
df = pd.DataFrame(data = newlist, columns=['Heights', 'Weights'])
print(df)

   Heights  Weights
0      160       60
1      188       80
2      175       78
3      174       81
4      155       50


### Exporting a dataframe to a CSV file

To export a dataframe in CSV, use the `to_csv` function. We use the `index` and `header` parameters, to request or not the presence of header and numbers of rows in the CSV file:

In [35]:
df.to_csv("data.csv", header=True, index=False)

In [36]:
df.to_csv("data_noheader.csv", header=False, index=False)

### Importing a dataframe from a CSV file

To import a dataframe from a CSV file, use the `to_csv` function. If the header is not available, use the parameter `names` to specify manually the header:

In [26]:
df = pd.read_csv("data.csv")
print(df)

   Heights  Weights
0      160       60
1      188       80
2      175       78
3      174       81
4      155       50


In [60]:
df = pd.read_csv("data_noheader.csv", names = ["Heights","Weights"])
print(df)

   Heights  Weights
0      160       60
1      188       80
2      175       78
3      174       81
4      155       50


## Selecting and manipulating dataframes

In [149]:
# to select a column
df['Heights']

0    160
1    188
2    175
3    174
4    155
Name: Heights, dtype: int64

In [166]:
# to select a row by numerical index
df.iloc[0,:]

Heights    160
Weights     60
Name: 0, dtype: int64

In [165]:
# to select a column by numerical index
df.iloc[:,0]

0    160
1    188
2    175
3    174
4    155
Name: Heights, dtype: int64

In [159]:
# to select multiple rows and multiple columns
df.iloc[0:3,0:2]

Unnamed: 0,Heights,Weights
0,160,60
1,188,80
2,175,78


In [146]:
# to select rows using data slicing
df[0:2]

Unnamed: 0,Heights,Weights
0,160,60
1,188,80


In [168]:
# to get a boolean mask
mask = df['Heights'] > 170
mask

0    False
1     True
2     True
3     True
4    False
Name: Heights, dtype: bool

In [190]:
# to select using a boolean mask
df[mask]

Unnamed: 0,Heights,Weights
1,188,80
2,175,78
3,174,81


In [124]:
# to insert or remove a column to a dataframe
df.insert(0,'Age',[20,34,18,38,46])
df

Unnamed: 0,Age,Heights,Weights
0,20,160,60
1,34,188,80
2,18,175,78
3,38,174,81
4,46,155,50


In [125]:
df.pop('Age')
df

Unnamed: 0,Heights,Weights
0,160,60
1,188,80
2,175,78
3,174,81
4,155,50


In [92]:
# to apply an operation to a column
df['Heights'] -= 100
df['Heights']

0    160
1    188
2    175
3    174
4    155
Name: Heights, dtype: int64

## Getting information about the contents of a dataframe

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
Heights    5 non-null int64
Weights    5 non-null int64
dtypes: int64(2)
memory usage: 160.0 bytes


In [42]:
df.head(2)

Unnamed: 0,Heights,Weights
0,160,60
1,188,80


In [191]:
df.tail(2)

Unnamed: 0,Heights,Weights
3,174,81
4,155,50


## Further reading

For information about more advanced operations on dataframes with pandas (merge, join, group-by, ...):
* http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
* http://pandas.pydata.org/pandas-docs/stable/

# EXERCISE 6

Given the iris dataset CSV file and a new unseen vector representing a flower, define a function that classifies the new flower according to the class of its nearest neighbor.

TIPS: 
* use the Euclidean distance as distance metric
* you can download and find more information about the iris dataset at https://archive.ics.uci.edu/ml/datasets/iris.

In [195]:
new_flower = [5.2,2.3,3.2,1.1]