# Introduction to pandas

"pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language." -- [pandas web page](https://pandas.pydata.org/)

"pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series." -- Wikipedia

We will only need the most basic features of pandas, essentially just reading data and separating out some features. However, since a lot of data science uses pandas, it is worth knowing a little more than the bare minimum. 

The [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) is part of the official pandas User Guide. It is a comprehensive introduction to pandas aimed at those who want to learn the details. 
w3schools also has a [section on pandas](https://www.w3schools.com/python/pandas/default.asp).

---

## pandas data structures

The following encapsulates the essential information from 
[Intro to data structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)

- *Series* are 1D labelled arrays capable of holding any data type (integers, strings, floating point numbers, etc.). **Series acts very similarly to the 1D arrays you know from NumPy**. 
  
- *DataFrame* are 2D labelled data structure with columns of potentially different types. **You can think of a DataFrame as a table or as a spreadsheet. DataFrames are the most commonly used object in pandas.** 

Series and DataFrames are in fact valid arguments to most NumPy functions, assuming the Series and DataFrames contain numerical values.  Think of NumPy's elementwise functions (exp, sqrt, sin, etc) and various other NumPy functions. These can be called with Series or DataFrames as arguments.

---

To use pandas we first need to import it. We follow standard practice and alias it to `pd`. 

In [2]:
# Import pandas and alias to pd

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

---
## Iris dataset

One of the most famous datasets for machine learning is the Iris dataset. 

"Originally published at UCI Machine Learning Repository: Iris Data Set, this small dataset from 1936 is often used for testing out machine learning algorithms and visualisations. Each row of the table represents an iris flower, including its species and dimensions of its botanical parts, sepal and petal, in centimeters."
-- The Iris Dataset on GitHub.

The iris data is in a file iris.csv. (csv stands for comma-separated values, although many csv files have entries separated by other characters, e.g. semicolons.) You should download this file from the Week 7 notebooks folder if you have not already done so. If you open iris.csv by double clicking on it from within JupyterLab, is will open in a new tab. The data will look like a table or spreadsheet, with columns `sepal_length`, `sepal_width`, `petal_length`, `petal_width` and `species`. Later we will look at the relationship between these quantities. For now all we want to do is read and manipulate the data.

---

### Reading in the data with pandas

The first thing we do is read in the data using `pd.read_csv` with the name of the file we want to read. 

In [3]:
# read the iris.csv file to dataframe iris
iris = pd.read_csv("iris.csv") 

Pandas reads the data from the file and returns it as a DataFrame.  Hence `iris` is a DataFrame. One can verify this by printing its type.

In [4]:
# print type() to verify that it is a pandas dataframe
print(type(iris))

<class 'pandas.core.frame.DataFrame'>


---

### Inspecting at the data with pandas

The next thing we want to do is display the top rows of the dataframe to ascertain that it has been read correctly and that is looks reasonable. We display the top rows using `head()`, as shown in the cell below. 

In [8]:
iris.head()
iris.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


Note again the similarity to a spreadsheet. (Place your cursor over the data and you should see the rows highlighted.) There are column labels across the top and there are row numbers (starting at zero) down the side. The column `species` does not contain numerical values but rather characters (strings). You should appreciate that with minimal effort we have read the data in into Python.

---

The `head()` method takes an argument for the number of rows to display. The default is 5. The `tail()` method is similar except that it displays the bottom rows (default 5) of a DataFrame.

**Exercise:** Edit the cell above to display the top 10 rows of `iris`. Then edit this to display the bottom 10 rows of `iris`. Try to display the top 5 rows and bottom 5 rows with `head()` and `tail()` in the same cell. You will find that only the final call is displayed. You need to execute the `head()` and `tail()` in different cells if you want both to appear.  However, there is a simpler way to see both the head and tail. In the code cell below, enter just `iris`.


In [9]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In addition to displaying the head and tail, the output shows explicitly that there are 150 rows and 5 columns in the DataFrame. 

---
Two further useful methods for obtains or confirming basic information about a dataframe of `info()` and `describe()`. Try these.

In [10]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [11]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


You can see that `describe()` provides a compact summary of the data. You can understand the basics of what is shown: `count` is the number of rows,  `mean` is the mean value of each of the columns etc. 

---

### Extracting columns

For our purposes, the main thing we will do with pandas after reading the data is extract the columns that we need. Columns of a dataframe can be referred to simply by the column label. For example we can copy the "sepal_length" column to a pandas Series. (Recall that a Series is a 1D structure like a 1D array.) 

In [12]:
test_series = iris['sepal_length']
test_series.head()

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64

The output from the `head()` of a Series is not as nice as for a DataFrame. 

---

We copied the column `sepal_length` from `iris` to test_series. We will now copy all the columns except `sepal_length` to a DataFrame using `drop()` as follows.

In [13]:
test_dataframe = iris.drop(['sepal_length'], axis=1)
test_dataframe.head()

Unnamed: 0,sepal_width,petal_length,petal_width,species
0,3.5,1.4,0.2,setosa
1,3.0,1.4,0.2,setosa
2,3.2,1.3,0.2,setosa
3,3.1,1.5,0.2,setosa
4,3.6,1.4,0.2,setosa


The `['species']` argument to `drop()` says to drop elements with the species label and `axis=1` says that this is a column. `test_dataframe` is a DataFrame since it is still two-dimensional data. What we have done is separated our data into two parts: `test_series` containing the `sepal_length` data and `test_dataframe` containing all the other data. 

This is the only type of data manipulation that we will actually need. However, in practice we will want to separate the species column from the `iris` DataFrame, not the `sepal_length`. We leave this for an exercise.

---

### Other manipulations

For illustration purposes we demonstrate two further features of pandas. As noted at the beginning of the notebook, NumPy functions accept pandas data as inputs. For example, we can compute the mean of `sepal_length` data

In [14]:
np.mean(test_series)

5.843333333333335

Edit the previous cell to compute the mean of `test_dataframe`. Note that NumPy knows to compute only the mean of the numerical data. The data in the species column is known as [categorical data](https://en.wikipedia.org/wiki/Categorical_variable) and there is no sense to taking a mean. 

---

One can display columns in sorted order. Here is an example of displaying the `test_dataframe` sorted by `sepal_width` from smallest to largest.

In [15]:
test_dataframe.sort_values(by='sepal_width', ascending=True)

Unnamed: 0,sepal_width,petal_length,petal_width,species
60,2.0,3.5,1.0,versicolor
62,2.2,4.0,1.0,versicolor
119,2.2,5.0,1.5,virginica
68,2.2,4.5,1.5,versicolor
41,2.3,1.3,0.3,setosa
...,...,...,...,...
16,3.9,1.3,0.4,setosa
14,4.0,1.2,0.2,setosa
32,4.1,1.5,0.1,setosa
33,4.2,1.4,0.2,setosa


Try the above with `ascending=False`. 

---

### Exercise

From the `iris` DataFrame, create a new DataFrame `X` containing all the columns except `species` and a Series `y` containing only the `species` column.  (This is the natural division of the data into two parts.)  Print the type of `X` and `y`. Then in separate cells display descriptions of `X` and `y` using `describe()`.

In [None]:
# your answer


---
Expand cells below to see answers

In [16]:
X = iris.drop(['species'], axis=1)
y = iris['species']
print(type(X))
print(type(y))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [17]:
X.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [18]:
y.describe()

count        150
unique         3
top       setosa
freq          50
Name: species, dtype: object

---