# 4. Pandas Basics

Pandas is a python library designed to work with tabular data (two-dimensional structure where columns and rows have meaning) that have a mixuture of data types throughout the data. Furthermore, the aim is to provide functionality that makes many common types of data analysis straightforward but also provides the flexibility to perform complex data transformations.

Pandas is designed to a few things very well. It is incredibly useful for grouping data, aggregating and processing these groups, and reshaping data. Support for time-series analyses in Pandas is also strong with built-in date and time representations.

## 4.1 Data structure

### 4.1.1 Series

The basic data structure in Pandas is a Series. The Pandas Series contains three structures: 
- a numpy one-dimensional array of elements (therefore, all elements are of the same type), 
- a mapping of Series indexes (which can be numeric, as in numpy, but can also be strings) to positions within the numpy array (this is actually saved as a very efficient dictionary object),
- some additional header information.

### 4.1.2 DataFrame

The most used data structure in Pandas is a DataFrame. A dataframe contains two dimensions: every column is a Series object, and every column has the same number of elements (rows).

The way a DataFrame is stored in Pandas is as a python dictionary object (with some additional header information). The key in this dictionary indicates the column name, and the value is the Pandas Series.

There are some advantages and disadvantages to how Pandas structures memory. First, by having each column be a numpy array, computations within each column are really fast and get all the benefits of numpy over python. Second, by having a dedicated and reasonably fast data structure to connect row indexes, computations across rows are reasonably fast. However, by storing additional index information for every column (it isn't really *that* bad, but is definitely extra) additional memory is needed to store a pandas DataFrame relative to a numpy multi-dimensional array with the same number of elements.


## 4.2 Relating Pandas DataFrames to other representations

### 4.2.1 Relationship to a Numpy two-dimensional array

A numpy two-dimensional array can be the same shape as a Pandas DataFrame. However, a numpy array can contain only one type of element, whereas each column in a Pandas DataFrame can have a different type.

This gives an advantage in processing speed for numpy arrays when data is appropriate for a multi-dimensional numpy array: when all elements are the same type. However, when data is a mixture of types, the flexibility of a Pandas DataFrame is often beneficial and provides functionality that is not easy with a Numpy array.


### 4.2.2 Relationship to the dataFrame object in R

Very similar. Much of how Pandas is designed was inspired by what works really well with R dataFrames.


### 4.2.3 Relationship to SQL and other databases

Similar. It is very straightforward to export a Pandas DataFrame to SQL (and import from SQL). Though they are represented in memory very differently, the basic structure is similar.

The major difference is that Pandas is (mostly) designed to work with data that fits within the memory of a single computer. SQL and other databases often provide functionality to work with much bigger datasets where only a portion of data is being manipuated or used at any time. This is not the goal of Pandas.



In [2]:
import numpy
import pandas

## 4.3 Creating Data Frames

### 4.3.1 From a numpy array

Including specifying the names of rows and columns:

In [3]:
# creating the Numpy array
grades = numpy.array([[90, 9.0],[52, 5.0],[76, 7.5],[68, 7.0],[84, 8.5]])

# creating a list of row names
student_numbers = [42343, 23423, 57567,54644, 34534]
   
# creating a list of column names
column_labels = ['Marks', 'Grade']
  
# creating the dataframe
df = pandas.DataFrame(data = grades, 
                      index = student_numbers, 
                      columns = column_labels)
df

Unnamed: 0,Marks,Grade
42343,90.0,9.0
23423,52.0,5.0
57567,76.0,7.5
54644,68.0,7.0
34534,84.0,8.5


### 4.3.2 Creating Pandas Dataframes From Python Lists

In [4]:
# creating list
students_info = [['Anna', 'van Rossum', 'F', 92, 9.0, True],
          ['David', 'Alberts', 'M', 52, 5.0, False],
         ['Ibrahim', 'Assad', 'M', 76, 7.5, True],
         ['Maria', 'Esposito', 'F', 68, 7.0, True],
         ['Sophie', 'van Dee', 'F', 84, 8.5, True]]
  
# creating a list of row names
student_numbers = [42343, 23423, 57567,54644, 34534]
   
# creating a list of column names
column_labels = ['Name', 'Suname', 'Gender', 'Marks', 'Grade', 'Pass']
  
# creating the dataframe
ADP_grades = pandas.DataFrame(data = students_info, 
                      index = student_numbers, 
                      columns = column_labels)
  
ADP_grades

Unnamed: 0,Name,Suname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


### 4.3.3 Creating Pandas Dataframes From Dictionaries

In [6]:
ADP_grades = {'Name' : ['Anna', 'David', 'Ibrahim', 'Maria', 'Sophie'],
              'Surname' : ['van Rossum', 'Alberts', 'Assad', 'Esposito', 'van Dee'],
              'Gender' : ['F','M','M','F','F'],
             'Marks': [92, 52, 76, 68, 84],
             'Grade': [9.0, 5.0, 7.5, 7, 8.5],
             'Pass': [True, False, True, True, True]}

# creating a list of row names
student_numbers = [42343, 23423, 57567,54644, 34534]

ADP_grades = pandas.DataFrame(ADP_grades, index=student_numbers)

ADP_grades

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


### 4.3.4 Creating Pandas Dataframes From Files

In [7]:
# read a data file into pandas
ADP_grades.to_csv('ADP_grades.csv')

df = pandas.read_csv('ADP_grades.csv', index_col=0)

df

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


Other input and output: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io

## 4.4 Simple Functions to work with Pandas Dataframes

### 4.4.1 Retrieving Index and Columns Labels

In [8]:
# Retrieving index labels
ADP_grades.index

Index([42343, 23423, 57567, 54644, 34534], dtype='int64')

In [9]:
# Retrieving column labels
ADP_grades.columns

Index(['Name', 'Surname', 'Gender', 'Marks', 'Grade', 'Pass'], dtype='object')

In [11]:
# changing the index
temp = ADP_grades.copy() # Creating a copy of the original dataframe
temp.set_index('Name') # Why this is not a good idea?

Unnamed: 0_level_0,Surname,Gender,Marks,Grade,Pass
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Anna,van Rossum,F,92,9.0,True
David,Alberts,M,52,5.0,False
Ibrahim,Assad,M,76,7.5,True
Maria,Esposito,F,68,7.0,True
Sophie,van Dee,F,84,8.5,True


### 4.4.2 Retrieving some information about the dataframe
There are many simple functions that can be called on any DataFrame to provide you with additional information or details about the object.

You can use the `head(n)` and `tail(n)` functions to view the first `n` and last`n` rows of a DataFrame 

In [None]:
ADP_grades.head(3)

In [13]:
ADP_grades.tail(2)

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


The `info()` function can tell you about the structure of the DataFrame, including the number of rows and columns, as well as the type of each column

In [12]:
ADP_grades.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 42343 to 34534
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Name     5 non-null      object 
 1   Surname  5 non-null      object 
 2   Gender   5 non-null      object 
 3   Marks    5 non-null      int64  
 4   Grade    5 non-null      float64
 5   Pass     5 non-null      bool   
dtypes: bool(1), float64(1), int64(1), object(3)
memory usage: 245.0+ bytes



The `describe()` function provides descriptive statistics of a numeric columns in a dataFrame. See below, non-numeric columns are skipped.

In [14]:
ADP_grades.describe()

Unnamed: 0,Marks,Grade
count,5.0,5.0
mean,74.4,7.4
std,15.388307,1.557241
min,52.0,5.0
25%,68.0,7.0
50%,76.0,7.5
75%,84.0,8.5
max,92.0,9.0


The output of the `describe` function is also a dataframe:

In [15]:
df = ADP_grades.describe()
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, count to max
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Marks   8 non-null      float64
 1   Grade   8 non-null      float64
dtypes: float64(2)
memory usage: 192.0+ bytes


In [16]:
df.index

Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')

In [17]:
df.columns

Index(['Marks', 'Grade'], dtype='object')

## 4.5 Indexing data frames

As with any data structure, it is important to be able to access subsets of the dataframe. Pandas has a number of methods for accessing both rows and columns in a number of ways.


### 4.5.1 Attribute indexing

First, much like python dictionaries, you can access a single, complete column from a Pandas Dataframe as an attribute. This method will return a Series object (one column from the DataFrame).

This method can only work when the attribute is a valid python attribute (so data.1 would not be allowed even if there is a column named 1 in data) and when that attribute does not already exist (for example, data.min would not work even if a column in data was named min because there is already an attribute min() for any Pandas DataFrame).

In [18]:
ADP_grades

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


In [19]:
# Slection by Attribute
ADP_grades.Grade

42343    9.0
23423    5.0
57567    7.5
54644    7.0
34534    8.5
Name: Grade, dtype: float64

The attribute indexing can also be used for assignment, but only if the column already exists:

In [20]:
tmp = ADP_grades.copy()
tmp.head()

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


In [21]:
tmp.Marks = 42
tmp.head()

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,42,9.0,True
23423,David,Alberts,M,42,5.0,False
57567,Ibrahim,Assad,M,42,7.5,True
54644,Maria,Esposito,F,42,7.0,True
34534,Sophie,van Dee,F,42,8.5,True


### Selection by isin

We can also use the attribute indexing in combination with `isin` function to retrieve records with specific values:

In [22]:
# column values
tmp = ADP_grades[ADP_grades.Gender.isin(['M'])]
tmp

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True


In [23]:
# row indexes
tmp = ADP_grades[ADP_grades.index.isin([23423,57567])]
tmp

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True


### 4.5.2 Numpy-style matrix indexing

Much like numpy, the `[]` notation can be used for indexing columns and rows. 

#### Columns

For columns, you can specify a single column name and the object returned will be a Series:

In [24]:
# Numpy style Selection for columns
ADP_grades['Marks']

42343    92
23423    52
57567    76
54644    68
34534    84
Name: Marks, dtype: int64

Or, you can specify a list of column labels to retreive several columns:

In [25]:
# Numpy style Selection for columns
ADP_grades[['Marks', 'Grade']]

Unnamed: 0,Marks,Grade
42343,92,9.0
23423,52,5.0
57567,76,7.5
54644,68,7.0
34534,84,8.5


#### Rows

When using the `[]` notation to index rows by their index, you cannot specify a single index (this is reserved for columns), you must specify a range (a slice):

In [26]:
# Numpy style Selection for rows
ADP_grades[0:2]

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False


In [27]:
ADP_grades[0::2]

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
57567,Ibrahim,Assad,M,76,7.5,True
34534,Sophie,van Dee,F,84,8.5,True


In [28]:
# boolean indexing of rows also works
ADP_grades[(ADP_grades['Gender'] == "F") & (ADP_grades['Marks'] > 70)]

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
34534,Sophie,van Dee,F,84,8.5,True


#### Rows and columns

Unlike with numpy, you cannot specify a subset along multiple dimensions within a single set of []s:

In [132]:
# doesn't work!
#ADP_grades[:3, 'Marks']

Instead, the two subsets must be chained together in separate []s. We will see that this is not the optimal way of doing these indexing calls, but it is _allowed_

In [30]:
# does work!
ADP_grades[:3]['Marks']

42343    92
23423    52
57567    76
Name: Marks, dtype: int64

In [31]:
# also works!
ADP_grades['Marks'][:3]

42343    92
23423    52
57567    76
Name: Marks, dtype: int64

### 4.5.3 Selection using labels

The .loc attribute can be used to index:
- A single row with a specified index
- Multiple rows 
- Multiple rows and columns

In [32]:
ADP_grades

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


#### Indexing a single row

When a single label is specified, it is assumed to be an index label (of a row)

In [33]:
# Selection by Label for rows
ADP_grades.loc[57567]

Name       Ibrahim
Surname      Assad
Gender           M
Marks           76
Grade          7.5
Pass          True
Name: 57567, dtype: object

#### Indexing Multiple Rows


In [34]:
# Selection by Label for rows
ADP_grades.loc[57567:34534]

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


#### Indexing Multiple Rows and Columns

Looking for a label in a column requires including the full range of rows (:) and the comma:

In [35]:
# Selection by Label for columns
ADP_grades.loc[:,['Name', 'Grade', 'Pass']]

Unnamed: 0,Name,Grade,Pass
42343,Anna,9.0,True
23423,David,5.0,False
57567,Ibrahim,7.5,True
54644,Maria,7.0,True
34534,Sophie,8.5,True


In [36]:
# Selection by labels  for rows and columns
ADP_grades.loc[57567:54644,['Name', 'Grade', 'Pass']]

Unnamed: 0,Name,Grade,Pass
57567,Ibrahim,7.5,True
54644,Maria,7.0,True


#### Slice of labels

In [37]:
ADP_grades.loc[42343:57567, 'Name':'Grade']

Unnamed: 0,Name,Surname,Gender,Marks,Grade
42343,Anna,van Rossum,F,92,9.0
23423,David,Alberts,M,52,5.0
57567,Ibrahim,Assad,M,76,7.5


#### Boolean array

This is slightly different than above because instaed of specifying labels a boolean value is specified for each label:

In [38]:
ADP_grades.loc[:, [True, False, False, False, True, True]]

Unnamed: 0,Name,Grade,Pass
42343,Anna,9.0,True
23423,David,5.0,False
57567,Ibrahim,7.5,True
54644,Maria,7.0,True
34534,Sophie,8.5,True


### 4.5.4 Selection by location

As with the label selection, you can select via position (or location). You can think of this as the _implicit_ position location that does not change when you rename the _labels_.

To index based on location, use the .iloc attribute. This can be done via:
- a single position (e.g. 14, 2)
- a list of positions ([1, 3, 2])
- a slice object with integers (3:42) (Note: just like in Numpy, only lower is included, upper is excluded. This is different than loc[]!)

In [39]:
ADP_grades

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
42343,Anna,van Rossum,F,92,9.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
34534,Sophie,van Dee,F,84,8.5,True


#### single position

In [40]:
ADP_grades.iloc[3]

Name          Maria
Surname    Esposito
Gender            F
Marks            68
Grade           7.0
Pass           True
Name: 54644, dtype: object

In [41]:
ADP_grades.iloc[3, 2]

'F'

#### a slice of positions

In [42]:
# Selection by index for rows
ADP_grades.iloc[3:5, 0:5]

Unnamed: 0,Name,Surname,Gender,Marks,Grade
54644,Maria,Esposito,F,68,7.0
34534,Sophie,van Dee,F,84,8.5


In [43]:
# Selection by index for rows and columns
ADP_grades.iloc[3:5, [1,3,5]]

Unnamed: 0,Surname,Marks,Pass
54644,Esposito,68,True
34534,van Dee,84,True


### 4.5.5 Selection by sample

Often in data science you want to sample a random set of rows (or columns) from your data. This is easy within Pandas using the sample() attribute. By default, it assumes you are sampling without replacement from rows:

In [60]:
#Selection by sampling
ADP_grades.sample(3)

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
54644,Maria,Esposito,F,68,7.0,True
23423,David,Alberts,M,52,5.0,False
57567,Ibrahim,Assad,M,76,7.5,True


In [69]:
# sample n = 3 random rows with replacement
ADP_grades.sample(3, replace=True)

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
57567,Ibrahim,Assad,M,76,7.5,True
54644,Maria,Esposito,F,68,7.0,True
54644,Maria,Esposito,F,68,7.0,True


In [95]:
# sample 3% of all rows
ADP_grades.sample(frac=0.30)

Unnamed: 0,Name,Surname,Gender,Marks,Grade,Pass
57567,Ibrahim,Assad,M,76,7.5,True
42343,Anna,van Rossum,F,92,9.0,True


In [105]:
# sample 2 columns
ADP_grades.sample(2, axis=1).head()

Unnamed: 0,Name,Gender
42343,Anna,F
23423,David,M
57567,Ibrahim,M
54644,Maria,F
34534,Sophie,F


# Exercises

We are going to use Iris dataset for this exercise. We first need to load the iris data from the iris.txt file (available in Canvas).

In [112]:
# read a data file into pandas
data = pandas.read_csv("iris.csv", index_col=0)
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,type
0,4.9,3.0,1.4,0.2,setosa
1,4.7,3.2,1.3,0.2,setosa
2,4.6,3.1,1.5,0.2,setosa
3,5.0,3.6,1.4,0.2,setosa
4,5.4,3.9,1.7,0.4,setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,virginica
145,6.3,2.5,5.0,1.9,virginica
146,6.5,3.0,5.2,2.0,virginica
147,6.2,3.4,5.4,2.3,virginica


#### Exercise 1

Find the index of the maximum sepal length for an iris type in the dataset (so the index of a setosa, versicolor or virdinica plant with the maximum sepal length of their respective types).

**Hint:** Have a look at Pandas documentation for `idxmax()` function.

In [125]:
data.groupby("type").idxmax()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,13,14,23,42
versicolor,49,84,82,69
virginica,130,116,117,99


#### Exercise 2

Find the most common iris type amongst irises that have a sepal width larger than or equal to the 75% percentile of the entire dataset's sepal widths.

Hint: use the .describe() function - it returns a DataFrame!

Hint: to get the counts of values in a column, you can use the function .value_counts()

In [147]:
data.describe()

count = data[data["sepal_width"] >= 3.3]
count.value_counts("type") 
count.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,42.0,42.0,42.0,42.0
mean,5.542857,3.592857,2.492857,0.72619
std,0.78185,0.267229,1.838928,0.839006
min,4.6,3.3,1.0,0.1
25%,5.025,3.4,1.4,0.2
50%,5.25,3.5,1.5,0.3
75%,5.775,3.8,1.9,0.575
max,7.9,4.4,6.7,2.5


#### Exercise 3

Find the index of virginica irises whose sepal width is smaller than the minimum sepal width of setosa irises.

In [171]:
setosa = data[data["type"] == "setosa"]
setosa.describe()[3:4]

tightVirgin = data[(data["type"] == "virginica") & (data["sepal_width"] < 2.3)]
display(tightVirgin)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,type
118,6.0,2.2,5.0,1.5,virginica
