# Introduction to Pandas

---

**Pandas** is a Python library which can be used to manipulate, clean, and query data. Pandas was created by Wes McKinney in 2008, and is an open source project. Pandas has quickly become the defacto library for representing relational data for data scientists. Very powerful, yet intuitive and simple library for data munging and analysis sometimes called **Excel on Steroids**.


### Lecture outline

---

* Pandas Series


* Pandas DataFrame


* Shape, Size, type, and Dimension of DataFrame and Series


* Index, Selection, Filtering


* Re(Set)-index, Dropping Entries, Axis


* Sorting


* Column Rename, Reorder, Insertion, Deletion


* Arithmetic Operations


* Unique Values and Value Counts


* Statistics with DataFrame

### Homework:

[101 Pandas Exercises for Data Analysis](https://www.machinelearningplus.com/python/101-pandas-exercises-python/)

In [1]:
import pandas as pd

import numpy as np

## Pandas Series

---

The series is one of the core data structures in Pandas. We can represent it as a two column structure, where the first column is a special index column and the second is actual data. It's important to note that the data column has a label of its own and can be retrieved using the `.name` attribute. Generally, Series is a one-dimensional array-like object containing a sequence of values and an associated array of index.

Imagine series as an one column, whereas two or more series construct the DataFrame.

![alt text](images/series_new.png "Title")

In [2]:
# Simple Series object

series = pd.Series([4, 7, -5, 3], name="First_Series")

series

0    4
1    7
2   -5
3    3
Name: First_Series, dtype: int64

In [3]:
pd.DataFrame(series) # Convert Series into DataFrame

Unnamed: 0,First_Series
0,4
1,7
2,-5
3,3


In [4]:
# Extract actual values (data) by using .values attribute

series.values

array([ 4,  7, -5,  3])

In [5]:
series.tolist() # Returns Python list

[4, 7, -5, 3]

In [6]:
# Extract the index column by using .index attribute

series.index

RangeIndex(start=0, stop=4, step=1)

In [7]:
series.name # Returns Series name

'First_Series'

In [8]:
# Select particular value from a Series using its index

series[2]

-5

`Pandas Series` can be constructed from Python data structures such as List, Dict, and others.

In [9]:
# Construct Series from dictionary


students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}


pd.Series(students_scores)

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [10]:
# Construct Series from list of tuples

students = [("Alice","Brown"), ("Jack", "White"), ("Molly", "Green")]


pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

#### Reference

[Series mathods and attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

## Pandas DataFrame

---

The DataFrame data structure is the heart of the Panda's library. It's a primary object that we will be working 
with in data analysis and cleaning tasks. A DataFrame represents a rectangular table of data and contains an ordered collection of columns, or conceptually it is a two-dimensional series object, where there's an index and multiple columns. Hence, we can think of it as a Stacked Series.


> While a DataFrame is physically two-dimensional, we can use it to represent higher dimensional data in a tabular format using hierarchical indexing.

There are many ways to construct DataFrame. To do so, we can use conventional Python data structures such as List, Dict, or Tuple. One of the most common way to have a DataFrame is to use dictionary.

Note that, DataFrame has also an index column and we can operate on that special column too.

![alt text](images/dataframe.svg "Title")

In [11]:
# Create DataFrame from dictionary


data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'population': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}


df = pd.DataFrame(data)

df

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [12]:
# Extract values from DataFrame

df.values

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], dtype=object)

In [13]:
# Extract index from DataFrame

df.index

RangeIndex(start=0, stop=6, step=1)

In [14]:
# Extract column names

df.columns

Index(['state', 'year', 'population'], dtype='object')

We can use different combinations of data stuctures to create DataFrame. Let try some of them!

In [15]:
# Create DataFrame using dict of Series

data = {"Nevada": pd.Series([2.4, 2.5, 2.6]),
       "Ohio": pd.Series([1.5, 1.7, 1.8])}


pd.DataFrame(data)

Unnamed: 0,Nevada,Ohio
0,2.4,1.5
1,2.5,1.7
2,2.6,1.8


In [16]:
# Create DataFrame using dict of dicts


data = {'Nevada': {2000: 2.3, 2001: 2.4, 2002: 2.9},
        
        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}


pd.DataFrame(data)

Unnamed: 0,Nevada,Ohio
2000,2.3,1.5
2001,2.4,1.7
2002,2.9,3.6


In [17]:
# Create DataFrame using list of lists


data = [['Nevada', 'Carson City', 900000], 
        ['Ohio', 'Columbus', 300000], 
        ['Nebraska', 'Lincoln', 10100000], 
        ['Kansas', 'Topeka', 5500000]]


pd.DataFrame(data)

Unnamed: 0,0,1,2
0,Nevada,Carson City,900000
1,Ohio,Columbus,300000
2,Nebraska,Lincoln,10100000
3,Kansas,Topeka,5500000


#### Reference

[DataFrame mathods and attributes](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)

## Shape, Size, type, and Dimension of DataFrame and Series

In [18]:
series

0    4
1    7
2   -5
3    3
Name: First_Series, dtype: int64

In [19]:
# Shape of a Series

series.shape # Only indicate number of rows

(4,)

In [20]:
# Number of dimensions

series.ndim

1

In [21]:
# Number of elements

series.size

4

In [22]:
# Type of values (entries) in a Series

series.dtypes

dtype('int64')

In [23]:
df

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [24]:
# Number of rows and columns

df.shape

(6, 3)

In [25]:
# Number of dimensions or number of axis

df.ndim

2

In [26]:
# Number of elements

df.size

18

In [27]:
# Type of values (entries) in a DataFrame for each column

df.dtypes

state          object
year            int64
population    float64
dtype: object

## Index, Selection, Filtering

---

As we've seen, both Series and DataFrames can have indices applied to them. The index is essentially a row level
label, and in Pandas the rows correspond to axis zero. Indices can either be autogenerated, or they can be set explicitly. The index objects are responsible for holding the axis labels and other metadata, like the axis name or names.

In [28]:
# Let read CSV file and perform indexing, selection and filtering

df = pd.read_csv("data/admission.csv")

df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


### Indexing

---

Index object is immutable, thus cannot be modified. The immutability allow us to share the index among data structures and perform merging operation. Moreover, it behaves like fix-sized set and we can perform common set operations on it, such as `intersection`, `union`, `difference` and so on. No worries, we will see all of them.

> Pandas Index object can contain duplicate elements.

#### Reference

[Index methods and attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html)

[Indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing)

Series indexing works similarly as in NumPy, except we can use the Series’s index values instead of only integers.

In [29]:
series = pd.Series(range(6), index=['a', 'b', 'c', 'd', 'e', 'f'])

series

a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64

In [30]:
series[1] # Select one particular element by its position

series[1:4] # Select range of values

series[:3] # Select up to third position

series[3:] # Select from third position

series[-1] # Select from the end

5

In [31]:
# Instead of integer indexing or positional indexing, we can use actual, Series indices to select data

series["b"]

series["b":"e"]

series[:"d"]

series["c":]

c    2
d    3
e    4
f    5
dtype: int64

Indexing in case of DataFrame is different compared to Series. As DataFrame have two axis then we can select row as well as columns. Let use our DataFrame for indexing tasks.

In [32]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


Indexing along y axis or selecting only columns

In [33]:
df["GRE Score"] # Select only one column

df[["GRE Score", "TOEFL Score", "SOP"]] # Select two or more columns

Unnamed: 0,GRE Score,TOEFL Score,SOP
0,337,118,4.5
1,324,107,4.0
2,316,104,3.0
3,322,110,3.5
4,314,103,2.0
...,...,...,...
395,324,110,3.5
396,325,107,3.0
397,330,116,5.0
398,312,103,3.5


Indexing along x axis or selecting only rows

In [34]:
df[:2] # Select first two rows

df[2:5] # Select range of rows

df[300:] # Select from 300 to the end

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
300,301,309,106,2,2.5,2.5,8.00,0,0.62
301,302,319,108,2,2.5,3.0,8.76,0,0.66
302,303,322,105,2,3.0,3.0,8.45,1,0.65
303,304,323,107,3,3.5,3.5,8.55,1,0.73
304,305,313,106,2,2.5,2.0,8.43,0,0.62
...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67


Integer indexing does not work for DataFrame

In [54]:
df[2]

KeyError: 2

Pandas also have different methods for row and column selection

### Selection

---

`loc` method performs selection using axis labels and `iloc` performs selection using integers. They enable us to select a subset of the rows and columns from a DataFrame with NumPy-like notation. Additinally, Pandas have `head()` and `tail()` methods to select row and columns from the head and tail, respectively.

In [35]:
df.head() # Select first 5 row by default

df.head(3) # Select first 3 rows



df.tail() # Select last 5 rows by default

df.tail(3) # Select last 3 rows

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67
399,400,333,117,4,5.0,4.0,9.66,1,0.95


In [36]:
df.columns[df.columns.str.contains("Rating")] # Look for a column name

Index(['University Rating'], dtype='object')

Selection using axis labels. Purely label-location based indexer.

In [37]:
df.loc[:] # Select all rows and columns

df.loc[:, "GRE Score"] # Select all rows and one column

df.loc[[0, 1, 2, 3], "GRE Score"] # Select first three entries of one column

df.loc[:10, ["GRE Score", "TOEFL Score"]] # Select first 10 entries of two or more columns

df.loc[5:10, ["GRE Score", "TOEFL Score"]] # Select range of rows and range of columns by picking them one-by-one

df.loc[5:10, "TOEFL Score":] # Select range of rows and range of columns

df.loc[:5, :"SOP"] # Select range of rows and columns

df.loc[:5, "SOP":"CGPA"] # Select range of rows and columns

Unnamed: 0,SOP,LOR,CGPA
0,4.5,4.5,9.65
1,4.0,4.5,8.87
2,3.0,3.5,8.0
3,3.5,2.5,8.67
4,2.0,3.0,8.21
5,4.5,3.0,9.34


Selecting using integers. Purely integer-location based indexing

In [38]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [39]:
df.iloc[:] # Select all rows and columns

df.iloc[0] # Select only one row

df.iloc[[0, 1, 2]] # Select first row. Same as above

df.iloc[:5] # Select range of rows

df.iloc[:5, 1] # Select 5 rows of the first columns

df.iloc[:5, :3] # Select row as well as column range

df.iloc[[0, 2, 4], [0, 2, 4]] # Select particular rows and columns

Unnamed: 0,Serial No.,TOEFL Score,SOP
0,1,118,4.5
2,3,104,3.0
4,5,103,2.0


### Filtering or Boolean Indexing

---

Moreover, we can perform row and column selection based on boolean indexing. This is more like to select rows or columns which satisfy pre-defined condition(s).

Boolean indexing means to use boolean series, only True and/or False to select rows or columns.

#### Reference

[Boolean indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing)

In [40]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [41]:
df["GRE Score"] > 325 # Returns boolean Series. True when condition is True, otherwise False

df[df["GRE Score"] > 325] # Returns corresponding DataFrame. Where True

df[~(df["GRE Score"] > 325)] # Returns DataFrame where False

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
6,7,321,109,3,3.0,4.0,8.20,1,0.75
...,...,...,...,...,...,...,...,...,...
391,392,318,106,3,2.0,3.0,8.65,0,0.71
393,394,317,104,2,3.0,3.0,8.76,0,0.77
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84


In [42]:
df.loc[~(df["GRE Score"] > 325), ["SOP", "LOR"]]

Unnamed: 0,SOP,LOR
1,4.0,4.5
2,3.0,3.5
3,3.5,2.5
4,2.0,3.0
6,3.0,4.0
...,...,...
391,2.0,3.0
393,3.0,3.0
395,3.5,3.5
396,3.0,3.5


#### Chained Conditionals

With the chained conditionals, we can filter DataFrame based on several conditions chained by using logic operators.

In [43]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [44]:
condition_one = (df["GRE Score"] > 325)

condition_two = (df["Research"] == 1)

condition_three = (df["CGPA"] > 8.00)


df[condition_one & condition_two] # Only select rows where BOTH conditions satisfy


df[condition_one | (condition_two & condition_three)] # Only select rows where ONE of the condition satisfy

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
3,4,322,110,3,3.5,2.5,8.67,1,0.80
5,6,330,115,5,4.5,3.0,9.34,1,0.90
6,7,321,109,3,3.0,4.0,8.20,1,0.75
...,...,...,...,...,...,...,...,...,...
394,395,329,111,4,4.5,4.0,9.23,1,0.89
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91


There is a possibility to use `loc` with conditions to perform DataFrame filtering.

In [45]:
df.loc[condition_one & condition_two]

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
5,6,330,115,5,4.5,3.0,9.34,1,0.90
11,12,327,111,4,4.0,4.5,9.00,1,0.84
12,13,328,112,4,4.0,4.5,9.10,1,0.78
22,23,328,116,5,5.0,5.0,9.50,1,0.94
...,...,...,...,...,...,...,...,...,...
385,386,335,117,5,5.0,5.0,9.82,1,0.96
392,393,326,112,4,4.0,3.5,9.12,1,0.84
394,395,329,111,4,4.5,4.0,9.23,1,0.89
397,398,330,116,4,5.0,4.5,9.45,1,0.91


Additionally, Pandas has `query()` method to perform querying the columns of a DataFrame with a boolean expression. Compared to other indexing and selection techniques, `query()` method only takes conditions specified as a string.

In [46]:
df.query("SOP > LOR") # Select rows where "SOP" is more than "LOR"


df.query("(SOP > LOR) | (CGPA > 9)") # Combine conditionals with OR

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
3,4,322,110,3,3.5,2.5,8.67,1,0.80
5,6,330,115,5,4.5,3.0,9.34,1,0.90
8,9,302,102,1,2.0,1.5,8.00,0,0.50
9,10,323,108,3,3.5,3.0,8.60,0,0.45
...,...,...,...,...,...,...,...,...,...
394,395,329,111,4,4.5,4.0,9.23,1,0.89
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91


## Re(Set)-index, Dropping Entries, Axis

In [47]:
test = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9], "d": ["x", "y", "z"]})

### Re-index

---

`reindex()` method unable us to introduce new index object outside the DataFrame. It changes the order of indexes without changing the values of the row associated to each index.

#### Reference

[Difference between df.reindex() and df.set_index() methods in pandas](https://stackoverflow.com/questions/50741330/difference-between-df-reindex-and-df-set-index-methods-in-pandas)

In [48]:
test

Unnamed: 0,a,b,c,d
0,1,4,7,x
1,2,5,8,y
2,3,6,9,z


In [49]:
test.reindex([1, 2, 3], axis=0) # Reindex rows

Unnamed: 0,a,b,c,d
1,2.0,5.0,8.0,y
2,3.0,6.0,9.0,z
3,,,,


In [50]:
test.reindex(["b", "c", "a", "d"], axis=1) # Reindex columns

Unnamed: 0,b,c,a,d
0,4,7,1,x
1,5,8,2,y
2,6,9,3,z


### Set index

---

`set_index()` method takes a list of columns and promotes those columns to an index. `set_index` will change the indexes with the values of a column, without touching the order of the other values in the DataFrame

In [51]:
test

Unnamed: 0,a,b,c,d
0,1,4,7,x
1,2,5,8,y
2,3,6,9,z


In [52]:
test.set_index("a") # Changed the index but values

Unnamed: 0_level_0,b,c,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4,7,x
2,5,8,y
3,6,9,z


In [53]:
test.set_index(["a", "b"]) # Set multiple columns as index

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4,7,x
2,5,8,y
3,6,9,z


### Reset index

---

We can remove existing index from the DataFrame and keep it as a new column or completely discard it.

In [54]:
test

Unnamed: 0,a,b,c,d
0,1,4,7,x
1,2,5,8,y
2,3,6,9,z


In [55]:
test = test.set_index("d")

test

Unnamed: 0_level_0,a,b,c
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,4,7
y,2,5,8
z,3,6,9


In [56]:
test.reset_index(drop=False) # Reset index and keep it as a column

Unnamed: 0,d,a,b,c
0,x,1,4,7
1,y,2,5,8
2,z,3,6,9


In [57]:
test.reset_index(drop=True) # Reset index and discard it

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


### Dropping Entries

---

We can drop entries from rows and/or columns by specifying its label.

In [58]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [59]:
df.drop(0, axis=0) # Drop one row

df.drop([0, 1, 2], axis=0) # Drop three rows

df.drop(df.index, axis=0) # Drop all rows

# #----------------------------------------------

df.drop("GRE Score", axis=1) # Drop one column

df.drop(["GRE Score", "TOEFL Score"], axis=1)

Unnamed: 0,Serial No.,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,4,4.5,4.5,9.65,1,0.92
1,2,4,4.0,4.5,8.87,1,0.76
2,3,3,3.0,3.5,8.00,1,0.72
3,4,3,3.5,2.5,8.67,1,0.80
4,5,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...
395,396,3,3.5,3.5,9.04,1,0.82
396,397,3,3.0,3.5,9.11,1,0.84
397,398,4,5.0,4.5,9.45,1,0.91
398,399,3,3.5,4.0,8.78,0,0.67


### Axis

---

As DataFrame have two dimensions and we can perform operations on each dimension. Hence, this requires to indicate on which axis do you want Pandas to perform operation. `axis=0` is x axis or `horizontal` axis and indicates rows. `axis=1` is y axis or `vertical` axis and indicates columns.

However, we can have DataFrame with more than 2 dimensions. In this case Pandas will represent them as a multi-index DataFrames. With higher dimensional DataFrames it's even more necessary to indicate on which axis you want/need operation.

Pandas has several methods and attributes which can operate on axis object itself. Now we just review two of them and later on we'll meet them again.

In [60]:
df.axes # Returns list of axis or labels along with horizontal and vertical axis

[RangeIndex(start=0, stop=400, step=1),
 Index(['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
        'LOR', 'CGPA', 'Research', 'Chance of Admit '],
       dtype='object')]

Let make simple DataFrame to understand how the `set_axis()` method works.

In [61]:
test

Unnamed: 0_level_0,a,b,c
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,4,7
y,2,5,8
z,3,6,9


In [62]:
test.set_axis(['A', 'B', 'C'], axis=0) # Set new values for axis zero

Unnamed: 0,a,b,c
A,1,4,7
B,2,5,8
C,3,6,9


In [63]:
test.set_axis(["X", "Y", "Z"], axis=1) # Set new values for axis one

Unnamed: 0_level_0,X,Y,Z
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,4,7
y,2,5,8
z,3,6,9


## Sorting

---

Sorting is an important operation. We can either sort by index or by value. The latter means to sort by columns. Moreover, we can sort DataFrame by several columns and with different sorting order, ascending or descending.

In [64]:
series = pd.Series([3, 5, 7, 1, 9, 2], index=['d', 'a', 'b', 'c', 'g', 'f'])


series

d    3
a    5
b    7
c    1
g    9
f    2
dtype: int64

In [65]:
series.sort_index(ascending=True) # Sort by index in ascending order


series.sort_index(ascending=False) # Sort by index in descending order

g    9
f    2
d    3
c    1
b    7
a    5
dtype: int64

In [66]:
series.sort_values(ascending=True) # Sort by value in ascending order

series.sort_values(ascending=False) # Sort by value in descending order

g    9
b    7
a    5
d    3
f    2
c    1
dtype: int64

We can sort DataFrame with applying the same logic.

In [67]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [68]:
df.sort_index(ascending=True) # Sort DataFrame by index in ascending order

df.sort_index(ascending=False) # Sort DataFrame by index in descending order

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
399,400,333,117,4,5.0,4.0,9.66,1,0.95
398,399,312,103,3,3.5,4.0,8.78,0,0.67
397,398,330,116,4,5.0,4.5,9.45,1,0.91
396,397,325,107,3,3.0,3.5,9.11,1,0.84
395,396,324,110,3,3.5,3.5,9.04,1,0.82
...,...,...,...,...,...,...,...,...,...
4,5,314,103,2,2.0,3.0,8.21,0,0.65
3,4,322,110,3,3.5,2.5,8.67,1,0.80
2,3,316,104,3,3.0,3.5,8.00,1,0.72
1,2,324,107,4,4.0,4.5,8.87,1,0.76


In [69]:
df.sort_values("CGPA", ascending=True) # Sort by column in ascending order

df.sort_values("CGPA", ascending=False) # Sort by column in descending order

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
143,144,340,120,4,4.5,4.0,9.92,1,0.97
202,203,340,120,5,4.5,4.5,9.91,1,0.97
203,204,334,120,5,4.0,5.0,9.87,1,0.97
385,386,335,117,5,5.0,5.0,9.82,1,0.96
34,35,331,112,5,4.0,5.0,9.80,1,0.94
...,...,...,...,...,...,...,...,...,...
29,30,310,99,2,1.5,2.0,7.30,0,0.54
118,119,296,99,2,3.0,3.5,7.28,0,0.47
348,349,302,99,1,2.0,2.0,7.25,0,0.57
28,29,295,93,1,2.0,2.0,7.20,0,0.46


In [70]:
df.sort_values(["CGPA", "Research"], ascending=[True, True]) # Sort by two columns. Both in ascending

df.sort_values(["CGPA", "Research"], ascending=[True, False]) # Sort by two columns. AScending and descending

df.sort_values(["CGPA", "Research"], ascending=True) # Same as the first one

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
58,59,300,99,1,3.0,2.0,6.80,1,0.36
28,29,295,93,1,2.0,2.0,7.20,0,0.46
348,349,302,99,1,2.0,2.0,7.25,0,0.57
118,119,296,99,2,3.0,3.5,7.28,0,0.47
29,30,310,99,2,1.5,2.0,7.30,0,0.54
...,...,...,...,...,...,...,...,...,...
148,149,339,116,4,4.0,3.5,9.80,1,0.96
385,386,335,117,5,5.0,5.0,9.82,1,0.96
203,204,334,120,5,4.0,5.0,9.87,1,0.97
202,203,340,120,5,4.5,4.5,9.91,1,0.97


## Column Rename, Reorder, Insertion, Deletion

---

DataFrame is dynamic, meaning that we can add and remove columns, as well as reorder or rename them.

#### Reference

[DataFrame.rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)

### Rename Columns

---

Rename existing columns, or set new names for columns.

In [71]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [72]:
# We have to indicate mapper. Which column we want to rename and new name of it

df.rename({"GRE Score": "GRE", "TOEFL Score": "TOEFL"}, axis=1)

Unnamed: 0,Serial No.,GRE,TOEFL,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67


We can use `set_axis` method to rename columns. However, `set_axis` method requires to give the same number of new column names as original DataFrame have.

In [75]:
df.set_axis(["a", "b", "c", "d", "e", "f", "g", "h", "i"], axis=1)

Unnamed: 0,a,b,c,d,e,f,g,h,i
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67


### Reorder Columns

---

Sometimes we may need to just re-order the columns. There are a few ways to do it.

In [76]:
test

Unnamed: 0_level_0,a,b,c
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,4,7
y,2,5,8
z,3,6,9


In [77]:
new_columns = ["c", "a", "b"]

test[new_columns] # Reorder columns by changing column names and then perform selection

Unnamed: 0_level_0,c,a,b
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,7,1,4
y,8,2,5
z,9,3,6


In [78]:
test.reindex(new_columns, axis=1) # Change column order by reindexing

Unnamed: 0_level_0,c,a,b
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,7,1,4
y,8,2,5
z,9,3,6


### Insert new columns

---

We can add new columns to DataFrame at either end

In [79]:
test

Unnamed: 0_level_0,a,b,c
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,4,7
y,2,5,8
z,3,6,9


In [84]:
test["E"] = ""  # Add an empty column

test["F"] = np.nan # Add new column with none values

test["G"] = [10, 11, 12] # Add new values

test

Unnamed: 0_level_0,a,b,c,E,F,G
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
x,1,4,7,,,10
y,2,5,8,,,11
z,3,6,9,,,12


The above techniques can add new column only at the end of the DataFrame. If we want to add new column at specified position, we can use `insert()` method. Furthermore, there is an `assign()` method which can assign new columns to a DataFrame. The `assign()` method, compared to `insert()` method, which is inplace method, returns new DataFrame with new columns assigned along with old columns.

In [85]:
test.insert(loc=0, column="H", value=test["b"] ** 2) # Inser new calculated column at the first position

test

Unnamed: 0_level_0,H,a,b,c,E,F,G
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
x,16,1,4,7,,,10
y,25,2,5,8,,,11
z,36,3,6,9,,,12


In [87]:
test = test.assign(I=test["b"]+test["c"]) # Assign new column


test

Unnamed: 0_level_0,H,a,b,c,E,F,G,I,V
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
x,16,1,4,7,,,10,11,11
y,25,2,5,8,,,11,13,13
z,36,3,6,9,,,12,15,15


### Remove columns

---

We saw how to drop values along the horizontal and vertical axis in above sections. Removing column is the same.

In [88]:
test

Unnamed: 0_level_0,H,a,b,c,E,F,G,I,V
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
x,16,1,4,7,,,10,11,11
y,25,2,5,8,,,11,13,13
z,36,3,6,9,,,12,15,15


In [90]:
test.drop("H", axis=1) # Remove one column

test.drop(["H", "I"], axis=1) # Remove two or more columns

Unnamed: 0_level_0,a,b,c,E,F,G,V
d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
x,1,4,7,,,10,11
y,2,5,8,,,11,13
z,3,6,9,,,12,15


## Arithmetic Operations

---

We can perform arithmetic operations along any axis. These operations are quite easy. However they can be a part of complex chained expression. These operations have alternative methods, as well. Actually, we only perform these simple arithmetic operations on Pandas Series and then insert the result into Pandas DataFrame.

In [None]:
series

In [None]:
series + 10 # Add constant value to a series

series * 10 # Multiply series by a constant value

series + pd.Series([1, 2, 3, 4, 5, 6], index=series.index) # Add new Series

Arithmetic operations on DataFrames

In [None]:
test

In [None]:
test + 5 # Cannot add integer while having string type column in DataFrame

## Unique Values and Value Counts

---

As the header suggests, we can count the number of unique values in a column, as well as count how many times a certain value occur.

### Unique Values

---

We can find the number of unique values in a column by using `unique()` method.

In [None]:
series

In [None]:
series.unique() # Returns list of only unique values

series.index.unique() # Finds unique values in a Series index. Returns index object without duplicates.

In [None]:
df.head()

In [None]:
df["Research"].unique() # Research column has only two unique values

df["University Rating"].unique() # Unique values for University Rating

### Value Counts

---

Value counts represent the operation when we want to count the number of unique values in a column or Pandas Series. In other words, after values count, we will have Pandas series. The index will be the unique value from  a particular column and actual data for a new series will be the count of these unique values. In shorts, calculates value frequencies

In [None]:
df.head()

In [None]:
df["Research"].value_counts() # We have 219 ones and 181 zeros

In [None]:
# Requires Pandas version 1.1.0


df.value_counts(subset=["Research", "University Rating"]) # Value counts for two or more columns.
                                                          # Returns multi-index DataFrame

## Statistics with DataFrame

---

Pandas is equipped with various methods for performing statistical operations on a Series or DataFrame. Also, note that these operations can be performed on either axis. On rows or columns.

#### Reference

[Statistical functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html)

In [91]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [94]:
df.iloc[:,1:].describe().round(2) # Summary statistics

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,316.81,107.41,3.09,3.4,3.45,8.6,0.55,0.72
std,11.47,6.07,1.14,1.01,0.9,0.6,0.5,0.14
min,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,308.0,103.0,2.0,2.5,3.0,8.17,0.0,0.64
50%,317.0,107.0,3.0,3.5,3.5,8.61,1.0,0.73
75%,325.0,112.0,4.0,4.0,4.0,9.06,1.0,0.83
max,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


We can separately calculate these statistics for each column by calling corresponding methods on columns.

In [None]:
df.sum(axis=0) # Sum across columns

df.sum(axis=1) # Sum across rows

#---------------------------------

df.mean(axis=0) # Average across columns

df.mean(axis=1) # Average across rows

#----------------------------------

df.min() # Minimum across columns

df.max() # Maximum across columns

What if we want to know where is the minimum and maximum value occur? In other words, what's the index of the minimum and maximum value? Pandas have methods for that.

In [None]:
df.idxmin() # Gives index of minimum values for each column

df.idxmax() # Gives index of maximum values for each column

#### There are different methods, which help to calculate various statistics on DataFrame or Series.

![alt text](images/statistics.png "Title")

### Covariance and Correlation

---

These statistics compared to above-mentioned ones requires pairs of values or at least two Series to give meaningful results. Let use our DataFrame to calculate covariance and correlation between some columns.

### Covariance

In [95]:
covariance = df.iloc[:, 1:].cov() # Compute pairwise covariances among the series in the DataFrame

covariance

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,131.644555,58.216967,8.778791,7.079699,5.747726,5.699742,3.31869,1.313271
TOEFL Score,58.216967,36.838997,4.828697,4.021053,3.095965,2.998337,1.481729,0.685179
University Rating,8.778791,4.828697,1.308114,0.845865,0.678352,0.509117,0.255232,0.116009
SOP,7.079699,4.021053,0.845865,1.013784,0.660025,0.431183,0.222807,0.097028
LOR,5.747726,3.095965,0.678352,0.660025,0.807262,0.359084,0.177701,0.085834
CGPA,5.699742,2.998337,0.509117,0.431183,0.359084,0.355594,0.155026,0.074265
Research,3.31869,1.481729,0.255232,0.222807,0.177701,0.155026,0.248365,0.039317
Chance of Admit,1.313271,0.685179,0.116009,0.097028,0.085834,0.074265,0.039317,0.020337


We see that above and below of the main diagonal we have the same values. This was expected considering `Covariance` nature. Let try and make it upper triangular matrix for better representational purposes.

In [96]:
covariance.where(np.triu(np.ones(covariance.shape)).astype(np.bool)).fillna("") # გაარჩიეთ ეს კოდი

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,131.644555,58.216967,8.778791,7.079699,5.747726,5.699742,3.31869,1.313271
TOEFL Score,,36.838997,4.828697,4.021053,3.095965,2.998337,1.481729,0.685179
University Rating,,,1.308114,0.845865,0.678352,0.509117,0.255232,0.116009
SOP,,,,1.013784,0.660025,0.431183,0.222807,0.097028
LOR,,,,,0.807262,0.359084,0.177701,0.085834
CGPA,,,,,,0.355594,0.155026,0.074265
Research,,,,,,,0.248365,0.039317
Chance of Admit,,,,,,,,0.020337


### Correlation

In [97]:
correlation = df.iloc[:, 1:].corr() # Compute pairwise correlation among the series in the DataFrame

correlation

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,0.835977,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,0.668976,0.69559,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,0.612831,0.657981,0.734523,1.0,0.729593,0.718144,0.444029,0.675732
LOR,0.557555,0.567721,0.660123,0.729593,1.0,0.670211,0.396859,0.669889
CGPA,0.83306,0.828417,0.746479,0.718144,0.670211,1.0,0.521654,0.873289
Research,0.580391,0.489858,0.447783,0.444029,0.396859,0.521654,1.0,0.553202
Chance of Admit,0.80261,0.791594,0.71125,0.675732,0.669889,0.873289,0.553202,1.0


We have the same situation as in case of covariance, or we see that above and below main diagonal there are the same elements. **Correlation is scaled form of the Covariance**, hence this result was expected too.

In [98]:
correlation.where(np.triu(np.ones(correlation.shape)).astype(np.bool)).fillna("") # Upper Triangular Matrix

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,,,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,,,,1.0,0.729593,0.718144,0.444029,0.675732
LOR,,,,,1.0,0.670211,0.396859,0.669889
CGPA,,,,,,1.0,0.521654,0.873289
Research,,,,,,,1.0,0.553202
Chance of Admit,,,,,,,,1.0


> **Covariance is a measure of correlation and it indicates direction of linear relationship between two variables.**


> **Correlation is scaled Covariance or the values are standardized. Correlation measures both the strength and direction of the linear relationship between two variables**

# Summary

---

In this class we've covered quite broad range of Pandas functionality. From the simplest operation of data creation or reading to performing statistical operations on DataFrames. In the next classes we dig deeper Pandas capabilities.