## <center><b>Python for Data Science</b></center>
## <center><b>Lesson 27</b></center>
## <center><b>Pandas Basics -- Part Four</b></center>
## <center><b>Pandas DataFrames (Notes)</b></center>

![7.jpg](attachment:7.jpg)

<font size="6"><center>[Link: Pandas Documentation](https://pandas.pydata.org/docs/)</center></font>

##  <span style="color:red">TABLE OF CONTENTS</span>

1. [What Is A Pandas DataFrame?](#1)<br>
2. [How to Create DataFrames](#2)<br>
a. [Creating a DataFrame from a Dictionary of Lists](#2a)<br>
&emsp;● [The Anatomy of a DataFrame](#2a1)<br>
&emsp;● [Changing the Row Index](#2a2)<br>
&emsp;● [Using a Column as the Row Index](#2a3)<br>
&emsp;● [Loading selective columns in a DataFrame](#2a4)<br>
&emsp;● [Set the Order of the DataFrame Columns](#2a5)<br>
&emsp;● [The Data Type of a Column of a Pandas DataFrame](#2a6)<br>
&emsp;● [Pandas DataFrame Columns Helper Functions](#2a7)<br>
3. [Inspecting/Viewing the Data of a DataFrame](#3)<br>
&emsp;● [Inspecting/Viewing the Data of a DataFrame -- DataFrame.head() Method](#3i)<br>
&emsp;● [Inspecting/Viewing the Data of a DataFrame -- DataFrame.tail() Method](#3ii)<br>
&emsp;● [Inspecting/Viewing the Data of a DataFrame -- DataFrame.info() Method](#3iii)<br>
4. [Creating a DataFrame from a NumPy ndarray](#4)<br>
5. [Creating a Pandas DataFrame from a List](#5)


<div class="alert alert-block alert-warning">
    <b><font size="4">Files needed for this presentation:</font></b>
</div>

#### No files are needed for this presentation.

In [1]:
# set up notebook to display multiple output in one cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print('The notebook is set up to display multiple output in one cell.')

The notebook is set up to display multiple output in one cell.


In [2]:
# Import libraries

import pandas as pd
import numpy as np

# The Data Structures / Objects Provided by Pandas

1. <code style="background:yellow;color:black">Pandas DataFrame (2-Dimensional)</code>
2. Pandas Series (1-Dimensional)
3. Pandas Index

<a class="anchor" id="1"></a>
# <span style="color:blue"><b>1. What Is A Pandas DataFrame?</b></span>

<b>Technical Definition</b> 

Pandas DataFrame is a 2-D labeled data structure with columns of potentially different types.

The DataFrame can be thought of as a generalization of a NumPy array or as a specialization of a Python dictionary.

Therefore, a DataFrame can be thought of as being similar to a 2-D array with both flexible row indices and flexible column names.




<b>How You Should Understand It</b>

Pandas DataFrame is nothing but an in-memory representation of an Excel-like spreadsheet via the Python programming language.

Therefore, a Pandas DataFrame is similar to an Excel/Google Sheets spreadsheet or SQL table or it can also be thought of as being a dictionary of Series objects.

Just like Excel and other Excel-like spreadsheets, Pandas DataFrame provides various functionalities to analyze, change, and extract valuable information from the given dataset.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)

# Pandas DataFrame Documentation

[pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

<a class="anchor" id="2"></a>
# <span style="color:blue"><b>2. How to Create DataFrames</b></span>

### First, we must import Pandas.

Note: This is just serving as a reminder because Pandas and NumPy were already imported above.

In [2]:
import pandas as pd
import numpy as np

<a class="anchor" id="2a"></a>
## <span style="color:red"><b><i>a. Creating a DataFrame from a Dictionary of Lists</b></span>

![image.png](attachment:image.png)    ![image-2.png](attachment:image-2.png)

In [3]:
# Create a dictionary of lists

# The dictionary will contain several key-value pairs
# Each pair consists of a column name and a list of elements


sample_dict = {                                                           # the keys will serve as the column names
     'name' : ["Adam", "Bruce", "Carl", "David", "Ed","Frank", "Greg"],                         
     'age' : [20, 27, 35, 55, 18, 21, 35],                                # the values ... i.e. the list will provide
     'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]         # the elements for the respective column
    }

# Use the pandas.DataFrame() function to construct the dataframe from the dictionary

sample_df = pd.DataFrame(sample_dict)

In [4]:
sample_df

Unnamed: 0,name,age,designation
0,Adam,20,VP
1,Bruce,27,CEO
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP
5,Frank,21,CEO
6,Greg,35,MD


![12.jpg](attachment:12.jpg)

<a class="anchor" id="2a1"></a>
### <span style="color:green"><b><i>The Anatomy of a DataFrame</i></b></span>

A DataFrame consists of three parts:
    
1. Index 
2. Columns Names (Column Index)
3. Data

The row and column labels can be accessed respectively by accessing the index and columns attributes:

In [5]:
sample_df.index

RangeIndex(start=0, stop=7, step=1)

In [6]:
sample_df.columns

Index(['name', 'age', 'designation'], dtype='object')

In [7]:
sample_df.values

# gives us the rows

array([['Adam', 20, 'VP'],
       ['Bruce', 27, 'CEO'],
       ['Carl', 35, 'CFO'],
       ['David', 55, 'VP'],
       ['Ed', 18, 'VP'],
       ['Frank', 21, 'CEO'],
       ['Greg', 35, 'MD']], dtype=object)

<a class="anchor" id="2a2"></a>
### <span style="color:green"><b><i>Changing the Row Index</i></b></span>

Since, we haven’t provided any Row Index values to the DataFrame, it automatically generates a sequence (0…6) as row index.<br>

To provide our own row index, we use the index = parameter.

In [8]:
# Changing the row index

sample_df = pd.DataFrame(sample_dict, index = [1,2,3,4,5,6,7])

In [9]:
sample_df

Unnamed: 0,name,age,designation
1,Adam,20,VP
2,Bruce,27,CEO
3,Carl,35,CFO
4,David,55,VP
5,Ed,18,VP
6,Frank,21,CEO
7,Greg,35,MD


The index need not be numerical all the time, we can pass strings also for the index.

In [10]:
# Using strings for the index

sample_df = pd.DataFrame(sample_dict, index = ["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"])

In [11]:
sample_df

Unnamed: 0,name,age,designation
First,Adam,20,VP
Second,Bruce,27,CEO
Third,Carl,35,CFO
Fourth,David,55,VP
Fifth,Ed,18,VP
Sixth,Frank,21,CEO
Seventh,Greg,35,MD


Index is homogeneous in nature which means we can also use NumPy arrays as Index.

In [12]:
# Using a NumPy array as the Index

np_arr = np.array([10,20,30,40,50,60,70])
sample_df = pd.DataFrame(sample_dict, index = np_arr)

In [13]:
sample_df

Unnamed: 0,name,age,designation
10,Adam,20,VP
20,Bruce,27,CEO
30,Carl,35,CFO
40,David,55,VP
50,Ed,18,VP
60,Frank,21,CEO
70,Greg,35,MD


<a class="anchor" id="2a3"></a>
### <span style="color:green"><b><i>Using a Column as the Row Index</i></b></span>

Most of the time, the given datasets already contains a row index. 

In those cases, we don’t need Pandas DataFrame to generate a separate row index.

Not only its just a redundant information but also takes unnecessary amount of memory.

Pandas DataFrame allows setting any existing column or set of columns as Row Index.

In [14]:
sample_df.set_index("name")

Unnamed: 0_level_0,age,designation
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Adam,20,VP
Bruce,27,CEO
Carl,35,CFO
David,55,VP
Ed,18,VP
Frank,21,CEO
Greg,35,MD


In [15]:
sample_df.set_index("age")

Unnamed: 0_level_0,name,designation
age,Unnamed: 1_level_1,Unnamed: 2_level_1
20,Adam,VP
27,Bruce,CEO
35,Carl,CFO
55,David,VP
18,Ed,VP
21,Frank,CEO
35,Greg,MD


And we can set multiple columns as index by passing a list ...

In [16]:
sample_df.set_index(["name","age"])

Unnamed: 0_level_0,Unnamed: 1_level_0,designation
name,age,Unnamed: 2_level_1
Adam,20,VP
Bruce,27,CEO
Carl,35,CFO
David,55,VP
Ed,18,VP
Frank,21,CEO
Greg,35,MD


<a class="anchor" id="2a4"></a>
### <span style="color:green"><b><i>Loading selective columns in a DataFrame</i></b></span>

Any data analytics activity requires data cleanup and it is quite possible that we come to a conclusion to exclude some columns from the datasets that needs to be analyzed.

This not only saves memory but also help to analyze the data which is of interest

We’ll use the same dictionary for loading Pandas DataFrame, but this time we will specify the columns which will be part of DataFrame

In [17]:
sample_dict = {                                                           # the keys will serve as the column names
     'name' : ["Adam", "Bruce", "Carl", "David", "Ed","Frank", "Greg"],                         
     'age' : [20,27, 35, 55, 18, 21, 35],                                 # the values ... i.e. the list will provide
     'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]         # the elements for the respective column
    }

# Use the pandas.DataFrame() function to construct the dataframe from the dictionary
sample_df = pd.DataFrame(sample_dict)

sample_df2 = pd.DataFrame(sample_dict, columns=["name", "age"])

In [23]:
sample_df

sample_df2

Unnamed: 0,name,age,designation
0,Adam,20,VP
1,Bruce,27,CEO
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP
5,Frank,21,CEO
6,Greg,35,MD


Unnamed: 0,name,age
0,Adam,20
1,Bruce,27
2,Carl,35
3,David,55
4,Ed,18
5,Frank,21
6,Greg,35


<a class="anchor" id="2a5"></a>
### <span style="color:green"><b><i>Set the Order of the DataFrame Columns</i></b></span>

In [18]:
# Use the pandas.DataFrame() function to construct the dataframe from the dictionary
# We can use the columns = parameter to set the order or the names of the columns

sample_df = pd.DataFrame(sample_dict, columns = ['designation', 'name', 'age'])

In [25]:
sample_df

Unnamed: 0,designation,name,age
0,VP,Adam,20
1,CEO,Bruce,27
2,CFO,Carl,35
3,VP,David,55
4,VP,Ed,18
5,CEO,Frank,21
6,MD,Greg,35


<a class="anchor" id="2a6"></a>
### <span style="color:green"><b><i>The Data Type of a Column of a Pandas DataFrame</i></b></span>

Unlike Python lists or dictionaries and just like NumPy, a column of the DataFrame will always be of same type.

We can check the data type of a column either using dictionary like syntax (i.e., by using bracket notation) or by adding the column name to the name of the DataFrame (i.e., by using dot notation).

In [19]:
# Checking the data types of the DataFrame columns using dictionary like syntax

sample_df
sample_df['designation'].dtype
sample_df['name'].dtype
sample_df['age'].dtype

Unnamed: 0,designation,name,age
0,VP,Adam,20
1,CEO,Bruce,27
2,CFO,Carl,35
3,VP,David,55
4,VP,Ed,18
5,CEO,Frank,21
6,MD,Greg,35


dtype('O')

dtype('O')

dtype('int64')

In [20]:
# Checking the data types of the DataFrame columns by adding the column name to the name of the DataFrame
# i.e. by using dot notation

sample_df.designation.dtype
sample_df.name.dtype
sample_df.age.dtype

dtype('O')

dtype('O')

dtype('int64')

If we want to check the data types of all columns inside the DataFrame, we can use the dtypes function of the DataFrame as follows.

In [18]:
# Using the dtypes function to check the data types of all columns inside the DataFrame

sample_df.dtypes

designation    object
name           object
age             int64
dtype: object

<a class="anchor" id="2a7"></a>
### <span style="color:green"><b><i>Pandas DataFrame Columns Helper Functions</i></b></span>

The Pandas DataFrame provides various column helper functions which are extremely useful for extracting valuable information from the column. 

Some of these are ...

In [21]:
sample_dict = {                                                           # the keys will serve as the column names
     'name' : ["Adam", "Bruce", "Carl", "David", "Ed","Frank", "Greg"],                         
     'age' : [20,27, 35, 55, 18, 21, 35],                                 # the values ... i.e. the list will provide
     'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]         # the elements for the respective column
    }

# Use the pandas.DataFrame() function to construct the dataframe from the dictionary
sample_df = pd.DataFrame(sample_dict)

In [22]:
sample_df

Unnamed: 0,name,age,designation
0,Adam,20,VP
1,Bruce,27,CEO
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP
5,Frank,21,CEO
6,Greg,35,MD


<b>unique</b> → Provides unique elements from a column by removing duplicates

[pandas.unique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html)

In [23]:
sample_df.designation.unique()
sample_df.age.unique()

array(['VP', 'CEO', 'CFO', 'MD'], dtype=object)

array([20, 27, 35, 55, 18, 21], dtype=int64)

<b>mean</b> → Provides the mean value of all the items in the column.<br>
<b>median</b> → Provides the median value of all the items in the column.<br>
<b>std</b> → Provides the standard deviation value of all the items in the column. 

In [24]:
sample_df.age.mean()
sample_df.age.median()
sample_df.age.std()

30.142857142857142

27.0

12.966991059719952

<a class="anchor" id="3"></a>
# <span style="color:blue"><b>3. Inspecting/Viewing the Data of a DataFrame</b></span>

At any point of time, Pandas DataFrame will contain hundreds (if not thousands) of rows of data. We can only view them selectively at any point of time.

To selectively view the rows, we can use head(…) and tail(…) functions, which by default give first or last five rows (if no input is provided), otherwise shows specific number of rows from top or bottom

<a class="anchor" id="3i"></a>
### <span style="color:green"><b><i>Inspecting/Viewing the Data of a DataFrame -- DataFrame.head() Method</i></b></span>

In [25]:
sample_df

Unnamed: 0,name,age,designation
0,Adam,20,VP
1,Bruce,27,CEO
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP
5,Frank,21,CEO
6,Greg,35,MD


In [26]:
sample_df.head()

Unnamed: 0,name,age,designation
0,Adam,20,VP
1,Bruce,27,CEO
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP


![13.jpg](attachment:13.jpg)

In [27]:
sample_df.head(3)

Unnamed: 0,name,age,designation
0,Adam,20,VP
1,Bruce,27,CEO
2,Carl,35,CFO


In [28]:
sample_df.head(-2)

Unnamed: 0,name,age,designation
0,Adam,20,VP
1,Bruce,27,CEO
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP


In [29]:
sample_df['age'].head()

0    20
1    27
2    35
3    55
4    18
Name: age, dtype: int64

In [30]:
sample_df[['age', 'name']].head()

Unnamed: 0,age,name
0,20,Adam
1,27,Bruce
2,35,Carl
3,55,David
4,18,Ed


<a class="anchor" id="3ii"></a>
### <span style="color:green"><b><i>Inspecting/Viewing the Data of a DataFrame -- DataFrame.tail() Method</i></b></span>

In [31]:
sample_df.tail()

Unnamed: 0,name,age,designation
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP
5,Frank,21,CEO
6,Greg,35,MD


![14.jpg](attachment:14.jpg)

In [32]:
sample_df.tail(3)

Unnamed: 0,name,age,designation
4,Ed,18,VP
5,Frank,21,CEO
6,Greg,35,MD


In [33]:
sample_df.tail(-2)

Unnamed: 0,name,age,designation
2,Carl,35,CFO
3,David,55,VP
4,Ed,18,VP
5,Frank,21,CEO
6,Greg,35,MD


<a class="anchor" id="3iii"></a>
### <span style="color:green"><b><i>Inspecting/Viewing the Data of a DataFrame -- DataFrame.info() Method</i></b></span>

In [34]:
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         7 non-null      object
 1   age          7 non-null      int64 
 2   designation  7 non-null      object
dtypes: int64(1), object(2)
memory usage: 296.0+ bytes


![15.jpg](attachment:15.jpg)

<a class="anchor" id="4"></a>
# <span style="color:blue"><b>4. Creating a DataFrame from a NumPy ndarray</b></span>

The DataFrame constructor’s data parameter also accepts a NumPy ndarray. We can generate an ndarray of any size with the randint function in NumPy’s random module. 

The next example creates a 3 x 5 ndarray of integers between 1 and 101 (exclusive):

In [35]:
 random_data = np.random.randint(1, 101, [3, 5])
 random_data

array([[19, 20,  1, 34, 99],
       [30, 55, 47, 81, 13],
       [72, 71, 84, 57, 32]])

Next, let’s pass our ndarray into the DataFrame constructor. 

The ndarray has neither row labels nor column labels. Thus, pandas uses a numeric index for both the
row axis and column axis:

In [36]:
pd.DataFrame(data = random_data)

Unnamed: 0,0,1,2,3,4
0,19,20,1,34,99
1,30,55,47,81,13
2,72,71,84,57,32


We can manually set the row labels with the DataFrame constructor’s index parameter, which accepts any iterable object, including a list, tuple, or ndarray. 

Note that the iterable’s length must be equal to the data set’s number of rows. We’re passing a 3 x 5 ndarray, so we must provide three row labels:

In [37]:
 row_labels = ["Morning", "Afternoon", "Evening"]
 temperatures = pd.DataFrame(data = random_data, index = row_labels)
 temperatures

Unnamed: 0,0,1,2,3,4
Morning,19,20,1,34,99
Afternoon,30,55,47,81,13
Evening,72,71,84,57,32


We can set the column names with the constructor’s columns parameter. 

The ndarray includes five columns, so we must pass an iterable with five items. 

The next example passes the column names in a tuple:

In [38]:
 row_labels = ["Morning", "Afternoon", "Evening"]
 column_labels = (
 "Monday",
 "Tuesday",
 "Wednesday",
 "Thursday",
 "Friday",
 )

 pd.DataFrame(data = random_data, index = row_labels,columns = column_labels)

Unnamed: 0,Monday,Tuesday,Wednesday,Thursday,Friday
Morning,19,20,1,34,99
Afternoon,30,55,47,81,13
Evening,72,71,84,57,32


Pandas permits duplicates in the row and column indices. 

In the next example, "Morning" appears twice in the rows’ index labels, and "Tuesday" appears twice in the columns’ index labels:

In [60]:
 row_labels = ["Morning", "Afternoon", "Morning"]
 column_labels = [
 "Monday",
 "Tuesday",
 "Wednesday",
 "Tuesday",
 "Friday"
 ]
 pd.DataFrame(
 data = random_data,
 index = row_labels,
 columns = column_labels,
 )

Unnamed: 0,Monday,Tuesday,Wednesday,Tuesday.1,Friday
Morning,62,1,90,74,21
Afternoon,12,97,63,82,4
Morning,11,99,53,78,29


<a class="anchor" id="5"></a>
# <span style="color:blue"><b>5. Creating a Pandas DataFrame from a List</b></span>

In [62]:
list_of_lists = [[1,2,3,4],
               [5,6,7,8],
               [9,10,11,12],
               [13,14,15,16],
               [17,18,19,20]]

new_df = pd.DataFrame(list_of_lists)

In [63]:
new_df

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20


When we get a dataset without any columns Pandas DataFrame generates the DataFrame by implicitly adding a Row Index as well as Column headers for us.

Once again, if we don’t want Pandas DataFrame to auto generate the Row Indexes and Column names, we can pass those in DataFrame function like as follows ...

In [64]:
new_df = pd.DataFrame(list_of_lists, index = ["1->", "2->", "3->", "4->", "5->"], columns = ["A", "B", "C", "D"])

In [65]:
new_df

Unnamed: 0,A,B,C,D
1->,1,2,3,4
2->,5,6,7,8
3->,9,10,11,12
4->,13,14,15,16
5->,17,18,19,20
