___This notebook is created & Maintained by "Daksh"___ @ https://github.com/DakshHub/Pandas_DataFrame_And_Series/upload/master/DataFrame 

## Introduction to Pandas : Pandas DataFrame

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. 

The Data structures are __DataFrame__ and __Series__

This chapter will talk about Pandas __DataFrame__

_Most of the time, people (even the official documentation) start with __Series__, however, after working with __Pandas__ for a long time, I'm convinced that we should start with __DataFrame__. Learning __Series__ will be super easy once we start with __DataFrame__ _

### What is a DataFrame ?

A 2-D labeled data structure with columns of potentially different type.  

_In other words... Pandas DataFrame is just like an Excel Sheet_

![alt text](excelsheet.png "Title")

Where we store the data in 2-D as

![alt text](exceldata.png "Title")

A __DataFrame__ is nothing but a programmcatic representation of the above sheet in memory. 

___Pandas DataFrame___  not only allows large amount of data to be stored in memory, but also provide various functionalities to analyze, change, and extract the valuable information from the data a.k.a ___DataFrame___

### Storing the Data Above in Pandas DataFrame

Though ___Pandas___ provide various API's to read data from files, including csv files. For better understanding, we'll stick with python Data Structures like __Dictionary__ and __List__

So if we consider __NAME, AGE, and DESIGNATION__ as Keys, what's the best python data structure to store the Values. You guessed it right, it's __Dictionay__. So here is how, we'll go ahead and create a __DataFrame__ using Python __Dictionary__

In [1]:
import pandas as pd
import numpy as np

In [2]:
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", "g"],
                   'age' : [20,27, 35, 55, 18, 21, 35],
                   'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]}

To get a __DataFrame__ out of this __Dictionary__ , we'll use

In [5]:
my_dict

{'age': [20, 27, 35, 55, 18, 21, 35],
 'designation': ['VP', 'CEO', 'CFO', 'VP', 'VP', 'CEO', 'MD'],
 'name': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}

In [3]:
df = pd.DataFrame(my_dict)

In [6]:
df

Unnamed: 0,age,designation,name
0,20,VP,a
1,27,CEO,b
2,35,CFO,c
3,55,VP,d
4,18,VP,e
5,21,CEO,f
6,35,MD,g


If we don't provide the row index, __DataFrame__ generates this on its own with sequence starting with zero. To provide the row number in, we'll change the code as

In [7]:
df = pd.DataFrame(my_dict, index=[1,2,3,4,5,6,7])

In [8]:
df

Unnamed: 0,age,designation,name
1,20,VP,a
2,27,CEO,b
3,35,CFO,c
4,55,VP,d
5,18,VP,e
6,21,CEO,f
7,35,MD,g


In [9]:
# Index as string
df = pd.DataFrame(my_dict, index=["First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh"])

In [10]:
df

Unnamed: 0,age,designation,name
First,20,VP,a
Second,27,CEO,b
Third,35,CFO,c
Fourth,55,VP,d
Fifth,18,VP,e
Sixth,21,CEO,f
Seventh,35,MD,g


_In fact, the `index` can very well be a NumPy array_

In [11]:
np_arr = np.array([10,20,30,40,50,60,70])
df = pd.DataFrame(my_dict, index=np_arr)

In [12]:
df

Unnamed: 0,age,designation,name
10,20,VP,a
20,27,CEO,b
30,35,CFO,c
40,55,VP,d
50,18,VP,e
60,21,CEO,f
70,35,MD,g


### The Homogeneous DataTypes of a Column

Unlike python lists or dictionaries and just like __NumPy__, a column of the __DataFrame__ will always be of the same type. 

We can check the type of a column in two ways

In [13]:
df['age'].dtype

dtype('int64')

In [14]:
df.age.dtype

dtype('int64')

In [15]:
df.name.dtype

dtype('O')

To check the datatype of all columns in the __DataFrame__, we can use

In [16]:
df.dtypes

age             int64
designation    object
name           object
dtype: object

### Viewing the Data of a DataFrame

Most often than not, Pandas __DataFrame__ will contain hundreds (if not thousands) of rows at any point of time. To selectively view the rows, we can use ___head(...) and tail(...)___ function, which by default give first or last five rows (if no input is provided), otherwise shows specific number of rows from top or bottom

In [17]:
df.head()

Unnamed: 0,age,designation,name
10,20,VP,a
20,27,CEO,b
30,35,CFO,c
40,55,VP,d
50,18,VP,e


In [18]:
df.tail()

Unnamed: 0,age,designation,name
30,35,CFO,c
40,55,VP,d
50,18,VP,e
60,21,CEO,f
70,35,MD,g


In [19]:
df.head(2)

Unnamed: 0,age,designation,name
10,20,VP,a
20,27,CEO,b


In [20]:
df.head(7)

Unnamed: 0,age,designation,name
10,20,VP,a
20,27,CEO,b
30,35,CFO,c
40,55,VP,d
50,18,VP,e
60,21,CEO,f
70,35,MD,g


In [21]:
df.tail(7)

Unnamed: 0,age,designation,name
10,20,VP,a
20,27,CEO,b
30,35,CFO,c
40,55,VP,d
50,18,VP,e
60,21,CEO,f
70,35,MD,g


You also have the option of getting all row indexes and column names as

In [22]:
df.index

Int64Index([10, 20, 30, 40, 50, 60, 70], dtype='int64')

In [23]:
df.columns

Index(['age', 'designation', 'name'], dtype='object')

__DataFrame__ also have column helper functions like `unique()` to extract unique elements from a column

In [24]:
df.designation.unique()

array(['VP', 'CEO', 'CFO', 'MD'], dtype=object)

OR --> To get the mean of a column

In [25]:
df.age.mean()

30.142857142857142

### Using a Column as Row Index

If we believe that our data already has row indexes with it, we can skip providing index as well as prevent __DataFrame__ from creating a default index by assigning a column as a row

In [26]:
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", "g"],
                   'age' : [20,27, 35, 55, 18, 21, 35],
                   'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]}

df = pd.DataFrame(my_dict)

In [27]:
df

Unnamed: 0,age,designation,name
0,20,VP,a
1,27,CEO,b
2,35,CFO,c
3,55,VP,d
4,18,VP,e
5,21,CEO,f
6,35,MD,g


In [28]:
df.set_index("name")

Unnamed: 0_level_0,age,designation
name,Unnamed: 1_level_1,Unnamed: 2_level_1
a,20,VP
b,27,CEO
c,35,CFO
d,55,VP
e,18,VP
f,21,CEO
g,35,MD


In [29]:

df.set_index("age")

Unnamed: 0_level_0,designation,name
age,Unnamed: 1_level_1,Unnamed: 2_level_1
20,VP,a
27,CEO,b
35,CFO,c
55,VP,d
18,VP,e
21,CEO,f
35,MD,g


We can even set multiple columns as Row Indexes

In [30]:
df.set_index(["name","age"])

Unnamed: 0_level_0,Unnamed: 1_level_0,designation
name,age,Unnamed: 2_level_1
a,20,VP
b,27,CEO
c,35,CFO
d,55,VP
e,18,VP
f,21,CEO
g,35,MD


### Loading selective columns in DataFrame

Any data science activity requires data cleanup and its quite possible that we come to a conclusion to exclude some columns from loading into the __DataFrame__. 

We can do so by selecting the name of Columns in the __DataFrame__

In [31]:
# Not including Designation
df = pd.DataFrame(my_dict, columns=["name", "age"])

In [32]:
df

Unnamed: 0,name,age
0,a,20
1,b,27
2,c,35
3,d,55
4,e,18
5,f,21
6,g,35


### Deleting Rows and Columns from DataFrame

__DataFrame__ provides multiple ways of deleting the rows and columns. There is no functional penalty for choosing one over another. Let's regenerate our __DataFrame__ with original __Dictionary__


In [33]:
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", "g"],
                   'age' : [20,27, 35, 55, 18, 21, 35],
                   'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]}

df = pd.DataFrame(my_dict)

In [34]:
df

Unnamed: 0,age,designation,name
0,20,VP,a
1,27,CEO,b
2,35,CFO,c
3,55,VP,d
4,18,VP,e
5,21,CEO,f
6,35,MD,g


In [35]:
# Deleteing by specifying the Column Name
del df['name']
df


Unnamed: 0,age,designation
0,20,VP
1,27,CEO
2,35,CFO
3,55,VP
4,18,VP
5,21,CEO
6,35,MD


In [37]:
# Regenerate the DataFrame
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", "g"],
                   'age' : [20,27, 35, 55, 18, 21, 35],
                   'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]}

df = pd.DataFrame(my_dict)

In [38]:
# Delete using .drop(...) function
df.drop('age',1) 

Unnamed: 0,designation,name
0,VP,a
1,CEO,b
2,CFO,c
3,VP,d
4,VP,e
5,CEO,f
6,MD,g


___The second argument "1" in function .drop() denotes deletion of the "Column", whereas "0" means deletion of the "Row"___

We can drop a Row by specifying the Row Index as

In [39]:
df.drop(3,0)

Unnamed: 0,age,designation,name
0,20,VP,a
1,27,CEO,b
2,35,CFO,c
4,18,VP,e
5,21,CEO,f
6,35,MD,g


We can also delete multiple Columns and Rows in one go by providing list into the .drop(...) function as 

In [40]:
# Regenerate the DataFrame
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", "g"],
                   'age' : [20,27, 35, 55, 18, 21, 35],
                   'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]}

df = pd.DataFrame(my_dict)

In [41]:
df.drop(['name','age'],1)

Unnamed: 0,designation
0,VP
1,CEO
2,CFO
3,VP
4,VP
5,CEO
6,MD


In [42]:
df.drop([2,3,4],0)

Unnamed: 0,age,designation,name
0,20,VP,a
1,27,CEO,b
5,21,CEO,f
6,35,MD,g


### Deleting the Columns using the Index

While deleting the columns, it may become tedious and error prone to write the name of individual columns in the drop list. Instead, we can provide the index of columns also using `df.column` into `df.drop()`

In [43]:
# Regenerate the DataFrame
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", "g"],
                   'age' : [20,27, 35, 55, 18, 21, 35],
                   'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]}

df = pd.DataFrame(my_dict, index=["1st", "2nd", "3rd", "4th", "5th", "6th", "7th"])

In [44]:
df

Unnamed: 0,age,designation,name
1st,20,VP,a
2nd,27,CEO,b
3rd,35,CFO,c
4th,55,VP,d
5th,18,VP,e
6th,21,CEO,f
7th,35,MD,g


In [45]:
df.drop(df.columns[[0,1]],1)

Unnamed: 0,name
1st,a
2nd,b
3rd,c
4th,d
5th,e
6th,f
7th,g


### Generating a DataFrame from a list

We don't need to always provide names columns to generate a __DataFrame__, we can also provide a list as

In [46]:
my_list = [[1,2,3,4],
           [5,6,7,8],
           [9,10,11,12],
           [13,14,15,16],
           [17,18,19,20]]

df = pd.DataFrame(my_list)

In [47]:
df

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20


Pandas __DataFrame__ generates Row indexes and Column names as a sequence of integers which can be changed by providing parameters while creating the __DataFrame__ as

In [48]:
df = pd.DataFrame(my_list, index = ["1->", "2->", "3->", "4->", "5->"], columns = ["A", "B", "C", "D"])

In [49]:
df

Unnamed: 0,A,B,C,D
1->,1,2,3,4
2->,5,6,7,8
3->,9,10,11,12
4->,13,14,15,16
5->,17,18,19,20


___Don't forget that NumPy Arrays are equally welcome as input to Pandas DataFrame___

In [50]:
np_arr = np.array([[1,2,3,4],
                   [5,6,7,8],
                   [9,10,11,12],
                   [13,15,16,16],
                   [17,18,19,20]])

df = pd.DataFrame(np_arr)

In [51]:
df

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,15,16,16
4,17,18,19,20


### It's Easy to do Mathematical Operations in DataFrames

Just like Excel Sheets, Pandas __DataFrame__ is at ease with doing mathematical operations.

In [52]:
df * df

Unnamed: 0,0,1,2,3
0,1,4,9,16
1,25,36,49,64
2,81,100,121,144
3,169,225,256,256
4,289,324,361,400


In [53]:
newdf = df * 10
newdf

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120
3,130,150,160,160
4,170,180,190,200


In [54]:
df + 100

Unnamed: 0,0,1,2,3
0,101,102,103,104
1,105,106,107,108
2,109,110,111,112
3,113,115,116,116
4,117,118,119,120


In [55]:
df & 0

Unnamed: 0,0,1,2,3
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0
3,0,0,0,0
4,0,0,0,0


## What is a Series?

___A series in a single column of a DataFrame, more than one series combines to form a DataFrame___