# Pandas DataFrames

A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Features:
- Potentially columns are of different types 
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

You can think of it as an SQL table or a spreadsheet data representation

In [1]:
# !pip install numpy
# !pip install pandas

In [2]:
import numpy as np
import pandas as pd

In [3]:
# Create an empty Dataframe
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [4]:
# Create a DataFrame from Lists
data = [10,20,30,40,50]
df = pd.DataFrame(data)
print(df) 

    0
0  10
1  20
2  30
3  40
4  50


In [5]:
# Create a DataFrame from Lists
students = [['John',18], ['Anna',17], ['Peter',19]]
df = pd.DataFrame(students, columns=['Name','Age'])
display(df)

Unnamed: 0,Name,Age
0,John,18
1,Anna,17
2,Peter,19


In [6]:
# Create a DataFrame from Lists
students = [['John',18], ['Anna',17], ['Peter',19]]
df = pd.DataFrame(students, columns=['Name','Age'],dtype=float)
display(df)

Unnamed: 0,Name,Age
0,John,18.0
1,Anna,17.0
2,Peter,19.0


**Create a DataFrame from Dict of ndarrays/Lists**

- All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.
- If no index is passed, then by default, index will be range(n), where n is the array length.

In [7]:
students = {'Name':['John','Jane','Emma'], 'Age':[18,17,19]}
df = pd.DataFrame(students, index=['a','b','c'])
display(df)

Unnamed: 0,Name,Age
a,John,18
b,Jane,17
c,Emma,19


In [8]:
# Create a DataFrame from a list of dictionaries
data = [{'a':11,'b':22},{'a':9,'b':15,'c':24}]
df = pd.DataFrame(data)
display(df)

Unnamed: 0,a,b,c
0,11,22,
1,9,15,24.0


Observe that the first dictionary does not have the key 'c', so a NaN value is assigned to the corresponde entry.

## Basic operations on DataFrames

A new column can be added to a DataFrame:

In [9]:
df['d'] = df['a'] + df['b']
display(df)

Unnamed: 0,a,b,c,d
0,11,22,,33
1,9,15,24.0,24


Observe what happen when we try to do operations on NaN values:

In [10]:
df['e'] = df['a'] - df['c']
display(df)

Unnamed: 0,a,b,c,d,e
0,11,22,,33,
1,9,15,24.0,24,-15.0


In [11]:
df['f'] = 5
display(df)

Unnamed: 0,a,b,c,d,e,f
0,11,22,,33,,5
1,9,15,24.0,24,-15.0,5


In [12]:
# Column deletion
del(df['f'])
display(df)

Unnamed: 0,a,b,c,d,e
0,11,22,,33,
1,9,15,24.0,24,-15.0


In [13]:
# Using pop
df.pop('e')
display(df)

Unnamed: 0,a,b,c,d
0,11,22,,33
1,9,15,24.0,24


In [14]:
# Using drop
df.drop(columns=['d'],axis=1,inplace=True)
df

Unnamed: 0,a,b,c
0,11,22,
1,9,15,24.0


`inplace = True` means the DataFrame will keep the changes. 

Adding new rows to a DataFrame using the append function:

In [15]:
df2 = pd.DataFrame([[1,2,3,4],[5,6,7,8]], columns=['a','b','c','d'])
df = df.append(df2)
display(df) 

Unnamed: 0,a,b,c,d
0,11,22,,
1,9,15,24.0,
0,1,2,3.0,4.0
1,5,6,7.0,8.0


Labels are duplicated! 

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

In [16]:

df1 = df.drop(0) 
display(df1)

Unnamed: 0,a,b,c,d
1,9,15,24.0,
1,5,6,7.0,8.0


In [17]:
# Adding more rows
df = df.append(df2)
df = df.append(df2)
df

Unnamed: 0,a,b,c,d
0,11,22,,
1,9,15,24.0,
0,1,2,3.0,4.0
1,5,6,7.0,8.0
0,1,2,3.0,4.0
1,5,6,7.0,8.0
0,1,2,3.0,4.0
1,5,6,7.0,8.0


We can reset the index to avoid multiple labels:

In [18]:
df = df.reset_index(drop=True)
df

Unnamed: 0,a,b,c,d
0,11,22,,
1,9,15,24.0,
2,1,2,3.0,4.0
3,5,6,7.0,8.0
4,1,2,3.0,4.0
5,5,6,7.0,8.0
6,1,2,3.0,4.0
7,5,6,7.0,8.0


In [19]:
# Renaming columns
df.rename(columns={'a':'A','b':'B'},inplace=True)
df

Unnamed: 0,A,B,c,d
0,11,22,,
1,9,15,24.0,
2,1,2,3.0,4.0
3,5,6,7.0,8.0
4,1,2,3.0,4.0
5,5,6,7.0,8.0
6,1,2,3.0,4.0
7,5,6,7.0,8.0


### Inspecting data

**df.head(n)**: show the first n rows of the DataFrame

In [20]:
df.head(2)

Unnamed: 0,A,B,c,d
0,11,22,,
1,9,15,24.0,


If n is not specified, the first five rows are shown.

In [21]:
df.head()

Unnamed: 0,A,B,c,d
0,11,22,,
1,9,15,24.0,
2,1,2,3.0,4.0
3,5,6,7.0,8.0
4,1,2,3.0,4.0


**df.tail(n)**: show the last n rows of the DataFrame

In [22]:
df.tail(2)

Unnamed: 0,A,B,c,d
6,1,2,3.0,4.0
7,5,6,7.0,8.0


In [23]:
df.tail()

Unnamed: 0,A,B,c,d
3,5,6,7.0,8.0
4,1,2,3.0,4.0
5,5,6,7.0,8.0
6,1,2,3.0,4.0
7,5,6,7.0,8.0


**df.shape**: show the number of rows and columns

In [24]:
df.shape

(8, 4)

**df.info()**: Show info, datatype, and memory information

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       8 non-null      int64  
 1   B       8 non-null      int64  
 2   c       7 non-null      float64
 3   d       6 non-null      float64
dtypes: float64(2), int64(2)
memory usage: 384.0 bytes


**df.describe()**: show some statistics for numerical columns

In [26]:
df.describe()

Unnamed: 0,A,B,c,d
count,8.0,8.0,7.0,6.0
mean,4.75,7.625,7.714286,6.0
std,3.770184,7.209864,7.454625,2.19089
min,1.0,2.0,3.0,4.0
25%,1.0,2.0,3.0,4.0
50%,5.0,6.0,7.0,6.0
75%,6.0,8.25,7.0,8.0
max,11.0,22.0,24.0,8.0


In [27]:
# df.mean() | Returns the mean of all columns
df.mean()

A    4.750000
B    7.625000
c    7.714286
d    6.000000
dtype: float64

In [28]:
# df.count() | Returns the number of non-null values in each DataFrame column
df.count()

A    8
B    8
c    7
d    6
dtype: int64

In [29]:
# df.min() | Returns the lowest value in each column
df.min()

A    1.0
B    2.0
c    3.0
d    4.0
dtype: float64

In [30]:
# df.max(): Returns the highest value in each column
df.max()

A    11.0
B    22.0
c    24.0
d     8.0
dtype: float64

In [31]:
# df.median(): Returns the median of each column
df.median()

A    5.0
B    6.0
c    7.0
d    6.0
dtype: float64

In [32]:
df

Unnamed: 0,A,B,c,d
0,11,22,,
1,9,15,24.0,
2,1,2,3.0,4.0
3,5,6,7.0,8.0
4,1,2,3.0,4.0
5,5,6,7.0,8.0
6,1,2,3.0,4.0
7,5,6,7.0,8.0


In [33]:
# mode: returns the mode of each column
df.mode()

Unnamed: 0,A,B,c,d
0,1,2,3.0,4.0
1,5,6,7.0,8.0


In [34]:
# df.std() | Returns the standard deviation of each column
df.std()

A    3.770184
B    7.209864
c    7.454625
d    2.190890
dtype: float64

The round() function returns a floating point number that is a rounded version of the specified number, with the specified number of decimals.

In [35]:
round(df.std(),2)

A    3.77
B    7.21
c    7.45
d    2.19
dtype: float64

The default number of decimals is 0, meaning that the function will return the nearest integer.

In [36]:
round(df.std())

A    4.0
B    7.0
c    7.0
d    2.0
dtype: float64

In [37]:
# sum(): Returns the sum of values
display(df2)
df2.sum()

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8


a     6
b     8
c    10
d    12
dtype: int64

In [38]:
# cumsum(): Returns the cummulative sum of values
display(df2)
df2.cumsum()

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8


Unnamed: 0,a,b,c,d
0,1,2,3,4
1,6,8,10,12


In [39]:
# prod(): Returns the product of values
display(df2)
df2.prod()

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8


a     5
b    12
c    21
d    32
dtype: int64