# Pandas DataFrames

A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

Features:
- Potentially, columns are of different types 
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

You can think of it as an SQL table or a spreadsheet data representation.

In [1]:
# !pip install pandas

In [2]:
import pandas as pd

In [3]:
# Create an empty Dataframe
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [4]:
# Create a DataFrame from a list
data = [10,20,30,40,50]
df = pd.DataFrame(data)
print(df) 

    0
0  10
1  20
2  30
3  40
4  50


In [5]:
# create a DataFrame from a list and assign a name to the column
df = pd.DataFrame(data, columns = ['data'])
print(df) 

   data
0    10
1    20
2    30
3    40
4    50


In [6]:
# Create a DataFrame from a list of lists
students = [['John',18], ['Anna',17], ['Peter',19]]
df = pd.DataFrame(students, columns=['Name','Age'])
display(df)

Unnamed: 0,Name,Age
0,John,18
1,Anna,17
2,Peter,19


**Create a DataFrame from Dict of ndarrays/Lists**

- All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.
- If no index is passed, then by default, index will be range(n), where n is the array length.

In [7]:
# with index
students = {'Name':['John','Jane','Emma'], 'Age':[18,17,19]}
df = pd.DataFrame(students, index=['a','b','c'])
display(df)

Unnamed: 0,Name,Age
a,John,18
b,Jane,17
c,Emma,19


In [8]:
# without index
df = pd.DataFrame(students)
display(df)

Unnamed: 0,Name,Age
0,John,18
1,Jane,17
2,Emma,19


In [9]:
# Create a DataFrame from a list of dictionaries
data = [{'a':5,'b':20},{'a':5,'b':20,'c':10},{'a':4,'c':10},{'a':3,'b':19},{'a':2,'b':18}]
df = pd.DataFrame(data)
display(df)

Unnamed: 0,a,b,c
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


Observe that the first dictionary does not have the key 'c', so a NaN value is assigned to the corresponding entry.

## Basic Operations on DataFrames

A new column can be added to a DataFrame:

In [10]:
df['d'] = df['a'] + df['b']
display(df)

Unnamed: 0,a,b,c,d
0,5,20.0,,25.0
1,5,20.0,10.0,25.0
2,4,,10.0,
3,3,19.0,,22.0
4,2,18.0,,20.0


Observe what happen when we try to do operations on NaN values:

In [11]:
df['e'] = df['a'] - df['c']
display(df)

Unnamed: 0,a,b,c,d,e
0,5,20.0,,25.0,
1,5,20.0,10.0,25.0,-5.0
2,4,,10.0,,-6.0
3,3,19.0,,22.0,
4,2,18.0,,20.0,


In [12]:
# adding a scalar as a new column
df['f'] = 5
display(df)

Unnamed: 0,a,b,c,d,e,f
0,5,20.0,,25.0,,5
1,5,20.0,10.0,25.0,-5.0,5
2,4,,10.0,,-6.0,5
3,3,19.0,,22.0,,5
4,2,18.0,,20.0,,5


In [13]:
# Column deletion
del(df['f'])
display(df)

Unnamed: 0,a,b,c,d,e
0,5,20.0,,25.0,
1,5,20.0,10.0,25.0,-5.0
2,4,,10.0,,-6.0
3,3,19.0,,22.0,
4,2,18.0,,20.0,


In [14]:
# Using pop
df.pop('e')
display(df)

Unnamed: 0,a,b,c,d
0,5,20.0,,25.0
1,5,20.0,10.0,25.0
2,4,,10.0,
3,3,19.0,,22.0
4,2,18.0,,20.0


In [15]:
# Using drop
df.drop(columns=['d'])

Unnamed: 0,a,b,c
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


`drop` produces another DataFrame without the column 'd'.

`drop` does not change df.

In [16]:
display(df)

Unnamed: 0,a,b,c,d
0,5,20.0,,25.0
1,5,20.0,10.0,25.0
2,4,,10.0,
3,3,19.0,,22.0
4,2,18.0,,20.0


 To modify df, you have to include: `inplace=True` It means the DataFrame will keep the changes.

In [17]:
df.drop(columns=['d'], inplace=True)
display(df)

Unnamed: 0,a,b,c
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


In [18]:
# Renaming columns
df.rename(columns={'a':'A','b':'B', 'c':'C'},inplace=True)
df

Unnamed: 0,A,B,C
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


## Inspecting data

`df.head(n)`: show the first `n` rows of the DataFrame

In [19]:
df.head(2)

Unnamed: 0,A,B,C
0,5,20.0,
1,5,20.0,10.0


If n is not specified, the first five rows are shown.

In [20]:
df.head()

Unnamed: 0,A,B,C
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


`df.tail(n)`: show the last `n` rows of the DataFrame

In [21]:
df.tail(2)

Unnamed: 0,A,B,C
3,3,19.0,
4,2,18.0,


In [22]:
df.tail()

Unnamed: 0,A,B,C
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


`df.shape`: show the number of rows and columns

In [23]:
df.shape

(5, 3)

`df.info()`: Show info, datatype, and memory information

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       5 non-null      int64  
 1   B       4 non-null      float64
 2   C       2 non-null      float64
dtypes: float64(2), int64(1)
memory usage: 248.0 bytes


`df.describe()`: show some statistics for numerical columns

In [25]:
df.describe()

Unnamed: 0,A,B,C
count,5.0,4.0,2.0
mean,3.8,19.25,10.0
std,1.30384,0.957427,0.0
min,2.0,18.0,10.0
25%,3.0,18.75,10.0
50%,4.0,19.5,10.0
75%,5.0,20.0,10.0
max,5.0,20.0,10.0


`df.mean()`: return the mean of all columns

In [26]:
df.mean()

A     3.80
B    19.25
C    10.00
dtype: float64

`df.count()`: return the number of non-null values in each DataFrame column

In [27]:
df.count()

A    5
B    4
C    2
dtype: int64

`df.min()`: return the lowest value in each column

In [28]:
df.min()

A     2.0
B    18.0
C    10.0
dtype: float64

`df.max()`: return the highest value in each column

In [29]:
df.max()

A     5.0
B    20.0
C    10.0
dtype: float64

`df.median()`: return the median of each column

In [30]:
df.median()

A     4.0
B    19.5
C    10.0
dtype: float64

In [31]:
df

Unnamed: 0,A,B,C
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


`df.mode()`: return the mode of each column

In [32]:
df.mode()

Unnamed: 0,A,B,C
0,5,20.0,10.0


`sd.std()`: return the standard deviation

In [33]:
df.std()

A    1.303840
B    0.957427
C    0.000000
dtype: float64

The `round()` function returns a floating point number that is a rounded version of the specified number, with the specified number of decimals.

In [34]:
round(df.std(),2)

A    1.30
B    0.96
C    0.00
dtype: float64

The default number of decimals is 0, meaning that the function will return the nearest integer.

In [35]:
round(df.std())

A    1.0
B    1.0
C    0.0
dtype: float64

`df.sum()`: return the sum of values

In [36]:
display(df)
df.sum()

Unnamed: 0,A,B,C
0,5,20.0,
1,5,20.0,10.0
2,4,,10.0
3,3,19.0,
4,2,18.0,


A    19.0
B    77.0
C    20.0
dtype: float64

`df.cumsum()`: return the cummulative sum of values

In [37]:
df.cumsum()

Unnamed: 0,A,B,C
0,5,20.0,
1,10,40.0,10.0
2,14,,20.0
3,17,59.0,
4,19,77.0,


`df.prod()`: return the product of values

In [38]:
df.prod()

A       600.0
B    136800.0
C       100.0
dtype: float64

Reference:
- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O’Reilly Media, Inc. chapter 3