What is pandas in Python? (3 Examples)
 

This tutorial explains what the pandas library is and how to use it in the Python programming language.

Table of contents:

1) Definition of the pandas Library in Python
2) Example Data & Add-On Libraries
3) Example 1: Add New Column to pandas DataFrame
4) Example 2: Remove Row from pandas DataFrame
5) Example 3: Calculate Mean for pandas DataFrame Column
6) Video & Further Resources
Here’s move on to the definition…

 

Definition of the pandas Library in Python
pandas is a software library that was created by Wes McKinney for the Python programming language.

The pandas library is mainly used for data manipulation, i.e. to edit, modify, and adjust certain components of a DataFrame object.

However, pandas is very flexible and can also be used for other tasks such as drawing data sets in plots, and storing time series values.

As other Python libraries, pandas is open source, i.e. freely available for usage, modification, and redistribution.

In the remaining part of this tutorial, I’ll show some example applications of the pandas library in practice.

So without too much talk, let’s dive into the example code!

 

Example Data & Add-On Libraries
We first have to load the pandas library to Python:

import pandas as pd                              # Import pandas library
Next, we can use the pd.DataFrame function to create some example data:

data = pd.DataFrame({"x1":range(5, 10),          # Create pandas DataFrame
                     "x2":["a", "b", "c", "d", "e"],
                     "x3":range(10, 5, - 1)})
print(data)                                      # Print pandas DataFrame
 

table 1 DataFrame what is pandas python programming language

 

Table 1 shows that our example DataFrame is composed of five rows and three columns.

 

Example 1: Add New Column to pandas DataFrame
This example illustrates how to append a new variable to a pandas DataFrame.

For this task, we first have to create a list object that contains the values of our new column:

new_col = ["foo", "bar", "bar", "foo", "bar"]    # Create list
print(new_col)                                   # Print list
# ['foo', 'bar', 'bar', 'foo', 'bar']
Next, we can apply the assign function to add our list as a new column to our pandas DataFrame:

data_add = data.assign(new_col = new_col)        # Add new column
print(data_add)                                  # Print DataFrame with new column
 

table 2 DataFrame what is pandas python programming language

 

As shown in Table 2, the previous code has managed to construct a new pandas DataFrame containing our input data plus our list object as a new variable.

 

Example 2: Remove Row from pandas DataFrame
Example 2 shows how to drop certain rows from a pandas DataFrame.

To achieve this, we can use logical operators as illustrated below:

data_drop = data[data.x2 != "c"]                 # Drop row using logical condition
print(data_drop)                                 # Print DataFrame without row
 

table 3 DataFrame what is pandas python programming language

 

Table 3 shows the output of the previous Python syntax: We have excluded the third row from our data set.

 

Example 3: Calculate Mean for pandas DataFrame Column
The pandas library can also be used to calculate certain descriptive statistics of the columns of a DataFrame.

In this specific example, we calculate the mean value of the variable x3:

data_mean = data["x3"].mean()                    # Calculate average
print(data_mean)                                 # Print average
# 8.0
The previous console output shows the mean value of our third column, i.e. 8.0.

 

Video & Further Resources
I have recently released a video on my YouTube channel, which explains the contents of this tutorial. You can find the video below.

 

The YouTube video will be added soon.

 

Furthermore, you may read the related articles on my website.

pandas Library Tutorial in Python
Change pandas DataFrames in Python
DataFrame Manipulation Using pandas in Python
Sort pandas DataFrame by Date in Python
Count Unique Values by Group in Column of pandas DataFrame
Insert Column at Specific Position of pandas DataFrame
Check If Any Value is NaN in pandas DataFrame in Python
Check if pandas DataFrame is Empty in Python
All Python Programming Tutorials
 

In summary: You have learned in this article how to apply the functions of the pandas library in the Python programming language. If you have any additional questions and/or comments, let me know in the comments section below.



In [1]:
import pandas as pd                              # Import pandas library

In [2]:
data = pd.DataFrame({"x1":range(5, 10),          # Create pandas DataFrame
                     "x2":["a", "b", "c", "d", "e"],
                     "x3":range(10, 5, - 1)})
print(data)                                      # Print pandas DataFrame

   x1 x2  x3
0   5  a  10
1   6  b   9
2   7  c   8
3   8  d   7
4   9  e   6


In [3]:
new_col = ["foo", "bar", "bar", "foo", "bar"]    # Create list
print(new_col)                                   # Print list
# ['foo', 'bar', 'bar', 'foo', 'bar']

['foo', 'bar', 'bar', 'foo', 'bar']


In [4]:
data_add = data.assign(new_col = new_col)        # Add new column
print(data_add)                                  # Print DataFrame with new column

   x1 x2  x3 new_col
0   5  a  10     foo
1   6  b   9     bar
2   7  c   8     bar
3   8  d   7     foo
4   9  e   6     bar


In [5]:
data_drop = data[data.x2 != "c"]                 # Drop row using logical condition
print(data_drop)                                 # Print DataFrame without row

   x1 x2  x3
0   5  a  10
1   6  b   9
3   8  d   7
4   9  e   6


In [6]:
data_mean = data["x3"].mean()                    # Calculate average
print(data_mean)                                 # Print average
# 8.0

8.0


At the top of the figure, two different separate data sets are shown. At the bottom of the figure, you can see four different joined versions of these two data sets:

- Inner join: Keep only IDs that are contained in both data sets.
- Outer join: Keep all IDs.
- Left join: Keep only IDs that are contained in the first data set.
- Right join: Keep only IDs that are contained in the second data set.

All of these join-types will be applied in the following programming examples.



In [7]:
data1 = pd.DataFrame({"ID":range(1001, 1007),    # Create first pandas DataFrame
                      "x1":range(1, 7),
                      "x2":["a", "b", "c", "d", "e", "f"],
                      "x3":range(16, 10, - 1)})
print(data1)                                     # Print first pandas DataFrame


     ID  x1 x2  x3
0  1001   1  a  16
1  1002   2  b  15
2  1003   3  c  14
3  1004   4  d  13
4  1005   5  e  12
5  1006   6  f  11


In [8]:
data2 = pd.DataFrame({"ID":range(1004, 1009),    # Create second pandas DataFrame
                      "y1":["x", "y", "x", "y", "x"],
                      "y2":range(10, 1, - 2)})
print(data2)                                     # Print second pandas DataFrame

     ID y1  y2
0  1004  x  10
1  1005  y   8
2  1006  x   6
3  1007  y   4
4  1008  x   2


In [9]:
data_inner = pd.merge(data1,                     # Inner join
                      data2,
                      on = "ID",
                      how = "inner")
print(data_inner)                                # Print merged DataFrame

     ID  x1 x2  x3 y1  y2
0  1004   4  d  13  x  10
1  1005   5  e  12  y   8
2  1006   6  f  11  x   6


In [10]:
data_outer = pd.merge(data1,                     # Outer join
                      data2,
                      on = "ID",
                      how = "outer")
print(data_outer)                                # Print merged DataFrame

     ID   x1   x2    x3   y1    y2
0  1001  1.0    a  16.0  NaN   NaN
1  1002  2.0    b  15.0  NaN   NaN
2  1003  3.0    c  14.0  NaN   NaN
3  1004  4.0    d  13.0    x  10.0
4  1005  5.0    e  12.0    y   8.0
5  1006  6.0    f  11.0    x   6.0
6  1007  NaN  NaN   NaN    y   4.0
7  1008  NaN  NaN   NaN    x   2.0


In [11]:
data_left = pd.merge(data1,                      # Left join
                      data2,
                      on = "ID",
                      how = "left")
print(data_left)                                 # Print merged DataFrame

     ID  x1 x2  x3   y1    y2
0  1001   1  a  16  NaN   NaN
1  1002   2  b  15  NaN   NaN
2  1003   3  c  14  NaN   NaN
3  1004   4  d  13    x  10.0
4  1005   5  e  12    y   8.0
5  1006   6  f  11    x   6.0


In [12]:
data_right = pd.merge(data1,                     # Right join
                      data2,
                      on = "ID",
                      how = "right")
print(data_right)                                # Print merged DataFrame

     ID   x1   x2    x3 y1  y2
0  1004  4.0    d  13.0  x  10
1  1005  5.0    e  12.0  y   8
2  1006  6.0    f  11.0  x   6
3  1007  NaN  NaN   NaN  y   4
4  1008  NaN  NaN   NaN  x   2


In [13]:
data = pd.DataFrame({'x1':[2, 7, 5, 7, 1, 5, 9],  # Create pandas DataFrame
                     'x2':range(1, 8),
                     'group':['A', 'B', 'A', 'A', 'C', 'B', 'A']})
print(data)                           

   x1  x2 group
0   2   1     A
1   7   2     B
2   5   3     A
3   7   4     A
4   1   5     C
5   5   6     B
6   9   7     A


In [14]:
print(data['x1'].mean())                          # Get mean of one column

5.142857142857143


In [15]:
print(data.mean())                                # Get mean of all columns

x1    5.142857
x2    4.000000
dtype: float64


  print(data.mean())                                # Get mean of all columns


In [16]:
print(data.describe())      

             x1        x2
count  7.000000  7.000000
mean   5.142857  4.000000
std    2.853569  2.160247
min    1.000000  1.000000
25%    3.500000  2.500000
50%    5.000000  4.000000
75%    7.000000  5.500000
max    9.000000  7.000000


In [17]:
print(data.groupby('group').mean())               # Get mean by group

         x1    x2
group            
A      5.75  3.75
B      6.00  4.00
C      1.00  5.00


In [19]:
print(data.groupby('group').describe())             # Get descriptive stats by group

         x1                                              x2                  \
      count  mean       std  min   25%  50%  75%  max count  mean       std   
group                                                                         
A       4.0  5.75  2.986079  2.0  4.25  6.0  7.5  9.0   4.0  3.75  2.500000   
B       2.0  6.00  1.414214  5.0  5.50  6.0  6.5  7.0   2.0  4.00  2.828427   
C       1.0  1.00       NaN  1.0  1.00  1.0  1.0  1.0   1.0  5.00       NaN   

                                 
       min  25%  50%   75%  max  
group                            
A      1.0  2.5  3.5  4.75  7.0  
B      2.0  3.0  4.0  5.00  6.0  
C      5.0  5.0  5.0  5.00  5.0  


In [20]:
data = pd.DataFrame({'x1':[6, 2, 7, 1, 9, 3, 4, 9],  # Create example DataFrame
                     'x2':[2, 5, 7, 1, 3, 1, 2, 3],
                     'x3':range(8, 0, - 1)})
print(data)                                          # Print example DataFrame

   x1  x2  x3
0   6   2   8
1   2   5   7
2   7   7   6
3   1   1   5
4   9   3   4
5   3   1   3
6   4   2   2
7   9   3   1


In [21]:
print(data.sum(axis = 1))                            # Get row sums

0    16
1    14
2    20
3     7
4    16
5     7
6     8
7    13
dtype: int64
