In [1]:
# Importing Pandas
import pandas as pd

In [2]:
# Creating a sample DataFrame called df
df = pd.DataFrame({
    "name": ["John","Jane","Emily","Lisa","Matt"],
    "note": [92,94,87,82,90],
    "profession":["Electrical engineer","Mechanical engineer",
                  "Data scientist","Accountant","Athlete"],
    "date_of_birth":["1998-11-01","2002-08-14","1996-01-12",
                     "2002-10-24","2004-04-05"],
    "group":["A","B","B","A","C"]
})

In [4]:
df.head()

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B
2,Emily,87,Data scientist,1996-01-12,B
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


In [6]:
# Selecting a subset of columns
df[["name","note"]]

Unnamed: 0,name,note
0,John,92
1,Jane,94
2,Emily,87
3,Lisa,82
4,Matt,90


In [7]:
# Selecting a subset of rows and columns with loc
df.loc[:3, ["name","note"]]

Unnamed: 0,name,note
0,John,92
1,Jane,94
2,Emily,87
3,Lisa,82


The loc method is used for filtering with row and column labels. The following line of code returns the first 3 rows and the name and note columns

In [8]:
#Selecting a subset of rows and columns with iloc
df.iloc[:3, 2]

0    Electrical engineer
1    Mechanical engineer
2         Data scientist
Name: profession, dtype: object

The iloc method is similar to the loc method but it uses row and column indices instead of labels. The following line of code returns the first 3 rows and the third column (index starts from 0)

You may have noticed that we use the same expression (“:3”) for selecting the rows with loc and iloc methods. The reason is that Pandas assigns integer labels to the rows by default. Unless you assign a specific label, the index and label of a row are the same

In [9]:
#Using a comparison operator on column values
df[df.note > 90]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B


In [10]:
#Using a comparison operator with strings
df[df.name=="John"]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A


In [11]:
#String condition with str accessor
df[df.profession.str.contains("engineer")]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B


In [12]:
#Another string condition with str accessor
df[df.name.str.startswith("L")]

Unnamed: 0,name,note,profession,date_of_birth,group
3,Lisa,82,Accountant,2002-10-24,A


In [13]:
# Multipe str methods
df[df.name.str.lower().str.startswith("l")]

Unnamed: 0,name,note,profession,date_of_birth,group
3,Lisa,82,Accountant,2002-10-24,A


We can combine multiple str methods by chaining them. For instance, if we are not sure if all the names start with a capital letter, we can first convert them to lowercase and then filter.

In [14]:
# Tilde (~) operator
df[~df.profession.str.contains("engineer")]

Unnamed: 0,name,note,profession,date_of_birth,group
2,Emily,87,Data scientist,1996-01-12,B
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


The tilde operator represents the “not” logic. It let us get the rows that do not fit the given condition

Next we want to extract the month information from a date and used it for filtering

In [15]:
#The dt accessor
df.date_of_birth = df.date_of_birth.astype("datetime64[ns]")
df[df.date_of_birth.dt.month==11]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A


The dt accessor offers us lots of methods to work with dates and times. However, we cannot apply the dt accessor to a column with string data type so we first need to change the data type

In [16]:
df[df.date_of_birth.dt.year > 2000]

Unnamed: 0,name,note,profession,date_of_birth,group
1,Jane,94,Mechanical engineer,2002-08-14,B
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


In [17]:
# Multiple conditions
df[(df.date_of_birth.dt.year > 2000) &  
   (df.profession.str.contains("engineer"))]

Unnamed: 0,name,note,profession,date_of_birth,group
1,Jane,94,Mechanical engineer,2002-08-14,B


We can combine multiple conditions with logical operators. The following line of code combines two conditions with “and” logic

In [18]:
df[(df.note > 90) | (df.profession=="Data scientist")]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B
2,Emily,87,Data scientist,1996-01-12,B


The above line of code returns people who are data scientist or have a note more than 90

In [19]:
# isin method
df[df.group.isin(["A","C"])]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


The isin method is another technique to use multiple conditions for filtering. It compares the value to a list of values.

The isin method is another technique to use multiple conditions for filtering. It compares the value to a list of values

In [21]:
# query function
df.query("note > 90")

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A
1,Jane,94,Mechanical engineer,2002-08-14,B


The query function accepts strings as filters

In [22]:
df.query("group=='A' and note > 89")

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,Electrical engineer,1998-11-01,A


In [23]:
#nsmallest function
df.nsmallest(2, "note")

Unnamed: 0,name,note,profession,date_of_birth,group
3,Lisa,82,Accountant,2002-10-24,A
2,Emily,87,Data scientist,1996-01-12,B


It allows for selecting the n smallest values based on the given column

In [24]:
# nlargest function
df.nlargest(2, "note")


Unnamed: 0,name,note,profession,date_of_birth,group
1,Jane,94,Mechanical engineer,2002-08-14,B
0,John,92,Electrical engineer,1998-11-01,A


In [26]:
# isna function
import numpy as np
df.loc[0, "profession"] = np.nan # add a null value in our DataFrame

The isna function can be used for selecting rows that have a missing (i.e. null) value in the specified column. 

In [28]:
# returns the rows in which the profession value is 
df[df.profession.isna()]

Unnamed: 0,name,note,profession,date_of_birth,group
0,John,92,,1998-11-01,A


In [29]:
# notna function

df[df.profession.notna()]

Unnamed: 0,name,note,profession,date_of_birth,group
1,Jane,94,Mechanical engineer,2002-08-14,B
2,Emily,87,Data scientist,1996-01-12,B
3,Lisa,82,Accountant,2002-10-24,A
4,Matt,90,Athlete,2004-04-05,C


We can use it to filter out rows with missing values
It is just the opposite of the isna function