# Subsets - Rows - Columns

In [2]:
import pandas as pd

This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:

PassengerId: Id of every passenger.

Survived: Indication whether passenger survived. 0 for yes and 1 for no.

Pclass: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

Name: Name of passenger.

Sex: Gender of passenger.

Age: Age of passenger in years.

SibSp: Number of siblings or spouses aboard.

Parch: Number of parents or children aboard.

Ticket: Ticket number of passenger.

Fare: Indicating the fare.

Cabin: Cabin number of passenger.

Embarked: Port of embarkation.

https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

In [3]:
titanic = pd.read_csv("https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv")

In [4]:
# info
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
# stats
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [7]:
# check if there is duplicated
titanic.PassengerId.duplicated().any()

False

In [8]:
titanic.set_index(titanic.columns[0], inplace=True)

In [9]:
# shape
titanic.shape # seen already in info

(891, 11)

## Selecting Columns

Ages of passengers

In [27]:
# titanic.Age return age column - pandas serie
titanic.Age

PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64

In [26]:
# titanic.Age return age column - pandas serie
titanic['Age']

PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64

In [11]:
titanic.Age.unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

In [12]:
titanic.Age.shape

(891,)

##### Columns

In [29]:
titanic.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [30]:
list(titanic.columns)

['Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

Age and Sex

In [13]:
titanic[["Age", "Sex"]].head()

Unnamed: 0_level_0,Age,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,22.0,male
2,38.0,female
3,26.0,female
4,35.0,female
5,35.0,male


In [28]:
# get name of columns with missing values if there is any
cols_with_missing = [ col for col in titanic.columns if titanic[col].isnull().any()]
cols_with_missing

['Age', 'Cabin', 'Embarked']

## Selecting Rows

In [14]:
# testing a boolean condition in a serie
titanic["Age"] > 35

PassengerId
1      False
2       True
3      False
4      False
5      False
       ...  
887    False
888    False
889    False
890    False
891    False
Name: Age, Length: 891, dtype: bool

In [15]:
# looking for the rows of a serie
titanic[titanic["Age"]> 35].head() # is kind of "where"

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


In [16]:
# testing a condition: Pclass has values 2 or 3
titanic["Pclass"].isin([2, 3])

PassengerId
1       True
2      False
3       True
4      False
5       True
       ...  
887     True
888    False
889     True
890    False
891     True
Name: Pclass, Length: 891, dtype: bool

In [17]:
# Looking for rows where Pclass has values 2 or 3
titanic[titanic["Pclass"].isin([2, 3])].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [18]:
# the same as above in another way: 
titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


In [19]:
# dataframe with passenger for wich the age is known
titanic[titanic["Age"].notna()]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [20]:
# dataframe with passenger for wich the age is unknown
titanic[titanic["Age"].isna()]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...
860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [21]:
# the same as above: isna() is similar to isnull()
titanic[titanic["Age"].isnull()]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...
860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


## Selecting Rows and Columns - .loc

In [22]:
# Select names of passengers older than 35
titanic.loc[titanic["Age"] > 35, "Name"] # Only columns with names when age is greater than 35

PassengerId
2      Cumings, Mrs. John Bradley (Florence Briggs Th...
7                                McCarthy, Mr. Timothy J
12                              Bonnell, Miss. Elizabeth
14                           Andersson, Mr. Anders Johan
16                      Hewlett, Mrs. (Mary D Kingcome) 
                             ...                        
866                             Bystrom, Mrs. (Karolina)
872     Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
874                          Vander Cruyssen, Mr. Victor
880        Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
886                 Rice, Mrs. William (Margaret Norton)
Name: Name, Length: 217, dtype: object

In [23]:
# Select names and ages of passengers older thann 35
titanic.loc[titanic["Age"] > 35, ["Name", "Age"]]

Unnamed: 0_level_0,Name,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
7,"McCarthy, Mr. Timothy J",54.0
12,"Bonnell, Miss. Elizabeth",58.0
14,"Andersson, Mr. Anders Johan",39.0
16,"Hewlett, Mrs. (Mary D Kingcome)",55.0
...,...,...
866,"Bystrom, Mrs. (Karolina)",42.0
872,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",47.0
874,"Vander Cruyssen, Mr. Victor",47.0
880,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",56.0


## Selecting Rows and Columns - .iloc()

In [24]:
# interested in rows 10 till 25 and columns 3 to 5
titanic.iloc[9:25, 2:5]

Unnamed: 0_level_0,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0
11,"Sandstrom, Miss. Marguerite Rut",female,4.0
12,"Bonnell, Miss. Elizabeth",female,58.0
13,"Saundercock, Mr. William Henry",male,20.0
14,"Andersson, Mr. Anders Johan",male,39.0
15,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0
16,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0
17,"Rice, Master. Eugene",male,2.0
18,"Williams, Mr. Charles Eugene",male,
19,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0


In [25]:
# set values
titanic.iloc[9:25, 2:5] = 0
titanic.iloc[9:25, 2:5]


Unnamed: 0_level_0,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,0,0,0.0
11,0,0,0.0
12,0,0,0.0
13,0,0,0.0
14,0,0,0.0
15,0,0,0.0
16,0,0,0.0
17,0,0,0.0
18,0,0,0.0
19,0,0,0.0


REMEMBER
When selecting subsets of data, square brackets [] are used.

Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

Select specific rows and/or columns using loc when using the row and column names.

Select specific rows and/or columns using iloc when using the positions in the table.

You can assign new values to a selection based on loc/iloc.