# Indexing, Selecting & Assigning

In [19]:
import pandas as pd
titanic = pd.read_csv("../Datasets/titanic.csv")
health = pd.read_csv("../Datasets/Health_heart_experimental.csv")

---
## Native accessors

Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

In [20]:
health

Unnamed: 0.1,Unnamed: 0,age,sex,SysBP,DiaBP,HR,weightKg,heightCm,BMI,indication
0,0,64,1,141,96,128,69,147,32.0,1
1,1,21,1,109,100,106,48,150,21.0,0
2,2,30,0,112,73,126,69,183,21.0,0
3,3,35,1,106,90,130,45,149,20.0,0
4,4,39,0,140,90,112,92,166,33.0,1
...,...,...,...,...,...,...,...,...,...,...
71755,71755,50,1,133,88,116,53,154,22.0,0
71756,71756,69,1,123,112,94,70,142,35.0,1
71757,71757,24,1,143,107,107,57,156,23.0,0
71758,71758,42,1,122,111,150,115,176,37.0,1


In Python, we can access the property of an object by accessing it as an attribute.

In [21]:
health.age

0        64
1        21
2        30
3        35
4        39
         ..
71755    50
71756    69
71757    24
71758    42
71759    49
Name: age, Length: 71760, dtype: int64

We can access its values using the indexing ([]) operator. 

In [22]:
health['BMI']

0        32.0
1        21.0
2        21.0
3        20.0
4        33.0
         ... 
71755    22.0
71756    35.0
71757    23.0
71758    37.0
71759    36.0
Name: BMI, Length: 71760, dtype: float64

The key advantages of indexing operator is it can handle whitespaces.

TO spot a specific value:

In [23]:
health['BMI'][0]

np.float64(32.0)

---
## Indexing in pandas

Pandas has two acessor operator: loc and iloc

In [24]:
health.iloc[0]

Unnamed: 0      0.0
age            64.0
sex             1.0
SysBP         141.0
DiaBP          96.0
HR            128.0
weightKg       69.0
heightCm      147.0
BMI            32.0
indication      1.0
Name: 0, dtype: float64

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.<br>
To get a column with iloc, we can do the following:

In [25]:
health.iloc[:,0]

0            0
1            1
2            2
3            3
4            4
         ...  
71755    71755
71756    71756
71757    71757
71758    71758
71759    71759
Name: Unnamed: 0, Length: 71760, dtype: int64

On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the country column from just the first, second, and third row, we would do:

In [26]:
titanic.iloc[:3,0]

0    1
1    2
2    3
Name: PassengerId, dtype: int64

Or, to select just the second and third entries, we would do:

In [27]:
titanic.iloc[1:3,4]

1    female
2    female
Name: Sex, dtype: object

It's also possible to pass a list:

In [28]:
titanic.iloc[[3,4,5],3]

3    Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                        Allen, Mr. William Henry
5                                Moran, Mr. James
Name: Name, dtype: object

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values.

In [29]:
titanic.iloc[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


#### Label-based selection

The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [30]:
titanic.loc[0,'Name']

'Braund, Mr. Owen Harris'

iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. For example, here's one operation that's much easier using loc:

In [31]:
titanic.loc[:,['Name','Cabin']]

Unnamed: 0,Name,Cabin
0,"Braund, Mr. Owen Harris",
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",C85
2,"Heikkinen, Miss. Laina",
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",C123
4,"Allen, Mr. William Henry",
...,...,...
886,"Montvila, Rev. Juozas",
887,"Graham, Miss. Margaret Edith",B42
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",C148


---
## Manipulating the index

set_index can be used to make a local data index into the index, can be used if the data has a better index.

In [32]:
titanic.set_index('Name')

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


---
## Conditional selection

FOR EXAMPLE: If we want to check whether or not the person in titanic is male, we can check it by:

In [33]:
titanic.Sex == 'male'

0       True
1      False
2      False
3      False
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Sex, Length: 891, dtype: bool

This operation can be used to check whether the data Sex is male or not, as it returns a series of bool acoording to it.<br>
It can be used in loc to select only relevant data:

In [34]:
titanic.loc[titanic.Sex == 'male']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


The logical operation like 'OR', 'AND' can be performed. <br>
For Example: We want to select only male and greater than 25 years.

In [35]:
titanic.loc[(titanic.Sex == 'male') & (titanic.Age > 25.0)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0000,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,D56,S
...,...,...,...,...,...,...,...,...,...,...,...,...
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [36]:
titanic.loc[(titanic.Sex == 'male') | (titanic.Age < 10)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Pandas come in with some built-in functions for conditional selectors. <br>
The first is isin. isin is lets you select data whose value "is in" a list of values.

In [37]:
titanic.loc[titanic.Age.isin([20.0, 19.0])]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
44,45,1,3,"Devaney, Miss. Margaret Delia",female,19.0,0,0,330958,7.8792,,Q
67,68,0,3,"Crease, Mr. Ernest James",male,19.0,0,0,S.P. 3464,8.1583,,S
91,92,0,3,"Andreasson, Mr. Paul Edvin",male,20.0,0,0,347466,7.8542,,S
113,114,0,3,"Jussila, Miss. Katriina",female,20.0,1,0,4136,9.825,,S
131,132,0,3,"Coelho, Mr. Domingos Fernandeo",male,20.0,0,0,SOTON/O.Q. 3101307,7.05,,S
136,137,1,1,"Newsom, Miss. Helen Monypeny",female,19.0,0,2,11752,26.2833,D47,S
143,144,0,3,"Burke, Mr. Jeremiah",male,19.0,0,0,365222,6.75,,Q
145,146,0,2,"Nicholls, Mr. Joseph Charles",male,19.0,1,1,C.A. 33112,36.75,,S


The second is isnull (and its companion notnull). These methods let you highlight values which are (or are not) empty (NaN)

In [38]:
titanic.loc[titanic.Age.isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


---
## Assigning data

We can either assign a constant value or iterate values.

In [39]:
titanic['Sex'] = 'Male'
titanic['Sex']

0      Male
1      Male
2      Male
3      Male
4      Male
       ... 
886    Male
887    Male
888    Male
889    Male
890    Male
Name: Sex, Length: 891, dtype: object

In [40]:
titanic['PassengerId'] = range(0,len(titanic),1)
titanic['PassengerId']

0        0
1        1
2        2
3        3
4        4
      ... 
886    886
887    887
888    888
889    889
890    890
Name: PassengerId, Length: 891, dtype: int64

---