# Recap of Lecture 07

> shown in class: `chdir, dir, cd, python` commands (for Anaconda Prompt on Windows)

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky"></th>
    <th class="tg-7btt">windows</th>
    <th class="tg-7btt">macOS</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-fymr">print out the current working directory</td>
    <td class="tg-c3ow">chdir</td>
    <td class="tg-c3ow">pwd</td>
  </tr>
  <tr>
    <td class="tg-fymr">list contents of current working directory</td>
    <td class="tg-c3ow">dir</td>
    <td class="tg-c3ow">ls</td>
  </tr>
  <tr>
    <td class="tg-fymr">change directory</td>
    <td class="tg-c3ow" colspan="2">cd &lt;relative path&gt;</td>
  </tr>
  <tr>
    <td class="tg-fymr">run a python file</td>
    <td class="tg-c3ow" colspan="2">python &lt;filename.py&gt;</td>
  </tr>
</tbody>
</table>

Message from Michael: 

## Please install WSL (Ubuntu distribution)

* WSL: Windows Subsystem for Linux
* Instructions: [here](https://learn.microsoft.com/en-gb/windows/wsl/install)
* If all goes well: you "just" have to run `wsl --install` from your CLI (details see instructions)
* Why? Because then you can use a Linux shell on your Windows machine!
* We will send out more detailed instructions, and help you in the StudyLab/LiveCoding

In [1]:
# list comprehension
[i**2 for i in range(10,20)]

[100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

In [2]:
# list comprehension
[i**2 for i in range(10,20) if i**2 > 250]

[256, 289, 324, 361]

# Lecture 08
* `seed()`ing in the random module
* (more or less) common file formats for tabular data 
* the `pandas` package ("Excel on steroids")

# A `random` fact

`random.seed()` makes "random" numbers "reproducible"

In [4]:
# if i don't indicate any "seed", running this cell several times
# will produce a new "random" number at each time:
import random
random.choice(range(10))

5

In [17]:
# if i indicate a "seed", running this cell several times
# will produce THE SAME "random" number at each time:
import random
random.seed(42) # what you use as seed is up to you; ~"same seed > same results"
random.choice(range(10))

1

# Common (?) file formats for tabular data
* `.txt` plain text file, not formatted
* `.csv` text in comma-separated values, not formatted
* `.xls`, `.xlsx` Microsoft Excel worksheets
* `.json` "JavaScript Object Specification"

# `csv`: "comma" separated values

`.csv` is often used even when the separator is not a comma, but a tab, a whitespace, a semicolon, ...

<p style="text-align:left;">
    <img src="images/csv.png" alt="csv file" width=1000px>
</p>

# `json` JavaScript Object Specification

* format understood by many programming languages (not only JavaScript!)
* can store different (tree-like) data structures, not only tables
* often used for server-web application data transfer
* data types allowed in json: numbers, strings, booleans, "arrays" (similar to lists in Python), "objects" (name-value pair collections, similar to dictionaries in Python)

<p style="text-align:left;">
    <img src="images/json.png" alt="json file" width=1000px>
</p>


# Jupyter notebook is actually  a json file, too!

<p style="text-align:left;">
    <img src="images/ipynb.png" alt="ipynb file opened with text editor" width=1000px>
</p>


# Our table of the day: Titanic passengers

[(Link to raw data)](https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv)

same data, stored in different file formats, in `data` folder

In [19]:
# reading in tabular data in csv format - the painful way
import csv
with open("data/titanic.csv", "r") as opened_file:
    my_reader = csv.reader(opened_file)
    rows = [row for row in my_reader]


In [21]:
# reading in tabular data in json format - the painful way
import json
with open("data/titanic.json", "r") as opened_file:
    my_json = json.load(opened_file)
#my_json

# Enter: `pandas`

for tabular data

* reading in data `.read_csv()`, writing data `.to_csv()`
* displaying parts of the data set `.head(), .tail()`
* displaying column and row names `.columns, .index` 
* displaying and changing column datatypes `.dtypes`, `.astype()`
* displaying summary statistics: `.describe()`
* accessing columns `[]`, rows `.loc[]`, and single values `.loc[]`
* boolean indexing by condition `[condition]`
* boolean indexing by several conditions `[(condition1) & (condition2)]`
* filtering out missing values `.isna()` or available values `.notna()`
* creating a copy `.copy()`
* adding new columns `[]`
* dropping rows and columns `.drop()`
* sorting rows by values: `.sort_values()`

In [1]:
# import pandas with the alias "pd"
# pandas is a separate Python PACKAGE
# it doesn't "usually" come with Python
# but it is part of the Anaconda distribution
# so you SHOULD already have it on your machine
import pandas as pd


Object `to_dict` not found.


In [23]:
# read_csv(filepath) reads in a csv from a file and returns a pandas DataFrame:
pd.read_csv("data/titanic.csv")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [24]:
# .read_csv() assumes that sep=","... but i can indicate a different separator:
pd.read_csv("data/titanic_semicolon.csv")

Unnamed: 0,PassengerId;Survived;Pclass;Name;Sex;Age;SibSp;Parch;Ticket;Fare;Cabin;Embarked
1;0;3;Braund,Mr. Owen Harris;male;22.0;1;0;A/5 21171;7.25;;S
2;1;1;Cumings,Mrs. John Bradley (Florence Briggs Thayer);fe...
3;1;3;Heikkinen,Miss. Laina;female;26.0;0;0;STON/O2. 3101282;...
4;1;1;Futrelle,Mrs. Jacques Heath (Lily May Peel);female;35....
5;0;3;Allen,Mr. William Henry;male;35.0;0;0;373450;8.05;;S
...,...
887;0;2;Montvila,Rev. Juozas;male;27.0;0;0;211536;13.0;;S
888;1;1;Graham,Miss. Margaret Edith;female;19.0;0;0;112053;3...
"889;0;3;""Johnston","Miss. Catherine Helen """"Carrie"""""";female;;1;2..."
890;1;1;Behr,Mr. Karl Howell;male;26.0;0;0;111369;30.0;C148;C


In [25]:
# .read_csv() assumes that sep=","... 
# but i can indicate a different separator:
pd.read_csv("data/titanic_semicolon.csv", sep = ";")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [26]:
# let's save the pandas DataFrame into a variable, df:
df = pd.read_csv("data/titanic.csv")

In [29]:
# now we can display the first (by default 5) rows:
df.head()
# if you provide an integer argument n, will display the first n rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [30]:
# to display the last (by default 5) rows:
df.tail() # if you provide an integer argument n, will display the last n rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [31]:
# length of the dataframe = number of rows
len(df)

891

In [32]:
# the type of the variable is "pandas dataframe":
type(df)

pandas.core.frame.DataFrame

In [33]:
# this object (the pandas DataFrame) has ATTRIBUTES:
# characteristics accessible by .attributename
df.dtypes # .dtypes contains the data types of all columns

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [34]:
# another attribute is .index (containing the ROW LABELS)
df.index

RangeIndex(start=0, stop=891, step=1)

In [35]:
# another attribute is .columns (containing the COLUMN LABELS)
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [36]:
# .shape contains the shape (nrows, ncols) of the dataframe
df.shape

(891, 12)

# Get some summary statistics

(count, mean, std, min, max, for each column separately)

In [37]:
df.describe()
# count: number of values (that are NOT NaN)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# Accessing specific columns in the data frame

...a little bit like indexing:

`df[]` with single column label, or with list of column labels:

#### `df[columnlabel]` 
#### `df[[col1, col2, col3]]` 


In [40]:
# access the columns separately with square brackets 
# and their column name ("label"):
df["Name"] # returns ONLY the column "Name"

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [41]:
# access the columns separately with square brackets 
# and their column name ("label"):
df["Age"] # returns ONLY the column "Age"

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [43]:
# access specific columns by giving a list of column names as index: 
df[["Name", "Survived","Age"]]

Unnamed: 0,Name,Survived,Age
0,"Braund, Mr. Owen Harris",0,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0
2,"Heikkinen, Miss. Laina",1,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0
4,"Allen, Mr. William Henry",0,35.0
...,...,...,...
886,"Montvila, Rev. Juozas",0,27.0
887,"Graham, Miss. Margaret Edith",1,19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",0,
889,"Behr, Mr. Karl Howell",1,26.0


# Accessing specific rows in the data frame

`df.loc[]` with single row label (index) or with list of row labels:

#### `df.loc[rowlabel]` 
#### `df.loc[[row1,row2,row3]]` 

In [44]:
# remember, our row labels (in this case) are simply integer numbers:
df.index

RangeIndex(start=0, stop=891, step=1)

In [45]:
# accessing rows by index: df.loc[] with index (row label) as argument
df.loc[0] # returns only first row

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

In [46]:
# accessing rows by index: df.loc[] with list of indeces (row labels) as argument
df.loc[[0,1,2]] # returns first 3 rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## Accessing specific rows AND columns in the data frame

#### `df.loc[rowlabels, columnlabels]` 


In [47]:
# row labels: first three rows; column label: "Name"
df.loc[[0,1,2], "Name"]

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
Name: Name, dtype: object

In [48]:
# row labels: first three rows; column labels: "Name" & "Age"
df.loc[[0,1,2], ["Name", "Age"]]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0


In [49]:
# row labels: first row; column labels: ["Name", "Sex"]
df.loc[0, ["Name", "Sex"]]

Name    Braund, Mr. Owen Harris
Sex                        male
Name: 0, dtype: object

In [50]:
# accessing one single value
df.loc[0, "Name"]

'Braund, Mr. Owen Harris'

# Try it out yourself

Each of the tasks below is 1 line of code!

* Access only the column "Fare" `[]`
* Access only the columns "Fare" and "Age" `[[]]`
* Access only the rows with row labels 3, 4, and 5 `.loc[]`
* Access only the rows 3, 4, 5, and only the columns "Survived" and "Name"
* Access the name of the last passenger in the dataframe

In [52]:
# access the column Fare
df["Fare"]

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
886    13.0000
887    30.0000
888    23.4500
889    30.0000
890     7.7500
Name: Fare, Length: 891, dtype: float64

In [56]:
# access the columns Fare and Age
df[["Fare","Age"]]

Unnamed: 0,Fare,Age
0,7.2500,22.0
1,71.2833,38.0
2,7.9250,26.0
3,53.1000,35.0
4,8.0500,35.0
...,...,...
886,13.0000,27.0
887,30.0000,19.0
888,23.4500,
889,30.0000,26.0


In [64]:
# access the rows [3,4,5]
df.loc[[3,4,5]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [66]:
# access the rows [3,4,5] and the columns ["Survived", "Name"]
df.loc[[3,4,5],["Survived","Name"]]

Unnamed: 0,Survived,Name
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,0,"Allen, Mr. William Henry"
5,0,"Moran, Mr. James"


In [67]:
# access the name of the last passenger in the dataframe
df.tail(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


# Deleting rows and columns with `.drop()`

In [68]:
# remove a row: axis = 0
df.drop(labels=3, axis=0)
# this removes the row with label 3

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [69]:
# remove a column: axis = 1
df.drop(labels="Age", axis = 1)
# this removes the column "Age"

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,30.0000,C148,C


In [75]:
# how to CHANGE the dataframe instead of changing the VIEW?
df = df.drop(labels="Age", axis = 1)
# either overwrite the variable

In [76]:
# how to CHANGE the dataframe instead of changing the VIEW?
df.drop(labels="Name", axis = 1, inplace=True)
# OR set inplace=True

In [None]:
# now both "Name" and "Age" columns are removed:
df.head(3)


In [77]:
# let's read in the data one more time, after we messed with it:
df = pd.read_csv("data/titanic.csv")

# Boolean indexing (by condition)

##### `df[columnlabel]` can be combined with comparison `> < == !=` operators
##### `df[condition]` returns only those rows where condition is True
##### `df[(condition1) & (condition2)]` returns only those rows where both conditions are True

In [78]:
# let's see what happens if we use a comparison operator with a single column:
# returns for each row either True or False
df["Age"] > 18 # was this person over 18 on that ship?

0       True
1       True
2       True
3       True
4       True
       ...  
886     True
887     True
888    False
889     True
890     True
Name: Age, Length: 891, dtype: bool

In [79]:
# we can use that condition to index only rows where condition is True:
my_condition = df["Age"] > 18 # who was over 18 on that ship?
df[my_condition]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [80]:
# a shorter (but perhaps more confusing at first) way to write this: 
# df[condition] (where condition contains a df column)
df[df["Age"]>18]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [81]:
# filtering by several conditions:
# put each condition inside () round brackets
# combine them with & (meaning "and") or | (meaining "or")
# everyone that was over 18 AND survived 
df[ (df["Age"]>18) & (df["Survived"]==1) ]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [82]:
# everyone that is male and under 25
df[ (df["Sex"]=="male") & (df["Age"]<25) ]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S
...,...,...,...,...,...,...,...,...,...,...,...,...
861,862,0,2,"Giles, Mr. Frederick Edward",male,21.0,1,0,28134,11.5000,,S
864,865,0,2,"Gill, Mr. John William",male,24.0,0,0,233866,13.0000,,S
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S
876,877,0,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S


# Filtering out missing values (NAs) with boolean conditions

* `NaN` .... "not a number" in maths and pandas
* `na` (in pandas): "not available" (includes both `NaN` and `None`)
* pandas functions: `isna()` and `notna()`

In [83]:
df.head()
# in rows with index 0,2,4 the Cabin data is missing

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [84]:
df["Cabin"].isna()
# returns for each row whether data is NOT AVAILABLE (True) or available (False)

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Name: Cabin, Length: 891, dtype: bool

In [85]:
df["Cabin"].notna()
# returns for each row whether data is AVAILABLE (True) and not available (False)

0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: Cabin, Length: 891, dtype: bool

In [86]:
# boolean indexing to only have rows where we know the Cabin value:
my_condition = df["Cabin"].notna()
df[my_condition]
# or, shorter: df[df["Cabin"].notna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


# Changing values in a cell

#### `df.loc[rowlabel, columnlabel] = new_value`

In [87]:
# this will change the value in the cell of the first row, column "Name"
df.loc[0, "Name"] = "S.O.S" # assigning CHANGES the object!
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,S.O.S,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [88]:
# this will change the value in the cells of the first 3 rows, column "Name"
df.loc[[0,1,2], "Name"] = "Not Me Please!" # assigning CHANGES the object!
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Not Me Please!,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,Not Me Please!,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Not Me Please!,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Creating a copy of the dataframe

If you want to manipulate a subset of the dataframe, ALWAYS use `.copy()`

In [89]:
# until now, we have been just creating VIEWS of the data sets:
df[df["Age"]>18] # this returns a VIEW of the filtered dataframe,
# NOT the dataframe itself; we CANNOT MANIPULATE this object

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Not Me Please!,male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,Not Me Please!,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Not Me Please!,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [90]:
# don't save the VIEW into a variable!!
df_adults = df[df["Age"]>18] # this gives us a VIEW of the dataframe
# because now pandas is confused:
df_adults.loc[5,"Name"] = "Rosie"
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_adults.loc[5,"Name"] = "Rosie"


In [None]:
# Save a COPY of the view into a new variable:
df_adults = df[df["Age"]>18].copy() # this gives us a COPY of the dataframe
# now pandas is not confused anymore:
df_adults.loc[0,"Name"] = "Rosie"
df_adults.head()

### When working with subsets of a dataframe, use `.copy()`

# Try it out yourself!

* read in the data one more time (we messed around with the old data frame) with `pd.read_csv()`
* filter by 2 conditions: `"Sex"=="female"` and `"Age">60` (CORR: NOT 70!)
* save a COPY of the filtered data set to the variable `old_ladies`
* how many old ladies were on the Titanic? (`len()`)
* how may of the old ladies were badass ladies that survived?
* what is the mean fare that the old ladies paid? (`.describe()`, or calculate it yourself dividing the sum of the "Fare" column by the length of the dataframe)


In [92]:
# read in the data
pd.read_csv("data/titanic.csv")


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [100]:
# filter by 2 conditions and save a COPY
df_old_women = df[ (df["Sex"]=="female") & (df["Age"]>60) ].copy()
df_old_women


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [101]:
# number of old ladies:
len(df_old_women)

3

In [111]:
# number of survived ladies:
len(df_old_women["Survived"]==1)

3

In [112]:
# compute the mean fare with .describe()...
df_old_women.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,3.0,3.0,3.0,3.0,3.0,3.0,3.0
mean,530.0,1.0,1.666667,62.666667,0.333333,0.0,55.8486
std,279.84996,0.0,1.154701,0.57735,0.57735,0.0,40.076292
min,276.0,1.0,1.0,62.0,0.0,0.0,9.5875
25%,380.0,1.0,1.0,62.5,0.0,0.0,43.7729
50%,484.0,1.0,1.0,63.0,0.0,0.0,77.9583
75%,657.0,1.0,2.0,63.0,0.5,0.0,78.97915
max,830.0,1.0,3.0,63.0,1.0,0.0,80.0


In [113]:
# ...or compute the mean fare yourself
sum(df_old_women["Fare"])/len(df_old_women)

55.8486

# Adding new columns to the dataframe

In [114]:
# adding a new column
df["had_a_bad_trip"] = True # adds a new column, with the same value in ALL rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,had_a_bad_trip
0,1,0,3,Not Me Please!,male,22.0,1,0,A/5 21171,7.25,,S,True
1,2,1,1,Not Me Please!,female,38.0,1,0,PC 17599,71.2833,C85,C,True
2,3,1,3,Not Me Please!,female,26.0,0,0,STON/O2. 3101282,7.925,,S,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,True
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,True


In [115]:
# adding a new column based on another column
df["Fare_DKK"] = df["Fare"] * 900 # multiply the value in each row of "Fare" by 900
df["Fare_EUR"] = df["Fare"] * 120 # multiply the value in each row of "Fare" by 120
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,had_a_bad_trip,Fare_DKK,Fare_EUR
0,1,0,3,Not Me Please!,male,22.0,1,0,A/5 21171,7.25,,S,True,6525.0,870.0
1,2,1,1,Not Me Please!,female,38.0,1,0,PC 17599,71.2833,C85,C,True,64154.97,8553.996
2,3,1,3,Not Me Please!,female,26.0,0,0,STON/O2. 3101282,7.925,,S,True,7132.5,951.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,True,47790.0,6372.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,True,7245.0,966.0


# Sorting by value with `.sort_values()`

In [116]:
df.sort_values(by = "Age") # by default: ascending

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,had_a_bad_trip,Fare_DKK,Fare_EUR
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C,True,7665.03,1022.004
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5000,,S,True,13050.00,1740.000
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C,True,17332.47,2310.996
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C,True,17332.47,2310.996
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0000,,S,True,26100.00,3480.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C,True,6506.28,867.504
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S,True,62595.00,8346.000
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S,True,8550.00,1140.000
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S,True,7106.22,947.496


In [117]:
# to sort by descending values: set ascending=False 
df.sort_values(by = "Age", ascending=False) 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,had_a_bad_trip,Fare_DKK,Fare_EUR
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S,True,27000.00,3600.000
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S,True,6997.50,933.000
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,True,44553.78,5940.504
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,True,31188.78,4158.504
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q,True,6975.00,930.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C,True,6506.28,867.504
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S,True,62595.00,8346.000
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S,True,8550.00,1140.000
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S,True,7106.22,947.496


In [None]:
# how to make your changes last (and not just affect the view):
# same as with .drop(), either overwrite the variable...
df = df.sort_values(by="Age", ascending = False)

In [118]:
# ... or set inplace=True
df.sort_values(by = "Fare", ascending=False, inplace=True)

In [119]:
# df is now sorted by descending Fare:
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,had_a_bad_trip,Fare_DKK,Fare_EUR
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C,True,461096.28,61479.504
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,True,461096.28,61479.504
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,True,461096.28,61479.504
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S,True,236700.0,31560.0
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S,True,236700.0,31560.0


# Save to csv with `.to_csv()`

In [None]:
# writing to a csv is as easy as reading in from as csv:
df.to_csv("data/mydf.csv")

# Recap of today: pandas!

Methods/functions for data frames (df):
* `read_csv(), .to_csv()` 
* `.head(), .tail(), .describe()`
* `.drop(), .sort_values(), .copy()`

Methods we've used for separate df columns (and also for entire df): `.astype(), .isna(), .notna()`

**Attributes** of data frames: `.index, .columns, .dtypes`

**Indexing:** `[]`, `.loc[]`

# Resources for more pandas

YouTube video course: [Pandas for Beginners](https://www.youtube.com/playlist?list=PLUaB-1hjhk8GZOuylZqLz-Qt9RIdZZMBE) by Alex the Analyst

Stefanie Molin & Ken Jee: [Hands-On Data Analysis with Pandas](https://ebookcentral.proquest.com/lib/itucopenhagen/reader.action?docID=6579305) 

Hannah Stepanek: [Thinking in pandas](https://link-springer-com.ep.ituproxy.kb.dk/book/10.1007/978-1-4842-5839-2) (see also our the "self-study resources" course page)

> ITU has free online access to many resources, log in with your credentials at [kb.dk](kb.dk)