# Extracting a single value

## By square bracket notation

We can extract a single value by using the square bracket notation twice.  For example, I can get the 11,000th value from the rainfall amount column like this.a row or a column from a data frame/series.  This is a simple consequence of the fact that square bracket notation works on both data frames _and_ series.  The left-most one is working on a data frame and returning a series, the second one is working on the series.


In [2]:
import pandas as pd

wentworth = pd.read_csv("data/rainfall/IDCJAC0009_047045_1800_Data.csv")  # This gives us a dataframe
display(wentworth)
wentworth["Rainfall amount (millimetres)"][11000]
# dataframe
    # Extracting a series from the data frame
                                        # Extracing the value at index 11,000 from the series

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,47045,1933,1,1,,,
1,IDCJAC0009,47045,1933,1,2,,,
2,IDCJAC0009,47045,1933,1,3,,,
3,IDCJAC0009,47045,1933,1,4,,,
4,IDCJAC0009,47045,1933,1,5,,,
...,...,...,...,...,...,...,...,...
32288,IDCJAC0009,47045,2021,5,27,0.0,,N
32289,IDCJAC0009,47045,2021,5,28,0.0,,N
32290,IDCJAC0009,47045,2021,5,29,0.0,,N
32291,IDCJAC0009,47045,2021,5,30,0.0,,N


0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
32288    0.0
32289    0.0
32290    0.0
32291    0.0
32292    0.0
Name: Rainfall amount (millimetres), Length: 32293, dtype: float64


## By Summarising

Pandas provides some "magic" when it comes to summarising columns.  Series have a set of "methods" attached to them that you can call any time you like to get summaries.  Note that these summaries work on Series, so you should extract them first.  Examples are:
  * add up all elements (`sum`)
  * calculate the average (`mean`) or mode (`mode`)
  * find the largest (`max`) or smallest (`min`).

In [8]:
wentworth["Rainfall amount (millimetres)"].sum()
# Exercise, try out mean, mode, min, and max


# 2 things to notice
# What data types do these methods give back?
# What does each thing mean? 

# print(wentworth["Rainfall amount (millimetres)"].mean())
print(wentworth["Rainfall amount (millimetres)"].mode())  # What does mode mean? 
# print(wentworth["Rainfall amount (millimetres)"].min())
# print(wentworth["Rainfall amount (millimetres)"].max())


values = pd.Series([10, 10, 10, 20, 20, 20, 30, 30, 30])
print(values.mode().mean())
    # Series of the most common values
            # The average of the msot common values

0    0.0
Name: Rainfall amount (millimetres), dtype: float64
0    10
1    20
2    30
dtype: int64


# Example

What is the largest rainfall day for Richmond RAAF base (which is in the file `data/rainfall/IDCJAC0009_067105_1800_Data.csv`)?

Which of our rainfall files has the highest average rainfall?

In [21]:


data = pd.read_csv("data/rainfall/IDCJAC0009_067105_1800_Data.csv")
print(data['Rainfall amount (millimetres)'].max())

data = pd.read_csv("data/rainfall/IDCJAC0009_049092_1800_Data.csv")
print(data['Rainfall amount (millimetres)'].max())

data = pd.read_csv("data/rainfall/IDCJAC0009_066128_1800_Data.csv")
print(data['Rainfall amount (millimetres)'].max())


# 1. What is the highest amount of rainfall in the richmond data set  -> number 
# 2. What is the highest average rainfall across ALL files
# 3. On what day did that rainfall amount occur? -> day, month, year

list = [
    "data/rainfall/IDCJAC0009_067105_1800_Data.csv",
    "data/rainfall/IDCJAC0009_049092_1800_Data.csv",
    "data/rainfall/IDCJAC0009_066128_1800_Data.csv"
]
for file in list:
    data = pd.read_csv(file)
    print(data['Rainfall amount (millimetres)'].max())

# Another way
stations = pd.read_csv("data/rainfall/stations.csv")
station_numbers = stations['station number']
for number in station_numbers:
    data = pd.read_csv("data/rainfall/IDCJAC0009_0" + str(number) + "_1800_Data.csv")
    print(data['Rainfall amount (millimetres)'].max())


richmond = pd.read_csv("data/rainfall/IDCJAC0009_067105_1800_Data.csv")
index_of_highest_rainfall = richmond['Rainfall amount (millimetres)'].idxmax()
print(index_of_highest_rainfall)
print(data['Year'][index_of_highest_rainfall])
print(data['Month'][index_of_highest_rainfall])
print(data['Day'][index_of_highest_rainfall])

126.4
91.0
191.2
126.4
91.0
191.2
113.0
91.0
222.0
101.4
166.8
84.0
122.4
243.2
191.2
126.4
86.0
10288
2003
3
3


# Exercise

What is the total rainfall recorded for Meriwagga (rainfall file 075167)?  What is the maximum and minimum rainfall on any one day?  I am sure you can guess the minimum, but what code will give it to you?

In [24]:
meriwagga = pd.read_csv('data/rainfall/IDCJAC0009_075167_1800_Data.csv')
display(meriwagga)
total_rainfall = meriwagga['Rainfall amount (millimetres)'].sum()
max_rainfall = meriwagga['Rainfall amount (millimetres)'].max()
min_rainfall = meriwagga['Rainfall amount (millimetres)'].min()
print(total_rainfall, max_rainfall, min_rainfall)

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,75167,1975,1,1,,,
1,IDCJAC0009,75167,1975,1,2,,,
2,IDCJAC0009,75167,1975,1,3,,,
3,IDCJAC0009,75167,1975,1,4,,,
4,IDCJAC0009,75167,1975,1,5,,,
...,...,...,...,...,...,...,...,...
17162,IDCJAC0009,75167,2021,12,27,0.0,,Y
17163,IDCJAC0009,75167,2021,12,28,0.0,,Y
17164,IDCJAC0009,75167,2021,12,29,0.0,,Y
17165,IDCJAC0009,75167,2021,12,30,0.0,,Y


15648.6 86.0 0.0


## By `loc` and `iloc`

We've seen how to recover a Series from a DataFrame - i.e. how to extract a column.

Lets see how to extract a row.

It is important to realise that, since DataFrames are built from Series, it is somewhat awkward to pull out a single row.  In effect, we are asking for pandas to visit each Series and grab the value at a particular index.

Instead of doing this though, we will use the `loc` functionality of pandas.

`loc` and `iloc` are functions that can get columns _or rows_.  `loc` goes by column name when getting columns and by index when getting rows.  `iloc` goes by the order of the column when getting columns and the order of the row when getting rows.

`loc` and `iloc` actually take two parameters to look up both axis at once.

In [55]:
# wentworth.loc[1110, "Rainfall amount (millimetres)"]


# loc and iloc are used to get values by ROW

# display(wentworth)
# loc works using LABELS
output = wentworth.loc[3, "Day"]  
# display(output)

# iloc works using INDEXES
output = wentworth.iloc[3, 4]
# display(output)

# output = wentworth.iloc[100:150, 2:5]
display(wentworth)
output = wentworth.loc[100:150, ["Year", "Day"]]
display(output)

# myFrame = pd.DataFrame()
# myFrame["first column"] = ["a", "b", "c", "d", "e"]
# myFrame["second column"] = ["hello", "my", "name", "is", "michael"]
# myFrame["third column"] = ["number one", "two", "three", "4", "five"]
# myFrame.set_index('first column', inplace=True)
# display(myFrame)

# output = myFrame.loc["d", "second column"]


# How to loop through column headers
# print(wentworth.columns)
"""
idx = 0
for column in wentworth.columns:
    print(idx, column)
    idx += 1

for idx, value in enumerate(wentworth.columns):
    print(idx, value)
"""

Unnamed: 0,Product code,Bureau of Meteorology station number,Year,Month,Day,Rainfall amount (millimetres),Period over which rainfall was measured (days),Quality
0,IDCJAC0009,47045,1933,1,1,,,
1,IDCJAC0009,47045,1933,1,2,,,
2,IDCJAC0009,47045,1933,1,3,,,
3,IDCJAC0009,47045,1933,1,4,,,
4,IDCJAC0009,47045,1933,1,5,,,
...,...,...,...,...,...,...,...,...
32288,IDCJAC0009,47045,2021,5,27,0.0,,N
32289,IDCJAC0009,47045,2021,5,28,0.0,,N
32290,IDCJAC0009,47045,2021,5,29,0.0,,N
32291,IDCJAC0009,47045,2021,5,30,0.0,,N


Unnamed: 0,Year,Day
100,1933,11
101,1933,12
102,1933,13
103,1933,14
104,1933,15
105,1933,16
106,1933,17
107,1933,18
108,1933,19
109,1933,20


'\nidx = 0\nfor column in wentworth.columns:\n    print(idx, column)\n    idx += 1\n\nfor idx, value in enumerate(wentworth.columns):\n    print(idx, value)\n'

but (as you can see) does it _row first_.  This means if we only give one, they will look up by row and give you back a series for that row.  It looks like the table was "flipped", but that is not really what happens.

In [21]:
wentworth.loc[1110]

Product code                                      IDCJAC0009
Bureau of Meteorology station number                   47045
Year                                                    1936
Month                                                      1
Day                                                       16
Rainfall amount (millimetres)                            0.0
Period over which rainfall was measured (days)           NaN
Quality                                                    Y
Name: 1110, dtype: object

# Example

What was the rainfall for the 1st May 2019 in Richmond RAF?

# Exercise

What is the title of the 6th row in the `workouts.csv` file?

In [59]:
workouts = pd.read_csv('data/workouts.csv')
print(workouts.loc[5, "Title"])
print(workouts.loc[5]["Title"])

Cycling
Cycling


# Using `loc`/`iloc` for everything?

Many pandas programmers just use `loc` and `iloc` for everything but I will not.  Using them "hides" the underlying working of pandas and since we are here to learn, that doesn't suit us.  We will use it when we need to, but stick to square bracket notation as much as possible.  If you post a question on stack overflow you will probably get a `loc`/`iloc` based answer though, so we want to make sure you really know how they work.