In [None]:
# We will load all required modules first.
try:
    from dstaster import *
except:
    !pip install wget

    import wget
    url = 'https://raw.githubusercontent.com/BBKdatasciencetaster/DS/main/dstaster.py'
    wget.download(url)
    from dstaster import *
    

<h2>Loading the dataset with pandas</h2>

In this notebook we will use the pandas library to load the data from a comma-separated file. The file itself stores our data in a very simple table-like format:

<pre style="font-size:10pt">
,artist,title,year,groundtruth,height,width
T13896,John Constable,Salisbury Cathedral from the Meadows,1831,L,1537,1920
T05010,Pablo Picasso,Weeping Woman,1937,O,608,500
N05915,Pablo Picasso,Bust of a Woman,1909,P,727,600
N00530,Joseph Mallord William Turner,Snow Storm - Steam-Boat off a Harbour’s Mouth,1842,L,914,1219
T00598,Richard Dadd,The Fairy Feller’s Master-Stroke,1855,O,540,394
...
</pre>

The first row contains the names of the different columns, every following line contains elements from the dataset. Now, we need to load this data into the notebook to work with it.  The following code cells shows how: it loads a dataset and stores it in the variable `collection`. The data is stored in a so-called `DataFrame` which is provided to us by the `pandas` library (abbreviated in the code by `pd`).

The second line passes the DataFrame to the magic Jupyter `display(...)` function which provides us with a pretty-printed excerpt of the dataset. If you run the cell, you should see a table with six columns (artist, title, year, groundtruth, height and width).

In [None]:
collection = pd.read_csv("https://raw.githubusercontent.com/BBKdatasciencetaster/DS/main/data/paintings.csv", index_col=0)
display(collection)

Unnamed: 0,artist,title,year,groundtruth,height,width
T13896,John Constable,Salisbury Cathedral from the Meadows,1831,L,1537,1920
T05010,Pablo Picasso,Weeping Woman,1937,O,608,500
N05915,Pablo Picasso,Bust of a Woman,1909,P,727,600
N00530,Joseph Mallord William Turner,Snow Storm - Steam-Boat off a Harbour’s Mouth,1842,L,914,1219
T00598,Richard Dadd,The Fairy Feller’s Master-Stroke,1855,O,540,394
...,...,...,...,...,...,...
N05609,Maurice Sterne,Mexican Church Interior,1934,O,1283,1022
T14823,Unknown artist,Leon Trotsky,1980,P,510,480
AL00397,Louise Bourgeois,Untitled,1946,O,660,1116
T14824,Unknown artist,Leon Trotsky,1980,P,638,511


Pandas DataFrames are very powerful data structures that come with a lot of useful functionality. For example, we can ask the DataFrame to compute common statistics for all numerical columns. When we call `collection.describe()` we receive a new DataFrame containing the summary of `collection`. Note that we can leave away the call to `display`: Jupyter automatically displays whatever the last statement in the cell returns.

In [None]:
collection.describe()

Unnamed: 0,year,height,width
count,2158.0,2158.0,2158.0
mean,1873.828082,960.444856,1026.646895
std,76.739168,529.841346,642.269151
min,1594.0,137.0,102.0
25%,1824.0,610.0,616.0
50%,1889.5,813.0,893.5
75%,1934.0,1219.0,1232.0
max,2017.0,4285.0,8915.0


<h2>Working with columns</h2>

Each column of the DataFrame can be access individually using the index brackets `[]`. For example, `collection['artist']` will give us the artist column, `collection['year']` the year column and so on. 

<div class="note">Note: A single column of a DataFrame is a data structure called a Series, so it's representation in the notebook looks slightly differently.</div>

<div class="task">
    <div class="no">1</div>
    <div class="text">
        Change the index string in the following cell to
        values other than <code>'artist'</code> and observe how 
        the output changes.
    </div>
</div>

In [None]:
collection['artist']

T13896                    John Constable
T05010                     Pablo Picasso
N05915                     Pablo Picasso
N00530     Joseph Mallord William Turner
T00598                      Richard Dadd
                       ...              
N05609                    Maurice Sterne
T14823                    Unknown artist
AL00397                 Louise Bourgeois
T14824                    Unknown artist
T14825                    Unknown artist
Name: artist, Length: 2158, dtype: object

Pandas series also come equip with a number of methods that allow us to quickly compute various statistics. Let's say we want to find out which artist appear most often in our dataset. Then we can use the method `.value_counts()` to obtain a count for every unique entry in `collection['artist']`. The output is already sorted from hight to low, so we can read of the most common artist at the top:

In [None]:
collection['artist'].value_counts()

Joseph Mallord William Turner    240
John Constable                    34
John Singer Sargent               32
Sir Joshua Reynolds               30
Thomas Gainsborough               25
                                ... 
Richard Cook                       1
Henry Inlander                     1
William Hodges                     1
Clive Gardiner                     1
Jasper Johns                       1
Name: artist, Length: 869, dtype: int64

The <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html">complete list of available methods</a> to aggregate and modify pandas Series is quite long! A few methods we are interested in are `.sum()`, `.count()`, `.mean()`, `.min()`, `.median()`, and `.max()`.

<div class="task">
    <div class="no">2</div>
    <div class="text">
        Change <code>.sum()</code> in the following cell to the methods 
        mentioned above and observe the resulting output.
    </div>
</div>

In [None]:
collection['width'].sum()

2215504

<div class="task">
    <div class="no">3</div>
    <div class="text">
        We would like to know which range of years the collection covers. Use the cell above to find out
        the <b>earliest</b> and <b>latest</b> year for which the collection contains a painting. Return to FutureLearn to discuss your experience and your findings!
    </div>
</div>