# Assignment 2

For this assignment, you will work on the *SuperRare crypto art transactions* [dataset](https://www.kaggle.com/franceschet/superrare). Please carefully read its Kaggle documentation before getting started.

**For this assignment, you are welcome to propose an alternative dataset of interest to you to explore. The dataset must be of comparable complexity and you must still perform the below tasks on it: if in doubt, ask.**

**Please carefully read the assignment guidelines in Canvas. You are expected to work in groups and submit as a group.**

## Tasks

Consider the dataset at hand and: 
1. Perform an **exploratory data analysis** on *at least 2 of this dataset variables (columns)*. For each, show its descriptive statistics, plot its distribution using an appropriate plot, and comment on your results. Furthermore, use a scatterplot to visualize two of these variables together.
2. Comment on whether the **distributions** of these variables look normal or long-tail (or neither) to you. Verify the presence of possible **outliers** and comment on your results.
3. Measure the **covariance and correlation** among these variables and comment on your results. Hint: check the pandas [`cov`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cov.html) function.
4. Show the **trend of sales over time**, using timestamp information.
5. Check whether this dataset is robust to sampling, and if so how many datapoints you would need to sample to get the analyses you have just performed (steps 1-4) to have similar results to the full datasets. Briefly comment on your results.
6. Bonus: find out which are the **highest-value artworks** in terms of sale price.

**Please make sure to carefully explain and motivate your choices via markdown cells and Python comments, as approptiate.**

## Dataset

Documentation on the contents of each data frame is on Kaggle, please check it carefully: https://www.kaggle.com/franceschet/superrare

In [1]:
import pandas as pd

In [3]:
df_sales = pd.read_csv('https://raw.githubusercontent.com/Giovanni1085/UvA_CSDA_2021/main/assignments/data/sales.csv')
df_artworks = pd.read_csv('https://raw.githubusercontent.com/Giovanni1085/UvA_CSDA_2021/main/assignments/data/tokens.csv')

In [4]:
df_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13485 entries, 0 to 13484
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   timestamp      13485 non-null  object 
 1   tokenId        13485 non-null  int64  
 2   buyer          13485 non-null  object 
 3   seller         13485 non-null  object 
 4   eth            13485 non-null  float64
 5   rate           13480 non-null  float64
 6   usd            13480 non-null  float64
 7   contract       13485 non-null  object 
 8   transactionId  13485 non-null  object 
dtypes: float64(3), int64(1), object(5)
memory usage: 948.3+ KB


In [29]:
df_sales.head(3)

Unnamed: 0,timestamp,tokenId,buyer,seller,eth,rate,usd,contract,transactionId
0,2018-04-05T23:50:12Z,1,0xbc74c3adc2aa6a85bda3eca5b0e235ca08532772,0x860c4604fe1125ea43f81e613e7afb2aa49546aa,0.46,381.36,175.4256,0x41a322b28d0ff354040e2cbc676f0320d8c8850d,0xf1097e3617632e43b7c0a46ffeb4d741d0a67b25fb06...
1,2020-01-18T16:16:42Z,1,0x54d7f921785ebe46010d83c73712e80dfaff1e81,0xbc74c3adc2aa6a85bda3eca5b0e235ca08532772,75.0,174.0,13050.0,0x41a322b28d0ff354040e2cbc676f0320d8c8850d,0xf8d3b8be83601d0351c72d2093738a4a25c70b49503b...
2,2021-01-05T00:47:24Z,1,0xd0c0650cd08acd4e9553c48c60c94be04fecce43,0x54d7f921785ebe46010d83c73712e80dfaff1e81,100.0,1103.19,110319.0,0x41a322b28d0ff354040e2cbc676f0320d8c8850d,0xc917fe7d09a750c09fd8f467d60e5adac4bbd3a5e5ea...


In [31]:
df_artworks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18518 entries, 0 to 18517
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tokenId        18518 non-null  int64  
 1   name           18517 non-null  object 
 2   description    18518 non-null  object 
 3   tags           18517 non-null  object 
 4   image          18518 non-null  object 
 5   media          9940 non-null   object 
 6   type           9940 non-null   object 
 7   size           9940 non-null   float64
 8   dimensions     9708 non-null   object 
 9   creator        18518 non-null  object 
 10  owner          18518 non-null  object 
 11  timestamp      16995 non-null  object 
 12  contract       18518 non-null  object 
 13  transactionId  16995 non-null  object 
dtypes: float64(1), int64(1), object(12)
memory usage: 2.0+ MB


In [30]:
df_artworks.head(3)

Unnamed: 0,tokenId,name,description,tags,image,media,type,size,dimensions,creator,owner,timestamp,contract,transactionId
0,1,AI Generated Nude Portrait #1,"Robbie Barrat, AI Generated, 2018",Nude Portrait AI Painting Portrait Generat...,https://ipfs.pixura.io/ipfs/QmX4ECCmA4HZSPxzsg...,,,,,0x860c4604fe1125ea43f81e613e7afb2aa49546aa,0xd0c0650cd08acd4e9553c48c60c94be04fecce43,2018-04-05T23:20:48Z,0x41a322b28d0ff354040e2cbc676f0320d8c8850d,0xf3e68d3a53b1bb3a2cdb4aa3a6c871626e6dcf7b8df1...
1,2,AI Generated Nude Portrait #2,"Robbie Barrat, AI Generated, 2018",Nude Portrait AI Painting Portrait Generat...,https://ipfs.pixura.io/ipfs/QmRe3WvttmMR7mELga...,,,,,0x860c4604fe1125ea43f81e613e7afb2aa49546aa,0x6853a596d6d7264d3622546da3b891b6fe17eb82,2018-04-05T23:49:27Z,0x41a322b28d0ff354040e2cbc676f0320d8c8850d,0x8fb08cb45e1a0032dccd0951812dba7a8ebe5b255bdd...
2,3,AI Generated Nude Portrait #3,"Robbie Barrat, AI Generated, 2018",Nude Portrait AI Painting Portrait Generat...,https://ipfs.pixura.io/ipfs/QmYCyvs9JwKTAChpri...,,,,,0x860c4604fe1125ea43f81e613e7afb2aa49546aa,0x8a0a834077a8ecea4983e2288f81afb2c6764116,2018-04-06T00:07:31Z,0x41a322b28d0ff354040e2cbc676f0320d8c8850d,0xdf2952f467fddc9f81f6beada8dc2bed1ae4e497c0d2...


In `df_sales`, each row represents a sale transaction. In `df_artworks`, each row represents an artwork. You can join the tables using the `tokenId` variable.

### Advice

I recommend that you consider at least `eth` (to work in [Ether](https://en.wikipedia.org/wiki/Ethereum)) or `usd` (to work in dollars), and another variable in artworks, for example the `media` or `size` (make sure to deal with missing observations appropriately). 

For time trends, use the `timestamp` variables.

---

In [None]:
# your code here