<img src="img/MoMA.jpg" style="width: 600px;"/>

# SQL Exercise - Museum of Modern Art

**As a way of practising my SQL skills I download the dataset provided by the [MoMA](https://github.com/MuseumofModernArt/collection). The dataset has 2 tables Artists & Artworks**

According to the github repository with the data: 
<br>

> _"This research dataset contains 138,210 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the Museum. Some of these records have incomplete information and are noted as “not Curator Approved.”_

> *The Artists dataset contains 15,388 records, representing all the artists who have work in MoMA's collection and have been cataloged in our database. It includes basic metadata for each artist, including name, nationality, gender, birth year, death year, Wiki QID, and Getty ULAN ID."*

**I will be using PostgreSQL to practice**

Here are some of the basic commands for macOS users

<ol>
    <li>brew install postgresql --> will install postgresql</li>
    <li>initdb /usr/local/var/postgres --> will point to the data directory</li>
    <li>pg_ctl -D /usr/local/var/postgres start --> will run postgres</li>
    <li>pg_ctl -D /usr/local/var/postgres stop --> will stop running postgres</li>
    <li>createdb database_name --> will create a new database with the name you specify</li>
    <li>dropdb database_name  --> will delete a database</li>
    <li>psql database_name --> will enter the database</li>
    <li>postgres=#\du --> will show the users</li>
</ol>

<br>

## CREATING TABLES AND COPYING DATA

I will create the database directly from the command line even if it is possible to do it from a notebook. The following lines of code are the ones I used to create the tables and to copy the data.

<br>

### Artists Table

In [None]:
DROP TABLE IF EXISTS artists; 
CREATE TABLE artists(
    ConstituentID serial PRIMARY KEY,
    DisplayName VARCHAR (100),
    ArtistBio VARCHAR (250),
    Nationality VARCHAR (50),
    Gender VARCHAR(10),
    BeginDate INTEGER,
    EndDate INTEGER,
    Wiki_QID VARCHAR(50),
    ULAN INTEGER);

In [None]:
COPY artists
FROM '/Users/diego/Documents/Artists.csv' # Write the path where your data is located
DELIMITER ',' CSV HEADER;

<br>
<br>

### Artworks Table

In [None]:
DROP TABLE IF EXISTS artworks;
CREATE TABLE artworks (
    Title VARCHAR (5000),
    Artist VARCHAR (5000),
    ConstituentID VARCHAR(5000),
    ArtistBio VARCHAR (5000),
    Nationality VARCHAR (5000),
    BeginDate VARCHAR(5000),
    EndDate VARCHAR(5000),
    Gender VARCHAR(5000),
    Date VARCHAR(5000),
    Medium VARCHAR (5000),
    Dimensions VARCHAR (5000),
    CreditLine VARCHAR (5000),
    AccessionNumber VARCHAR (5000),
    Classification VARCHAR (5000),
    Department VARCHAR (5000),
    DateAcquired DATE,
    Cataloged VARCHAR(1),
    ObjectID serial PRIMARY KEY,
    URL VARCHAR(5000),
    ThumbnailURL VARCHAR(5000),
    Circumference_cm DECIMAL,
    Depth_cm DECIMAL,
    Diameter_cm DECIMAL,
    Height_cm DECIMAL,
    Length_cm DECIMAL,
    Weight_kg DECIMAL,
    Width_cm DECIMAL,
    Seat_Height_cm DECIMAL,
    Duration_sec DECIMAL);

In [None]:
COPY artworks
FROM '/Users/diego/Documents/Artworks.csv' # Write the path where your data is located
DELIMITER ',' CSV HEADER;

In [None]:
''' The code above to create the artworks table works well but I had to keep increasing the VARCHAR values due to some 
entries being very long. This is the code to change a predefined table datatype:'''

ALTER TABLE Artworks
ALTER COLUMN enddate TYPE VARCHAR(200);

Unfortunately, this second table is not as straight-forward as the 1st one. It is quite of a messy dataset. Here ar some points:
<ol>
    <li>The Seat Height column is empty, it could be droped, df.drop('Seat Height (cm)', axis=1, inplace=True)</li><br>
    <li>Most of the artworks measures (length, weight, etc.) are empty (I will keep them)</li><br>
    <li>ConstituentId sometimes references two or more ids. This is because some artworks are done by more than one artist. This makes it rather difficult to analyze since dates, names, ids, artistsbio, etc. are all grouped together in one entry. The question of having them separated is there but we would be duplicating rows (since the artworks would be assigned to each artist and thus lose their unique id). There is a great StackOverflow thread where user MaxU shares a function to separate these types of entries (see below)</li><br>
    <li>What I have done is create an edited Artworks dataset that can be found in the data folder, I will use that one for some of the exercises I created for myself.</li>
</ol>
        
        
[StackOverflow link](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows)

The following commands will allow you to upload the edited "Artorks" table I mentioned above:

In [None]:
DROP TABLE IF EXISTS artworks_edit;
CREATE TABLE artworks_edit (
    ObjectID INTEGER,
    ArtistID INTEGER,
    Title VARCHAR (5000),
    All_Artists VARCHAR (5000),
    All_IDs VARCHAR(5000),
    All_Bios VARCHAR (5000),
    All_Nationalities VARCHAR (5000),
    All_BeginDates VARCHAR(5000),
    All_EndDates VARCHAR(5000),
    All_Genders VARCHAR(5000),
    Date VARCHAR(5000),
    Medium VARCHAR (5000),
    Dimensions VARCHAR (5000),
    CreditLine VARCHAR (5000),
    AccessionNumber VARCHAR (5000),
    Classification VARCHAR (5000),
    Department VARCHAR (5000),
    DateAcquired DATE,
    Cataloged VARCHAR(1),
    URL VARCHAR(5000),
    ThumbnailURL VARCHAR(5000),
    Circumference_cm DECIMAL,
    Depth_cm DECIMAL,
    Diameter_cm DECIMAL,
    Height_cm DECIMAL,
    Length_cm DECIMAL,
    Weight_kg DECIMAL,
    Width_cm DECIMAL,
    Duration_sec DECIMAL);

In [None]:
COPY artworks_edit
FROM '/Users/diego/Documents/Artworks_Edits.csv'
DELIMITER ',' CSV HEADER;