# Understanding Lego sets popularity

## 📖 Background
You recently applied to work as a data analyst intern at the famous **Lego Group in Denmark**. As part of the job interview process, you received the following take-home assignment: *You are asked to use the provided dataset to understand the popularity of different Lego sets and themes. The idea is to become familiarized with the data to be ready for an interview with a business stakeholder.*

![lego bricks](lego%20bricks.jpg)


## 💪 Challenge
By utilizing PostgreSQL and Jupyter Notebook, we will respond to the following inquiries:

1. What is the average number of Lego sets released per year?
2. What is the average number of Lego parts per year?
3. Create a visualization for item 2.
4. What are the 5 most popular colors used in Lego parts?
5. What proportion of Lego parts are transparent?
6. What are the 5 rarest lego bricks?
7. Summarize your findings.

## 💾 The data

#### You received access to a database with the following tables. You can also see above a visualization of how the tables are related to each other:

#### inventory_parts
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "part_num" - unique id for the part (as in the parts table)
- "color_id" - id of the color
- "quantity" - the number of copies of the part included in the set
- "is_spare" - whether or not it is a spare part

#### parts
- "part_num" - unique id for the part (as in the inventory_parts table)
- "name" - name of the part
- "part_cat_id" - part category id (as in part_catagories table)

#### part_categories
- "id" - part category id (as in parts table)
- "name" - name of the category the part belongs to

#### colors
- "id" - id of the color (as in inventory_parts table)
- "name" - color name
- "rgb" - rgb code of the color
- "is_trans" - whether or not the part is transparent/translucent

#### inventories
- "id" - id of the inventory the part is in (as in the inventory_sets and inventory_parts tables)
- "version" - version number
- "set_num" - set number (as in sets table)

#### inventory_sets
- "inventory_id" - id of the inventory the part is in (as in the inventories table)
- "set_num" - set number (as in sets table)
- "quantity" - the quantity of sets included

#### sets
- "set_num" - unique set id (as in inventory_sets and inventories tables)
- "name" - the name of the set
- "year" - the year the set was published
- "theme_id" - the id of the theme the set belongs to (as in themes table)
- num-parts - the number of parts in the set

#### themes
- "id" - the id of the theme (as in the sets table)
- "name" - the name of the theme
- "parent_id" - the id of the larger theme, if there is one


***Acknowledgments**: [LEGO Database](https://rebrickable.com/downloads)

![erd](data/lego_erd.png)

## Before answering the posed questions, let's examine the contents of each of the tables that make up this dataset.

In [13]:
SELECT *
FROM inventory_sets
LIMIT 2;

Unnamed: 0,inventory_id,set_num,quantity
0,35,75911-1,1
1,35,75912-1,1


In [14]:
SELECT *
FROM inventories
LIMIT 2;

Unnamed: 0,id,version,set_num
0,1,1,7922-1
1,3,1,3931-1


In [15]:
SELECT *
FROM sets
LIMIT 2;

Unnamed: 0,set_num,name,year,theme_id,num_parts
0,00-1,Weetabix Castle,1970,414,471
1,0011-2,Town Mini-Figures,1978,84,12


In [16]:
SELECT *
FROM inventory_parts
LIMIT 2;

Unnamed: 0,inventory_id,part_num,color_id,quantity,is_spare
0,1,48379c01,72,1,False
1,1,48395,7,1,False


In [17]:
SELECT *
FROM parts
LIMIT 2;

Unnamed: 0,part_num,name,part_cat_id
0,0687b1,Set 0687 Activity Booklet 1,17
1,0901,Baseplate 16 x 30 with Set 080 Yellow House Print,1


In [18]:
SELECT *
FROM themes
LIMIT 2;

Unnamed: 0,id,name,parent_id
0,2,Arctic Technic,1
1,3,Competition,1


In [19]:
SELECT *
FROM part_categories
LIMIT 2;

Unnamed: 0,id,name
0,1,Baseplates
1,2,Bricks Printed


In [20]:
SELECT *
FROM colors
LIMIT 2;

Unnamed: 0,id,name,rgb,is_trans
0,-1,Unknown,0033B2,False
1,0,Black,05131D,False


## 1. What is the average number of Lego sets released per year?

First, we wonder. How many **unique sets** releases does this dataset have? What is the **temporal range** analyzed?

In [11]:
SELECT COUNT(set_num) AS total_sets, MIN(year) AS first_year, MAX(year) AS last_year, COUNT(DISTINCT year) AS year
FROM sets;

Unnamed: 0,total_sets,first_year,last_year,year
0,11673,1950,2017,66


The dataset has **11,673** total set releases, with data from the year **1950** to the year **2017** inclusive (66 years of history).

Before presenting an **average of set releases per year**, we will present the **number of set releases per year** to analyze trends, if applicable.

In [52]:
SELECT year, COUNT(set_num) AS sets_released_per_year
FROM sets
GROUP BY year
ORDER BY year ASC;

Unnamed: 0,year,sets_released_per_year
0,1950,7
1,1953,4
2,1954,14
3,1955,28
4,1956,12
...,...,...
61,2013,593
62,2014,713
63,2015,665
64,2016,596


**The average of set releases per year is:**

In [13]:
SELECT (COUNT(set_num) / COUNT(DISTINCT year)) AS avg_sets_per_year
FROM sets;

Unnamed: 0,avg_sets_per_year
0,176


However, given the large volume of sets released in recent years, **the previous average is not consistent**. If you want to know the **average set releases for a time range different than 1950 to 2017**, you can consider the following:

In [66]:
WITH sets_released AS (
SELECT year, COUNT(set_num) as sets_released_per_year
FROM sets
GROUP BY year
ORDER BY YEAR ASC
    )

SELECT year,
    sets_released_per_year,
    ROUND(AVG(sets_released_per_year) OVER (ORDER BY year
                       ROWS BETWEEN 2 preceding  AND current row),1) AS moving_3_year_average

FROM sets_released;

Unnamed: 0,year,sets_released_per_year,moving_3_year_average
0,1950,7,7.0
1,1953,4,5.5
2,1954,14,8.3
3,1955,28,15.3
4,1956,12,18.0
...,...,...,...
61,2013,593,570.3
62,2014,713,640.3
63,2015,665,657.0
64,2016,596,658.0


## 2. What is the average number of Lego parts per year?

In the same way as in the previous exercise, we wonder. How many **parts** does this dataset have? What is the **temporal range** analyzed?

In [15]:
SELECT SUM(num_parts) AS total_parts, COUNT(DISTINCT year) AS total_years,  (SUM(num_parts) / COUNT(DISTINCT year)) AS average_number_of_parts
FROM sets;

Unnamed: 0,total_parts,total_years,average_number_of_parts
0,1894089,66,28698


The dataset has **1,894,089** total parts, with data from the year **1950** to the year **2017** inclusive (66 years of history). Consequently, the **average number of parts per year** (considering the entire period analyzed) is **28,698**.

Despite the **average number of parts per year** presented, we will present the **average number of part for each year** to analyze trends, if applicable.

In [7]:
SELECT year,
    SUM(num_parts) AS parts_per_year,
    COUNT(num_parts) AS count_parts_per_year,
    SUM(num_parts) / COUNT(num_parts) AS avg_parts_per_year
FROM sets
GROUP BY year
ORDER BY year;

Unnamed: 0,year,parts_per_year,count_parts_per_year,avg_parts_per_year
0,1950,71,7,10
1,1953,66,4,16
2,1954,173,14,12
3,1955,1032,28,36
4,1956,222,12,18
...,...,...,...,...
61,2013,107537,593,181
62,2014,121007,713,169
63,2015,134110,665,201
64,2016,150834,596,253


How is it possible that these yearly averages are so far from the average analyzed in the first place (28,698)? Well, another way to approach the problem would be to show a **moving average.**

In [9]:
WITH parts_released AS (
SELECT year, SUM(num_parts) as parts_per_year
FROM sets
GROUP BY year
ORDER BY YEAR ASC
    )

SELECT year,
    parts_per_year,
    ROUND(AVG(parts_per_year) OVER (ORDER BY year
                       ROWS BETWEEN 2 preceding  AND current row),1) AS moving_3_year_average

FROM parts_released;

Unnamed: 0,year,parts_per_year,moving_3_year_average
0,1950,71,71.0
1,1953,66,68.5
2,1954,173,103.3
3,1955,1032,423.7
4,1956,222,475.7
...,...,...,...
61,2013,107537,93389.7
62,2014,121007,106883.3
63,2015,134110,120884.7
64,2016,150834,135317.0


## 3. Create a visualization for item 2.


Unnamed: 0,year,parts_per_year,moving_3_year_average
0,1950,71,71.0
1,1953,66,68.5
2,1954,173,103.3
3,1955,1032,423.7
4,1956,222,475.7
...,...,...,...
61,2013,107537,93389.7
62,2014,121007,106883.3
63,2015,134110,120884.7
64,2016,150834,135317.0


## 4. What are the 5 most popular colors used in Lego parts?

To answer this question, we need to perform a series of joins between 'inventory_parts' and 'colors' tables.

In [12]:
SELECT colors.name AS color, COUNT(inventory_parts.part_num) AS parts
FROM inventory_parts
INNER JOIN colors
ON inventory_parts.color_id = colors.id
GROUP BY colors.name
ORDER BY parts DESC
LIMIT 5;

Unnamed: 0,color,parts
0,Black,115085
1,White,66536
2,Light Bluish Gray,55302
3,Red,50213
4,Dark Bluish Gray,43907


We can observe that dark and gray colors predominate, with a wide advantage on the part of the black color.

## 5. What proportion of Lego parts are transparent?

First we will generate a selection, then answer the question using python.

In [20]:
WITH transparent_parts AS (
    SELECT COUNT(DISTINCT inventory_parts.part_num) AS transparent_parts
    FROM inventory_parts
    INNER JOIN colors 
    ON inventory_parts.color_id = colors.id
    WHERE colors.is_trans = True),
    
all_parts AS (
    SELECT COUNT(DISTINCT part_num)::float AS all_parts
    FROM inventory_parts)
    
SELECT (transparent_parts/all_parts) AS proportion_transparent_parts
FROM transparent_parts, all_parts;

Unnamed: 0,proportion_transparent_parts
0,0.062949


The **proportion of Lego parts that are transparent** is ~6%.

## 6. What are the 5 rarest lego bricks?

To determine the rarity of a Lego block, I will try to analyze those pieces that are low in inventory. Low inventory is not the only way to classify a part as rare, exotic names being another possible way. But here I must make an assumption.

In [8]:
SELECT parts.name, COUNT(DISTINCT inventory_parts.part_num) AS parts
FROM parts
INNER JOIN inventory_parts
USING (part_num)
WHERE parts.name LIKE 'Brick%'
GROUP BY parts.name
ORDER BY parts
LIMIT 5;

Unnamed: 0,name,parts
0,Brick 10 x 20 with Bottom Tubes in single row ...,1
1,Brick 10 x 20 with Bottom Tubes in single row ...,1
2,"Brick 10 x 20 without Bottom Tubes, with '+' C...",1
3,"Brick 10 x 20 without Bottom Tubes, with '+' C...",1
4,"Brick 10 x 10 without Bottom Tubes, with '+' C...",1


## 7.     Summarize your findings.

### 1. The average number of sets released per year is 176, considering the entire time period from 1950 to 2017. However, when analyzing the number of sets released per year, we see a strong increase in recent years. Therefore, we have proposed a moving average as an alternative to answer the question posed.

### 2. The average number of parts per year (considering the entire period analyzed) is 28,698. However, by the same reasoning, the calculation of the moving average has also been proposed.

### 4. The most popular colors in this dataset are:
- Black;
- White;
- Light Bluish Gray;
- Red; and 
- Dark Bluish Gray.

### 5. The proportion of Lego parts that are transparent is ~6%.

### 6. To determine the rarity of a Lego block, I tried to analyze those pieces that were low in inventory. Low inventory was not the only way to classify a part as rare, exotic names being another possible way. But there I had to make an assumption.