Do Lego sets have more specialized parts than they used to?
===========================================================

![](https://raw.githubusercontent.com/colbeseder/resources/master/lego_mariachi_crate.jpg)
_Some pretty specializd pieces_

I've recently come back to Lego (ie. I have kids) and it seems to me that modern lego sets have more specialized pieces.

In my day (👴) we had to use our imaginations to make new things out of generic bricks. Or did we?

Here's my investigation to whether new lego sets really are introducing more specialized bricks than they used to.




Aims
----

To evaluate whether newer Lego sets introduce more less-useful parts than they did in the '80s. With _useful_ losely defined as "can be used to make lots of other things".

### Some notes on the data

I've excluded 2017 as the data was not complete.
It's expected that in the first years of Lego (created in 1949), there was a rush of new bricks (likely generic ones).
I've ignored the colour of the pieces.

### The data
Let's join some tables and take a look at the basic data we're going to be looking at.

Each row is the first appearance of a Lego piece (ignoring colour).
* part_num - The distinct part
* color_id - The piece's orignal colour
* set_num - The first set that the piece appeared in
* year - The year that the peice was first released
* theme_id - The theme that the piece's first set was part of (eg. 250:  _Prisoner of Azkaban_)
* parent_theme_id - The broader theme (eg. 246: _Harry Potter_)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import pprint, math
import matplotlib.pyplot as plt

# Read in the data files

# For the part number:
inv_parts_data = pd.read_csv("../input/lego-database/inventory_parts.csv")

# Sets and the year they were released
sets_data = pd.read_csv("../input/lego-database/sets.csv")

# To assist the join between part-number, and year introduced 
inv_data = pd.read_csv("../input/lego-database/inventories.csv")

# We'll need this to track parent themes
theme_data = pd.read_csv("../input/lego-database/themes.csv")

# We want to know the type of part
part_cat_data = pd.read_csv("../input/lego-database/parts.csv")

# Join data into a table mapping part-number (with duplicates) to year
years_set_data = pd.merge(inv_data, sets_data,on='set_num')
years = years_set_data[['id', 'set_num','year', 'theme_id']]
data = pd.merge(inv_parts_data, years, left_on='inventory_id', right_on='id')
data = pd.merge(part_cat_data, data, on='part_num')
data = data.drop(['id', 'quantity', 'is_spare', 'name'], axis='columns').sort_values(by=['year'])

# Clean up parts that appear in duplicate rows for a single set (each color is listed separately)
data = data.drop_duplicates(subset=['part_num', 'set_num'], keep='first')

# Add the parent theme

memo = {}
def get_parent_theme(theme):
    if theme in memo:
        return memo[theme]
    
    parent = theme_data.loc[theme_data['id'] == theme]['parent_id']
    if math.isnan(parent):
        return theme
    else:
        parent = int(parent)
    if parent == theme:
        r = theme
    else:
        r = get_parent_theme(parent)
    
    memo[theme] = r
    return r

data['parent_theme_id'] = data['theme_id'].apply(lambda x: get_parent_theme(x) )

print(data.head())


Now let's organize some of this data by year.

We can see how many new parts were introduced that year.

For all the parts that originated in that year: how many sets have those parts appeared in since?
On average, how many new parts were in each set?

In [None]:
# Find the year each part was introduced

part_sum = {} # part_num : (sets_containing, year_introduced)
for _, inv in data.iterrows():
    part_num = inv['part_num']

    if part_num in part_sum:
        s = part_sum[part_num][0] + 1
        yr = min(part_sum[part_num][1], inv['year'])
    else:
        s = 1
        yr = inv['year']
    part_sum[part_num] = (s, yr)


# Find number of sets released each year

set_releases = {}
for _, row in sets_data.iterrows():
    yr = row['year']
    if yr in set_releases:
        set_releases[yr] +=1
    else:
        set_releases[yr] = 1

#print("New sets released by year:")
#pprint.pprint(set_releases)


        

In [None]:
# Parts introduced each year
year_part_count = {}

# For each part, number of sets that it appears in, summed per year (ie. sum of set-appearances for all parts originating in year)
part_appearances_by_year_first_seen = {}

for part_num in part_sum:
    yr = part_sum[part_num][1]
    s = part_sum[part_num][0]
    
    if yr in year_part_count:
        year_part_count[yr] += 1
        part_appearances_by_year_first_seen[yr] += s
    else:
        year_part_count[yr] = 1
        part_appearances_by_year_first_seen[yr] = s

years = []
part_appearances = []
for yr in sorted(part_appearances_by_year_first_seen):
    years.append(yr)
    part_appearances.append(part_appearances_by_year_first_seen[yr])


In [None]:

first_seen = []
new_parts = []
new_parts_per_set = []
new_sets = []
sets_per_new_part = []
year_reuses = {}
for yr in year_part_count:
    if yr == 2017:
        continue # Seems data was collected mid-year
    first_seen.append(yr)
    new_parts.append(year_part_count[yr])
    new_sets.append(set_releases[yr])
    sets_per_new_part.append(part_appearances_by_year_first_seen[yr] / year_part_count[yr])
    year_reuses[yr] = part_appearances_by_year_first_seen[yr] / year_part_count[yr]
    new_parts_per_set.append(year_part_count[yr] / set_releases[yr])

first_seen_parts = pd.DataFrame.from_dict({'year': first_seen, 'new_parts': new_parts, 'new_sets': new_sets, 'sets_per_new_part': sets_per_new_part, 'new_parts_per_set': new_parts_per_set}).sort_values(by=['year'])

print(first_seen_parts.head())

In [None]:
#fig, ax = plt.subplots()
#first_seen_parts.plot.scatter(x='year',y='new_parts_per_set', c='new_sets', colormap='Wistia', ax=ax);

print("Zoom in on last 35 years")
#fig, ax = plt.subplots()
first_seen_parts[-35:].plot.bar(x='year',y='new_parts_per_set');#, c='new_sets', colormap='Wistia', ax=ax);

**There definitely seems to be an increase in average number of new bricks per set, over the last few years.**

I wonder if we can spot a trend of newer bricks being in fewer sets?

In [None]:
#All time
#fig, ax = plt.subplots()
#first_seen_parts.plot.scatter(x='year',y='sets_per_new_part', c='new_sets', colormap='Wistia', ax=ax);

print("Zoom in on last 35 years")
fig, ax = plt.subplots()
#first_seen_parts[-35:].plot.scatter(x='year',y='sets_per_new_part', c='new_sets', colormap='Wistia', ax=ax);
sns.regplot(x=first_seen_parts[-35:]['year'], y=first_seen_parts[-35:]['sets_per_new_part'])

That's a very steep decline. But it's likely skewed by the fact that newer bricks cannot be included in older sets. 

So now, we'll normalize the plot to show what percentage of sets the brick appeared in, starting from the year it was originated.

In [None]:
all_sets = sum(new_sets)
t = 0
total_sets = []

for x in new_sets:
    t += x
    total_sets.append(all_sets - t)

first_seen_parts['sets_to_be_released'] = total_sets
first_seen_parts['percent_of_sets_containing_part'] = first_seen_parts['sets_per_new_part'] / first_seen_parts['sets_to_be_released'] * 100

#print(first_seen_parts)
fig, ax = plt.subplots()

print("Decline in reuse of new bricks")
print("Zoom in on last 35 years")
first_seen_parts[-35:].plot.scatter(x='year',y='percent_of_sets_containing_part', c='sets_per_new_part', colormap='Wistia', ax=ax);


There's definitely a downward trend in the reuse of new bricks from the 80's to around 2007. But this is followed by a small bounce back.  This could be a blip, or the start of a climb back to the top.

I'm interested to see if this bounce-back is related to themes.

**Is the increase in reusability over the last 10 years cross-theme usability?** Or are these pieces only relevant to their themes?

In [None]:
# For each part, let's count how many themes it appeared in

themes_for_part = {} # part_num : list of themes
parent_themes_for_part = {} # part_num : list of parent themes
for _, inv in data.iterrows():
    part_num = inv['part_num']
    theme = inv['theme_id']
    parent_theme = get_parent_theme(theme)
    if part_num not in themes_for_part:
        themes_for_part[part_num] = [theme]
        parent_themes_for_part[part_num] = [parent_theme]
    else:
        if theme not in themes_for_part[part_num]:
            themes_for_part[part_num].append(theme)
        if parent_theme not in parent_themes_for_part[part_num]:
            parent_themes_for_part[part_num].append(parent_theme)


themes_for_year = {} # year: themes * parts
parent_themes_for_year = {} # year: themes * parts
for part_num in part_sum:
        yr = part_sum[part_num][1]
        themes = len(themes_for_part[part_num])
        parent_themes = len(parent_themes_for_part[part_num])
        if yr in themes_for_year:
            themes_for_year[yr] += themes
            parent_themes_for_year[yr] += parent_themes
        else:
            themes_for_year[yr] = themes
            parent_themes_for_year[yr] = parent_themes
years = []
theme_count = []
parent_theme_count = []
for yr in sorted(themes_for_year):
    if yr == 2017:
        continue
    years.append(yr)
    theme_count.append(themes_for_year[yr])
    parent_theme_count.append(parent_themes_for_year[yr])

first_seen_parts['themes_per_part'] = theme_count / first_seen_parts['new_parts']
first_seen_parts['parent_themes_per_part'] = parent_theme_count / first_seen_parts['new_parts']

In [None]:
#print(first_seen_parts.head())

# Plot parent themes - very similar results to themes
#sns.regplot(x=first_seen_parts[-35:]['year'], y=first_seen_parts[-35:]['parent_themes_per_part'])

sns.regplot(x=first_seen_parts[-35:]['year'], y=first_seen_parts[-35:]['themes_per_part'])

It's clear from this that the number of cross theme parts is dropping. Pieces introduced more recently are, on average, in less than 2 themes. Plotting against parent themes shows the same picture.

In [None]:
# An Aside: how many themes per parent (root) theme?
themes_per_parent = {}
for t in memo:
    p = memo[t]
    if p in themes_per_parent:
        themes_per_parent[p] +=1
    else:
        themes_per_parent[p] = 1

pparents = []
pcount = []
for p in sorted(themes_per_parent):
    pparents.append(p)
    pcount.append(themes_per_parent[p])

df = pd.DataFrame.from_dict({'parent_theme': pparents, 'themes_count': pcount})
df.plot.bar(x='parent_theme',y='themes_count');




Conclusion
----------

I got my first lego kit in 1987. As I suspected, new sets are contain more new bricks shapes than they did back then. There's also been a decline in reuse of the bricks that have been "invented" during that time. The slight bounce back in the last few years seems to be related to pieces that are used in multiple sets, but only one theme. 

As a further investigation, it would be interesting to investigate what effect we would see from including colour in this study. I'd also be interested to if we could predict the likely future reuse of a new piece.



Appendix
--------

Here's my attempt at a model predicting future use of a new brick. How accurately can we predict the expected number of sets that will contain a piece, based on its:
* part category
* First colour
* Year of release
* Theme
* Parent theme

I've excluded Typical ammount of reuse for a brick from that year ( _reuse\_for\_year_ )


In [None]:
# Let's try to make some predictions

import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder



# We want to make a prediction based only on the information that's available at part origination
part_data = data.copy()
part_data = part_data[part_data.year < 2017 ]

part_data = part_data.drop(['inventory_id', 'set_num'], axis='columns').sort_values(by=['part_num'])
part_data = part_data.drop_duplicates(subset=['part_num'], keep='first')
    
part_data['reuse_for_year'] = part_data['year'].apply(lambda yr: year_reuses[yr] )

part_data['uses'] = part_data['part_num'].apply(lambda x: part_sum[x][0] )

# We'll class pieces with over 30 reuses as "high use" and bunch them together
part_data['uses'] = part_data['uses'].clip(1, 31)

part_data.reset_index(drop=True, inplace=True)
print(part_data.head())

# Convert part categories, colours, themes and parent themes to One Hot Encoding

def convert_to_one_hot(df, key, enc):
    enc_df = pd.DataFrame(enc.fit_transform(df[[key]]).toarray())
    df = pd.concat([df, enc_df], axis=1)
    df = df.drop([key], axis='columns')
    return df


enc = OneHotEncoder()
for k in ['part_cat_id', 'color_id', 'theme_id', 'parent_theme_id']:
    part_data = convert_to_one_hot(part_data, k, enc)

# Randomize row order before splitting
part_data = shuffle(part_data, random_state=3)

y = part_data['uses']

part_data = part_data.drop(['uses', 'part_num'], axis='columns')
X = part_data
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=2)

print("%s rows of training data"%(len(train_y)))
print("%s rows of validation data"%(len(val_y)))
# Define and fit the model.
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)

# Make predictions on the validation data
predictions = rf_model.predict(val_X)

# Calculate the error of the predictions
rf_val_mae = mean_absolute_error(val_y, predictions)
rf_val_rmse = mean_squared_error(val_y, predictions)

print("\n")
print("Validation RMSE: {}".format(rf_val_rmse))

print("\n\n** The Mean Average Error on the validation set is {}! **".format(rf_val_mae))

<br><br><br><br><br><br><br><br><br><br>