## Lab 3: pandas

#### Introduction

We will examine a dataset containing characteristics of lego sets manufactured between 1961 and 2019 from the BRICKSET website. Variables in the dataset are described below.

| VARIABLE             |      DESCRIPTION             |
|:----------|:-------------|
| ID |  set id |
| Name |    name of set   |
| Themegroup | themegroup of set |
| Theme |  theme of set |
| Subtheme |    subtheme of set   |
| Year | year released |
| Pieces |  number of pieces |
| Minifig |    number of minifigs   |
| Package | type of packaging |
| Retail_Price |  recommended retail pri



Load pandas to read in the data and save it as dataframe named lego.

In [3]:
from google.colab import files
import pandas as pd
uploaded = files.upload()

Saving lego.csv to lego.csv


In [4]:
import io
df = pd.read_csv(io.BytesIO(uploaded['lego.csv']))

#### Question 1:
Some sets have missing information for retail_price or pieces or both. This could be because the sets are free (giveaways), they aren’t traditional lego sets (comic books, etc) or just because the information is missing. Filter the lego dataset based on the specifications below and save the result as lego. Hence, you will overwrite the original lego object. In addition, describe the implications of removing these sets.

Your new lego (data frame) should have:

• no missing pieces

• only contain sets with a nonzero number of pieces

• no missing retail_price

• only contain sets with a nonzero retail_price

• no missing year

- Print out the shape of the dataframe after cleaning the dataset.
-

In [6]:
# Write your answer here
lego = pd.read_csv('lego.csv')
# Filter the DataFrame based on the specified conditions
lego = lego.dropna(subset=['pieces', 'retail_price', 'year'])  # Remove rows with missing values in specified columns
lego = lego[(lego['pieces'] > 0) & (lego['retail_price'] > 0)]  # Keep only rows with non-zero pieces and retail price

# Print the shape of the cleaned DataFrame
print(lego.shape)


(7213, 10)


#### Question 2:
Arrange the dataset in descending order of retail_price and print the first three rows. Report in words the names of the three most expensive lego sets, their prices, and how many pieces each has.

In [7]:
# Write your answer here
# Sort the DataFrame by 'retail_price' in descending order
sorted_lego = lego.sort_values(by='retail_price', ascending=False)

# Get the first three rows
top_three_sets = sorted_lego.head(3)

# Print the top three sets
print(top_three_sets[['name', 'retail_price', 'pieces']])


                   name  retail_price  pieces
1434  Millennium Falcon        799.99  7541.0
3426    Connections Kit        754.99  2455.0
2065         Death Star        499.99  4016.0


#### Question 3:

It appears that the most expensive sets generally have more pieces. Create a new variable (column) price_per_piece, representing the price in dollars per piece for each of the sets. Save the result as lego2. Hence, you will overwrite the current lego object.

In [8]:
# Write your answer here

# Create the new column 'price_per_piece'
lego['price_per_piece'] = lego['retail_price'] / lego['pieces']

# Save the result as lego2 (overwriting the current lego object)
lego2 = lego

# Display the first few rows to confirm the new column has been added
print(lego2.head())


        id                            name    themegroup           theme  \
1  10264-1                   Corner Garage  Model making  Creator Expert   
2  10265-1                    Ford Mustang  Model making  Creator Expert   
3  10766-1                      Woody & RC      Licensed       Toy Story   
4  10769-1                     RV Vacation      Licensed       Toy Story   
5  10770-1  Buzz & Woody's Carnival Mania!      Licensed       Toy Story   

            subtheme    year  pieces  minifigs package  retail_price  \
1  Modular Buildings  2019.0  2569.0         6     Box        199.99   
2           Vehicles  2019.0  1471.0         0     Box        149.99   
3        Toy Story 4  2019.0    69.0         0     Box          9.99   
4        Toy Story 4  2019.0   178.0         0     Box         34.99   
5        Toy Story 4  2019.0   230.0         0     Box         49.99   

   price_per_piece  
1         0.077847  
2         0.101965  
3         0.144783  
4         0.196573  
5    

#### Question 4:

Arrange the lego2 dataset in descending order of price_per_piece and return only the columns: name, themegroup, theme, pieces, price_per_piece, and the first five rows. What do you notice about these sets?

In [10]:
# Write your answer here
# Arrange the lego2 dataset in descending order of price_per_piece
sorted_lego2 = lego2.sort_values(by='price_per_piece', ascending=False)

# Select the specified columns and get the first five rows
result = sorted_lego2[['name', 'themegroup', 'theme', 'pieces', 'price_per_piece']].head(5)

# Display the result
print(result)

                               name   themegroup       theme  pieces  \
3586          EV3 Intelligent Brick    Technical  Mindstorms     1.0   
5298  Intelligent NXT Brick (Black)    Technical  Mindstorms     1.0   
6452          NXT Intelligent Brick    Technical  Mindstorms     1.0   
9029    RCX Programmable LEGO Brick    Technical  Mindstorms     1.0   
5272    NXT DC Rechargeable Battery  Educational   Education     1.0   

      price_per_piece  
3586           204.99  
5298           169.99  
6452           169.99  
9029           110.00  
5272            79.99  


#### Question 5:

What is the mean price_per_piece for the top 40 'Toy Story' sets in terms of price_per_piece?

In [11]:
# Write your answer here
# Filter the lego2 dataset for 'Toy Story' sets
toy_story_sets = lego2[lego2['theme'] == 'Toy Story']

# Sort the 'Toy Story' sets by price_per_piece in descending order
top_toy_story_sets = toy_story_sets.sort_values(by='price_per_piece', ascending=False)

# Select the top 40 sets
top_40_toy_story_sets = top_toy_story_sets.head(40)

# Calculate the mean price_per_piece
mean_price_per_piece = top_40_toy_story_sets['price_per_piece'].mean()

# Print the mean price_per_piece
print(f"The mean price per piece for the top 40 'Toy Story' sets is: ${mean_price_per_piece:.2f}")


The mean price per piece for the top 40 'Toy Story' sets is: $0.15


#### Question 6:

What are the unique themes in the lego dataset?

In [12]:
# Write your answer here
# Get the unique themes in the lego2 dataset
unique_themes = lego2['theme'].unique()

# Print the unique themes
print(unique_themes)


['Creator Expert' 'Toy Story' 'Duplo' 'Classic' 'Architecture' 'Minecraft'
 'Ideas' 'City' 'Creator' 'BrickHeadz' 'Friends' 'Disney' 'Technic'
 'Ninjago' 'The Lego Movie 2: The Second Part' 'Star Wars'
 'Speed Champions' 'Jurassic World' 'Overwatch' 'Marvel Super Heroes'
 'DC Comics Super Heroes' 'Juniors' 'Wizarding World' 'Miscellaneous'
 'Seasonal' 'Promotional' 'Xtra' 'Unikitty' 'Elves' 'The Powerpuff Girls'
 'The LEGO Ninjago Movie' 'The LEGO Batman Movie'
 'Collectable Minifigures' 'Nexo Knights' 'Powered Up' 'Boost'
 'DC Super Hero Girls' 'Pirates of the Caribbean' 'Dimensions' 'Books'
 'Education' 'Mixels' 'Mindstorms' 'Bionicle' 'The Angry Birds Movie'
 'Ghostbusters' nan 'Legends of Chima' 'Pirates' 'Ultra Agents'
 'The LEGO Movie' 'The Simpsons' 'Scooby-Doo' 'Bricks and More' 'Fusion'
 'Teenage Mutant Ninja Turtles' 'HERO Factory' 'The Hobbit' 'Castle'
 'The Lord of the Rings' 'Serious Play' 'Master Builder Academy' 'Space'
 'The Lone Ranger' 'Games' 'Power Functions' 'Monst