This file is to trim down the lego data set file into a more managable size and compare/clean the remaining data

In [21]:
import pandas as pd

data_frame = pd.read_csv(r"lego_sets.csv")
filt_data = data_frame[(data_frame["year"]>= 2000) & (data_frame["US_retailPrice"]> 0) & (data_frame["pieces"]>0)]

print(filt_data.head())

      set_id                   name  year        theme subtheme themeGroup  \
4045  3458-1       2x4 Black Bricks  2000  Bulk Bricks      NaN      Basic   
4048  3461-1  2x4 Dark Green Bricks  2000  Bulk Bricks      NaN      Basic   
4049  3462-1         2x4 Red Bricks  2000  Bulk Bricks      NaN      Basic   
4057  3470-1       1x4 White Bricks  2000  Bulk Bricks      NaN      Basic   
4072  3485-1         2x4 Red Plates  2000  Bulk Bricks      NaN      Basic   

     category  pieces  minifigs  agerange_min  US_retailPrice  \
4045   Normal    50.0       NaN           NaN            6.99   
4048   Normal    50.0       NaN           NaN            6.99   
4049   Normal    50.0       NaN           NaN            6.99   
4057   Normal    50.0       NaN           NaN            5.99   
4072   Normal   100.0       NaN           NaN            8.99   

                           bricksetURL  \
4045  https://brickset.com/sets/3458-1   
4048  https://brickset.com/sets/3461-1   
4049  https://

In [22]:
# drop unnecessary columns

filt_data = filt_data.drop(columns=["bricksetURL","thumbnailURL","imageURL","agerange_min"])

# fill empty values

filt_data["minifigs"] = pd.to_numeric(filt_data["minifigs"],errors = "coerce").fillna(0).astype(int)
filt_data = filt_data[filt_data['year'] <= 2010]

In [23]:

# Save filtered data to a new CSV file
filt_data.to_csv(r"lego_filtered.csv", index=False)

print("Filtered dataset saved successfully!")
print(filt_data.head())


Filtered dataset saved successfully!
      set_id                   name  year        theme subtheme themeGroup  \
4045  3458-1       2x4 Black Bricks  2000  Bulk Bricks      NaN      Basic   
4048  3461-1  2x4 Dark Green Bricks  2000  Bulk Bricks      NaN      Basic   
4049  3462-1         2x4 Red Bricks  2000  Bulk Bricks      NaN      Basic   
4057  3470-1       1x4 White Bricks  2000  Bulk Bricks      NaN      Basic   
4072  3485-1         2x4 Red Plates  2000  Bulk Bricks      NaN      Basic   

     category  pieces  minifigs  US_retailPrice  
4045   Normal    50.0         0            6.99  
4048   Normal    50.0         0            6.99  
4049   Normal    50.0         0            6.99  
4057   Normal    50.0         0            5.99  
4072   Normal   100.0         0            8.99  


In [24]:
# Choose your column
col = "theme"

# Get value counts for all unique values
value_counts = filt_data[col].value_counts()

# --- Print all unique values and their counts ---
print("All unique values and counts:\n")
for value, count in value_counts.items():
    print(f"{value}: {count}")

# --- Now filter for rare values (less than 20 instances) ---
rare_values = value_counts[value_counts < 20].index.tolist()

print("\nUnique values with fewer than 20 instances:\n")
for value, count in value_counts.items():
    if count < 20:
        print(f"{value}: {count}")

# --- Show the list (for use in drop later) ---
print("\nList of rare values to remove:")
print(rare_values)


All unique values and counts:

Duplo: 128
City: 107
Bionicle: 92
Star Wars: 81
Racers: 76
Creator: 47
Technic: 40
Castle: 37
Collectable Minifigures: 32
Space: 31
Bulk Bricks: 29
Bricks and More: 26
Mindstorms: 22
Exo-Force: 21
Advanced models: 19
Seasonal: 17
Power Miners: 16
Games: 16
Indiana Jones: 16
HERO Factory: 15
Power Functions: 14
Atlantis: 14
Agents: 13
Belville: 13
Toy Story: 11
SpongeBob SquarePants: 10
Pirates: 9
Batman: 9
Education: 8
Architecture: 7
Aqua Raiders: 7
Harry Potter: 7
Miscellaneous: 6
World Racers: 6
Ben 10: Alien Force: 6
Make and Create: 6
Trains: 5
Prince of Persia: 5
Factory: 3
Quatro: 2
Serious Play: 2
Town: 1
Vikings: 1
Books: 1

Unique values with fewer than 20 instances:

Advanced models: 19
Seasonal: 17
Power Miners: 16
Games: 16
Indiana Jones: 16
HERO Factory: 15
Power Functions: 14
Atlantis: 14
Agents: 13
Belville: 13
Toy Story: 11
SpongeBob SquarePants: 10
Pirates: 9
Batman: 9
Education: 8
Architecture: 7
Aqua Raiders: 7
Harry Potter: 7
Miscella

In [25]:
# --- Drop rows with rare values ---
filt_data_cleaned = filt_data[~filt_data[col].isin(rare_values)]

# --- Save the cleaned dataset to a new CSV file ---
filt_data_cleaned.to_csv(r"lego_cleaned.csv", index=False)

# --- Print row counts for confirmation ---
print(f"\nRows before cleaning: {len(filt_data)}")
print(f"Rows after cleaning: {len(filt_data_cleaned)}")
print("Cleaned dataset saved as 'lego_cleaned.csv'")



Rows before cleaning: 1034
Rows after cleaning: 769
Cleaned dataset saved as 'lego_cleaned.csv'
