# Analyze Bags Data in Dataset
This notebook will help you check the number, label distribution, and image path validity for "Bags" items in your dataset.

# Generalized Analysis for All Item Types
This section analyzes the dataset for **all item types**. For each type, it will:
- Show the number of samples
- Display condition score statistics
- Show output class distribution
- Check for missing images
- Summarize years used, condition, and description fields


In [1]:
import pandas as pd
import os
from collections import Counter

# Load the dataset
csv_path = '../data/full_items_extended_dataset.csv'
df = pd.read_csv(csv_path)

# Filter for Bags
bags_df = df[df['item_type'].str.lower() == 'bags']
print(f"Number of 'Bags' samples: {len(bags_df)}")
bags_df.head()

Number of 'Bags' samples: 5


Unnamed: 0,item_type,years_used,condition,description,image_damage,condition_score,green_points,output,image_path
15,Bags,1,Working,"Zip works fine, fabric a bit discolored",Low,0.95,99,Refurbish and Resell,waste_classifier/data/images/used-school-bags-...
16,Bags,3,Working,"Zip works fine, fabric a bit discolored",Moderate,0.53,66,Refurbish and Resell,waste_classifier/data/images/images.jpeg
17,Bags,2,Repairable,"Zip works fine, fabric a bit discolored",Low,0.67,70,Refurbish and Resell,waste_classifier/data/images/bags_resell.jpg
18,Bags,3,Repairable,"Zip works fine, fabric a bit discolored",High,0.55,66,Salvage Components,waste_classifier/data/images/bag.jpg
19,Bags,5,Dead,"Zip doesn't works , fabric a bit discolored",Low,0.25,35,Recycle,waste_classifier/data/images/bag_dead.png


In [2]:
# Check condition_score and output distribution for Bags
print("Condition score stats for Bags:")
print(bags_df['condition_score'].describe())
print("\nOutput class counts for Bags:")
print(bags_df['output'].value_counts())

Condition score stats for Bags:
count    5.000000
mean     0.590000
std      0.253377
min      0.250000
25%      0.530000
50%      0.550000
75%      0.670000
max      0.950000
Name: condition_score, dtype: float64

Output class counts for Bags:
output
Refurbish and Resell    3
Salvage Components      1
Recycle                 1
Name: count, dtype: int64


In [3]:
# Check if image files exist for Bags
image_dir = '../data/images/'
def check_image_exists(image_path):
    return os.path.exists(os.path.join(image_dir, os.path.basename(image_path)))
bags_df['image_exists'] = bags_df['image_path'].apply(check_image_exists)
missing_images = bags_df[~bags_df['image_exists']]
print(f"Missing images for Bags: {len(missing_images)}")
if not missing_images.empty:
    print(missing_images[['image_path']])

Missing images for Bags: 0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bags_df['image_exists'] = bags_df['image_path'].apply(check_image_exists)


In [4]:
# Show a summary of years_used, condition, and description for Bags
print(bags_df[['years_used', 'condition', 'description']].describe(include='all'))

        years_used condition                              description
count      5.00000         5                                        5
unique         NaN         3                                        2
top            NaN   Working  Zip works fine, fabric a bit discolored
freq           NaN         2                                        4
mean       2.80000       NaN                                      NaN
std        1.48324       NaN                                      NaN
min        1.00000       NaN                                      NaN
25%        2.00000       NaN                                      NaN
50%        3.00000       NaN                                      NaN
75%        3.00000       NaN                                      NaN
max        5.00000       NaN                                      NaN


# Generalized Analysis for All Item Types
The following cells will analyze the dataset for all item types, including sample counts, output class distributions, missing images, and summary statistics.

In [5]:
import pandas as pd
import os

df = pd.read_csv('../data/full_items_extended_dataset.csv')
image_dir = '../data/images/'

item_types = df['item_type'].unique()

for item in item_types:
    print(f'\n=== {item} ===')
    subset = df[df['item_type'] == item]
    print(f'Total samples: {len(subset)}')
    print('Output class distribution:')
    print(subset['output'].value_counts())
    # Check for missing images
    missing_images = subset[~subset['image_path'].apply(lambda x: os.path.isfile(os.path.join(image_dir, str(x))))]
    print(f'Missing images: {len(missing_images)}')
    if len(missing_images) > 0:
        print('Sample missing image paths:', missing_images['image_path'].head(3).tolist())


=== Blenders/Mixers ===
Total samples: 5
Output class distribution:
output
Salvage Components      3
Refurbish and Resell    1
Recycle                 1
Name: count, dtype: int64
Missing images: 5
Sample missing image paths: ['data/images/blender_low.jpeg', 'waste_classifier/data/images/blender_moderate.jpg', 'waste_classifier/data/images/blenders_high-1.jpeg']

=== Electric Kettles ===
Total samples: 5
Output class distribution:
output
Salvage Components      3
Refurbish and Resell    1
Recycle                 1
Name: count, dtype: int64
Missing images: 5
Sample missing image paths: ['waste_classifier/data/images/electric-kettle-best.jpg', 'waste_classifier/data/images/electric-kettle_worst.png', 'waste_classifier/data/images/electric kettle_moderate.png']

=== Chairs ===
Total samples: 5
Output class distribution:
output
Refurbish and Resell    3
Salvage Components      1
Recycle                 1
Name: count, dtype: int64
Missing images: 5
Sample missing image paths: ['data/images/

In [6]:
for item in item_types:
    print(f'\n=== {item} Summary Statistics ===')
    subset = df[df['item_type'] == item]
    print('years_used:')
    print(subset['years_used'].describe())
    print('condition:')
    print(subset['condition'].describe())
    # Description length
    desc_len = subset['description'].fillna('').apply(len)
    print('description length:')
    print(desc_len.describe())


=== Blenders/Mixers Summary Statistics ===
years_used:
count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
Name: years_used, dtype: float64
condition:
count              5
unique             3
top       Repairable
freq               3
Name: condition, dtype: object
description length:
count     5.000000
mean     40.400000
std       0.894427
min      40.000000
25%      40.000000
50%      40.000000
75%      40.000000
max      42.000000
Name: description, dtype: float64

=== Electric Kettles Summary Statistics ===
years_used:
count    5.00000
mean     2.80000
std      2.04939
min      1.00000
25%      1.00000
50%      2.00000
75%      5.00000
max      5.00000
Name: years_used, dtype: float64
condition:
count        5
unique       3
top       Dead
freq         3
Name: condition, dtype: object
description length:
count     5.000000
mean     41.600000
std       1.341641
min      41.000000
25%      41