# Analyze Bags Data in Dataset
This notebook will help you check the number, label distribution, and image path validity for "Bags" items in your dataset.

# Generalized Analysis for All Item Types
This section analyzes the dataset for **all item types**. For each type, it will:
- Show the number of samples
- Display condition score statistics
- Show output class distribution
- Check for missing images
- Summarize years used, condition, and description fields


In [3]:
import pandas as pd
import os
from collections import Counter

# Load the dataset
csv_path = '../data/full_items_extended_dataset.csv'
df = pd.read_csv(csv_path)

# Filter for Bags
bags_df = df[df['item_type'].str.lower() == 'bags']
print(f"Number of 'Bags' samples: {len(bags_df)}")
bags_df.head()

Number of 'Bags' samples: 45


Unnamed: 0,item_type,years_used,condition,description,image_damage,condition_score,output,image_path
405,Bags,1,Working,"Zip works fine, fabric a bit discolored",Low,0.95,Refurbish and Resell,waste_classifier/data/images/used-school-bags-...
406,Bags,2,Working,"Zip works fine, fabric a bit discolored",Low,0.92,Refurbish and Resell,waste_classifier/data/images/images.jpeg
407,Bags,3,Working,"Zip works fine, fabric a bit discolored",Low,0.89,Refurbish and Resell,images/bags_working_low_3yrs.jpg
408,Bags,4,Working,"Zip works fine, fabric a bit discolored",Low,0.86,Refurbish and Resell,images/bags_working_low_4yrs.jpg
409,Bags,5,Working,"Zip works fine, fabric a bit discolored",Low,0.83,Refurbish and Resell,images/bags_working_low_5yrs.jpg


In [4]:
# Check condition_score and output distribution for Bags
print("Condition score stats for Bags:")
print(bags_df['condition_score'].describe())
print("\nOutput class counts for Bags:")
print(bags_df['output'].value_counts())

Condition score stats for Bags:
count    45.000000
mean      0.364889
std       0.282607
min       0.050000
25%       0.060000
50%       0.330000
75%       0.590000
max       0.950000
Name: condition_score, dtype: float64

Output class counts for Bags:
output
Refurbish and Resell    20
Salvage Components      15
Recycle                 10
Name: count, dtype: int64


In [5]:
# Check if image files exist for Bags
image_dir = '../data/images/'
def check_image_exists(image_path):
    return os.path.exists(os.path.join(image_dir, os.path.basename(image_path)))
bags_df['image_exists'] = bags_df['image_path'].apply(check_image_exists)
missing_images = bags_df[~bags_df['image_exists']]
print(f"Missing images for Bags: {len(missing_images)}")
if not missing_images.empty:
    print(missing_images[['image_path']])

Missing images for Bags: 43
                                   image_path
407          images/bags_working_low_3yrs.jpg
408          images/bags_working_low_4yrs.jpg
409          images/bags_working_low_5yrs.jpg
410     images/bags_working_moderate_1yrs.jpg
411     images/bags_working_moderate_2yrs.jpg
412     images/bags_working_moderate_3yrs.jpg
413     images/bags_working_moderate_4yrs.jpg
414     images/bags_working_moderate_5yrs.jpg
415         images/bags_working_high_1yrs.jpg
416         images/bags_working_high_2yrs.jpg
417         images/bags_working_high_3yrs.jpg
418         images/bags_working_high_4yrs.jpg
419         images/bags_working_high_5yrs.jpg
420       images/bags_repairable_low_1yrs.jpg
421       images/bags_repairable_low_2yrs.jpg
422       images/bags_repairable_low_3yrs.jpg
423       images/bags_repairable_low_4yrs.jpg
424       images/bags_repairable_low_5yrs.jpg
425  images/bags_repairable_moderate_1yrs.jpg
426  images/bags_repairable_moderate_2yrs.jpg
427  i

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bags_df['image_exists'] = bags_df['image_path'].apply(check_image_exists)


In [6]:
# Show a summary of years_used, condition, and description for Bags
print(bags_df[['years_used', 'condition', 'description']].describe(include='all'))

        years_used condition                              description
count    45.000000        45                                       45
unique         NaN         3                                        1
top            NaN   Working  Zip works fine, fabric a bit discolored
freq           NaN        15                                       45
mean      3.000000       NaN                                      NaN
std       1.430194       NaN                                      NaN
min       1.000000       NaN                                      NaN
25%       2.000000       NaN                                      NaN
50%       3.000000       NaN                                      NaN
75%       4.000000       NaN                                      NaN
max       5.000000       NaN                                      NaN


# Generalized Analysis for All Item Types
The following cells will analyze the dataset for all item types, including sample counts, output class distributions, missing images, and summary statistics.

In [9]:
import pandas as pd
import os

df = pd.read_csv('../data/full_items_extended_dataset.csv')
image_dir = '../data/images/'

item_types = df['item_type'].unique()

for item in item_types:
    print(f'\n=== {item} ===')
    subset = df[df['item_type'] == item]
    print(f'Total samples: {len(subset)}')
    print('Output class distribution:')
    print(subset['output'].value_counts())
    # Check for missing images
    missing_images = subset[~subset['image_path'].apply(lambda x: os.path.isfile(os.path.join(image_dir, str(x))))]
    print(f'Missing images: {len(missing_images)}')
    if len(missing_images) > 0:
        print('Sample missing image paths:', missing_images['image_path'].head(3).tolist())


=== Blenders/Mixers ===
Total samples: 45
Output class distribution:
output
Refurbish and Resell    20
Salvage Components      15
Recycle                 10
Name: count, dtype: int64
Missing images: 45
Sample missing image paths: ['images/blendersmixers_working_low_1yrs.jpg', 'images/blendersmixers_working_low_2yrs.jpg', 'images/blendersmixers_working_low_3yrs.jpg']

=== Electric Kettles ===
Total samples: 45
Output class distribution:
output
Refurbish and Resell    20
Salvage Components      15
Recycle                 10
Name: count, dtype: int64
Missing images: 45
Sample missing image paths: ['images/electrickettles_working_low_1yrs.jpg', 'images/electrickettles_working_low_2yrs.jpg', 'images/electrickettles_working_low_3yrs.jpg']

=== Water Purifier ===
Total samples: 45
Output class distribution:
output
Refurbish and Resell    20
Salvage Components      15
Recycle                 10
Name: count, dtype: int64
Missing images: 45
Sample missing image paths: ['images/waterpurifier_wor

In [10]:
for item in item_types:
    print(f'\n=== {item} Summary Statistics ===')
    subset = df[df['item_type'] == item]
    print('years_used:')
    print(subset['years_used'].describe())
    print('condition:')
    print(subset['condition'].describe())
    # Description length
    desc_len = subset['description'].fillna('').apply(len)
    print('description length:')
    print(desc_len.describe())


=== Blenders/Mixers Summary Statistics ===
years_used:
count    45.000000
mean      3.000000
std       1.430194
min       1.000000
25%       2.000000
50%       3.000000
75%       4.000000
max       5.000000
Name: years_used, dtype: float64
condition:
count          45
unique          3
top       Working
freq           15
Name: condition, dtype: object
description length:
count    45.0
mean     40.0
std       0.0
min      40.0
25%      40.0
50%      40.0
75%      40.0
max      40.0
Name: description, dtype: float64

=== Electric Kettles Summary Statistics ===
years_used:
count    45.000000
mean      3.000000
std       1.430194
min       1.000000
25%       2.000000
50%       3.000000
75%       4.000000
max       5.000000
Name: years_used, dtype: float64
condition:
count          45
unique          3
top       Working
freq           15
Name: condition, dtype: object
description length:
count    45.0
mean     41.0
std       0.0
min      41.0
25%      41.0
50%      41.0
75%      41.0
max  