# Summary of Hierarchical Classification Project

## I. Initial Analysis Phase

### A. Data Distribution Analysis
- **Level 1 (Cat1)**: 6 categories
- **Level 2 (Cat2)**: 64 categories
- **Level 3 (Cat3)**: 377 categories

### B. Rare Category Analysis
- **Cat2**: 22 rare categories (34.38%)
- **Cat3**: 209 rare categories (55.44%)
- **Threshold**: <10 samples considered rare

### C. Text Features Analysis
- Average title length: 43.71 characters
- Average text length: 313.60 characters
- Title-Text ratio: ~1:7

## II. Data Cleaning Implementation

### A. Category Merging Strategy
1. **Very Rare Categories** (<5 samples)
   - Direct merge to "other" category
   - Preserves parent category structure

2. **Rare Categories** (5-10 samples)
   - Similarity-based merging
   - Text and hierarchy consideration

3. **Moderate Categories** (10-20 samples)
   - Parent-based merging
   - Maintain category relationships

### B. Similarity Calculation
- Text similarity (TF-IDF)
- Category name matching
- Parent category bonus
- Category-specific rules

### C. Results

#### Category Reduction
| Category Type | Original | Final | Reduction |
|---------------|----------|--------|-----------|
| Pet supplies | 35 | 30 | -14% |
| Health personal | 81 | 45 | -44% |
| Grocery/food | 106 | 32 | -70% |
| Toys games | 151 | 67 | -56% |
| Beauty | 57 | 28 | -51% |
| Baby products | 86 | 33 | -62% |

#### Merge Statistics
- **Total Merges**: 538
  - Very rare merges: 471 (87.5%)
  - No good match: 21 (3.9%)
  - Similarity-based: 46 (8.6%)

## III. Key Decisions

### A. Category-Specific Rules
- Different similarity thresholds
- Custom minimum sample requirements
- Category-specific merge strategies

### B. Enhanced Similarity Matching
1. **Primary Metrics**
   - Text-based similarity
   - Hierarchical relationships
   - Product type matching

2. **Secondary Considerations**
   - Parent category preservation
   - Category name similarity
   - Semantic relationships

### C. Hierarchy Preservation
- Maintained parent-child relationships
- Prevented cross-category merges
- Preserved category meaning



In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')


In [None]:
class HierarchicalDataCleaner:
    def __init__(self, df):
        """
        Initialize data cleaner with configuration parameters
        """
        self.df = df.copy()
        self.rare_threshold_cat2 = 20 
        self.rare_threshold_cat3 = 10 
        self.unknown_categories = {'unknown'}
        
    def clean_data(self):
        print("Starting data cleaning process...")
        self.handle_missing_values()
        self.clean_category_names()
        self.handle_unknown_categories()
        self.consolidate_rare_categories()
        
        return self.df
    
    def handle_missing_values(self):
        """
        Handle missing values in the dataset
        """
        print("\nHandling missing values...")
        
        missing_stats = self.df.isnull().sum()
        if len(missing_stats) == 0:
            print("No missing values found")
        print("Missing values before cleaning:")
        print(missing_stats[missing_stats > 0])
        
        self.df['Title'].fillna('', inplace=True)
        self.df['Text'].fillna('', inplace=True)
        
        for col in ['Cat1', 'Cat2', 'Cat3']:
            self.df[col].fillna('unknown', inplace=True)
    
    def clean_category_names(self):
        """
        Clean and standardize category names
        """
        print("\nCleaning category names...")
        
        def clean_name(name):
            name = str(name).lower().strip()
            name = re.sub(r'[^\w\s]', ' ', name)
            name = re.sub(r'\s+', ' ', name)
            return name
        
        # Clean categories
        for col in ['Cat1', 'Cat2', 'Cat3']:
            self.df[col] = self.df[col].apply(clean_name)
            
        print(f"Unique categories after cleaning:")
        print(f"Cat1: {self.df['Cat1'].nunique()}")
        print(f"Cat2: {self.df['Cat2'].nunique()}")
        print(f"Cat3: {self.df['Cat3'].nunique()}")
    
    def handle_unknown_categories(self):
        """
        Handle 'unknown' categories based on similarity
        """
        print("\nHandling unknown categories...")
        
        self.df['combined_text'] = self.df['Title'] + ' ' + self.df['Text']
        unknown_mask = self.df['Cat3'] == 'unknown'
        unknown_samples = self.df[unknown_mask]
        
        if len(unknown_samples) > 0:
            print(f"Found {len(unknown_samples)} samples with unknown Cat3")
            
            # Create TF-IDF vectors
            tfidf = TfidfVectorizer(max_features=1000)
            known_texts = self.df[~unknown_mask]['combined_text']
            known_vectors = tfidf.fit_transform(known_texts)
            unknown_vectors = tfidf.transform(unknown_samples['combined_text'])
            
            similarities = cosine_similarity(unknown_vectors, known_vectors)
            most_similar_indices = similarities.argmax(axis=1)
            similar_categories = self.df[~unknown_mask].iloc[most_similar_indices]['Cat3'].values
            self.df.loc[unknown_mask, 'Cat3'] = similar_categories
            
            print(f"Reassigned {len(unknown_samples)} unknown categories")
    
    def consolidate_rare_categories(self):
        """
        Consolidate rare categories based on frequency and similarity
        """
        print("\nConsolidating rare categories...")
        
        cat2_counts = self.df['Cat2'].value_counts()
        cat3_counts = self.df['Cat3'].value_counts()
        
        rare_cat2 = set(cat2_counts[cat2_counts < self.rare_threshold_cat2].index)
        rare_cat3 = set(cat3_counts[cat3_counts < self.rare_threshold_cat3].index)
        
        print(f"Found {len(rare_cat2)} rare Cat2 categories")
        print(f"Found {len(rare_cat3)} rare Cat3 categories")
        
        # Consolidate based on parent-child relationships
        for cat2 in rare_cat2:
            parent_cat1 = self.df[self.df['Cat2'] == cat2]['Cat1'].mode()[0]
            similar_cat2 = self._find_similar_category(cat2, 'Cat2', parent_cat1)
            self.df.loc[self.df['Cat2'] == cat2, 'Cat2'] = similar_cat2
        
        for cat3 in rare_cat3:
            parent_cat2 = self.df[self.df['Cat3'] == cat3]['Cat2'].mode()[0]
            similar_cat3 = self._find_similar_category(cat3, 'Cat3', parent_cat2)
            self.df.loc[self.df['Cat3'] == cat3, 'Cat3'] = similar_cat3
    
    def _find_similar_category(self, category, level, parent_category):
        """
        Find similar category for merging based on text similarity
        """
        parent_samples = self.df[self.df[f'Cat{int(level[-1])-1}'] == parent_category]
        rare_samples = self.df[self.df[level] == category]
        
        if len(rare_samples) == 0 or len(parent_samples) == 0:
            return category 
        

        tfidf = TfidfVectorizer(max_features=1000)
        parent_vectors = tfidf.fit_transform(parent_samples['combined_text'])
        rare_vectors = tfidf.transform(rare_samples['combined_text'])
        rare_vectors_mean = np.asarray(rare_vectors.mean(axis=0)).flatten()
        
        similarities = cosine_similarity(
            rare_vectors_mean.reshape(1, -1), 
            parent_vectors.toarray()
        )
        most_similar_idx = similarities.argmax()
        
        return parent_samples.iloc[most_similar_idx][level]
    

In [None]:
cleaner = HierarchicalDataCleaner(df)
cleaned_df = cleaner.clean_data()


Starting data cleaning process...

Handling missing values...
Missing values before cleaning:
Title    5
dtype: int64

Cleaning category names...
Unique categories after cleaning:
Cat1: 6
Cat2: 64
Cat3: 377

Handling unknown categories...
Found 487 samples with unknown Cat3
Reassigned 487 unknown categories

Consolidating rare categories...
Found 12 rare Cat2 categories
Found 202 rare Cat3 categories


In [None]:
df = pd.read_csv('data.csv')

In [20]:
cleaned_df

Unnamed: 0,productId,Title,userId,Time,Text,Cat1,Cat2,Cat3,combined_text
0,B0002AQK70,PetSafe Staywell Pet Door with Clear Hard Flap,A2L6QTQQI13LZG,1344211200,We've only had it installed about 2 weeks. So ...,pet supplies,cats,cat flaps,PetSafe Staywell Pet Door with Clear Hard Flap...
1,B0002DK8OI,"Kaytee Timothy Cubes, 1-Pound",A2HJUOZ9R9K4F,1344211200,My bunny had a hard time eating this because t...,pet supplies,bunny rabbit central,food,"Kaytee Timothy Cubes, 1-Pound My bunny had a h..."
2,B0006VJ6TO,Body Back Buddy,A14PK96LL78NN3,1344211200,would never in a million years have guessed th...,health personal care,health care,massage relaxation,Body Back Buddy would never in a million years...
3,B000EZSFXA,SnackMasters California Style Turkey Jerky,A2UW73HU9UMOTY,1344211200,"Being the jerky fanatic I am, snackmasters han...",grocery gourmet food,snack food,jerky dried meats,SnackMasters California Style Turkey Jerky Bei...
4,B000KV61FC,Premier Busy Buddy Tug-a-Jug Treat Dispensing ...,A1Q99RNV0TKW8R,1344211200,Wondered how quick my dog would catch on to th...,pet supplies,dogs,toys,Premier Busy Buddy Tug-a-Jug Treat Dispensing ...
...,...,...,...,...,...,...,...,...,...
9995,B000FGDDI0,Sunbeam 732-500 King Size Heating Pad with Ult...,A3RUBUKF0YX4C7,1362182400,Stays on continuously without shutting off! It...,health personal care,health care,pain relievers,Sunbeam 732-500 King Size Heating Pad with Ult...
9996,B000FVC78C,Reef One Biorb Easy Plants,A1O9H18FJG81FS,1362182400,these look great in our 10 gallon tank- colors...,pet supplies,fish aquatic pets,aquarium d cor,Reef One Biorb Easy Plants these look great in...
9997,B000ICJ8DA,Snoozer Lookout II Pet Car Seat,A3D96MTZP9C1Y,1362182400,"This works great, but needs a better way to at...",pet supplies,dogs,carriers travel products,Snoozer Lookout II Pet Car Seat This works gre...
9998,B000Q7AH3W,Omega Paw Tricky Treat Ball,A37L6DBOH234BC,1362182400,she absolutely LOVES this thing. I dice up gre...,pet supplies,dogs,toys,Omega Paw Tricky Treat Ball she absolutely LOV...


In [None]:
def merge_rare_categories(df, min_samples=10):
    df = df.copy()
    
    # First, analyze category structure
    cat_structure = {}
    for cat1 in df.Cat1.unique():
        cat_structure[cat1] = {}
        for cat2 in df[df.Cat1 == cat1].Cat2.unique():
            cat_structure[cat1][cat2] = df[
                (df.Cat1 == cat1) & 
                (df.Cat2 == cat2)
            ].Cat3.value_counts()
    
    # Process each parent category separately
    for cat1 in cat_structure:
        print(f"\nProcessing {cat1}...")
        
        for cat2, cat3_counts in cat_structure[cat1].items():
            rare_cats = cat3_counts[cat3_counts < min_samples].index
            
            if len(rare_cats) == 0:
                continue
                
            print(f"\n  Processing {len(rare_cats)} rare categories under '{cat2}'")
            
            non_rare = cat3_counts[cat3_counts >= min_samples].index
            
            if len(non_rare) == 0:
                new_cat = f"{cat2} other"
                df.loc[
                    (df.Cat2 == cat2) & 
                    (df.Cat3.isin(rare_cats)), 
                    'Cat3'
                ] = new_cat
                print(f"    Merged all rare categories to '{new_cat}'")
                continue
            
            for rare_cat in rare_cats:
                rare_samples = df[df.Cat3 == rare_cat]
                
                if len(rare_samples) < 5:
                    new_cat = f"{cat2} other"
                else:
                    try:
                        rare_text = ' '.join(rare_samples.text_combined)
                        non_rare_text = df[
                            (df.Cat2 == cat2) & 
                            (df.Cat3.isin(non_rare))
                        ].groupby('Cat3')['text_combined'].agg(' '.join)
                        
                        tfidf = TfidfVectorizer(max_features=1000)
                        non_rare_vecs = tfidf.fit_transform(non_rare_text)
                        rare_vec = tfidf.transform([rare_text])
                        
                        sims = cosine_similarity(rare_vec, non_rare_vecs)[0]
                        
                        if sims.max() > 0.3:  # Increased similarity threshold
                            new_cat = non_rare_text.index[sims.argmax()]
                        else:
                            new_cat = f"{cat2} other"
                    except:
                        new_cat = f"{cat2} other"
                
                df.loc[df.Cat3 == rare_cat, 'Cat3'] = new_cat
                print(f"    Merged '{rare_cat}' ({len(rare_samples)} samples) -> '{new_cat}'")
    
    return df

In [None]:

CATEGORY_RULES = {
    'health personal care': {
        'min_samples': 15,
        'min_similarity': 0.6,
        'allowed_merges': ['health', 'medical', 'personal care']
    },
    'toys games': {
        'min_samples': 8,
        'min_similarity': 0.4,
        'allowed_merges': ['toys', 'games', 'play']
    },
    'grocery gourmet food': {
        'min_samples': 10,
        'min_similarity': 0.5,
        'allowed_merges': ['food', 'grocery', 'snack', 'beverage']
    },
    'beauty': {
        'min_samples': 12,
        'min_similarity': 0.55,
        'allowed_merges': ['beauty', 'skin', 'hair', 'makeup']
    },
    'pet supplies': {
        'min_samples': 8,
        'min_similarity': 0.45,
        'allowed_merges': ['pet', 'animal', 'supplies']
    },
    'baby products': {
        'min_samples': 10,
        'min_similarity': 0.5,
        'allowed_merges': ['baby', 'infant', 'child']
    }
}

In [None]:
def clean_text(text):
    """Enhanced text cleaning"""
    if pd.isna(text): 
        return ''
    text = str(text).lower().strip()
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text

In [None]:
def calculate_similarity_score(df, cat1, cat2, cat_type):
    """
    Calculate similarity between categories with improved metrics
    """
    cat1_products = df[df['Cat3'] == cat1]
    cat2_products = df[df['Cat3'] == cat2]
    
    if len(cat1_products) == 0 or len(cat2_products) == 0:
        return 0.0
    cat1_text = ' '.join(cat1_products['Title'] * 2 + ' ' + cat1_products['Text'])
    cat2_text = ' '.join(cat2_products['Title'] * 2 + ' ' + cat2_products['Text'])
    
    try:
        tfidf = TfidfVectorizer(max_features=1000)
        vectors = tfidf.fit_transform([cat1_text, cat2_text])
        text_similarity = cosine_similarity(vectors)[0][1]
    except:
        text_similarity = 0.0
    
    name_similarity = len(set(cat1.split()) & set(cat2.split())) / \
                     len(set(cat1.split()) | set(cat2.split()))
    
    same_parent = df[df['Cat3'] == cat1]['Cat2'].iloc[0] == \
                 df[df['Cat3'] == cat2]['Cat2'].iloc[0]
    parent_bonus = 0.2 if same_parent else 0.0
    
    rules = CATEGORY_RULES.get(cat_type, {
        'min_similarity': 0.5,
        'allowed_merges': []
    })
    
    allowed_terms = rules['allowed_merges']
    keyword_match = any(term in cat1 or term in cat2 for term in allowed_terms)
    keyword_bonus = 0.1 if keyword_match else 0.0
    final_score = (0.6 * text_similarity + 
                  0.2 * name_similarity + 
                  parent_bonus + 
                  keyword_bonus)
    
    return final_score


In [None]:

def find_best_merge_target(df, rare_cat, cat_type, existing_cats, min_similarity):
    """
    Find the best category to merge with
    """
    similarities = []
    
    for target_cat in existing_cats:
        if target_cat.endswith('other'):
            continue
            
        similarity = calculate_similarity_score(df, rare_cat, target_cat, cat_type)
        similarities.append((target_cat, similarity))
    

    similarities.sort(key=lambda x: x[1], reverse=True)
    if similarities and similarities[0][1] >= min_similarity:
        return similarities[0]
    
    return None, 0.0

In [None]:


def merge_rare_categories(df, verbose=True):
    """
    Improved category merging with better handling
    """
    print("Starting improved category merging...")
    df = df.copy()
    merges = []
    for cat_type in df['Cat1'].unique():
        if verbose:
            print(f"\nProcessing {cat_type}...")
        
        rules = CATEGORY_RULES.get(cat_type, {
            'min_samples': 10,
            'min_similarity': 0.5,
            'allowed_merges': []
        })
        
        for cat2 in df[df['Cat1'] == cat_type]['Cat2'].unique():
            # Get category counts
            cat3_counts = df[
                (df['Cat1'] == cat_type) & 
                (df['Cat2'] == cat2)
            ]['Cat3'].value_counts()
        
            rare_cats = cat3_counts[cat3_counts < rules['min_samples']]
            
            if len(rare_cats) == 0:
                continue
                
            if verbose:
                print(f"\n  Processing {len(rare_cats)} rare categories under '{cat2}'")
            
            non_rare_cats = set(cat3_counts[cat3_counts >= rules['min_samples']].index)
            
            for rare_cat in rare_cats.index:
                n_samples = rare_cats[rare_cat]
                if n_samples < 5:
                    new_cat = f"{cat2} other"
                    merge_type = "other (very rare)"
                else:
                    similar_cat, similarity = find_best_merge_target(
                        df, rare_cat, cat_type, non_rare_cats, 
                        rules['min_similarity']
                    )
                    
                    if similar_cat:
                        new_cat = similar_cat
                        merge_type = f"similar (score: {similarity:.2f})"
                    else:
                        new_cat = f"{cat2} other"
                        merge_type = "other (no good match)"
                
                df.loc[
                    (df['Cat1'] == cat_type) & 
                    (df['Cat2'] == cat2) & 
                    (df['Cat3'] == rare_cat),
                    'Cat3'
                ] = new_cat
                
                merges.append({
                    'cat1': cat_type,
                    'cat2': cat2,
                    'old_cat3': rare_cat,
                    'new_cat3': new_cat,
                    'samples': n_samples,
                    'merge_type': merge_type
                })
                
                if verbose:
                    print(f"    Merged '{rare_cat}' ({n_samples} samples) -> '{new_cat}' [{merge_type}]")
    
    merge_df = pd.DataFrame(merges)
    
    if verbose:
        print("\nMerge Summary:")
        print(f"Total merges: {len(merges)}")
        print("\nMerge types:")
        print(merge_df['merge_type'].value_counts())
        
        print("\nFinal category counts:")
        for cat1 in df['Cat1'].unique():
            n_cat2 = len(df[df['Cat1'] == cat1]['Cat2'].unique())
            n_cat3 = len(df[df['Cat1'] == cat1]['Cat3'].unique())
            print(f"{cat1}: {n_cat2} Cat2, {n_cat3} Cat3")
    
    return df, merge_df

print("Initial category counts:")
for cat1 in df['Cat1'].unique():
    n_cat2 = len(df[df['Cat1'] == cat1]['Cat2'].unique())
    n_cat3 = len(df[df['Cat1'] == cat1]['Cat3'].unique())
    print(f"{cat1}: {n_cat2} Cat2, {n_cat3} Cat3")

cleaned_df, merge_report = merge_rare_categories(df)
cleaned_df.to_csv('cleaned_categories_improved.csv', index=False)
merge_report.to_csv('category_merges_report.csv', index=False)


Initial category counts:
pet supplies: 6 Cat2, 35 Cat3
health personal care: 7 Cat2, 81 Cat3
grocery gourmet food: 16 Cat2, 106 Cat3
toys games: 17 Cat2, 151 Cat3
beauty: 6 Cat2, 57 Cat3
baby products: 12 Cat2, 86 Cat3
Starting improved category merging...

Processing pet supplies...

  Processing 1 rare categories under 'cats'
    Merged 'collars' (3 samples) -> 'cats other' [other (very rare)]

  Processing 5 rare categories under 'bunny rabbit central'
    Merged 'food' (3 samples) -> 'bunny rabbit central other' [other (very rare)]
    Merged 'rabbit hutches' (3 samples) -> 'bunny rabbit central other' [other (very rare)]
    Merged 'carriers' (2 samples) -> 'bunny rabbit central other' [other (very rare)]
    Merged 'feeding watering supplies' (1 samples) -> 'bunny rabbit central other' [other (very rare)]
    Merged 'treats' (1 samples) -> 'bunny rabbit central other' [other (very rare)]

  Processing 2 rare categories under 'birds'
    Merged 'health supplies' (1 samples) -> 'bi