# Mission 6: Feasibility Study of Product Classification Engine

## 1. Introduction
**Objective**: Evaluate the feasibility of automatic product classification using text descriptions and images for an e-commerce marketplace.

## 2. Data Overview
**Dataset Components**:
- Product descriptions (English text)
- Product images
- Category labels

In [1]:
import pandas as pd
import glob

# Read all CSV files from dataset/Flipkart directory with glob
csv_files = glob.glob('dataset/Flipkart/flipkart*.csv')

# Import the CSV files into a dataframe
df = pd.read_csv(csv_files[0])

# Display first few rows
df.head()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,55b85ea15a1536d46b7190ad6fff8ce7,2016-04-30 03:22:56 +0000,http://www.flipkart.com/elegance-polyester-mul...,Elegance Polyester Multicolor Abstract Eyelet ...,"[""Home Furnishing >> Curtains & Accessories >>...",CRNEG7BKMFFYHQ8Z,1899.0,899.0,55b85ea15a1536d46b7190ad6fff8ce7.jpg,False,Key Features of Elegance Polyester Multicolor ...,No rating available,No rating available,Elegance,"{""product_specification""=>[{""key""=>""Brand"", ""v..."
1,7b72c92c2f6c40268628ec5f14c6d590,2016-04-30 03:22:56 +0000,http://www.flipkart.com/sathiyas-cotton-bath-t...,Sathiyas Cotton Bath Towel,"[""Baby Care >> Baby Bath & Skin >> Baby Bath T...",BTWEGFZHGBXPHZUH,600.0,449.0,7b72c92c2f6c40268628ec5f14c6d590.jpg,False,Specifications of Sathiyas Cotton Bath Towel (...,No rating available,No rating available,Sathiyas,"{""product_specification""=>[{""key""=>""Machine Wa..."
2,64d5d4a258243731dc7bbb1eef49ad74,2016-04-30 03:22:56 +0000,http://www.flipkart.com/eurospa-cotton-terry-f...,Eurospa Cotton Terry Face Towel Set,"[""Baby Care >> Baby Bath & Skin >> Baby Bath T...",BTWEG6SHXTDB2A2Y,,,64d5d4a258243731dc7bbb1eef49ad74.jpg,False,Key Features of Eurospa Cotton Terry Face Towe...,No rating available,No rating available,Eurospa,"{""product_specification""=>[{""key""=>""Material"",..."
3,d4684dcdc759dd9cdf41504698d737d8,2016-06-20 08:49:52 +0000,http://www.flipkart.com/santosh-royal-fashion-...,SANTOSH ROYAL FASHION Cotton Printed King size...,"[""Home Furnishing >> Bed Linen >> Bedsheets >>...",BDSEJT9UQWHDUBH4,2699.0,1299.0,d4684dcdc759dd9cdf41504698d737d8.jpg,False,Key Features of SANTOSH ROYAL FASHION Cotton P...,No rating available,No rating available,SANTOSH ROYAL FASHION,"{""product_specification""=>[{""key""=>""Brand"", ""v..."
4,6325b6870c54cd47be6ebfbffa620ec7,2016-06-20 08:49:52 +0000,http://www.flipkart.com/jaipur-print-cotton-fl...,Jaipur Print Cotton Floral King sized Double B...,"[""Home Furnishing >> Bed Linen >> Bedsheets >>...",BDSEJTHNGWVGWWQU,2599.0,698.0,6325b6870c54cd47be6ebfbffa620ec7.jpg,False,Key Features of Jaipur Print Cotton Floral Kin...,No rating available,No rating available,Jaipur Print,"{""product_specification""=>[{""key""=>""Machine Wa..."


In [2]:
from src.classes.analyze_value_specifications import SpecificationsValueAnalyzer

analyzer = SpecificationsValueAnalyzer(df)
value_analysis = analyzer.get_top_values(top_keys=5, top_values=5)
value_analysis

Unnamed: 0,key,value,count,percentage,total_occurrences
0,Type,Analog,123,16.9,728
1,Type,Mug,74,10.16,728
2,Type,Ethnic,56,7.69,728
3,Type,Wireless Without modem,27,3.71,728
4,Type,Religious Idols,26,3.57,728
5,Brand,Lapguard,11,1.94,568
6,Brand,PRINT SHAPES,11,1.94,568
7,Brand,Lal Haveli,10,1.76,568
8,Brand,Raymond,8,1.41,568
9,Brand,Aroma Comfort,8,1.41,568


In [3]:

fig = analyzer.create_radial_icicle_chart(top_keys=10, top_values=20)
fig.show()

In [4]:
from src.classes.analyze_category_tree import CategoryTreeAnalyzer

# Create analyzer instance with your dataframe
category_analyzer = CategoryTreeAnalyzer(df)

# Create and display the radial category chart
fig = category_analyzer.create_radial_category_chart(max_depth=1)
fig.show()

In [12]:
# Create and display the radial category chart
fig = category_analyzer.create_radial_category_chart3(max_depth=5)
fig.show()


## 3. Basic NLP Classification Feasibility Study

### 3.1 Text Preprocessing
**Steps**:
- Clean text data
- Remove stopwords
- Perform stemming/lemmatization
- Handle special characters

In [6]:
from src.classes.preprocess_text import TextPreprocessor

processor = TextPreprocessor()

# Single text stats
sample_text = df['product_name'].iloc[0]
stats = processor.get_preprocessing_stats(sample_text)
print("\nPreprocessing Statistics:")
for key, value in stats.items():
    print(f"{key}: {value}")

# Batch statistics
batch_stats = processor.get_batch_stats(df['product_name'].head())
print("\nBatch Statistics Summary:")
print(batch_stats.describe())

# Create lemmatized product names column
df['product_name_lemmatized'] = df['product_name'].apply(processor.preprocess)

# Display sample comparisons
comparison_df = pd.DataFrame({
    'Original': df['product_name'].head(),
    'Lemmatized': df['product_name_lemmatized'].head()
})


# Get processing statistics
total_words_before = df['product_name'].str.split().str.len().sum()
total_words_after = df['product_name_lemmatized'].str.split().str.len().sum()
reduction = ((total_words_before - total_words_after) / total_words_before) * 100

print(f"\nProcessing Statistics:")
print(f"Total words before: {total_words_before}")
print(f"Total words after: {total_words_after}")
print(f"Word reduction: {reduction:.2f}%")


print("Sample Text Processing Results:")
comparison_df


Preprocessing Statistics:
original_length: 58
processed_length: 58
original_words: 7
processed_words: 7
removed_stopwords: 0
stopwords_percentage: 0.0
reduction_percentage: 0.0
unique_words_original: 7
unique_words_processed: 7
sample_removed_words: []

Batch Statistics Summary:
       original_length  processed_length  original_words  processed_words  \
count         5.000000          5.000000        5.000000         5.000000   
mean         47.000000         47.000000        6.800000         6.800000   
std          15.795569         15.795569        1.923538         1.923538   
min          26.000000         26.000000        4.000000         4.000000   
25%          35.000000         35.000000        6.000000         6.000000   
50%          53.000000         53.000000        7.000000         7.000000   
75%          58.000000         58.000000        8.000000         8.000000   
max          63.000000         63.000000        9.000000         9.000000   

       removed_stopwords 

Unnamed: 0,Original,Lemmatized
0,Elegance Polyester Multicolor Abstract Eyelet ...,elegance polyester multicolor abstract eyelet ...
1,Sathiyas Cotton Bath Towel,sathiyas cotton bath towel
2,Eurospa Cotton Terry Face Towel Set,eurospa cotton terry face towel set
3,SANTOSH ROYAL FASHION Cotton Printed King size...,santosh royal fashion cotton printed king size...
4,Jaipur Print Cotton Floral King sized Double B...,jaipur print cotton floral king sized double b...


### 3.2 Basic Text Encoding
**Methods**:
- Bag of Words (BoW)
- TF-IDF Vectorization

In [7]:
from src.classes.encode_text import TextEncoder

# Initialize encoder once
encoder = TextEncoder()

# Fit and transform product names
encoding_results = encoder.fit_transform(df['product_name_lemmatized'])

# Get top features
top_features = encoder.get_top_features(n_features=15)
print("\nTop features by TF-IDF score:")
print(top_features)

# Create and display BoW plot
bow_fig = encoder.plot_bow_features(threshold=0.98)
print("\nBag of Words Feature Distribution:")
bow_fig.show()

# Create and display TF-IDF plot
tfidf_fig = encoder.plot_tfidf_features(threshold=0.98)
print("\nTF-IDF Feature Distribution:")
tfidf_fig.show()

# Show comparison
comparison_fig = encoder.plot_feature_comparison(threshold=0.98)
print("\nFeature Comparison:")
comparison_fig.show()

# Plot scatter comparison
scatter_fig = encoder.plot_scatter_comparison()
print("\nTF-IDF vs BoW Scatter Comparison:")
scatter_fig.show()


Top features by TF-IDF score:
      feature     score
75        set  0.063150
87      watch  0.061769
18         cm  0.060559
77  showpiece  0.058942
2      analog  0.055595
20      combo  0.050876
4        baby  0.045023
50        men  0.039020
16    ceramic  0.036586
54        mug  0.034973
36       girl  0.034170
12        boy  0.033933
90      woman  0.028947
22     cotton  0.028131
85        usb  0.027226

Bag of Words Feature Distribution:



TF-IDF Feature Distribution:



Feature Comparison:




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




TF-IDF vs BoW Scatter Comparison:


### 3.3 Dimensionality Reduction & Visualization
**Analysis**:
- Apply PCA/t-SNE
- Visualize category distribution
- Evaluate cluster separation

In [8]:
from src.classes.reduce_dimensions import DimensionalityReducer

# Initialize reducer
reducer = DimensionalityReducer()

# Clean and preprocess categories (sample code, adjust as needed)
def extract_main_category(category_string):
    if pd.isna(category_string):
        return "Unknown"
    try:
        # Extract first category from tree structure
        return category_string.split('>>')[0].strip()
    except:
        return "Unknown"

# Extract main product category
df['main_category'] = df['product_category_tree'].apply(extract_main_category)

# Apply dimensionality reduction to TF-IDF matrix of product names
print("\nApplying PCA to product name features...")
pca_results = reducer.fit_transform_pca(encoder.tfidf_matrix)
pca_fig = reducer.plot_pca(labels=df['main_category'])
pca_fig.show()

print("\nApplying t-SNE to product name features...")
tsne_results = reducer.fit_transform_tsne(encoder.tfidf_matrix)
tsne_fig = reducer.plot_tsne(labels=df['main_category'])
tsne_fig.show()

# Evaluate separation based on categories
metrics = reducer.evaluate_separation(encoder.tfidf_matrix, df['main_category'])
print("\nCluster separation metrics:", metrics)


Applying PCA to product name features...



divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul




Applying t-SNE to product name features...



divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul


divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul




Cluster separation metrics: {'silhouette_score': np.float64(0.11137662887614042), 'num_clusters': 7}



divide by zero encountered in matmul


overflow encountered in matmul


invalid value encountered in matmul



## 4. Advanced NLP Classification Feasibility Study

### 4.1 Word Embeddings
**Approaches**:
- Word2Vec Implementation
- BERT Embeddings
- Universal Sentence Encoder

### 4.2 Comparative Analysis
**Evaluation**:
- Compare embedding methods
- Analyze clustering quality
- Assess category separation

## 5. Basic Image Processing Classification Study

### 5.1 Image Preprocessing
**Steps**:
- Grayscale conversion
- Noise reduction
- Contrast enhancement
- Size normalization

### 5.2 Feature Extraction
**Methods**:
- SIFT implementation
- Feature detection
- Descriptor computation

### 5.3 Analysis
**Evaluation**:
- Dimension reduction
- Cluster visualization
- Category separation assessment


## 6. Advanced Image Processing Study

### 6.1 Transfer Learning
**Implementation**:
- Pre-trained CNN setup
- Feature extraction
- Model adaptation

### 6.2 Comparative Analysis
**Evaluation**:
- Compare with basic methods
- Assess classification accuracy
- Analyze feature quality

## 7. Final Feasibility Assessment

### 7.1 Performance Metrics
**Evaluation**:
- Clustering quality metrics
- Classification accuracy
- Processing time analysis

### 7.2 Recommendations
**Outcomes**:
- Feasibility conclusion
- Best approach identification
- Implementation suggestions


## 8. Future Improvements
- Scalability considerations
- Performance optimization
- Integration recommendations