# Kaggle Skincare Dataset Manager
## Setup Instructions

Before running the main code, you need to authenticate with Kaggle API. Choose one method below:

## Method 1: Using kaggle.json (Recommended)

1. Go to https://www.kaggle.com/settings/account
2. Click "Create New API Token" - this downloads `kaggle.json`
3. Save it to: `C:\Users\B H\.kaggle\kaggle.json`
4. Make sure the folder has proper permissions

## Method 2: Using Environment Variables

If you don't have kaggle.json, use the cell below to set environment variables instead.

In [2]:
import os
from pathlib import Path

# ============================================
# KAGGLE AUTHENTICATION SETUP
# ============================================
# Uncomment ONE of the methods below:

# METHOD 1: Using kaggle.json file
# (No code needed - it will auto-detect from C:\Users\B H\.kaggle\kaggle.json)

# METHOD 2: Using Environment Variables
# Replace with your actual Kaggle username and API key
# Get these from: https://www.kaggle.com/settings/account

os.environ['KAGGLE_USERNAME'] = 'Marwa-001'
os.environ['KAGGLE_KEY'] = 'YOUR_KAGGLE_API_KEY'

print("‚úÖ Kaggle authentication configured!")
print(f"   Method: Environment Variables")
print(f"   Username: {os.environ.get('KAGGLE_USERNAME', 'Not set')}")


‚úÖ Kaggle authentication configured!
   Method: Environment Variables
   Username: Marwa-001


In [3]:
import pandas as pd
import numpy as np
import os
import requests
import json
from pathlib import Path
import re
from datetime import datetime

class SkincareDatasetManager:
    def __init__(self, output_dir='skincare_datasets'):
        """Initialize the dataset manager"""
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)

        # Initialize Kaggle API safely (do NOT import kaggle at module import time)
        self.api = None

        # Prefer environment variables if set
        kaggle_username = os.environ.get('KAGGLE_USERNAME')
        kaggle_key = os.environ.get('KAGGLE_KEY')
        kaggle_json = Path.home() / '.kaggle' / 'kaggle.json'

        if kaggle_username and kaggle_key:
            try:
                from kaggle.api.kaggle_api_extended import KaggleApi
                self.api = KaggleApi()
                self.api.authenticate()
                print("‚úÖ Authenticated to Kaggle using environment variables")
            except Exception as e:
                print(f"‚ùå Kaggle authentication (env) failed: {e}")
                self.api = None
        elif kaggle_json.exists():
            try:
                from kaggle.api.kaggle_api_extended import KaggleApi
                self.api = KaggleApi()
                self.api.authenticate()
                print(f"‚úÖ Authenticated to Kaggle using {kaggle_json}")
            except Exception as e:
                print(f"‚ùå Kaggle authentication (kaggle.json) failed: {e}")
                self.api = None
        else:
            print("‚ö†Ô∏è  Could not find Kaggle credentials. Dataset download will be skipped unless you configure auth.")
            print("   See: https://github.com/Kaggle/kaggle-api/")
            self.api = None

        # Dataset sources from Kaggle
        self.kaggle_datasets = {
            'sephora': 'raghadalharbi/all-products-available-on-sephora-website',
            'amazon_beauty': 'skillsmuggler/amazon-ratings',
            'cosmetics': 'kingabzpro/cosmetics-datasets',
            'skincare_reviews': 'mrmars1010/skincare-reviews',
            'makeup_products': 'shudhanshusingh/25000-makeup-products-with-ingredients'
        }
    
    def download_kaggle_datasets(self):
        """Download datasets from Kaggle"""
        if self.api is None:
            print("‚ö†Ô∏è  Kaggle API not configured ‚Äî skipping downloads. Configure kaggle.json or environment variables and re-run the auth cell.")
            return

        print("üîΩ Downloading datasets from Kaggle...")
        
        for name, dataset_path in self.kaggle_datasets.items():
            try:
                dataset_dir = self.output_dir / name
                dataset_dir.mkdir(exist_ok=True)
                
                print(f"\nüì¶ Downloading {name}...")
                self.api.dataset_download_files(
                    dataset_path,
                    path=str(dataset_dir),
                    unzip=True,
                    quiet=False
                )
                print(f"‚úÖ {name} downloaded successfully!")
                
            except Exception as e:
                print(f"‚ùå Error downloading {name}: {str(e)}")
                print(f"   Please manually download from: https://www.kaggle.com/datasets/{dataset_path}")
    
    def load_and_explore_datasets(self):
        """Load all downloaded datasets and show their structure"""
        print("\nüìä Loading and exploring datasets...")
        
        datasets = {}
        
        for name in self.kaggle_datasets.keys():
            dataset_dir = self.output_dir / name
            if not dataset_dir.exists():
                print(f"‚ö†Ô∏è  {name} directory not found. Skipping...")
                continue
            
            # Find CSV files in the directory
            csv_files = list(dataset_dir.glob('*.csv'))
            
            if csv_files:
                print(f"\nüìÅ {name.upper()}:")
                for csv_file in csv_files:
                    try:
                        df = pd.read_csv(csv_file, nrows=5)  # Load first 5 rows to preview
                        print(f"   File: {csv_file.name}")
                        print(f"   Columns: {list(df.columns)}")
                        print(f"   Shape: {df.shape}")
                        datasets[f"{name}_{csv_file.stem}"] = csv_file
                    except Exception as e:
                        print(f"   ‚ùå Error reading {csv_file.name}: {str(e)}")
        
        return datasets
    
    def create_master_product_dataset(self):
        """Merge product datasets into a master product database"""
        print("\nüî® Creating master product dataset...")
        
        all_products = []
        
        # Process Sephora dataset
        sephora_path = self.output_dir / 'sephora'
        if sephora_path.exists():
            for csv_file in sephora_path.glob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    
                    # Standardize column names
                    product_df = pd.DataFrame({
                        'product_name': df.get('product_name', df.get('name', df.get('Product', ''))),
                        'brand': df.get('brand_name', df.get('brand', df.get('Brand', ''))),
                        'category': df.get('category', df.get('primary_category', '')),
                        'ingredients': df.get('ingredients', df.get('ingredient_list', '')),
                        'price': df.get('price', df.get('price_usd', np.nan)),
                        'rating': df.get('rating', df.get('reviews', np.nan)),
                        'source': 'sephora'
                    })
                    
                    all_products.append(product_df)
                    print(f"‚úÖ Processed Sephora: {len(product_df)} products")
                except Exception as e:
                    print(f"‚ùå Error processing Sephora file: {str(e)}")
        
        # Process Cosmetics dataset
        cosmetics_path = self.output_dir / 'cosmetics'
        if cosmetics_path.exists():
            for csv_file in cosmetics_path.glob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    
                    product_df = pd.DataFrame({
                        'product_name': df.get('Label', df.get('name', df.get('product_name', ''))),
                        'brand': df.get('Brand', df.get('brand', '')),
                        'category': df.get('Category', df.get('category', '')),
                        'ingredients': df.get('Ingredients', df.get('ingredients', '')),
                        'price': df.get('Price', df.get('price', np.nan)),
                        'rating': np.nan,
                        'source': 'cosmetics'
                    })
                    
                    all_products.append(product_df)
                    print(f"‚úÖ Processed Cosmetics: {len(product_df)} products")
                except Exception as e:
                    print(f"‚ùå Error processing Cosmetics file: {str(e)}")
        
        # Process Makeup Products dataset
        makeup_path = self.output_dir / 'makeup_products'
        if makeup_path.exists():
            for csv_file in makeup_path.glob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    
                    product_df = pd.DataFrame({
                        'product_name': df.get('product_name', df.get('name', '')),
                        'brand': df.get('brand', df.get('brand_name', '')),
                        'category': df.get('product_type', df.get('category', '')),
                        'ingredients': df.get('ingredients', ''),
                        'price': df.get('price', np.nan),
                        'rating': df.get('rating', np.nan),
                        'source': 'makeup'
                    })
                    
                    all_products.append(product_df)
                    print(f"‚úÖ Processed Makeup: {len(product_df)} products")
                except Exception as e:
                    print(f"‚ùå Error processing Makeup file: {str(e)}")
        
        # Combine all products
        if all_products:
            master_products = pd.concat(all_products, ignore_index=True)
            
            # Clean and deduplicate
            master_products = master_products.drop_duplicates(subset=['product_name', 'brand'])
            master_products = master_products[master_products['product_name'].notna()]
            
            # Save master product dataset
            output_path = self.output_dir / 'master_products.csv'
            master_products.to_csv(output_path, index=False)
            
            print(f"\n‚ú® Master product dataset created!")
            print(f"   Total products: {len(master_products)}")
            print(f"   Saved to: {output_path}")
            
            return master_products
        else:
            print("‚ö†Ô∏è  No product data found to merge")
            return None
    
    def create_master_reviews_dataset(self):
        """Merge review datasets into a master reviews database"""
        print("\nüî® Creating master reviews dataset...")
        
        all_reviews = []
        
        # Process Amazon Beauty reviews
        amazon_path = self.output_dir / 'amazon_beauty'
        if amazon_path.exists():
            for csv_file in amazon_path.glob('*.csv'):
                try:
                    df = pd.read_csv(csv_file, nrows=50000)  # Limit to 50k reviews per file
                    
                    reviews_df = pd.DataFrame({
                        'product_name': df.get('ProductId', df.get('product_name', '')),
                        'user_id': df.get('UserId', df.get('user_id', '')),
                        'rating': df.get('Score', df.get('rating', df.get('Rating', np.nan))),
                        'review_text': df.get('Text', df.get('review', df.get('review_text', ''))),
                        'review_summary': df.get('Summary', df.get('summary', '')),
                        'helpful_votes': df.get('HelpfulnessNumerator', 0),
                        'timestamp': df.get('Time', df.get('timestamp', '')),
                        'source': 'amazon'
                    })
                    
                    all_reviews.append(reviews_df)
                    print(f"‚úÖ Processed Amazon reviews: {len(reviews_df)} reviews")
                except Exception as e:
                    print(f"‚ùå Error processing Amazon reviews: {str(e)}")
        
        # Process Skincare Reviews
        skincare_reviews_path = self.output_dir / 'skincare_reviews'
        if skincare_reviews_path.exists():
            for csv_file in skincare_reviews_path.glob('*.csv'):
                try:
                    df = pd.read_csv(csv_file)
                    
                    reviews_df = pd.DataFrame({
                        'product_name': df.get('product_name', df.get('Product', '')),
                        'user_id': df.get('author', df.get('user_id', '')),
                        'rating': df.get('rating', df.get('Rating', np.nan)),
                        'review_text': df.get('review_text', df.get('review', '')),
                        'review_summary': df.get('review_title', ''),
                        'helpful_votes': df.get('helpful_count', 0),
                        'timestamp': df.get('date', df.get('timestamp', '')),
                        'source': 'skincare_reviews'
                    })
                    
                    all_reviews.append(reviews_df)
                    print(f"‚úÖ Processed skincare reviews: {len(reviews_df)} reviews")
                except Exception as e:
                    print(f"‚ùå Error processing skincare reviews: {str(e)}")
        
        # Combine all reviews
        if all_reviews:
            master_reviews = pd.concat(all_reviews, ignore_index=True)
            
            # Clean data
            master_reviews = master_reviews[master_reviews['review_text'].notna()]
            master_reviews['review_text'] = master_reviews['review_text'].astype(str)
            master_reviews = master_reviews[master_reviews['review_text'].str.len() > 10]
            
            # Save master reviews dataset
            output_path = self.output_dir / 'master_reviews.csv'
            master_reviews.to_csv(output_path, index=False)
            
            print(f"\n‚ú® Master reviews dataset created!")
            print(f"   Total reviews: {len(master_reviews)}")
            print(f"   Saved to: {output_path}")
            
            return master_reviews
        else:
            print("‚ö†Ô∏è  No review data found to merge")
            return None
    
    def create_ingredient_database(self, products_df):
        """Extract and create ingredient database from products"""
        print("\nüî® Creating ingredient database...")
        
        if products_df is None or 'ingredients' not in products_df.columns:
            print("‚ö†Ô∏è  No product data with ingredients available")
            return None
        
        all_ingredients = []
        
        for idx, row in products_df.iterrows():
            if pd.notna(row['ingredients']):
                ingredients_text = str(row['ingredients'])
                
                # Split ingredients (common separators)
                ingredients = re.split(r'[,;]', ingredients_text)
                
                for ingredient in ingredients:
                    ingredient = ingredient.strip()
                    if ingredient and len(ingredient) > 2:
                        all_ingredients.append({
                            'ingredient_name': ingredient,
                            'product_name': row['product_name'],
                            'brand': row['brand'],
                            'category': row['category']
                        })
        
        if all_ingredients:
            ingredients_df = pd.DataFrame(all_ingredients)
            
            # Create ingredient frequency table
            ingredient_stats = ingredients_df.groupby('ingredient_name').agg({
                'product_name': 'count',
                'category': lambda x: x.mode()[0] if len(x.mode()) > 0 else 'unknown'
            }).reset_index()
            
            ingredient_stats.columns = ['ingredient_name', 'frequency', 'common_category']
            ingredient_stats = ingredient_stats.sort_values('frequency', ascending=False)
            
            # Save ingredient database
            output_path = self.output_dir / 'ingredient_database.csv'
            ingredient_stats.to_csv(output_path, index=False)
            
            print(f"‚ú® Ingredient database created!")
            print(f"   Unique ingredients: {len(ingredient_stats)}")
            print(f"   Saved to: {output_path}")
            
            return ingredient_stats
        else:
            print("‚ö†Ô∏è  No ingredients extracted")
            return None
    
    def generate_summary_report(self):
        """Generate a summary report of all datasets"""
        print("\n" + "="*60)
        print("üìã DATASET SUMMARY REPORT")
        print("="*60)
        
        files = {
            'Master Products': 'master_products.csv',
            'Master Reviews': 'master_reviews.csv',
            'Ingredient Database': 'ingredient_database.csv'
        }
        
        for name, filename in files.items():
            filepath = self.output_dir / filename
            if filepath.exists():
                df = pd.read_csv(filepath)
                print(f"\nüìä {name}:")
                print(f"   Rows: {len(df):,}")
                print(f"   Columns: {len(df.columns)}")
                print(f"   File size: {filepath.stat().st_size / 1024 / 1024:.2f} MB")
                print(f"   Location: {filepath}")
            else:
                print(f"\n‚ö†Ô∏è  {name}: Not found")
        
        print("\n" + "="*60)
        print("‚úÖ Dataset preparation complete!")
        print("="*60)

def main():
    """Main execution function"""
    print("üöÄ Skincare Dataset Downloader & Merger")
    print("="*60)
    
    # Initialize manager
    manager = SkincareDatasetManager()
    
    # Step 1: Download datasets
    print("\nStep 1: Downloading datasets from Kaggle...")
    print("‚ö†Ô∏è  Make sure you have kaggle.json configured in ~/.kaggle/ or set KAGGLE_USERNAME/KAGGLE_KEY env vars")
    manager.download_kaggle_datasets()
    
    # Step 2: Explore datasets
    print("\n" + "="*60)
    datasets = manager.load_and_explore_datasets()
    
    # Step 3: Create master product dataset
    print("\n" + "="*60)
    products_df = manager.create_master_product_dataset()
    
    # Step 4: Create master reviews dataset
    print("\n" + "="*60)
    reviews_df = manager.create_master_reviews_dataset()
    
    # Step 5: Create ingredient database
    print("\n" + "="*60)
    ingredients_df = manager.create_ingredient_database(products_df)
    
    # Step 6: Generate summary report
    print("\n" + "="*60)
    manager.generate_summary_report()
    
    print("\nüéâ All done! Your datasets are ready for ML training.")
    print("\nNext steps:")
    print("1. Review the master datasets in the 'skincare_datasets' folder")
    print("2. Run data cleaning and preprocessing")
    print("3. Create feature engineering pipeline")
    print("4. Train your ML models")

if __name__ == "__main__":
    main()


üöÄ Skincare Dataset Downloader & Merger
‚úÖ Authenticated to Kaggle using environment variables

Step 1: Downloading datasets from Kaggle...
‚ö†Ô∏è  Make sure you have kaggle.json configured in ~/.kaggle/ or set KAGGLE_USERNAME/KAGGLE_KEY env vars
üîΩ Downloading datasets from Kaggle...

üì¶ Downloading sephora...
Dataset URL: https://www.kaggle.com/datasets/raghadalharbi/all-products-available-on-sephora-website
Downloading all-products-available-on-sephora-website.zip to skincare_datasets\sephora


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4.64M/4.64M [00:04<00:00, 982kB/s] 



‚úÖ sephora downloaded successfully!

üì¶ Downloading amazon_beauty...
Dataset URL: https://www.kaggle.com/datasets/skillsmuggler/amazon-ratings
Downloading amazon-ratings.zip to skincare_datasets\amazon_beauty


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 28.8M/28.8M [00:31<00:00, 970kB/s] 



‚úÖ amazon_beauty downloaded successfully!

üì¶ Downloading cosmetics...
Dataset URL: https://www.kaggle.com/datasets/kingabzpro/cosmetics-datasets
Downloading cosmetics-datasets.zip to skincare_datasets\cosmetics


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 263k/263k [00:00<00:00, 296kB/s]



‚úÖ cosmetics downloaded successfully!

üì¶ Downloading skincare_reviews...
Dataset URL: https://www.kaggle.com/datasets/mrmars1010/skincare-reviews
‚ùå Error downloading skincare_reviews: 403 Client Error: Forbidden for url: https://api.kaggle.com/v1/datasets.DatasetApiService/DownloadDataset
   Please manually download from: https://www.kaggle.com/datasets/mrmars1010/skincare-reviews

üì¶ Downloading makeup_products...
Dataset URL: https://www.kaggle.com/datasets/shudhanshusingh/25000-makeup-products-with-ingredients
‚ùå Error downloading makeup_products: 403 Client Error: Forbidden for url: https://api.kaggle.com/v1/datasets.DatasetApiService/DownloadDataset
   Please manually download from: https://www.kaggle.com/datasets/shudhanshusingh/25000-makeup-products-with-ingredients


üìä Loading and exploring datasets...

üìÅ SEPHORA:
   File: sephora_website_dataset.csv
   Columns: ['id', 'brand', 'category', 'name', 'size', 'rating', 'number_of_reviews', 'love', 'price', 'value_pr