<a href="https://colab.research.google.com/github/EvgeniaKantor/Hackathon-1/blob/main/PrepCosmeticDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**SynchroCosmetics** is a recommendation system designed for beauty bloggers that uses audience data to improve dynamic content and optimize product recommendations. It uses user information such as age, gender, skin type and key product features to suggest the most appropriate cosmetics, providing a personalized and engaging experience for subscribers.

Scenario:

Imagine you're a beauty blogger who creates content about skincare cosmetics on platforms like Instagram or Telegram. You strive to better understand your audience's needs so you can adapt your content more effectively. You gather information from social media analytics (e.g. age, gender, location) and conduct detailed surveys asking about skin type and buying goals.

SynchroCosmetics processes this data to:

Identify key audience segments
Recommend cosmetics that match audience preferences
Help you develop personalized content strategies to increase engagement and influence.
This system allows beauty bloggers to make data-driven decisions, maximizing the impact of their cosmetic content and product recommendations.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import openai
import os
from google.colab import userdata
import time
import re

# Loading the cosmetic dataset from kaggle

In [None]:
from google.colab import files
files.upload()

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d yyue11/cosmetics-data-for-recommendation-systems

Saving kaggle.json to kaggle (1).json
Dataset URL: https://www.kaggle.com/datasets/yyue11/cosmetics-data-for-recommendation-systems
License(s): unknown
cosmetics-data-for-recommendation-systems.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!unzip cosmetics-data-for-recommendation-systems.zip

Archive:  cosmetics-data-for-recommendation-systems.zip
replace cosmatics dataset.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: cosmatics dataset.csv   


In [None]:
df = pd.read_csv('cosmatics dataset.csv')
df.head()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1472 entries, 0 to 1471
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Label        1472 non-null   object 
 1   Brand        1472 non-null   object 
 2   Name         1472 non-null   object 
 3   Price        1472 non-null   int64  
 4   Rank         1472 non-null   float64
 5   Ingredients  1472 non-null   object 
 6   Combination  1472 non-null   int64  
 7   Dry          1472 non-null   int64  
 8   Normal       1472 non-null   int64  
 9   Oily         1472 non-null   int64  
 10  Sensitive    1472 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 126.6+ KB


# Generating a new column "Features" using API Gemini

In [None]:
import pandas as pd
import google.generativeai as genai
import os
import time
from google.api_core.exceptions import ServiceUnavailable
from google.colab import userdata

# Retrieve and set the API key
GOOGLE_API_KEY = userdata.get('gemini_key')
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

# Configure the generative AI client
genai.configure(api_key=GOOGLE_API_KEY)

# Load the generative model correctly
model = genai.GenerativeModel("gemini-1.5-pro-001")

# Define the features to be checked
list_features = ['Anti-aging', 'Anti-acne', 'Hydrating', 'Anti-dark spots', 'Day Care', 'Night Care', 'Sun Protect']

# Ensure the 'Features' column exists
if 'Features' not in df.columns:
    df['Features'] = ''

# Function to generate features information from label, brand, name
def get_features(label, brand, name, list_features):
    query = (
        f"Consider {label}, {brand}, {name}. "
        f"Give me the key features only from the list: {', '.join(list_features)}. "
        f"1 - yes, the cosmetics product has the feature. "
        f"0 - no, the cosmetics product does not have the feature. "
        f"Don't give explanations or any additional information, just 0 or 1. "
        f"Example: 'Anti-aging: 1, Anti-acne: 0, Hydrating: 0, Anti-dark spots: 0, Day Care: 0, Night Care: 0, Sun Protect: 0'"
    )
    retries = 3  # Number of retries
    delay = 5  # Initial delay in seconds

    while retries > 0:
        try:
            response = model.generate_content(query)
            if response and response.candidates and response.candidates[0].content.parts:
                response_content = response.candidates[0].content.parts[0].text.strip()
                return response_content
            else:
                print(f"No valid content received for query: {query}")
                return None
        except IndexError as e:
            print(f"IndexError: {str(e)}")
            return None
        except ServiceUnavailable as e:
            print(f"ServiceUnavailable error: {e}. Retrying in {delay} seconds...")
            time.sleep(delay)
            retries -= 1
            delay *= 2  # Exponential backoff
        except Exception as e:
            if 'Resource has been exhausted' in str(e):
                print("Quota exhausted. Stopping further requests.")
                return None
            else:
                print(f"Exception: {str(e)}. Retrying in {delay} seconds...")
                time.sleep(delay)
                retries -= 1
                delay *= 2  # Exponential backoff

    print(f"Maximum retries reached. Unable to generate features.")
    return None

# Get the indices of empty cells in 'Features' column
empty_features_indices = df[df['Features'] == ''].index.tolist()

# Process all empty cells
for count, index in enumerate(empty_features_indices):
    row = df.loc[index]
    label = row.get('Label', '')
    brand = row.get('Brand', '')
    name = row.get('Name', '')

    # Get features from the generative model
    features_result = get_features(label, brand, name, list_features)

    if features_result:
        # Store the entire response in the 'Features' column
        df.at[index, 'Features'] = features_result
    else:
        print(f"Failed to get features for index {index}")

    # Save progress every 10 iterations
    if (count + 1) % 10 == 0:
        df.to_csv('updated_features.csv', index=False)
        print(f"Progress saved at iteration {count + 1}")

    # Throttling to prevent rate-limiting
    time.sleep(2)  # Add a delay between each request

# Final save of the DataFrame
df.to_csv('updated_features.csv', index=False)
print("Final progress saved.")

In [None]:
df

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Features
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1,Anti-aging: 1\nAnti-acne: 0\nHydrating: 1\nAnt...
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1,"Here are the key features of SK-II, Facial Tre..."
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant..."
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant..."
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1,Anti-aging: 1\nAnti-acne: 0\nHydrating: 1\nAnt...
...,...,...,...,...,...,...,...,...,...,...,...,...
1467,Sun protect,KORRES,Yoghurt Nourishing Fluid Veil Face Sunscreen B...,35,3.9,"Water, Alcohol Denat., Potassium Cetyl Phospha...",1,1,1,1,1,
1468,Sun protect,KATE SOMERVILLE,Daily Deflector™ Waterlight Broad Spectrum SPF...,48,3.6,"Water, Isododecane, Dimethicone, Butyloctyl Sa...",0,0,0,0,0,
1469,Sun protect,VITA LIBERATA,Self Tan Dry Oil SPF 50,54,3.5,"Water, Dihydroxyacetone, Glycerin, Sclerocarya...",0,0,0,0,0,
1470,Sun protect,ST. TROPEZ TANNING ESSENTIALS,Pro Light Self Tan Bronzing Mist,20,1.0,"Water, Dihydroxyacetone, Propylene Glycol, PPG...",0,0,0,0,0,


# Generating a new column "Features" using API OpenAI

In [None]:
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
openai.api_key = OPENAI_API_KEY


# Ensure 'Features' column exists
if 'Features' not in df.columns:
    df['Features'] = None

# Define the features to be checked
list_features = ['Anti-aging', 'Anti-acne', 'Hydrating', 'Anti-dark spots', 'Day Care', 'Night Care', 'Sun Protect']

# Function to generate features information
def get_features(label, brand, name, list_features):
    prompt_for_role_system = "You're a highly qualified cosmetic chemist."
    prompt_for_role_user = (
        f"Search online and determine the key features of the product: {label}, {brand}, {name}. "
        f"Only provide the features from this list: {', '.join(list_features)}. "
        f"1 - yes, the cosmetic product has the feature. "
        f"0 - no, the cosmetic product does not have the feature. "
        f"Don't give explanations or any additional information, just 0 or 1. "
        f"Example: 'Anti-aging: 1, Anti-acne: 0, Hydrating: 0, Anti-dark spots: 0, Day Care: 0, Night Care: 0, Sun Protect: 0'"
    )

    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": prompt_for_role_system},
                {"role": "user", "content": prompt_for_role_user}
            ],
            max_tokens=100
        )
        return response['choices'][0]['message']['content'].strip()
    except Exception as e:
        print(f"Error: {e}")
        return None

# Iterate through DataFrame and update only empty cells
for index, row in df.iterrows():
    if pd.isna(row['Features']) or row['Features'] == '':
        print(f"Processing {row['Label']} - {row['Brand']} - {row['Name']}")
        features = get_features(row['Label'], row['Brand'], row['Name'], list_features)

        # Only update if a valid response is received
        if features:
            df.at[index, 'Features'] = features

        # To avoid hitting rate limits, add a delay between requests
        time.sleep(1)

# Display the updated DataFrame
print("\nUpdated DataFrame:")
print(df)

Processing Face Mask - SK-II - Overnight Miracle Mask
Processing Face Mask - GLOW RECIPE - Watermelon Glow Sleeping Mask Mini
Processing Face Mask - DR. JART+ - Lover Rubber Masks
Processing Face Mask - FRESH - Lotus Youth Preserve Rescue Mask
Processing Face Mask - FRESH - Black Tea Firming Overnight Mask Mini
Processing Face Mask - ORIGINS - Original Skin™ Retexturizing Mask with Rose Clay
Processing Face Mask - ORIGINS - Hello, Calm Relaxing & Hydrating Face Mask with Cannabis Sativa Seed Oil
Processing Face Mask - FARMACY - Bright On Massage-Activated Vitamin C Mask with Echinacea GreenEnvy™
Processing Face Mask - REN CLEAN SKINCARE - Glycol Lactic Radiance Renewal Mask
Processing Face Mask - KIEHL'S SINCE 1851 - Calendula & Aloe Soothing Hydration Mask
Processing Face Mask - ORIGINS - Out of Trouble™ 10 Minute Mask to Rescue Problem Skin
Processing Face Mask - KIEHL'S SINCE 1851 - Turmeric & Cranberry Seed Energizing Radiance Mask
Processing Face Mask - GLAMGLOW - The Ultimate Glo

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1472 entries, 0 to 1471
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Label        1472 non-null   object 
 1   Brand        1472 non-null   object 
 2   Name         1472 non-null   object 
 3   Price        1472 non-null   int64  
 4   Rank         1472 non-null   float64
 5   Ingredients  1472 non-null   object 
 6   Combination  1472 non-null   int64  
 7   Dry          1472 non-null   int64  
 8   Normal       1472 non-null   int64  
 9   Oily         1472 non-null   int64  
 10  Sensitive    1472 non-null   int64  
 11  Features     1472 non-null   object 
dtypes: float64(1), int64(6), object(5)
memory usage: 138.1+ KB


In [None]:
df.head()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Features
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant..."
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant..."
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant..."
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant..."
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant..."


# Cleaning the column 'Features' and extracting information from this column

In [None]:
# Function to clean and standardize the 'Features' column
def clean_features(text):
    # Replace line breaks and multiple spaces with a single space
    text = re.sub(r'[\n\s]+', ' ', text)
    # Remove trailing periods
    text = re.sub(r'\.\s*$', '', text)
    # Ensure consistent spacing after commas
    text = re.sub(r',\s*', ', ', text)
    # Add a comma after each value except the last one
    text = re.sub(r'(?<=\d) (?=[A-Z])', ', ', text)
    # Ensure consistent spacing
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Apply the cleaning function
df['Features'] = df['Features'].apply(clean_features)

# Check unique values again
print(df['Features'].unique())

['Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 0, Day Care: 1, Night Care: 1, Sun Protect: 0'
 'Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 1, Day Care: 0, Night Care: 1, Sun Protect: 0'
 'Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 1, Day Care: 1, Night Care: 1, Sun Protect: 1'
 'Anti-aging: 0, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 0, Day Care: 1, Night Care: 1, Sun Protect: 0'
 'Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 1, Day Care: 1, Night Care: 1, Sun Protect: 0'
 'Anti-aging: 0, Anti-acne: 1, Hydrating: 1, Anti-dark spots: 0, Day Care: 1, Night Care: 1, Sun Protect: 0'
 'Anti-aging: 0, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 0, Day Care: 1, Night Care: 0, Sun Protect: 1'
 'Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 0, Day Care: 0, Night Care: 1, Sun Protect: 0'
 'Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Anti-dark spots: 1, Day Care: 1, Night Care: 0, Sun Protect: 1'
 'Anti-aging: 0, An

In [None]:
# Function to split features into key-value pairs
def split_features(features):
    # Extract key-value pairs using regex
    pairs = re.findall(r'(\w[\w\s-]+):\s*(\d)', features)
    return dict(pairs)

# Apply the function to split features
df_features = df['Features'].apply(split_features).apply(pd.Series).fillna(0).astype(int)

# Display the transformed DataFrame
print(df_features.head())

   Anti-aging  Anti-acne  Hydrating  Anti-dark spots  Day Care  Night Care  \
0           1          0          1                0         1           1   
1           1          0          1                1         0           1   
2           1          0          1                0         1           1   
3           1          0          1                0         1           1   
4           1          0          1                1         1           1   

   Sun Protect  
0            0  
1            0  
2            0  
3            0  
4            1  


In [None]:
df_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1472 entries, 0 to 1471
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Anti-aging       1472 non-null   int64
 1   Anti-acne        1472 non-null   int64
 2   Hydrating        1472 non-null   int64
 3   Anti-dark spots  1472 non-null   int64
 4   Day Care         1472 non-null   int64
 5   Night Care       1472 non-null   int64
 6   Sun Protect      1472 non-null   int64
dtypes: int64(7)
memory usage: 80.6 KB


In [None]:
# Combine the two DataFrames using index
df_combined = pd.concat([df, df_features], axis=1)

In [None]:
df_combined.head()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,Features,Anti-aging,Anti-acne,Hydrating,Anti-dark spots,Day Care,Night Care,Sun Protect
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant...",1,0,1,0,1,1,0
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant...",1,0,1,1,0,1,0
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant...",1,0,1,0,1,1,0
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant...",1,0,1,0,1,1,0
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1,"Anti-aging: 1, Anti-acne: 0, Hydrating: 1, Ant...",1,0,1,1,1,1,1


In [None]:
# save df_combined
df_combined.to_excel('df_combined.xlsx')