# Topic Modeling and WordClouds on Beauty Reviews

This notebook focuses on topic modeling within a dataset of beauty reviews. It involves loading a dataset, applying a topic modeling technique (Latent Dirichlet Allocation), and saving the results. Additionally, word clouds are generated to visualize the most frequent terms in each topic.

## Libraries and Tools Used

- **Pandas:** Essential for efficient data manipulation and handling.
- **Streamlit:** Employed for interactive and user-friendly web app functionalities.
- **Matplotlib and Seaborn:** Utilized for data visualization and the creation of bar plots.
- **Scikit-Learn (CountVectorizer and LatentDirichletAllocation):** Employed for vectorization and the implementation of Latent Dirichlet Allocation (LDA).
- **NLTK (Natural Language Toolkit):** Used for stopwords removal and tokenization.
- **WordCloud:** Applied for generating word clouds.

## Workflow Overview

- **Data Loading and Filtering:** The beauty reviews dataset is loaded, and interactive filters are provided using Streamlit to refine the analysis based on specific criteria.
- **Text Preprocessing:** Text undergoes preprocessing, including lemmatization and removal of stopwords, ensuring a clean and standardized representation.
- **Topic Modeling (LDA):** Latent Dirichlet Allocation (LDA) from Scikit-Learn is employed to identify latent topics within the beauty reviews.
- **WordCloud Generation:** Word clouds are generated for each identified topic, providing an intuitive visualization of the most frequently occurring terms.
- **Results Display:** The top 10 keywords for each topic are displayed in an interactive tabular format. Additionally, bar plots showcase the weight of each word in its respective topic.
- **Visual and Functional Enhancements:** The code incorporates visual enhancements such as color-coded backgrounds and improved font sizes for better readability and user experience.

Feel free to explore the code for a detailed understanding of the topic modeling process and the associated word clouds.

In [None]:
import pandas as pd
import streamlit as st
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('stopwords')
nltk.download('punkt')

# Cargar tu DataFrame inicial (reemplaza esto con tu carga de datos real)
df = pd.read_csv("C:/Users/johan/Bootcamp_SoyHenry/PI_Final/topics.csv")

# Definir las opciones por defecto (vacías)
default_options = {
    "business_id": [],
    "source": [],
    "stars": [],
    "city": [],
    "state": [],
    "year": [],
    "month": [],
    "sentiment_label": [],
}


filters = {}


#filters = {}
for column in default_options:
    filters[column] = st.sidebar.multiselect(column.capitalize(), default_options[column] + list(df[column].unique()))

# Botón para realizar la consulta
if st.sidebar.button("Realizar Consultas"):
    # Aplicar los filtros al DataFrame solo cuando el botón es presionado
    filter_conditions = pd.Series(True, index=df.index)
    for column in filters:
        if filters[column]:  # Solo aplicar el filtro si la lista no está vacía
            filter_conditions = filter_conditions & df[column].isin(filters[column])

    # Filtrar el DataFrame
    filtered_df = df[filter_conditions]

    # Obtener las stop words en inglés
    stop_words = set(stopwords.words('english'))

    # Aplicar modelo de tópicos
    def preprocess_text(text, stop_words):
        try:
            # Check if the text is a string
            if isinstance(text, str):
                # Tokenize and remove stop words
                tokens = word_tokenize(text.lower())
                tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
                return ' '.join(tokens)
            else:
                # If not a string, return an empty string or handle it according to your needs
                return ''
        except Exception as e:
            # Handle any exceptions that might occur during text preprocessing
            st.error(f"An error occurred during text preprocessing: {e}")
            return ''

    filtered_df['text_processed'] = filtered_df['text'].apply(preprocess_text, stop_words=stop_words)

    vectorizer = CountVectorizer(max_features=1000, stop_words='english')
    X = vectorizer.fit_transform(filtered_df['text_processed'])

    lda = LatentDirichletAllocation(n_components=5, random_state=42)
    topics = lda.fit_transform(X)

    # Crear un gráfico por tópico
    st.write("<h2>Top 10 Palabras por Tópico</h2>", unsafe_allow_html=True)
    for topic_idx, topic in enumerate(lda.components_):
        top_keywords_idx = topic.argsort()[:-11:-1]
        top_keywords = [vectorizer.get_feature_names_out()[i] for i in top_keywords_idx]

        # Crear un DataFrame para la tabla
        topic_df = pd.DataFrame({'Palabras': top_keywords, 'Peso': topic[top_keywords_idx]})
        topic_df['Peso'] = topic_df['Peso'].round(2)  # Redondear a 2 decimales
        topic_df = topic_df.sort_values(by='Peso', ascending=False)

        # Crear un subplot para el gráfico
        fig, (ax_table, ax_plot) = plt.subplots(1, 2, gridspec_kw={'width_ratios': [1, 3]}, figsize=(7, 4))

        # Mostrar la tabla
        #table = ax_table.table(cellText=topic_df.values, colLabels=topic_df.columns, cellLoc='center', loc='center')
        #table.auto_set_font_size(False)
        #table.set_fontsize(11)  # Aumentar el tamaño de la fuente
        #ax_table.axis('off')
        
        # Mostrar la tabla
        table = ax_table.table(cellText=topic_df.values, colLabels=topic_df.columns, cellLoc='center', loc='center')
        table.auto_set_font_size(False)
        table.set_fontsize(11)  # Aumentar el tamaño de la fuente

        # Ajustes adicionales para la tabla
        table.auto_set_column_width([0, 1, 2, 3, 4, 5, 6])  # Ajustar automáticamente el ancho de las columnas
        table.scale(1, 1.5)  # Escalar la tabla para hacerla un poco más ancha

        # Establecer el fondo gris para el encabezado
        header_cells = table.get_celld()
        for (i, j), cell in header_cells.items():
            if i == 0:
                cell.set_text_props(weight='bold', color='w')  # Establecer el texto en negrita y color blanco
                cell.set_facecolor('#D3D3D3')  # Código de color gris


        ax_table.axis('off')

        
        # Mostrar el gráfico
        sns.barplot(x='Peso', y='Palabras', data=topic_df, ax=ax_plot, palette='viridis')

        # Ajustes para el gráfico
        ax_plot.set_title(f"Tópico {topic_idx + 1}", fontsize=12)  # Ajustar el tamaño de la fuente del título
        ax_plot.set_xlabel("Peso de la Palabra", fontsize=7)  # Ajustar el tamaño de la fuente del eje x
        ax_plot.set_ylabel("Palabras", fontsize=7)  # Ajustar el tamaño de la fuente del eje y

        # Ajustes de fondo del gráfico
        ax_plot.set_facecolor('#ededed')  # Código de color gris para el fondo del gráfico

        # Ocultar los nombres de los ejes
        ax_plot.set_xlabel('')
        ax_plot.set_ylabel('')

        # Ajustes de diseño
        plt.tight_layout()
        st.pyplot(fig)

___________________________________________________________________________________________________________________

### For a better interpretation of the themes based on the keywords associated with each theme that the model has provided us, we recommend applying a more advanced language model such as GPT-3 from OpenAI (it is not free and requires access through the API from OpenAI).

OpenAI Prompt

Suggest possible themes for these topics:
Positive Topic 1: hair, salon, ulta, cut, color, great, stylist, make, time, want<br /> 
Positive Topic 2: help, product, ulta, store, need, look, make, makeup, time, ask<br /> 
Positive Topic 3: love, brow, makeup, ulta, place, great, bar, product, good, eyebrow<br /> 
Positive Topic 4: staff, helpful, great, store, friendly, love, product, clean, nice, ulta<br /> 
Positive Topic 5: product, store, ulta, brand, beauty, makeup, great, like, love, hair<br /> 
Negative Topic 1: help, store, like, look, ask, ulta, time, product, makeup, employee<br /> 
Negative Topic 2: say, product, just, want, like, ulta, color, make, return, ask<br /> 
Negative Topic 3: hair, cut, say, salon, appointment, tell, want, time, ask, color<br /> 
Negative Topic 4: hair, salon, time, stylist, ulta, appointment, tell, just, look, pay<br /> 
Negative Topic 5: store, ulta, customer, product, help, service, ask, employee, time, say<br /> 

Sure, based on the keywords associated with each topic, here are possible thematic interpretations:
* Positive Topics:
    - Topic 1: Hair Salon Experience
    Keywords: hair, salon, ulta, cut, color, great, stylist, make, time, want
    Possible Theme: Positive experiences related to hair salon services at Ulta, including cuts, colors, and stylist interactions.

    - Topic 2: Store Assistance and Product Advice
    Keywords: help, product, ulta, store, need, look, make, makeup, time, ask
    Possible Theme: Positive interactions with store staff, seeking product advice, and overall helpful assistance at Ulta.
    
    - Topic 3: Love for Makeup and Brow Services
    Keywords: love, brow, makeup, ulta, place, great, bar, product, good, eyebrow
    Possible Theme: Positive sentiments towards makeup products, brow services, and overall satisfaction with Ulta's offerings.

    - Topic 4: Positive Store Environment and Staff
    Keywords: staff, helpful, great, store, friendly, love, product, clean, nice, ulta
    Possible Theme: Positive experiences related to the store environment, helpful staff, and overall positive impressions of Ulta.

    - Topic 5: Appreciation for Beauty Products
    Keywords: product, store, ulta, brand, beauty, makeup, great, like, love, hair
    Possible Theme: Positive sentiments towards various beauty products, brands, and the overall selection at Ulta.

* Negative Topics:
    - Topic 1: Challenges with Store Assistance
    Keywords: help, store, like, look, ask, ulta, time, product, makeup, employee
    Possible Theme: Challenges or negative experiences related to seeking assistance, store interactions, and employee support.
    
    - Topic 2: Dissatisfaction or Issues with Products
    Keywords: say, product, just, want, like, ulta, color, make, return, ask
    Possible Theme: Dissatisfaction or issues related to specific products, color choices, and product returns at Ulta.
    
    - Topic 3: Challenges with Hair Salon Services
    Keywords: hair, cut, say, salon, appointment, tell, want, time, ask, color
    Possible Theme: Challenges or negative experiences related to hair salon services, including appointments, cuts, and color treatments.
    
    - Topic 4: Negative Salon Experiences
    Keywords: hair, salon, time, stylist, ulta, appointment, tell, just, look, pay
    Possible Theme: Negative experiences related to the hair salon, including stylist interactions, appointment issues, and overall dissatisfaction.
    
    - Topic 5: Customer Service and Assistance Concerns
    Keywords: store, ulta, customer, product, help, service, ask, employee, time, say
    Possible Theme: Concerns or dissatisfaction related to customer service, assistance, and employee interactions at Ulta.

These thematic interpretations are based on the keywords associated with each topic. It's important to note that the actual interpretation may vary based on the context of the reviews and the specific sentiments expressed by customers.