# Customer Insights in Fashion Retail using NLP
## Objective
Use topic modeling and sentiment analysis to gain insights into customer feedback, helping a retail brand understand customer perceptions and preferences.

## 1. Import Necessary Libraries
In this section, we import libraries essential for data manipulation, NLP, sentiment analysis, and visualization.

In [2]:

# Install necessary libraries if not already installed
import pandas as pd
import numpy as np
import spacy
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from transformers import pipeline
from nltk.corpus import stopwords
from nltk import download


  from .autonotebook import tqdm as notebook_tqdm


## 2. Preprocessing
### Custom Stopwords
Define custom stopwords related to the fashion domain to avoid topic dilution.
### Text Preprocessing
Lemmatize and clean the text using SpaCy, preparing it for vectorization and topic modeling.

In [3]:

# Load NLTK stopwords
download('stopwords')
stop_words = set(stopwords.words('english'))

# Load SpaCy's English model for lemmatization
nlp = spacy.load("en_core_web_sm")

# Custom stopword list for the fashion domain
fashion_stopwords = {"dress", "clothing", "fabric", "wear", "size", "fit", "look"}
stop_words.update(fashion_stopwords)

# Load dataset
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv').dropna(subset=['Review Text'])
df['Review Text'] = df['Review Text'].astype(str)

# Define a preprocessing function using SpaCy lemmatization and custom stopwords
def preprocess_text(text):
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc if token.lemma_ not in stop_words and not token.is_punct]
    return ' '.join(tokens)

# Apply preprocessing to 'Review Text'
df['Cleaned_Review'] = df['Review Text'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Satvik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Vectorization
### CountVectorizer (for LDA)
We use CountVectorizer to create the document-term matrix for LDA.
### TfidfVectorizer (for NMF)
For NMF, we use TfidfVectorizer to focus on unique terms that define each topic.

In [4]:

# Vectorization: CountVectorizer for LDA, TfidfVectorizer for NMF
vectorizer = CountVectorizer(max_df=0.95, min_df=10, stop_words='english', ngram_range=(1, 2))
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=10, stop_words='english', ngram_range=(1, 2))

dtm = vectorizer.fit_transform(df['Cleaned_Review'])
tfidf_dtm = tfidf_vectorizer.fit_transform(df['Cleaned_Review'])


## 4. Optimal Topic Count via Grid Search
Using GridSearchCV, we find the best parameters for LDA, including topic count, alpha, and eta for coherence improvement.

In [8]:

# Optimal Topic Count via Grid Search on LDA
param_grid = {
    'n_components': [5, 10],
    'doc_topic_prior': [0.1, 0.5],
    'topic_word_prior': [0.1, 0.5]
}
lda_grid_search = GridSearchCV(LatentDirichletAllocation(learning_method='batch', random_state=42), 
                               param_grid=param_grid, scoring='neg_log_loss', cv=3)
lda_grid_search.fit(dtm)
best_lda_model = lda_grid_search.best_estimator_
print("Best LDA Model Parameters:", lda_grid_search.best_params_)




Best LDA Model Parameters: {'doc_topic_prior': 0.1, 'n_components': 5, 'topic_word_prior': 0.1}


## 5. Non-Negative Matrix Factorization (NMF) for Topic Modeling
NMF often provides clearer topics, especially when used with TF-IDF features.

In [9]:

# NMF Topic Modeling
nmf_model = NMF(n_components=10, random_state=42)
nmf_model.fit(tfidf_dtm)


## 6. Advanced Sentiment Analysis with BERT-based Model
To achieve higher accuracy in sentiment analysis, we use a BERT-based model instead of simple polarity scores.

In [10]:

# Advanced Sentiment Analysis using BERT-based model (Transformers)
sentiment_pipeline = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

# Apply BERT-based sentiment analysis
def get_sentiment_bert(text):
    try:
        result = sentiment_pipeline(text[:512])  
        return result[0]['label']
    except Exception as e:
        print(f"Error: {e}")
        return "neutral"

df['Sentiment_Label'] = df['Review Text'].apply(get_sentiment_bert)
print(df[['Review Text', 'Sentiment_Label']].head())


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


                                         Review Text Sentiment_Label
0  Absolutely wonderful - silky and sexy and comf...         5 stars
1  Love this dress!  it's sooo pretty.  i happene...         5 stars
2  I had such high hopes for this dress and reall...         3 stars
3  I love, love, love this jumpsuit. it's fun, fl...         5 stars
4  This shirt is very flattering to all due to th...         5 stars


## 7. Multi-Class Classification for Aspect Detection
Detect multiple aspects (e.g., fit, color, design) within reviews, treating each aspect as a class in multi-label classification.

In [None]:

from sklearn.preprocessing import MultiLabelBinarizer

# Define aspects for multi-label classification
aspect_terms = {"fit": ["fit", "comfortable", "tight"],
                "color": ["color", "shade", "tone"],
                "material": ["material", "fabric", "texture"],
                "price": ["price", "cost", "value"],
                "design": ["design", "style", "look"]}

# Aspect-Based Sentiment Analysis
def aspect_sentiment_analysis(text):
    aspects_detected = set()
    for aspect, keywords in aspect_terms.items():
        if any(keyword in text.lower() for keyword in keywords):
            aspects_detected.add(aspect)
    return list(aspects_detected)

# Apply aspect detection for multi-label classification
df['Aspects'] = df['Review Text'].apply(aspect_sentiment_analysis)

# Encode multi-label aspects
mlb = MultiLabelBinarizer()
aspect_labels = mlb.fit_transform(df['Aspects'])
aspect_df = pd.DataFrame(aspect_labels, columns=mlb.classes_)
df = pd.concat([df, aspect_df], axis=1)

print("Multi-Label Aspect Analysis:")
print(df[['Review Text', 'Aspects'] + mlb.classes_.tolist()].head())


## 8. Visualize Sentiment and Aspect Analysis
We analyze the distribution of sentiments and aspects to identify customer trends and insights.

In [None]:

# Sentiment Distribution Visualization
sns.countplot(data=df, x='Sentiment_Label', order=df['Sentiment_Label'].value_counts().index)
plt.title("Distribution of Sentiment Labels")
plt.xlabel("Sentiment")
plt.ylabel("Frequency")
plt.show()

# Topic Distribution per Aspect
for aspect in mlb.classes_:
    subset = df[df[aspect] == 1]
    plt.figure(figsize=(10, 6))
    sns.countplot(data=subset, x='Sentiment_Label', order=subset['Sentiment_Label'].value_counts().index)
    plt.title(f"Sentiment Distribution for Aspect: {aspect.capitalize()}")
    plt.xlabel("Sentiment")
    plt.ylabel("Frequency")
    plt.show()
