# Building a Customer Feedback Analyzer

### Dataset
For this project, I will focus on a single dataset containing feedback from a hospitality business.

### Key Concepts

- **Net Promoter Score (NPS):**  
  Measures customer loyalty by answering the question:  
  *"How likely is it that you would recommend this product to a friend or colleague?"*  
  Scores range from 0 to 10:
  - 9–10: Promoters  
  - 7–8: Passives  
  - 0–6: Detractors  

  **Formula:**  
`NPS = % of Promoters - % of Detractors`

- **Sentiment Analysis:**  
  Identifies the emotional tone of reviews using pre-trained AI models:
  - Positive  
  - Negative  
  - Neutral  

- **Topic Modeling:**  
  Uses embeddings to identify recurring themes in textual data.


In [None]:
# Imports
import pandas as pd

## 1. Exploratory Data Analysis (EDA)

In [None]:
# Read the Excel file
file_path = 'guest_data_with_reviews.xlsx'
df = pd.read_excel(file_path)

# Display the first few rows of the DataFrame
df.head()

In [None]:
# Display the shape of the DataFrame
df.shape

In [None]:
# Identify the number of missing values in the dataset
missing_values = df.isnull().sum()
missing_values

In [None]:
# Remove columns with any missing values
df_cleaned = df.dropna(axis=1)

# Remove rows with any missing values
df_cleaned = df_cleaned.dropna()

# Display the shape of the cleaned DataFrame
df_cleaned.shape

In [None]:
import plotly.express as px

# Plot the distribution of the 'Recomendation' column
fig = px.histogram(df_cleaned, x='How likely are you to recommend us to a friend or colleague?', nbins = 10, title='Distribution of Recommendations Scores')
fig.update_layout(xaxis_title='Recommendation Score', yaxis_title='Count')

## 2. Calculate NPS

In [None]:
# Cluster recommendation scores to calculate NPS
def calculate_nps(score):
    if score >= 9:
        return 'Promoter'
    elif score >= 7:
        return 'Passive'
    else:
        return 'Detractor'

# Apply the NPS calculation function to the DataFrame
df_cleaned['NPS Category'] = df_cleaned['How likely are you to recommend us to a friend or colleague?'].apply(calculate_nps)

# Display to confirm new column "NPS Category" is added
df_cleaned.head()

In [None]:
# Calculate proportions of each NPS category
nps_proportions = df_cleaned['NPS Category'].value_counts(normalize=True) * 100

# Display the proportions of each NPS category
nps_proportions

If the **NPS Score** is greater than zero, we can interpret it as a positive NPS: there are more promoters than detractors, indicating good customer loyalty.

In [None]:
# Calculate NPS score
nps_score = nps_proportions['Promoter'] - nps_proportions['Detractor']
nps_score

## 3. Sentiment Analysis

It may be necessary to run:
```
!pip install tf_keras
```

In [None]:
# Imports
from transformers import pipeline

In [None]:
# Initialize the sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Analyze the sentiment of the reviews
df_cleaned['Sentiment'] = df_cleaned['Review'].apply(lambda review: sentiment_pipeline(review)[0]['label'])

# Display the first few rows of the DataFrame with sentiment analysis results
df_cleaned.head()

Sentiment analysis aligns with NPS results, showing significantly more negative resolutions than positive ones.

In [None]:
# Visualize the sentiment distribution
fig = px.histogram(df_cleaned, x='Sentiment', title='Sentiment Distribution', labels={'Sentiment': 'Sentiment Category'}, color='Sentiment')
fig.show()

## 4. Topic Modeling

Using embeddings to extract key topics from the dataset.

In [None]:
# Imports
# 1st Step
from sklearn.feature_extraction.text import CountVectorizer
# 2nd Step
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import TfidfVectorizer
# Visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer(stop_words='english', max_features=20)

# Fit and transform the reviews text
X = vectorizer.fit_transform(df_cleaned['Review'])

# Get the feature names (keywords)
keywords = vectorizer.get_feature_names_out()

# Sum up the counts of each keyword
keyword_counts = X.toarray().sum(axis=0)

# Create a DataFrame for keywords and their counts
keywords_df = pd.DataFrame({'Keyword': keywords, 'Count': keyword_counts})

# Sort the DataFrame by count in descending order
keywords_df = keywords_df.sort_values(by='Count', ascending=False)

# Display the top keywords
keywords_df.head(20)

In [None]:
# Initialize the TfidVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

# Fit and transform the reviews text
tfidf_matrix = tfidf_vectorizer.fit_transform(df_cleaned['Review'])

# Initialize the LDA model
lda_model = LDA(n_components=5, random_state=42)

# Fit the LDA model to the TF-IDF matrix
lda_model.fit(tfidf_matrix)

# Get the feature names (keywords)
keywords = tfidf_vectorizer.get_feature_names_out()

# Display the top words for each topic
n_top_words = 10
topics = {}
for topic_idx, topic in enumerate(lda_model.components_):
	top_keywords = [keywords[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
	topics[f'Topic {topic_idx + 1}'] = top_keywords

# Convert the topics to a DataFrame
topics_df = pd.DataFrame(topics)

# Display the topics DataFrame
topics_df

In [None]:
# Create a word cloud for each topic
fig, axes = plt.subplots(1, 5, figsize=(20, 10), sharex=True, sharey=True)

for i, (topic, top_keywords) in enumerate(topics.items()):
	wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(top_keywords))
	axes[i].imshow(wordcloud, interpolation='bilinear')
	axes[i].set_title(topic, fontsize=16)
	axes[i].axis('off')

plt.tight_layout()
plt.show()