# Customer Reviews Keyword Extraction — Negative Reviews

This Colab notebook demonstrates how to generate synthetic customer reviews, label negative reviews, and extract keywords from negative reviews using TF-IDF and simple statistical scoring. Everything in this repo is synthetic and created for demonstration only.

## Install and import dependencies
Run the cell below in Colab to install any missing packages and download NLTK data.

In [None]:
# Install required packages (uncomment when running in fresh Colab)
!pip install -q nltk scikit-learn pandas

import nltk
nltk.download('punkt')
nltk.download('stopwords')

print('Dependencies ready')

## 1) Create synthetic dataset
We'll generate a small synthetic dataset of service reviews with some negative and positive examples.

In [None]:
import random
import pandas as pd

random.seed(42)

negative_templates = [
    'Very disappointed with the service. {}',
    'Poor experience: {}',
    'Terrible support. {}',
    'I waited too long and {}',
    'The {} was unacceptable and frustrating.'
]
positive_templates = [
    'Great service, {}',
    'Very happy with the quick response: {}',
    'Excellent support. {}',
    'Loved the {} and the staff were helpful.'
]
negative_phrases = [
    'agent was rude',
    'took ages to resolve',
    'problem still not fixed',
    'charged me extra',
    'could not reach anyone',
    'kept transferring my call',
    'billing issue not resolved',
    'promised callback never came',
    'website kept crashing',
    'service outage for hours'
]
positive_phrases = [
    'friendly staff',
    'quick resolution',
    'helpful support',
    'refund processed smoothly',
    'very satisfied'
]
rows = []
for i in range(300):
    if random.random() < 0.45:
        t = random.choice(negative_templates)
        phrase = random.choice(negative_phrases)
        review = t.format(phrase)
        sentiment = 'negative'
    else:
        t = random.choice(positive_templates)
        phrase = random.choice(positive_phrases)
        review = t.format(phrase)
        sentiment = 'positive'
    rows.append({'review_id': i+1, 'review': review, 'sentiment': sentiment})

df = pd.DataFrame(rows)
df.head()

## 2) Simple EDA
Show counts and sample negative reviews.

In [None]:
print('Total reviews:', len(df))
print(df['sentiment'].value_counts())
df[df['sentiment']=='negative'].sample(5, random_state=1)

## 3) Keyword extraction for negative reviews using TF-IDF
We'll extract keywords by computing TF-IDF on the negative reviews corpus and selecting the top terms by average TF-IDF score.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np

neg_df = df[df['sentiment']=='negative'].copy()
corpus = neg_df['review'].tolist()

stop_words = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words=stop_words, max_df=0.85, min_df=1)
X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()

# average tf-idf score per term across negative corpus
avg_tfidf = np.asarray(X.mean(axis=0)).ravel()
top_n = 30
top_idx = np.argsort(avg_tfidf)[::-1][:top_n]
top_terms = [(terms[i], round(avg_tfidf[i],4)) for i in top_idx]
import pandas as pd
pd.DataFrame(top_terms, columns=['term','avg_tfidf']).head(30)

## 4) Extract keywords per review (top terms per review)
We can also extract top-k tf-idf terms for each negative review.

In [None]:
def top_keywords_for_doc(doc_idx, k=5):
    row = X[doc_idx].toarray().ravel()
    idx = row.argsort()[::-1][:k]
    return [(terms[i], round(row[i],4)) for i in idx if row[i]>0]

neg_df = neg_df.reset_index(drop=True)
neg_df['top_keywords'] = neg_df.index.map(lambda i: top_keywords_for_doc(i, k=5))
neg_df[['review','top_keywords']].head(10)

## 5) Simple phrase scoring (frequency + TF-IDF) for multi-word phrases
We'll score candidate phrases (from n-grams) by combining document frequency and average TF-IDF to produce a ranked list of *keywords/phrases* relevant to negative reviews.

In [None]:
from collections import Counter

# get candidate phrases and their document frequency
df_counts = Counter()
tfidf_scores = {}
for i, term in enumerate(terms):
    df_counts[term] = (X[:, i].toarray().ravel() > 0).sum()
    tfidf_scores[term] = avg_tfidf[i]

candidates = []
for term in terms:
    # score = df * avg_tfidf (simple)
    score = df_counts[term] * tfidf_scores[term]
    candidates.append((term, df_counts[term], round(tfidf_scores[term],5), round(score,5)))

cand_df = pd.DataFrame(candidates, columns=['phrase','doc_freq','avg_tfidf','score'])
cand_df = cand_df.sort_values('score', ascending=False).head(40)
cand_df

## 6) Save results and export
Save negative keywords to a CSV for inspection.

In [None]:
out_csv = '/content/negative_keywords.csv'
cand_df.to_csv(out_csv, index=False)
print('Saved candidate keywords to', out_csv)
display(cand_df.head(20))

## Notes and next steps
- This notebook uses synthetic data for demonstration (see repository README).
- For production: consider using YAKE, RAKE, or Multi-word phrase extraction with more advanced preprocessing, lemmatization, and domain-specific stopwords.
- You can extend this to cluster negative reviews and extract cluster-specific keywords.