# ============================================
# Semantic Analysis of Tourist Reviews
# Sentiment Classification (Weak Supervision) + Topic Modeling
# ============================================
### This notebook-style script is designed for Google Colab or Jupyter.
### Dataset: Tourist Review.csv (with column: review, location). Data will be loaded from an ONLINE SOURCE (GitHub raw link).

This project explores tourist reviews using Natural Language Processing (NLP).
Goals:
- Automatically assign sentiment labels using a pretrained model (VADER)
- Train classical ML classifiers on generated labels
- Perform topic modeling to discover common themes
- Analyze language patterns in positive and negative reviews

# ============================================
# 1. INSTALL & IMPORT LIBRARIES
# ============================================

In [None]:
!pip install nltk scikit-learn wordcloud
!pip install transformers

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from collections import Counter

from transformers import pipeline

nltk.download('vader_lexicon')

# ============================================
# 2. LOAD DATASET FROM GITHUB
# ============================================
### Dataset is stored inside GitHub repository and loaded via RAW link

In [None]:
DATA_URL = "https://raw.githubusercontent.com/MEGMON19/tourist-review-nlp/refs/heads/main/Tourist%20Review.csv"

df = pd.read_csv(DATA_URL)
print(df.head())
print(df.columns)

### Basic Exploration

In [None]:
print("Number of rows:", len(df))
print(df['location'].value_counts().head())

# ============================================
# 3. TEXT CLEANING
# ============================================

In [None]:
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"[^a-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


In [None]:
df['clean_text'] = df['review'].apply(clean_text)

# ============================================
# 4. AUTO-LABEL SENTIMENT WITH VADER
# ============================================
Sentiment labels are generated automatically using a pretrained model (VADER). This approach is known as weak supervision.


In [None]:
sia = SentimentIntensityAnalyzer()

In [None]:
scores = df['clean_text'].apply(lambda x: sia.polarity_scores(x)['compound'])

In [None]:
def vader_to_label(score):
  if score >= 0.05:
    return "positive"
  elif score <= -0.05:
    return "negative"
  else:
    return "neutral"

In [None]:
df['sentiment'] = scores.apply(vader_to_label)
print(df['sentiment'].value_counts())

The dataset is highly imbalanced, with a strong dominance of positive reviews. This reflects a common real-world phenomenon where users are more likely to share positive travel experiences than negative ones. However, class imbalance makes classification of minority classes more challenging.

# ============================================
# 5. SPLIT DATA
# ============================================

In [None]:
X = df['clean_text']
y = df['sentiment']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

# ============================================
# 6. TF-IDF Vectorization
# ============================================

In [None]:
vectorizer = TfidfVectorizer(
  max_features=15000,
  stop_words='english',
  ngram_range=(1,2)
)

In [None]:
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# ============================================
# 7. Baseline Model - LOGISTIC REGRESSION
# ============================================

In [None]:
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train_vec, y_train)

In [None]:
pred_log = log_model.predict(X_test_vec)
print("Logistic Regression Results")
print(classification_report(y_test, pred_log))

The baseline Logistic Regression model performs well for positive reviews but struggles with minority classes, especially negative sentiment. This behavior is caused by strong class imbalance.

# ============================================
# 8. Support Vector Machine SVM MODEL
# ============================================

In [None]:
svm_model = LinearSVC()
svm_model.fit(X_train_vec, y_train)

In [None]:
pred_svm = svm_model.predict(X_test_vec)
print("SVM Results")
print(classification_report(y_test, pred_svm))

Support Vector Machine significantly improves performance compared to Logistic Regression, especially for neutral and negative classes. This suggests that SVM is better suited for high-dimensional sparse TF-IDF features.

# ============================================
# 9. Hyperparameter Tuning - GRIDSEARCH
# ============================================

In [None]:
param_grid = {
  'C': [0.1, 1, 5]
}

In [None]:
grid = GridSearchCV(
  LogisticRegression(max_iter=1000),
  param_grid,
  cv=3,
  scoring='f1_macro'
)

In [None]:
grid.fit(X_train_vec, y_train)
print("Best parameters:", grid.best_params_)

In [None]:
gs_model = grid.best_estimator_
pred_gs = gs_model.predict(X_test_vec)
print("Tuned Logistic Regression")
print(classification_report(y_test, pred_gs))

Hyperparameter tuning improves the Logistic Regression model substantially. However, SVM still achieves the best overall balance between precision and recall.

## Model Comparison

The performance of evaluated models is summarized below:

| Model | Accuracy | Macro F1-score |
|-----|----------|----------------|
| Logistic Regression (baseline) | 0.86 | 0.52 |
| Tuned Logistic Regression | 0.90 | 0.66 |
| Support Vector Machine (SVM) | 0.91 | 0.69 |

SVM achieved the highest accuracy and the best balance between precision and recall across all classes. Therefore, SVM was selected as the final model for further analysis.


## **SVM achieved the highest accuracy and the best balance between precision and recall across all classes. Therefore, SVM was selected as the final model for further analysis.**

In [None]:
final_model = svm_model


# ============================================
# 10. Transformer-based Sentiment Analysis (BERT)
# ============================================

To verify the consistency of classical machine learning results, a pretrained transformer-based sentiment model (BERT) was tested on sample reviews. Transformer models leverage contextual word representations and often achieve strong performance on sentiment analysis tasks.

The BERT predictions were largely consistent with the labels produced by the classical models, especially for clearly positive reviews.


In [None]:
bert_pipeline = pipeline("sentiment-analysis")

In [None]:
samples = [
    "The place was absolutely wonderful and peaceful",
    "The area was dirty and poorly maintained",
    "It is a famous tourist destination"
]

In [None]:
for s in samples:
    print(s, "->", bert_pipeline(s))

The BERT model correctly classified clearly positive and negative examples with very high confidence. This qualitative evaluation shows strong agreement between transformer-based predictions and the labels produced by classical machine learning models, providing additional validation of the overall approach.


# ============================================
# 11. Confusion Matrix
# ============================================

In [None]:
ConfusionMatrixDisplay.from_estimator(
  final_model,
  X_test_vec,
  y_test
)
plt.title("Confusion Matrix")
plt.show()

The confusion matrix shows that misclassifications mainly occur between neutral and positive classes, while negative reviews are often misclassified as positive. This reflects the linguistic similarity between neutral and positive opinions and scarcity of negative examples.

# ============================================
# 12. Most Important Words
# ============================================

In [None]:
feature_names = vectorizer.get_feature_names_out()

In [None]:
for i, label in enumerate(final_model.classes_):
  top = np.argsort(final_model.coef_[i])[-10:]
  print("Top words for", label)
  print([feature_names[j] for j in top])

## The most important words extracted by the classifier are highly interpretable and align well with human intuition, confirming that the model learns meaningful semantic patterns.

# ============================================
# 13. Topic Modeling (LDA)
# ============================================

In [None]:
lda_vectorizer = TfidfVectorizer(
  max_features=5000,
  stop_words='english'
)

In [None]:
X_lda = lda_vectorizer.fit_transform(df['clean_text'])

In [None]:
lda = LatentDirichletAllocation(
  n_components=5,
  random_state=42
)

In [None]:
lda.fit(X_lda)

In [None]:
lda_features = lda_vectorizer.get_feature_names_out()

In [None]:
for idx, topic in enumerate(lda.components_):
  print(f"Topic {idx+1}:")
  print([lda_features[i] for i in topic.argsort()[-10:]])

## Topic modeling reveals several coherent themes related to nature tourism, religious sites, heritage cities, and national parks, confirming that the dataset captures diverse types of tourist experiences.

# ============================================
# 14. Simple Location-Based Analysis
# ============================================

In [None]:
location_sentiment = df.groupby(['location','sentiment']).size().unstack().fillna(0)
print(location_sentiment.head())

Location-based analysis shows that most popular destinations receive predominantly positive sentiment, but some locations exhibit higher proportions of neutral or negative opinions, which could indicate areas for improvement.

# ============================================
# 15. LIMITATIONS
# ============================================

- Dataset is highly imbalanced.
- Sentiment labels are automatically generated and may contain noise.
- Only English language reviews were considered.

# ============================================
# 16. CONCLUSIONS
# ============================================

This project demonstrated how Natural Language Processing techniques can be applied to analyze tourist reviews.

First, Sentiment labels in this project were generated automatically using a pretrained sentiment analyzer (VADER). Such weak supervision introduces label noise, as automatically generated labels may contain errors or biases. These imperfections can propagate into supervised models trained on this data. Therefore, results should be interpreted with caution, and manual annotation would be required for high-stakes applications.
 Classical machine learning models were then trained using TF-IDF features.

Among the evaluated models, Support Vector Machine achieved the best overall performance with 91% accuracy and the highest macro F1-score. Hyperparameter tuning improved Logistic Regression performance, but it remained slightly inferior to SVM.

Feature importance analysis showed that the models learned meaningful semantic patterns, with positive words such as *beautiful*, *amazing*, and *peaceful*, and negative words such as *dirty*, *bad*, and *unfortunately*.

Topic modeling revealed coherent themes related to nature tourism, religious sites, heritage cities, and national parks, indicating that the dataset captures diverse tourist experiences.

Location-based analysis showed that most destinations receive predominantly positive reviews, although some locations exhibit higher proportions of neutral and negative opinions.

Limitations of this study include class imbalance and reliance on automatically generated sentiment labels. Future work could include manual annotation, class balancing techniques, and fine-tuning transformer-based models.

Overall, the results confirm that NLP methods can effectively extract insights from large collections of tourist reviews.

Additionally, a pretrained BERT model was used for qualitative validation and produced predictions consistent with the classical models, further supporting the reliability of the obtained results.

