# Notebook Purpose

This Jupyter Notebook is created with the following objectives:

## Objective 1: Filtering Non-English Reviews

The first goal of this notebook is to filter out non-English reviews from a dataset. Non-English reviews can introduce noise when performing text analysis, and it's important to focus on English-language data for this analysis.

## Objective 2: Applying Topic Modeling

The second objective of this notebook is to apply topic modeling techniques to the filtered English-language reviews. Topic modeling helps in uncovering hidden themes or topics within textual data, which can be valuable for various applications such as sentiment analysis, content categorization, and understanding customer feedback.

## Objective 3: Sentiment Analysis

The third objective is to perform sentiment analysis on the English-language reviews. Sentiment analysis helps in understanding the emotional tone and sentiment expressed by customers in their reviews. This analysis will categorize reviews into positive, negative, or neutral sentiments, providing valuable insights into customer satisfaction and opinions about the restaurants.

By the end of this notebook, we aim to have a clean dataset of English reviews, a set of identified topics, and sentiment scores that can be used for further analysis and insights.

Let's proceed with the tasks to achieve these objectives.


- The main reason I am performing these two steps here and not in the ETL process is the time it takes for each of them.
(Filtering base on language and topic modeling and sentiment analysis take more than 2 and half hours)

# Objective 1:

In [12]:
from langdetect import detect

In [13]:
import pandas as pd
df_review = pd.read_csv('C:\\Users\\User\\Desktop\\Final Project\\CSV_files\\mixed_review.csv')
def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

df_review = df_review.loc[df_review['text'].apply(is_english)]

In [14]:
len(df_review)

346077

In [15]:
df_review.to_csv('english_review.csv', index=False)

# Objective 2: Topic Modeling

In [16]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [17]:
df = pd.read_csv('C:\\Users\\User\\Desktop\\Final Project\\english_review.csv')

# Text Preprocessing
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

# Create an empty list to store the topics for each review
all_topics = []

for review in df['text']:
    # Tokenize and preprocess the review text
    tokenized_review = word_tokenize(review.lower())
    filtered_review = [word for word in tokenized_review if word not in stop_words and word.isalpha()]

    # Topic Modeling using Latent Dirichlet Allocation
    tfidf_vectorizer = TfidfVectorizer()
    
    # The input for fit_transform should be a list of strings, not a list of lists
    tfidf_review = tfidf_vectorizer.fit_transform([" ".join(filtered_review)])

    lda = LatentDirichletAllocation(n_components=2, random_state=42)  # You can adjust the number of topics
    lda.fit(tfidf_review)

    # Get the top words for the topic
    feature_names = tfidf_vectorizer.get_feature_names_out()
    topic = lda.components_[0] if lda.components_[0].sum() > lda.components_[1].sum() else lda.components_[1]
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [feature_names[i] for i in top_words_idx]

    # Append the top words to the 'all_topics' list
    all_topics.append(", ".join(top_words))

# Add a new column 'topic' to the DataFrame and store the topics
df['topic'] = all_topics

# Save the DataFrame to a new CSV file with the added 'topic' column
df.to_csv('english_topic_review.csv', index=False)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Objective 3: Sentiment analysis

In [7]:
import pandas as pd
df_review = pd.read_csv('C:\\Users\\User\\Desktop\\Final Project\\CSV_files\\review_with_topic.csv')

In [9]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


True

In [12]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Initialize the sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# List to store sentiment results
sentiments = []

# Iterate through the 'text' column of the DataFrame and calculate sentiment scores
for text in df_review['text']:
    sentiment = analyzer.polarity_scores(text)
    compound_score = sentiment['compound']
    if compound_score >= 0.05:
        sentiment_label = 'Positive'
    elif compound_score <= -0.05:
        sentiment_label = 'Negative'
    else:
        sentiment_label = 'Neutral'
    sentiments.append({'Sentiment': sentiment_label, 'Score': compound_score})

# Convert the list of sentiment results into a DataFrame
sentiments_df = pd.DataFrame(sentiments)

# Add the 'Sentiment' and 'Score' columns to the original DataFrame
df_review['Sentiment'] = sentiments_df['Sentiment']
df_review['Score'] = sentiments_df['Score']

# Now, df_review contains the 'Sentiment' and 'Score' columns

# Save the DataFrame to a CSV file
df_review.to_csv('review.csv', index=False)