# Notebook Purpose

This Jupyter Notebook is created with the following objectives:

## Objective 1: Filtering Non-English Reviews

The first goal of this notebook is to filter out non-English reviews from a dataset. Non-English reviews can introduce noise when performing text analysis, and it's important to focus on English-language data for this analysis.

## Objective 2: Applying Topic Modeling

The second objective of this notebook is to apply topic modeling techniques to the filtered English-language reviews. Topic modeling helps in uncovering hidden themes or topics within textual data, which can be valuable for various applications such as sentiment analysis, content categorization, and understanding customer feedback.

By the end of this notebook, we aim to have a clean dataset of English reviews and a set of identified topics that can be used for further analysis and insights.

Let's proceed with the tasks to achieve these objectives.


- The main reason I am performing these two steps here and not in the ETL process is the time it takes for each of them.

# Objective 1:

In [12]:
from langdetect import detect

In [13]:
import pandas as pd
df_review = pd.read_csv('C:\\Users\\User\\Desktop\\Final Project\\CSV_files\\mixed_review.csv')
def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

df_review = df_review.loc[df_review['text'].apply(is_english)]

In [14]:
len(df_review)

346077

In [15]:
df_review.to_csv('english_review.csv', index=False)

# Objective 2: Topic Modeling

In [16]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [17]:
df = pd.read_csv('C:\\Users\\User\\Desktop\\Final Project\\english_review.csv')

# Text Preprocessing
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

# Create an empty list to store the topics for each review
all_topics = []

for review in df['text']:
    # Tokenize and preprocess the review text
    tokenized_review = word_tokenize(review.lower())
    filtered_review = [word for word in tokenized_review if word not in stop_words and word.isalpha()]

    # Topic Modeling using Latent Dirichlet Allocation (same as in your original code)
    tfidf_vectorizer = TfidfVectorizer()
    
    # The input for fit_transform should be a list of strings, not a list of lists
    tfidf_review = tfidf_vectorizer.fit_transform([" ".join(filtered_review)])

    lda = LatentDirichletAllocation(n_components=2, random_state=42)  # You can adjust the number of topics
    lda.fit(tfidf_review)

    # Get the top words for the topic
    feature_names = tfidf_vectorizer.get_feature_names_out()
    topic = lda.components_[0] if lda.components_[0].sum() > lda.components_[1].sum() else lda.components_[1]
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [feature_names[i] for i in top_words_idx]

    # Append the top words to the 'all_topics' list
    all_topics.append(", ".join(top_words))

# Add a new column 'topic' to the DataFrame and store the topics
df['topic'] = all_topics

# Save the DataFrame to a new CSV file with the added 'topic' column
df.to_csv('english_topic_review.csv', index=False)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [18]:
df = pd.read_csv('C:\\Users\\User\\Desktop\\Final Project\\english_topic_review.csv')

In [19]:
df.head()

Unnamed: 0,business_id,review_id,user_id,stars,useful,funny,cool,text,date,topic
0,XQfwVwDr-v0ZS3_CbbE5Xw,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,"usually, long, pleasant, waiting, want"
1,XQfwVwDr-v0ZS3_CbbE5Xw,VJxlBnJmCDIy8DFG0kjSow,Iaee7y6zdSB3B-kRCo4z1w,2,0,0,0,This is the second time we tried turning point...,2017-05-13 17:06:55,"time, wait, chopped, long, food"
2,XQfwVwDr-v0ZS3_CbbE5Xw,S6pQZQocMB1WHMjTRbt77A,ejFxLGqQcWNLdNByJlIhnQ,4,2,0,1,The place is cute and the staff was very frien...,2017-08-08 00:58:18,"nice, brunch, place, away, avocado"
3,XQfwVwDr-v0ZS3_CbbE5Xw,WqgTKVqWVHDHjnjEsBvUgg,f7xa0p_1V9lx53iIGN5Sug,3,0,0,0,We came on a Saturday morning after waiting a ...,2017-11-19 02:20:23,"came, server, away, got, right"
4,XQfwVwDr-v0ZS3_CbbE5Xw,M0wzFFb7pefOPcxeRVbLag,dCooFVCk8M1nVaQqcfTL3Q,2,0,0,0,"Mediocre at best. The decor is very nice, and ...",2017-09-09 17:49:47,"food, one, us, time, star"
