## Using Lemmatization with Zemberek-Python
Instead of using a static suffix list, we will use `zemberek-python` for lemmatization. This allows for more accurate matching of keywords by finding their root forms, handling Turkish morphology correctly.

### Import Packages

In [42]:
#Temizlik kodu
#Bu kod wpdeki excel alıp temizler, sütunları düzeltir küfürleri ve ingilizce ve saçma kelimeler içeren satırları siler.

import pandas as pd
import re
from collections import Counter
from openpyxl import load_workbook
from openpyxl.styles import PatternFill
from openpyxl.utils import get_column_letter
#from langdetect import detect
import json
from zemberek import (
    TurkishMorphology,
)
from openpyxl.styles import PatternFill

In [43]:
# Dosya yolları
INPUT_PATH = "../dataset/reviews.csv"
OUTPUT_PATH = "../dataset/cleaned_reviews_zemberek.xlsx"

# KEYWORDS
# SWEAR_WORDS_PATH = "./constants/swear_words.json" # No longer needed directly if included in sentiment
ASPECTS_KEYWORDS_PATH = "./constants/aspects_keywords.json"
SENTIMENT_KEYWORDS_PATH = "./constants/sentiment_keywords.json"
COLORS_MAP_PATH = "./constants/colors_map.py"

# Initialize Zemberek Morphology
morphology = TurkishMorphology.create_with_defaults()

# Keyword lists will be loaded from JSON
ASPECTS_KEYWORDS = {}
SENTIMENT_KEYWORDS = {}



COLOR_MAP = {
    "Olumlu": PatternFill(start_color="90EE90", fill_type="solid"),
    "Olumsuz": PatternFill(start_color="FFA07A", fill_type="solid"),
    "Nötr": PatternFill(start_color="D3D3D3", fill_type="solid"),
}

2025-04-23 08:26:52,420 - zemberek.morphology.turkish_morphology - INFO
Msg: TurkishMorphology instance initialized in 6.211079835891724



In [44]:
# read json file
def read_json(file_path):
    with open(file_path, "r", encoding='utf-8') as f:
        # read the json file
        data = json.load(f)
        # close the file

        f.close()
    return data

In [45]:
# GET THE ASPECTS KEYWORDS FROM THE ASPECTS KEYWORDS FILE
aspects_data = read_json(ASPECTS_KEYWORDS_PATH)
ASPECTS_KEYWORDS = aspects_data["ASPECTS_KEYWORDS"]

# GET THE SENTIMENT KEYWORDS FROM THE SENTIMENT KEYWORDS FILE
sentiments_data = read_json(SENTIMENT_KEYWORDS_PATH)
SENTIMENT_KEYWORDS = sentiments_data["SENTIMENT_KEYWORDS"]

# print the aspects keywords
# Calculate the total number of keywords across all aspect categories
total_aspect_keywords = sum(len(keywords) for keywords in ASPECTS_KEYWORDS.values())
print("Aspects keywords count: ", total_aspect_keywords)
# print the sentiment keywords
print("Sentiment keywords count: ", len(SENTIMENT_KEYWORDS["Olumlu"]) + len(SENTIMENT_KEYWORDS["Olumsuz"]))

Aspects keywords count:  631
Sentiment keywords count:  942


### TEXT PROCESSING (with Zemberek):

In [46]:
# Convert any given text to string and lowercase
def to_str_lowercase(text):
    return str(text).lower()

# Replace Turkish characters with English equivalents for normalization after lemmatization
def normalize_text(text):
    replacements = str.maketrans("ÇĞİÖŞÜçğıöşü", "CGIOSUcgiosu")
    return text.translate(replacements)

# Normalize a single keyword (used for keyword lists before lemmatization)
def normalize_keyword_simple(keyword):
    keyword = to_str_lowercase(keyword)
    keyword = normalize_text(keyword)
    return keyword

# Remove extra spaces from the text
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# --- ZEMBEREK BASED LEMMATIZATION AND MATCHING ---
# Function to get lemma of a word using Zemberek
def get_lemma(word):
    try:
        results = morphology.analyze(word)
        # Check if results is a non-empty list
        if results and isinstance(results, list) and len(results) > 0:
            # Access the first analysis result
            first_analysis = results[0]
            # Check if the necessary attributes exist
            if hasattr(first_analysis, 'item') and hasattr(first_analysis.item, 'lemma'):
                return first_analysis.item.lemma
            else:
                # Log or handle unexpected structure if needed
                # print(f"Warning: Unexpected analysis structure for word '{word}': {first_analysis}")
                pass # Fall through to return original word
        # Handle cases where results is not a list or is empty
        # print(f"Warning: No valid analysis found for word '{word}'. Results: {results}")
        pass # Fall through to return original word
    except Exception as e:
        # Log unexpected errors during analysis if needed
        # print(f"Error analyzing word '{word}': {e}")
        pass # Fall through to return original word

    # Return original word if analysis fails or encounters issues
    return word

# Check if a normalized keyword's lemma exists in the lemmatized text
def has_lemmatized_keyword(text_lemmatized_normalized_set, normalized_keyword_lemma):
    # Check if the already normalized keyword lemma exists in the set of lemmatized words from the text
    return normalized_keyword_lemma in text_lemmatized_normalized_set

# Pre-lemmatize and normalize a text
def lemmatize_normalize_text(text):
    text = to_str_lowercase(text) # Lowercase the input text
    # Tokenize text into words
    words = re.findall(r'\b\w+\b', text)
    # Lemmatize each word in the text and normalize (English chars)
    lemmatized_normalized_words = {normalize_text(get_lemma(word)) for word in words}
    return lemmatized_normalized_words


### PROCESS CONSTANTS (with Zemberek Lemmatization)

In [47]:
# Lemmatize and normalize all keywords in the dictionaries
def lemmatize_normalize_keywords(keywords_dict):
    processed_dict = {}
    for category, keywords in keywords_dict.items():
        # Lemmatize and normalize keywords to lowercase and English chars
        # Ensure keywords are strings before processing
        lemmatized_normalized_keywords = {normalize_text(get_lemma(str(k).lower())) for k in keywords}
        processed_dict[category] = lemmatized_normalized_keywords # Use set for faster lookups
    return processed_dict

# Lemmatize and normalize keywords once at the start
LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS = lemmatize_normalize_keywords(ASPECTS_KEYWORDS)
LEMMATIZED_NORMALIZED_SENTIMENT_KEYWORDS = lemmatize_normalize_keywords(SENTIMENT_KEYWORDS)

print("Lemmatized & Normalized aspects keywords loaded: Yes")
print("Lemmatized & Normalized sentiment keywords loaded: Yes")

# Example: Print some lemmatized keywords to verify
if 'Grafik' in LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS:
    print("Sample lemmatized 'Grafik' keywords: ", list(LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS['Grafik'])[:5])
if 'Olumlu' in LEMMATIZED_NORMALIZED_SENTIMENT_KEYWORDS:
    print("Sample lemmatized 'Olumlu' keywords: ", list(LEMMATIZED_NORMALIZED_SENTIMENT_KEYWORDS['Olumlu'])[:5])

Lemmatized & Normalized aspects keywords loaded: Yes
Lemmatized & Normalized sentiment keywords loaded: Yes
Sample lemmatized 'Grafik' keywords:  ['2d', 'cizimler', 'gorunum', 'hareket animasyonu', 'gorsel stil']
Sample lemmatized 'Olumlu' keywords:  ['akil almaz', 'erdemli', 'kacirmayin', 'devrim niteliginde', 'optimize edilmis']


In [48]:
# Keyword match function - NOW USES LEMMATIZATION
def keyword_match(text_lemmatized_normalized_set, normalized_keyword_lemma):
    # Uses the new function which checks lemmatized forms
    return has_lemmatized_keyword(text_lemmatized_normalized_set, normalized_keyword_lemma)

# ---- MAIN ANALYSIS FUNCTION (Updated with Zemberek & Sentence-Level Sentiment) ----
# Define common Turkish negation word lemmas (adjust if Zemberek produces different lemmas)
NEGATION_LEMMAS = {"değil", "yok", "hiç"}

def analyze_review(text):
    # Initialize result with Nötr for all aspects
    result = {aspect: "Nötr" for aspect in LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS}
    
    # Ensure text is a string before processing
    if pd.isna(text):
        return result # Return default neutral if input is NaN
    text = str(text)

    # Lemmatize and normalize the entire review text once for efficiency
    review_lemmatized_normalized_set = lemmatize_normalize_text(text)
    
    found_aspects_with_keywords = {} # Store aspect and the specific keyword lemma found
    
    # First pass: Identify all aspects mentioned
    for aspect, lemmatized_keywords_set in LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS.items():
        matching_keyword_lemmas = review_lemmatized_normalized_set.intersection(lemmatized_keywords_set)
        if matching_keyword_lemmas:
            found_aspects_with_keywords[aspect] = list(matching_keyword_lemmas) # Store matched lemmas if needed later
    
    # Second pass: Determine sentiment for found aspects based on relevant sentences
    if found_aspects_with_keywords:
        sentences = re.split(r'[.!?]+', text.lower()) # Split original text into sentences, handle multiple delimiters
        pos_keywords_lemmas = LEMMATIZED_NORMALIZED_SENTIMENT_KEYWORDS.get("Olumlu", set())
        neg_keywords_lemmas = LEMMATIZED_NORMALIZED_SENTIMENT_KEYWORDS.get("Olumsuz", set())

        for aspect in found_aspects_with_keywords: # Iterate through aspects found in the review
            aspect_sentiment_score = 0
            aspect_keywords_set = LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS[aspect] # Get all keyword lemmas for the aspect

            for sentence in sentences:
                sentence = sentence.strip()
                if not sentence: continue # Skip empty sentences

                sentence_lemmatized_normalized_set = lemmatize_normalize_text(sentence)

                # Check if the sentence contains any keyword lemma for the current aspect
                if not aspect_keywords_set.isdisjoint(sentence_lemmatized_normalized_set):
                    # Sentence is relevant to the aspect, now analyze its sentiment
                    positive_matches = sentence_lemmatized_normalized_set.intersection(pos_keywords_lemmas)
                    negative_matches = sentence_lemmatized_normalized_set.intersection(neg_keywords_lemmas)
                    
                    sentence_score = len(positive_matches) - len(negative_matches)

                    # Basic Negation Check: If a negation lemma exists in the sentence, neutralize or flip the score?
                    # Simple approach: If negation present, reduce confidence by neutralizing score for this sentence.
                    # More complex logic could check proximity of negation to sentiment words.
                    if not NEGATION_LEMMAS.isdisjoint(sentence_lemmatized_normalized_set):
                         # If negation detected, perhaps ignore sentiment score for this sentence or reduce its weight
                         # Let's try neutralizing the score for this sentence if negation is found
                         sentence_score = 0 
                         # Alternatively, could flip: sentence_score = -sentence_score 
                         # Or reduce magnitude: sentence_score *= 0.5 

                    aspect_sentiment_score += sentence_score

            # Determine final sentiment for the aspect based on aggregated score
            if aspect_sentiment_score > 0:
                result[aspect] = "Olumlu"
            elif aspect_sentiment_score < 0:
                result[aspect] = "Olumsuz"
            # else: remains "Nötr" (if score is 0)
            
    return result

In [49]:
# --- VERİ YÜKLE ---
df = pd.read_csv(INPUT_PATH, usecols=["app_id", "review_text"], dtype={"app_id": str, "review_text": str}).dropna(subset=['review_text'])

In [50]:
# Display first few rows of the loaded data
print("Original DataFrame head:")
print(df.head())
print(f"Total reviews loaded: {len(df)}")

Original DataFrame head:
    app_id                                        review_text
0  1245620  İlk 40 saatimde nereye gitmem gerektiğini ne y...
1  1245620  Bu oyunda Malenia bossunu tasarlıyan arkadaşa ...
2  1245620  Güzel oyun  atmosfer kontroller vs. güzel ama ...
3  1245620  Oyunun devasa bir haritası var açık dünya olma...
4  1245620       oynu bitirdiğinizde huzurluca ölebilirisiniz
Total reviews loaded: 7245


In [51]:
# --- ANALİZ (Using Zemberek) ---
print("Starting analysis on the full DataFrame...") # Modified print statement
results = []
# Use apply for potentially faster processing, especially on larger dataframes
# analysis_results = df['review_text'].apply(analyze_review) # Use df here if using apply

# Or iterate for easier debugging:
total_rows = len(df) # Use df here
print(f"Processing {total_rows} reviews...")
for index, row in df.iterrows(): # Use df here instead of test_df
    if pd.isna(row["review_text"]):
        print(f"Skipping row {index} due to NaN review_text")
        analysis = {aspect: "Nötr" for aspect in LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS}
    else:
        analysis = analyze_review(row["review_text"])
    results.append({
        "app_id": row["app_id"],
        "review_text": row["review_text"],
        **analysis
    })
    # Print progress less frequently for the full dataset
    if (index + 1) % 500 == 0: # Adjusted print frequency
        print(f"  Processed {index + 1}/{total_rows} reviews...")

df_results = pd.DataFrame(results)
print(f"Analysis complete. Processed {len(df_results)} reviews. Results DataFrame head:") # Modified print statement
print(df_results.head())

Starting analysis on the full DataFrame...
Processing 7245 reviews...
  Processed 500/7245 reviews...
  Processed 500/7245 reviews...
  Processed 1000/7245 reviews...
  Processed 1000/7245 reviews...
  Processed 1500/7245 reviews...
  Processed 1500/7245 reviews...
  Processed 2000/7245 reviews...
  Processed 2000/7245 reviews...
  Processed 2500/7245 reviews...
  Processed 2500/7245 reviews...
  Processed 3000/7245 reviews...
  Processed 3000/7245 reviews...
  Processed 3500/7245 reviews...
  Processed 3500/7245 reviews...
  Processed 4000/7245 reviews...
  Processed 4000/7245 reviews...
  Processed 4500/7245 reviews...
  Processed 4500/7245 reviews...
  Processed 5000/7245 reviews...
  Processed 5000/7245 reviews...
  Processed 5500/7245 reviews...
  Processed 5500/7245 reviews...
  Processed 6000/7245 reviews...
  Processed 6000/7245 reviews...
  Processed 6500/7245 reviews...
  Processed 6500/7245 reviews...
  Processed 7000/7245 reviews...
  Processed 7000/7245 reviews...
Analysis

In [52]:
# --- RENKLİ KAYDET (Updated for Zemberek) ---
print(f"Saving results to {OUTPUT_PATH}...")
with pd.ExcelWriter(OUTPUT_PATH, engine="openpyxl") as writer:
    df_results.to_excel(writer, index=False, sheet_name="Analizler")
    worksheet = writer.sheets["Analizler"]
    
    # Find aspect columns dynamically based on the processed keywords
    aspect_columns = [col for col in df_results.columns if col in LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS]
    aspect_col_indices = {col: df_results.columns.get_loc(col) + 1 for col in aspect_columns} # +1 for 1-based index

    # Apply coloring to aspect columns
    print("Applying colors to Excel sheet...")
    for aspect, col_idx in aspect_col_indices.items():
        col_letter = get_column_letter(col_idx)
        print(f"  Coloring column: {aspect} ({col_letter})") # DEBUG PRINT
        for row in range(2, len(df_results) + 2): # +2 because Excel is 1-based and header is row 1
            cell = worksheet[f"{col_letter}{row}"]
            sentiment_value = cell.value
            # print(f"    Row {row}, Cell {col_letter}{row}, Value: '{sentiment_value}'") # Optional detailed DEBUG
            if sentiment_value in COLOR_MAP:
                # print(f"      Applying color for: {sentiment_value}") # Optional detailed DEBUG
                cell.fill = COLOR_MAP[sentiment_value]
            else:
                # print(f"      Applying default color (Nötr) for value: {sentiment_value}") # Optional detailed DEBUG
                cell.fill = COLOR_MAP["Nötr"] # Default fill
    
    # Adjust column widths (optional)
    print("Adjusting column widths...")
    worksheet.column_dimensions['A'].width = 15 # app_id
    worksheet.column_dimensions['B'].width = 80 # review_text
    for aspect, col_idx in aspect_col_indices.items():
        worksheet.column_dimensions[get_column_letter(col_idx)].width = 20 # Aspect columns

# --- ÖZET SHEET (Updated for Zemberek) ---
print("Creating summary sheet...")
summary_data = []
for aspect in LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS: # Use keys from lemmatized dict
    if aspect in df_results.columns:
        counts = df_results[aspect].value_counts().to_dict()
        summary_data.append({
            "Kategori": aspect,
            "Olumlu": counts.get("Olumlu", 0),
            "Olumsuz": counts.get("Olumsuz", 0),
            "Nötr": counts.get("Nötr", 0)
        })
df_summary = pd.DataFrame(summary_data)
with pd.ExcelWriter(OUTPUT_PATH, engine="openpyxl", mode="a") as writer:
    df_summary.to_excel(writer, index=False, sheet_name="Özet")

# --- NÖTR KELİMELER (Updated for Zemberek) ---
print("Finding most common words in neutral reviews...")
# Dynamically create the filter condition for all aspects being Nötr
neutral_condition = None
for aspect in LEMMATIZED_NORMALIZED_ASPECTS_KEYWORDS:
    if aspect in df_results.columns:
        condition = (df_results[aspect] == "Nötr")
        if neutral_condition is None:
            neutral_condition = condition
        else:
            neutral_condition &= condition

if neutral_condition is not None and not df_results[neutral_condition].empty:
    neutral_reviews = df_results[neutral_condition]["review_text"].tolist()
    
    neutral_corpus = " ".join(map(str, neutral_reviews)).lower()
    # Lemmatize and normalize words before counting
    neutral_words_lemmatized_normalized = lemmatize_normalize_text(neutral_corpus)
    # Filter out short words if needed (already handled by regex in lemmatize_normalize_text)
    # neutral_words_filtered = [word for word in neutral_words_lemmatized_normalized if len(word) >= 3]
    
    # Count the frequency of lemmatized words
    # Need to re-tokenize and lemmatize for counting, as the set lost frequency info
    words_for_counting = re.findall(r'\b\w+\b', neutral_corpus)
    lemmatized_words_for_counting = [normalize_text(get_lemma(word)) for word in words_for_counting if len(word) >= 3]
    
    most_common_words = Counter(lemmatized_words_for_counting).most_common(100)
    df_top_words = pd.DataFrame(most_common_words, columns=["Kelime (Lemma)", "Frekans"])
    with pd.ExcelWriter(OUTPUT_PATH, engine="openpyxl", mode="a") as writer:
        df_top_words.to_excel(writer, index=False, sheet_name="Nötr En Çok Kelimeler (Lemma)")
else:
    print("⚠️ No neutral reviews found or aspect columns missing for neutral word analysis.")

print(f"Process Done✅: {OUTPUT_PATH}")

Saving results to ../dataset/cleaned_reviews_zemberek.xlsx...
Applying colors to Excel sheet...
  Coloring column: Grafik (C)
  Coloring column: AI (D)
  Coloring column: Oynanis (E)
Applying colors to Excel sheet...
  Coloring column: Grafik (C)
  Coloring column: AI (D)
  Coloring column: Oynanis (E)
  Coloring column: Ses ve Muzik (F)
  Coloring column: Oyun Dunyasi (G)
  Coloring column: Topluluk ve Sosyal (H)
  Coloring column: Ses ve Muzik (F)
  Coloring column: Oyun Dunyasi (G)
  Coloring column: Topluluk ve Sosyal (H)
  Coloring column: Hikaye ve Senaryo (I)
  Coloring column: Performans ve Teknik (J)
Adjusting column widths...
  Coloring column: Hikaye ve Senaryo (I)
  Coloring column: Performans ve Teknik (J)
Adjusting column widths...
Creating summary sheet...
Creating summary sheet...
Finding most common words in neutral reviews...
Finding most common words in neutral reviews...
Process Done✅: ../dataset/cleaned_reviews_zemberek.xlsx
Process Done✅: ../dataset/cleaned_review