# Preprocessing Turkish Tweets with Zemberek

This notebook implements the preprocessing step of the project. It includes tokenization, lowercasing, stemming using Zemberek, and optional stop-word removal. The processed tweets will be saved for later steps, such as TF-IDF transformation and classification.
    

## Step 1: Import Required Libraries

In [1]:
import os
from py4j.java_gateway import JavaGateway, GatewayParameters, launch_gateway

## Step 2: Initialize Zemberek for Stemming

In [2]:
def initialize_zemberek(jar_path="zemberek-full.jar"):
    """Starts Zemberek NLP through Py4J gateway."""
    port = launch_gateway(classpath=jar_path)
    gateway = JavaGateway(gateway_parameters=GatewayParameters(port=port))
    tokenizer = gateway.jvm.zemberek.tokenization.TurkishTokenizer.DEFAULT
    extractor = gateway.jvm.zemberek.tokenization.TurkishSentenceExtractor.DEFAULT
    return tokenizer, extractor

## Step 3: Define Tokenization, Lowercasing, and Stemming

In [3]:
def preprocess_text(text, tokenizer, extractor, stop_words=[]):
    """
    Preprocesses Turkish text using Zemberek. Tokenizes, extracts lemmas, 
    and removes stop words.

    Args:
        text (str): Input text to preprocess.
        tokenizer: Instance of TurkishTokenizer.
        extractor: Instance of TurkishSentenceExtractor.
        stop_words (list): List of stop words to exclude.

    Returns:
        list: List of preprocessed tokens.
    """
    # Extract sentences from text
    sentences = extractor.fromDocument(text)
    tokens = []

    for sentence in sentences:
        # Tokenize each sentence
        token_strings = tokenizer.tokenizeToStrings(sentence)
        
        for token in token_strings:
            # Check if token is not a stop word
            if token not in stop_words:
                tokens.append(token)

    return tokens



## Step 4: Process All Tweets in the Dataset

In [4]:
def process_tweets(dataset_folder, tokenizer, extractor, stop_words=[]):
    """Processes all tweets in the dataset."""
    processed_data = {}
    for label in os.listdir(dataset_folder):  # Iterate over 'Positive', 'Negative', 'Neutral'
        label_folder = os.path.join(dataset_folder, label)
        tweets = []
        for filename in os.listdir(label_folder):  # Iterate over each tweet file
            with open(os.path.join(label_folder, filename), "r", encoding="ISO-8859-9") as file:
                text = file.read()
                processed = preprocess_text(text, tokenizer, extractor, stop_words)
                tweets.append(processed)
        processed_data[label] = tweets
    return processed_data

## Step 5: Example Usage

In [None]:

# Initialize Zemberek
tokenizer, extractor = initialize_zemberek()

# Define stop-words (optional)
turkish_stop_words = ["ve", "bir", "ama", "çok", "gibi"]  

# Path to the dataset
dataset_path = "/Users/egebilge/Documents/Lectures/SE-4475 NLP/SE-4475-NLP-Assignment/data/raw_texts"

# Process the tweets
processed_tweets = process_tweets(dataset_path, tokenizer, extractor, turkish_stop_words)

# Print a sample
for label, tweets in processed_tweets.items():
    print(f"\nLabel: {label}")
    print("Sample Processed Tweet:", tweets[0])
    