# Data Processing 
Apply NLP techniques such as word embeddings, stemming, lemmatization, and stop-word removal.

### Objective
An alien civilization has discovered an ancient text file on the abandoned Earth, containing
millions of sentence pairs. Their challenge is to decipher how closely related these sentence
pairs are, effectively developing a semantic similarity detection model.

### Dataset Features
- Contains ~1 million lines with paired sentences.
- Some sentence pairs share the same meaning, while others differ.
- Requires feature engineering and text preprocessing.


In [2]:
# import libraries 
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [8]:
# change this file path 
file_path = "/Users/dionnespaltman/Desktop/Luiss /Machine Learning/Project/rs2.csv"

# load the csv as a pandas dataframe 
df = pd.read_csv(file_path)

# print dimensions 
print(df.shape)

# display the first 5 rows
display(df.head())

(949080, 5)


Unnamed: 0,sentence1,sentence2,score,lang1,lang2
0,Ein Flugzeug hebt gerade ab.,An air plane is taking off.,5.0,de,en
1,Ein Flugzeug hebt gerade ab.,Un avión está despegando.,5.0,de,es
2,Ein Flugzeug hebt gerade ab.,Un avion est en train de décoller.,5.0,de,fr
3,Ein Flugzeug hebt gerade ab.,Un aereo sta decollando.,5.0,de,it
4,Ein Flugzeug hebt gerade ab.,飛行機が離陸します。,5.0,de,ja


The plan: 
1. Lowercase all text
2. Remove punctuation and special characters
3. Remove stop words (for languages where this makes sense)
4. Apply stemming or lemmatization
5. Tokenization
6. Word embeddings (later)

In [9]:
# import libraries
import spacy
from spacy.lang.xx import MultiLanguage
import pandas as pd
import string
from tqdm import tqdm

In [10]:
# Cache loaded spaCy models
loaded_models = {}

def load_spacy_model(lang_code):
    models = {
        "en": "en_core_web_sm",
        "de": "de_core_news_sm",
        "fr": "fr_core_news_sm",
        "es": "es_core_news_sm",
        "it": "it_core_news_sm",
        "pt": "pt_core_news_sm",
        "nl": "nl_core_news_sm",
        # Add more if needed
    }
    model_name = models.get(lang_code)
    if model_name:
        try:
            return spacy.load(model_name)
        except:
            print(f"⚠️ spaCy model {model_name} not found.")
            return None
    return None


In [11]:
# Preprocessing function 
# Lowercase, lemmatize, remove punctuation, remove stopwords
def preprocess(text, lang_code):
    nlp = load_spacy_model(lang_code)
    if not nlp:
        return text.lower()  # fallback
    doc = nlp(text.lower())
    tokens = [
        token.lemma_ for token in doc
        if token.is_alpha and not token.is_stop
    ]
    return " ".join(tokens)

In [None]:
# you have to run the following code in your terminal to make it work

# python -m spacy download de_core_news_sm
# python -m spacy download es_core_news_sm
# python -m spacy download fr_core_news_sm
# python -m spacy download it_core_news_sm
# python -m spacy download ja_core_news_sm
# python -m spacy download pt_core_news_sm
# python -m spacy download nl_core_news_sm


# also run 
# python -m spacy download de_core_news_sm



In [12]:
# test
print(preprocess("Ein Flugzeug hebt gerade ab.", "de"))


Flugzeug heben


In [13]:
test_sentences = {
    "en": "The airplane is taking off.",
    "de": "Ein Flugzeug hebt gerade ab.",
    "es": "Un avión está despegando.",
    "fr": "Un avion est en train de décoller.",
    "it": "Un aereo sta decollando.",
    "pt": "Um avião está decolando.",
    "nl": "Een vliegtuig is aan het opstijgen.",
    "pl": "Samolot właśnie startuje.",
    "ru": "Самолет взлетает.",
    "ja": "飛行機が離陸します。",
    "zh": "飞机正在起飞。"
}

for lang, sentence in test_sentences.items():
    print(f"{lang.upper()} | Original: {sentence}")
    print(f"         Preprocessed: {preprocess(sentence, lang)}\n")


EN | Original: The airplane is taking off.
         Preprocessed: airplane take

DE | Original: Ein Flugzeug hebt gerade ab.
         Preprocessed: Flugzeug heben

ES | Original: Un avión está despegando.
         Preprocessed: avión despegar

FR | Original: Un avion est en train de décoller.
         Preprocessed: avion train décoller

IT | Original: Un aereo sta decollando.
         Preprocessed: aereo decollare

PT | Original: Um avião está decolando.
         Preprocessed: avião decolar

NL | Original: Een vliegtuig is aan het opstijgen.
         Preprocessed: vliegtuig opstijgen

PL | Original: Samolot właśnie startuje.
         Preprocessed: samolot właśnie startuje.

RU | Original: Самолет взлетает.
         Preprocessed: самолет взлетает.

JA | Original: 飛行機が離陸します。
         Preprocessed: 飛行機が離陸します。

ZH | Original: 飞机正在起飞。
         Preprocessed: 飞机正在起飞。



Spacy is not working for polish, russian, japanese and chinese. So we need to find a different solution. 

In [None]:
# run the following in your terminal
# pip install jieba
# pip install spacy[ja]
# python -m spacy download ja_core_news_sm
# pip install stanza

In [14]:
import jieba

def preprocess_zh(text):
    tokens = jieba.lcut(text)
    # Optional: remove stopwords if you have a list
    return " ".join(tokens)

In [15]:
import spacy

nlp_ja = spacy.load("ja_core_news_sm")

def preprocess_ja(text):
    doc = nlp_ja(text)
    tokens = [token.lemma_ for token in doc if token.is_alpha]
    return " ".join(tokens)

In [None]:
# to run the code below you need to downgrade your pytorch version
# run in your terminal 
# pip install torch==2.1.2

# import torch
# print(torch.__version__)

In [17]:
import stanza

stanza.download("ru")  # Russian
stanza.download("pl")  # Polish

nlp_ru = stanza.Pipeline("ru", processors="tokenize,lemma", use_gpu=False)
nlp_pl = stanza.Pipeline("pl", processors="tokenize,lemma", use_gpu=False)

def preprocess_ru(text):
    doc = nlp_ru(text)
    tokens = [word.lemma for sent in doc.sentences for word in sent.words if word.lemma.isalpha()]
    return " ".join(tokens)

def preprocess_pl(text):
    doc = nlp_pl(text)
    tokens = [word.lemma for sent in doc.sentences for word in sent.words if word.lemma.isalpha()]
    return " ".join(tokens)


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-04-10 10:11:13 INFO: Downloaded file to /Users/dionnespaltman/stanza_resources/resources.json
2025-04-10 10:11:13 INFO: Downloading default packages for language: ru (Russian) ...
2025-04-10 10:11:14 INFO: File exists: /Users/dionnespaltman/stanza_resources/ru/default.zip
2025-04-10 10:11:20 INFO: Finished downloading models and saved to /Users/dionnespaltman/stanza_resources


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-04-10 10:11:21 INFO: Downloaded file to /Users/dionnespaltman/stanza_resources/resources.json
2025-04-10 10:11:21 INFO: Downloading default packages for language: pl (Polish) ...
2025-04-10 10:11:22 INFO: File exists: /Users/dionnespaltman/stanza_resources/pl/default.zip
2025-04-10 10:11:25 INFO: Finished downloading models and saved to /Users/dionnespaltman/stanza_resources
2025-04-10 10:11:25 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-04-10 10:11:29 INFO: Downloaded file to /Users/dionnespaltman/stanza_resources/resources.json
2025-04-10 10:11:29 INFO: Loading these models for language: ru (Russian):
| Processor | Package            |
----------------------------------
| tokenize  | syntagrus          |
| lemma     | syntagrus_nocharlm |

2025-04-10 10:11:29 INFO: Using device: cpu
2025-04-10 10:11:29 INFO: Loading: tokenize
  return self.fget.__get__(instance, owner)()
2025-04-10 10:11:30 INFO: Loading: lemma
2025-04-10 10:11:34 INFO: Done loading processors!
2025-04-10 10:11:34 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-04-10 10:11:34 INFO: Downloaded file to /Users/dionnespaltman/stanza_resources/resources.json
2025-04-10 10:11:34 INFO: Loading these models for language: pl (Polish):
| Processor | Package      |
----------------------------
| tokenize  | pdb          |
| mwt       | pdb          |
| lemma     | pdb_nocharlm |

2025-04-10 10:11:34 INFO: Using device: cpu
2025-04-10 10:11:34 INFO: Loading: tokenize
2025-04-10 10:11:34 INFO: Loading: mwt
2025-04-10 10:11:34 INFO: Loading: lemma
2025-04-10 10:11:36 INFO: Done loading processors!


In [18]:
def preprocess(text, lang_code):
    if lang_code == "zh":
        return preprocess_zh(text)
    elif lang_code == "ja":
        return preprocess_ja(text)
    elif lang_code == "ru":
        return preprocess_ru(text)
    elif lang_code == "pl":
        return preprocess_pl(text)
    else:
        # fallback to spaCy
        nlp = load_spacy_model(lang_code)
        if not nlp:
            return text.lower()
        doc = nlp(text.lower())
        tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
        return " ".join(tokens)


In [19]:
test_sentences = {
    "en": "The airplane is taking off.",
    "de": "Ein Flugzeug hebt gerade ab.",
    "es": "Un avión está despegando.",
    "fr": "Un avion est en train de décoller.",
    "it": "Un aereo sta decollando.",
    "pt": "Um avião está decolando.",
    "nl": "Een vliegtuig is aan het opstijgen.",
    "pl": "Samolot właśnie startuje.",
    "ru": "Самолет взлетает.",
    "ja": "飛行機が離陸します。",
    "zh": "飞机正在起飞。"
}

for lang, sentence in test_sentences.items():
    print(f"{lang.upper()} | Original: {sentence}")
    print(f"         Preprocessed: {preprocess(sentence, lang)}\n")


EN | Original: The airplane is taking off.
         Preprocessed: airplane take

DE | Original: Ein Flugzeug hebt gerade ab.
         Preprocessed: Flugzeug heben

ES | Original: Un avión está despegando.
         Preprocessed: avión despegar

FR | Original: Un avion est en train de décoller.
         Preprocessed: avion train décoller

IT | Original: Un aereo sta decollando.
         Preprocessed: aereo decollare

PT | Original: Um avião está decolando.
         Preprocessed: avião decolar

NL | Original: Een vliegtuig is aan het opstijgen.


Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/_4/nzq6mygj7j71_l3z_c9kc7wr0000gn/T/jieba.cache


         Preprocessed: vliegtuig opstijgen

PL | Original: Samolot właśnie startuje.
         Preprocessed: samolot właśnie startuje

RU | Original: Самолет взлетает.
         Preprocessed: самолет взлетать

JA | Original: 飛行機が離陸します。
         Preprocessed: 飛行 機 が 離陸 する ます

ZH | Original: 飞机正在起飞。


Loading model cost 0.670 seconds.
Prefix dict has been built successfully.


         Preprocessed: 飞机 正在 起飞 。



The test is not working for all the languages yet. 

In [None]:
# # Apply preprocessing
# tqdm.pandas(desc="Preprocessing sentence1")
# df["sentence1_clean"] = df.progress_apply(
#     lambda row: preprocess(row["sentence1"], row["lang1"]), axis=1
# )

# tqdm.pandas(desc="Preprocessing sentence2")
# df["sentence2_clean"] = df.progress_apply(
#     lambda row: preprocess(row["sentence2"], row["lang2"]), axis=1
# )

Preprocessing sentence1:   0%|          | 39/949080 [00:36<375:25:14,  1.42s/it]

⚠️ spaCy model fr_core_news_sm not found.


Preprocessing sentence1:   0%|          | 66/949080 [00:53<179:04:53,  1.47it/s]

⚠️ spaCy model nl_core_news_sm not found.


Preprocessing sentence1:   0%|          | 88/949080 [01:05<117:52:34,  2.24it/s]

In [None]:
# # Save or preview
# file_path_clean = "/Users/dionnespaltman/Desktop/Luiss /Machine Learning/Project/rs2_cleaned.csv"
# df.to_csv(file_path_clean, index=False)
# print("✅ Preprocessing complete! Cleaned data saved to 'rs2_cleaned.csv'")

# # change this file path 
# file_path_clean = "/Users/dionnespaltman/Desktop/Luiss /Machine Learning/Project/rs2_cleaned.csv"

# # load the csv as a pandas dataframe 
# df = pd.read_csv(file_path_clean)

✅ Preprocessing complete! Cleaned data saved to 'rs2_cleaned.csv'
