### Task 1: Entity Recognition and Replacement
- **Objective:** Recognize named entities in imdb dataset (such as person names, locations) and replace them with generic labels to anonymize text data, if necessary.
- **Instructions:**
  - Utilize Stanza's entity recognition capabilities to identify named entities within the text.
  - Replace recognized named entities with generic labels or placeholders to ensure anonymity in the text.

### Task 2: Removal of HTML Tags or URLs
- **Objective:** Enhance the preprocess function to eliminate HTML tags and URLs from the text data in imdb dataset.
- **Instructions:**
  - Update the preprocess function using regular expressions to remove HTML tags (e.g., <tag>) and URLs present within the text.
  - Ensure the removal of HTML tags and URLs to clean the text data effectively.

### Task 3: Comparison of Pre-trained Word Embeddings
- **Objective:** Download various pre-trained word embeddings (e.g., FastText, Word2Vec, GloVe) and evaluate their performance against custom-trained embeddings on imdb dataset.
- **Instructions:**
  - Access the NLPL Vector Repository http://vectors.nlpl.eu/repository/ to download pre-trained word embeddings such as FastText, Word2Vec, and GloVe.
  - Compare these pre-trained embeddings with our custom-trained embeddings on imdb dataset.

In [1]:
from datasets import load_dataset
import pandas as pd

# Download the IMDb dataset
imdb_dataset = load_dataset('imdb')

# Select 1,000 examples from each split (train and test)
data = pd.DataFrame(imdb_dataset['train'].shuffle(seed=42).select(range(20000)))

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

# TASK 1

In [9]:
import stanza

# Initialize Stanza pipeline with NER
stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

def anonymize_entities(text):
    doc = nlp(text)
    for entity in doc.ents:
        text = text.replace(entity.text, '[ANONYMIZED]')
    return text

# Example usage
print(data.size)
i = 0
anonymized_text = []
for text in data['text']:
    if i % 100 == 0:
        print(i)
    i += 1
    anonymized_text.append(anonymize_entities(text)) 
print(anonymized_text)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2024-12-29 00:47:35 INFO: Downloaded file to C:\Users\aljaz\stanza_resources\resources.json
2024-12-29 00:47:35 INFO: Downloading default packages for language: en (English) ...
2024-12-29 00:47:36 INFO: File exists: C:\Users\aljaz\stanza_resources\en\default.zip
2024-12-29 00:47:39 INFO: Finished downloading models and saved to C:\Users\aljaz\stanza_resources
2024-12-29 00:47:39 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2024-12-29 00:47:39 INFO: Downloaded file to C:\Users\aljaz\stanza_resources\resources.json
2024-12-29 00:47:40 INFO: Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

2024-12-29 00:47:40 INFO: Using device: cpu
2024-12-29 00:47:40 INFO: Loading: tokenize
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-12-29 00:47:40 INFO: Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-12-29 00:47:40 INFO: Loading: ner
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
2024-12-29 00:47:40 INFO: Done loading processors!


40000
0


KeyboardInterrupt: 

# Task 2

In [4]:
import re

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Add other preprocessing steps here if needed
    return text

# Example usage
sample_text = "Check out this link: <a href='http://example.com'>Example</a>"
clean_text = preprocess_text(sample_text)
print(clean_text)

Check out this link: Example


# Task 3

In [10]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Download necessary resources (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Function to preprocess text
def preprocess_text(text):
    # Tokenize the text into words
    words = word_tokenize(text.lower())  # Convert text to lowercase

    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    words = [word.translate(table) for word in words if word.isalpha()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Stemming (uncomment if you want to use stemming)
    stemmed_words = [stemmer.stem(word) for word in words]

    # Join the words back into a string
    preprocessed_text = ' '.join(lemmatized_words)
    return preprocessed_text


# Apply preprocessing
data['clean_text'] = data['text'].apply(preprocess_text)
# Check preprocessed first instance
data['clean_text'][0]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aljaz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aljaz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aljaz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\aljaz/nltk_data'
    - 'c:\\Users\\aljaz\\AppData\\Local\\Programs\\Python\\Python312\\nltk_data'
    - 'c:\\Users\\aljaz\\AppData\\Local\\Programs\\Python\\Python312\\share\\nltk_data'
    - 'c:\\Users\\aljaz\\AppData\\Local\\Programs\\Python\\Python312\\lib\\nltk_data'
    - 'C:\\Users\\aljaz\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [6]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[['clean_text', 'text']], data['label'], test_size=0.2, random_state=42)

KeyError: "['clean_text'] not in index"

In [30]:
import gensim.downloader as api

# Download embeddings -> https://github.com/piskvorky/gensim-data
word2vec_model = api.load("word2vec-google-news-300")
# word2vec_model = api.load("glove-twitter-25")
    

In [31]:
import numpy as np

def document_vector(word2vec_model, doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc if word in word2vec_model.key_to_index]
    if len(doc) == 0:
        return np.zeros(300)
    return np.mean(word2vec_model[doc], axis=0)

In [32]:
# Add Word2Vec representations to DataFrame
X_train_w2v = [document_vector(word2vec_model, text.lower().split()) for text in X_train['clean_text']]
X_test_w2v = [document_vector(word2vec_model, text.lower().split()) for text in X_test['clean_text']]

In [33]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train_w2v, y_train)
logistic_predictions = logistic_model.predict(X_test_w2v)
logistic_accuracy = accuracy_score(y_test, logistic_predictions)
print("Logistic Regression Accuracy:", logistic_accuracy)

# Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train_w2v, y_train)
rf_predictions = rf_model.predict(X_test_w2v)
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Accuracy:", rf_accuracy)

Logistic Regression Accuracy: 0.856
Random Forest Accuracy: 0.81375
