In [2]:
%uv pip install scikit-learn numpy

import numpy as np

/Users/s.mallet/passau/dlnlp/.venv/bin/python: No module named uv
Note: you may need to restart the kernel to use updated packages.


# Deep Learning for Natural Language and Code: Exercise 1

# Task 1: LIBSVM - BOW

## Load data

1. Download the dataset from [here](https://ai.stanford.edu/%7Eamaas/data/sentiment)
1. Copy the dataset next to this Jupyter (.ipynb file)
1. Install:
    * Sklearn (This library is only allowed to use for reading the BOW in LIBSVM format)

A **LIBSVM file** is a plain text file format used to store **sparse datasets** for machine learning tasks, especially **classification** and **regression**. It's called "LIBSVM" because it was originally used by the **LIBSVM** library, a very popular library for Support Vector Machines.

The format looks like this:

```
<label> <index1>:<value1> <index2>:<value2> <index3>:<value3> ...
```

- `<label>` = the **target value** (for example, `1` for positive, `-1` for negative).
- `<index>:<value>` = the **non-zero features**.
  - `<index>` is the feature number (starting at 1),
  - `<value>` is the value of that feature (usually the count or some preprocessed weight).

If a feature is **zero**, it is simply **omitted** from the line (to save space — this is why it's called *sparse* format).

---

### A real small example:

Suppose we have two movie reviews turned into a Bag-of-Words (BoW):
- Feature 1 = "awesome"
- Feature 2 = "terrible"
- Feature 3 = "boring"
- Feature 4 = "amazing"

And two reviews:
- Review 1 (positive): "awesome amazing"
- Review 2 (negative): "terrible boring boring"

The LIBSVM file would look like:

```
1 1:1 4:1
-1 2:1 3:2
```

**Explanation:**
- The first line:
  - `1` → label is positive
  - `1:1` → "awesome" appeared once
  - `4:1` → "amazing" appeared once
- The second line:
  - `-1` → label is negative
  - `2:1` → "terrible" appeared once
  - `3:2` → "boring" appeared twice

---

### Why use LIBSVM format?

- It's super lightweight for huge datasets where most feature values are 0.
- It's easy to parse and generate manually.
- Many machine learning tools accept this format directly (e.g., SVM, Random Forest, logistic regression).

**What's inside `./aclImdb/train/labeledBow.feat` ?**

- There are 25 000 lines
- One line per sample
- first element of each line is the label (frol 1 to 10)
- others are the number of apparences of each feature (e.g `0:9 1:1` means feature 0 appeared 9 times and the feature 1 appeared only once)


In [3]:
from typing import cast

type Feature = int
type Occurrence = int
type Label = int

def parse_libsvm_line(line:str) -> tuple[list[tuple[Feature, Occurrence]], Label]:
    label, features = line.split(' ', 1)
    features = features.split(' ')
    features = cast(list[tuple[int, int]], [tuple(map(int, feature.split(':'))) for feature in features])
    return features, int(label)

assert parse_libsvm_line('1 0:9 1:1 14:87') == ([(0, 9), (1, 1), (14, 87)], 1)

def parse_libsvm_content(content:str)->tuple[np.ndarray, np.ndarray]:
    
    data=  [parse_libsvm_line(line) for line in content.split('\n') if line]

    vocabulary_size= max((feature for features, _ in data for feature, _ in features), default=0) + 1
    X = np.zeros((len(data), vocabulary_size))
    y = np.zeros(len(data))
    for i, (line, label) in enumerate(data):
        for feature, occurrence in line:
            X[i, feature] = occurrence
        y[i] = label
    return X, y

X, y = parse_libsvm_content('1 0:9 1:1 3:87\n-1 2:1 3:2')
assert np.all(X== [[9, 1, 0, 87],[0,0,1,2]])
assert np.all(y== [1,-1])

def load_libsvm_file(path:str)->tuple[np.ndarray, np.ndarray]:
    with open(path, 'r') as f:
        return parse_libsvm_content(f.read())

In [4]:
X, y = load_libsvm_file('./aclImdb/train/labeledBow.feat')

In [5]:
X

array([[ 9.,  1.,  4., ...,  0.,  0.,  0.],
       [ 7.,  4.,  2., ...,  0.,  0.,  0.],
       [ 4.,  4.,  4., ...,  0.,  0.,  0.],
       ...,
       [17.,  6.,  7., ...,  0.,  0.,  0.],
       [15.,  8.,  3., ...,  0.,  0.,  0.],
       [10.,  2.,  2., ...,  0.,  0.,  0.]], shape=(25000, 89527))

In [6]:
y

array([9., 7., 9., ..., 4., 2., 2.], shape=(25000,))

In [7]:
from sklearn.datasets import load_svmlight_file


X_sklearn, y_sklearn = load_svmlight_file('./aclImdb/train/labeledBow.feat')

assert X_sklearn.shape == X.shape
assert np.allclose(X_sklearn.todense(), X)
assert np.allclose(y_sklearn, y)
# ->  Good, my implementation is consistent with the sklearn implementation

### Vocabulary

In [8]:
from pathlib import Path


VOCAB_FILE = Path('./aclImdb/imdb.vocab')
assert VOCAB_FILE.exists()

def read_vocab():
    return VOCAB_FILE.read_text().splitlines()


vocab = read_vocab()

vocab[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'this']

In [9]:
# Read the the bag of words and the Y for the training data
X, y = load_libsvm_file('./aclImdb/train/labeledBow.feat')
# Read the the bag of words and the Y for the test data
X_test, y_test = load_libsvm_file('./aclImdb/test/labeledBow.feat')

# Task 2: Bag of Words (BOW)
## Load Raw text and scores 

1. Be sure to have downloaded the dataset from the link provided in the exercise and have read the README file
1. Be sure to have copied the dataset next to this Jupyter (.ipynb file)
1. Be sure to have installed:
    * Numpy
    * NLTK (only for the stemming process)
    * Sklearn (only for building a Random Forest)
1. In this part of the exercise it is not allowed to use Sklearn
1. Build the Bag Of Words (BOW) with the raw data, for this you need to:
    * Tokenize on spaces and punctuation
    * Lower case
    * Remove punctuation
    * Remove terms appearing more often than X percent, this X percent should be variable. Which means that you should be able to change the percentage as a parameter.
    * Use NLTK porter stemmer
1. Build a classifier with the BOW previously built. Take into account:
    * The RF should be a binary classification positive (i.e., score >=7) and negative (i.e., score <= 4)
    * Test the classifier with the test data

In [2]:
import re
import glob
import string
import numpy as np
from nltk.stem.porter import *
from sklearn.ensemble import RandomForestClassifier
stemmer = PorterStemmer()

## Load data

Read all the training data, including the reviews and the scores associated to each one. Be sure to explore the data and learn characteristics of them, such as the type of encoding and special characters. 

## Clean HTML and tokenize text
Clean the review, handle the special characters, remove the html tags and tokenize the text based on the instructions given in the exercise sheet. 

## Convert to lower case and remove punctuation

## Remove X percentage and build vocabulary 
Remove all the tokens that do not meet the requirements based on the exercise sheet and build the vocabulary. 

## Use Porter Stemmer for stemming

## Build the bag of words (BOW)
For building the matrix for the representation of bag of words use the previously built vocabulary and tokens for each review.

# Task 3: Comparing BOWs

1. Use the previous steps to build a bag of words with the training data in which the tokens that appear more than 1% are discarded. 
1. Compare your BOW with LIBSVM BOW. 

# Task 4: Train a Random Forest and test it


# Task 5: Markov chain
Tip: For memory optimization use sparse structures not a matrix mostrly filled with zeros

## Pre-process data
Read the data and using the previous built functions for the BOW representation create a list of words per each review

## Chain words
Identify all the possible pairs of words (w0, w1) in all the reviews

## Initialize the Markov's Chain

## Generate data

Here you could also try to generate words for the unlabeled part of the dataset. Try to meassure the quality of the model