## Imports

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import datasets as ds
import string
import re
from bs4 import BeautifulSoup



## The dataset

### Question 1
How many splits does the dataset has?

In [4]:
splits: list[str] = ds.get_dataset_split_names('imdb')
print(splits)
print(f'Number of splits: {len(splits)}')

['train', 'test', 'unsupervised']
Number of splits: 3


There are 3 splits in the IMDB dataset.

### Question 2
How big are these splits?

In [5]:
def load_datasets() -> list[ds.Dataset]:
    """
    Loads the IMDB dataset from the datasets library.
    Returns:
        datasets: list[ds.Dataset] - List of datasets
    """
    datasets: list[ds.Dataset] = []
    for split in splits:
        dataset: ds.Dataset = ds.load_dataset('imdb', split=split)
        datasets.append(dataset)
    
    return datasets

datasets: list[ds.Dataset] = load_datasets()
for i, dataset in enumerate(datasets):
    print(f'{splits[i]} split size : {dataset.num_rows}')

Found cached dataset imdb (/Users/francois.soulier/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
Found cached dataset imdb (/Users/francois.soulier/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
Found cached dataset imdb (/Users/francois.soulier/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


train split size : 25000
test split size : 25000
unsupervised split size : 50000


### Question 3
What is the proportion of each class on the supervised splits?

In [6]:
supervised_datasets: list[ds.Dataset] = datasets[0:2]

for i, dataset in enumerate(supervised_datasets):
    train_data_frame = dataset.to_pandas()
    print(splits[i])
    print('Class 0')
    print(train_data_frame.where(train_data_frame['label'] == 0).count())
    print('Class 1')
    print(train_data_frame.where(train_data_frame['label'] == 1).count())
    print('\n')

train
Class 0
text     12500
label    12500
dtype: int64
Class 1
text     12500
label    12500
dtype: int64


test
Class 0
text     12500
label    12500
dtype: int64
Class 1
text     12500
label    12500
dtype: int64




Hence, each class represents 50% of the supervised dataset (both in train and test samples).

## Naive Bayes classifier 

### Question 1
Create an adapted processing function which lower case the text and replace punctuations with text:

In [28]:
def clean_html(text: str) -> str:
  """
  Removes HTML tags from the given text.
  Args:
      text (str): Text with html tags.
  Returns:
      str: Text from all html tags.
  """
  no_html = BeautifulSoup(text).get_text()
  return no_html

def text_processing(text: str) -> str:
  """
  Pre-processes the given text.
  Args:
      text (str): Text to process
  Returns:
      str: Processed text
  """
  result_text = text
  result_text = clean_html(result_text)
  result_text = result_text.lower()
  pattern = r"(?<![a-zA-Z])[^\w\s]|[^\w\s](?![a-zA-Z])"
  result_text = re.sub(pattern, "", result_text)
  result_text = result_text.strip()
  return re.sub("(\s+)", " ", result_text)

#### Tiny test

In [33]:
def test_preprocessing(input: str, expected: str) -> None:
    result: str = text_processing(input)
    assert text_processing(input) == expected or print(result)

test_preprocessing("Hello, ,,,World!::", "hello world")
test_preprocessing("Hello,        U.S.A!", "hello u.s.a")

Now let's apply `text_processing` function on our dataframe.

In [35]:
train_data_frame.text = train_data_frame.text.apply(text_processing)
train_data_frame.text[5]



"i had high hopes for this one until they changed the name to the shepherd border patrol the lamest movie name ever what was wrong with just the shepherd this is a by the numbers action flick that tips its hat at many classic van damme films there is a nice bit of action in a bar which reminded me of hard target and universal soldier but directed with no intensity or flair which is a shame there is one great line about being p*ss drunk and carrying a rabbit and some ok action scenes let down by the cheapness of it all a lot of the times the dialogue doesn't match the characters mouth and the stunt men fall down dead a split second before even being shot the end fight is one of the better van damme fights except the director tries to go a bit too john woo and fails also introducing flashbacks which no one really cares about just gets in the way of the action which is the whole point of a van damme film.not good not bad just average generic action"