# <font color = 'dodgerblue'>**Tokenization approaches spacy - Real Dataset**

# <font color = 'dodgerblue'>**Install/Import Libraries**

In [2]:
# install spacy
if 'google.colab' in str(get_ipython()):
    !pip install - U spacy - qq
    !pip install swifter


In [3]:
# Import the Path module from the pathlib library
from pathlib import Path

# Import the tarfile module for working with tar files
import tarfile

# Import the pandas library for working with data frames
import pandas as pd

# Import the spacy library for natural language processing
import spacy

# Import the List type from the typing module to use in function annotations
from typing import List

# Import the swifter package to speed up data processing tasks on pandas DataFrame and Series objects
import swifter


2023-08-21 06:45:56.857230: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-21 06:45:58.168677: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-08-21 06:45:58.172275: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-08-21 06:45:58.172827: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there m

In [4]:
# check spacy version
spacy.__version__


'3.6.0'

# <font color = 'dodgerblue'>**Specify Data Folders**

In [5]:
get_ipython()


<ipykernel.zmqshell.ZMQInteractiveShell at 0x7ffb0dd40070>

The function `get_ipython()` returns a reference to the current IPython instance running in the environment. This instance is an IPython shell or an IPython kernel, depending on the context in which the code is executed.

In [6]:
# Check if the code is running in a Colab environment
if 'google.colab' in str(get_ipython()):  # If the code is running in Colab
    # mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    # set the base path to a Google Drive folder
    base_path = '/content/drive/MyDrive/data'
else:
    # If the code is not running in Colab, set the base path to a local folder
    base_path = '/home/harpreet/Insync/google_drive_shaannoor/data'


# Convert the base path to a Path object
base_folder = Path(base_path)

# Define the archive folder path
archive_folder = base_folder/'archive'

# Define the data folder path
data_folder = base_folder/'datasets'


Code Explanation:

- **Environment Check**: The code determines whether it's running in a Google Colab environment or locally on a machine. This distinction guides the subsequent steps.
- **Mounting Google Drive (if in Colab)**:
  - **Access to Files**: By mounting Google Drive, the code gains access to files and folders stored in the user's Google Drive account. This is essential for reading and writing data that's stored in the cloud.
  - **Collaboration and Portability**: Mounting Google Drive allows multiple users to work on shared files and ensures that the code can be run from any device with access to the user's Google Drive. It promotes collaboration and makes the code more portable.
  - **Persistent Storage**: Google Colab instances are temporary and reset after a period of inactivity. Mounting Google Drive provides a way to save and access data across different sessions, ensuring persistence.
- **Setting the Base Path**: Depending on the environment, the base path is set to a specific directory in Google Drive (if in Colab) or a local folder (if running locally).
- **Using Path Objects**: The code utilizes `Path` objects for handling file paths, enhancing cross-platform compatibility.
- **Defining Specific Folder Paths**: Paths to specific subdirectories (`archive` and `datasets`) are defined relative to the base folder, organizing the data structure.

By accommodating both local and Colab environments and leveraging the advantages of Google Drive, this code snippet provides a flexible and robust way to handle file paths, access shared resources, and ensure data persistence.

# <font color = 'dodgerblue'>**Download Data**

## <font color = 'dodgerblue'>**Step1: use wget to download data files from URl**

In [7]:
# complete data link: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
file = archive_folder/'aclImdb_v1.tar.gz'
if not file.exists():  # check if file already exists
    !wget {url} - P {archive_folder} - O {file}


This code snippet is downloading a tar.gz file from a given URL and saving it to a specified location using the `wget` command-line utility. Let's go through each part of the code:

- `if not file.exists():`: This `if` statement checks whether the file specified by the `file` variable already exists or not. If the file doesn't exist, the code inside the `if` block will be executed.

- `!wget {url} -P {archive_folder} -O {file}`: This line runs the `wget` command to download the file from the `url` and save it to the `archive_folder` using the filename specified in the `file` variable.



## <font color = 'dodgerblue'>**Step2: check content of folder where data was downloaded**

In [8]:
# list files of google drive where data was downloaded
for entries in archive_folder.iterdir():
    if 'tar' in entries.name:
        print(entries.name)


UNGDC_1970-2020.tar.gz
20news-bydate.tar.gz
scale_whole_review.tar.gz
aclImdb_v1.tar.gz
review_polarity.tar.gz
cifar-10-python.tar.gz


## <font color = 'dodgerblue'>**Step3: Check content of zipped/tar folder**

In [9]:
# create a pathlib object for the file we want to untar
file = archive_folder / 'aclImdb_v1.tar.gz'


In [10]:
# Extract files using tarfile library
# you can skip running this cell

with tarfile.open(file, 'r') as tar:
    tar_file_names = tar.getnames()


In [11]:
tar_file_names[0:10]


['aclImdb',
 'aclImdb/test',
 'aclImdb/train',
 'aclImdb/test/neg',
 'aclImdb/test/pos',
 'aclImdb/train/neg',
 'aclImdb/train/pos',
 'aclImdb/train/unsup',
 'aclImdb/imdbEr.txt',
 'aclImdb/imdb.vocab']

## <font color = 'dodgerblue'>**Step 4: unzip/untar files**

In [12]:
# this cell can take time to run if you are running this for first time
file = archive_folder/'aclImdb_v1.tar.gz'
with tarfile.open(file, 'r') as tar:
    # Get the list of names of members in the tar file
    member_names = tar.getnames()
    # Loop over each member name
    for member_name in member_names:
        # Get the path of the current member
        member_path = data_folder / member_name
        # Extract the current member only if it does not already exist
        if not member_path.exists():
            tar.extract(member_name, path=data_folder)


Here is an explanation of the code:

- `with tarfile.open(file, 'r') as tar:`: This line opens the tar archive file specified by file in read mode, and creates a TarFile object, which is stored in the variable tar. The with statement is used to ensure that the tar file is properly closed when the code inside the block is finished executing.

- `member_names = tar.getnames()`: This line retrieves a list of names of the members in the tar archive, and stores it in the variable member_names.

- `for member_name in member_names:` : This line starts a for loop that iterates over each member name in the list member_names.

- `member_path = data_folder / member_name`: This line creates a Path object that represents the path of the current member in the loop, using the data_folder variable and the current member_name variable.

- `if not member_path.exists():`: This line checks if the path represented by member_path exists.

- `tar.extract(member_name, path=data_folder)`: If the path does not exist, this line extracts the current member from the tar archive and saves it to the data_folder path.

## <font color = 'dodgerblue'>**Step 5: Understand the structure of unzipped folder**

In [13]:
# we will use rglob which will help us to specify the pattern to search
# ** - Recursively matches zero or more directories that fall under the current directory.

for entries in (data_folder/'aclImdb').rglob('**'):
    print(entries)


/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/train
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/train/neg
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/train/pos
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/train/unsup
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/imdb-bert-base-uncased.hf
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/imdb-bert-base-uncased.hf/train
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/imdb-bert-base-uncased.hf/valid
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/imdb-bert-base-uncased.hf/test
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/.ipynb_checkpoints
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/test
/home/harpreet/Insync/google_drive_shaannoor/data/datasets/aclImdb/test/neg
/hom

Explanation of the code:
- The `rglob` method is used to perform a recursive search for files and directories, and
- ``'**'` argument is used to match all subdirectories.

## <font color = 'dodgerblue'>**Step 6a: combine all text files and create dataframe**

In [14]:
# Function to combine reviews from multiple text files
# the conepts were covered in first lecture

def get_reviews(path: Path) -> List[str]:
    """
    This function takes a directory path and returns a list of strings,
    where each string is the contents of a '.txt' file in the directory.

    Parameters:
    - path (Path): The directory path to search for '.txt' files

    Returns:
    - List[str]: A list of strings, where each string is the contents of a '.txt' file in the directory
    """
    reviews = []  # list to store the contents of each '.txt' file

    # loop through all the entries in the directory
    for file in path.iterdir():
        # check if the entry is a '.txt' file
        if file.suffix == '.txt':
            # open the file and read its contents
            with open(path/file, 'r') as f:
                text = f.read()
                # add the contents to the list of reviews
                reviews.append(text)

    # return the list of reviews
    return reviews


In [15]:
# Function to create dataframe from extracted list of files

def make_dataframe(folder: Path) -> pd.DataFrame:
    """
    This function takes a directory path and returns a Pandas DataFrame with two columns: 'Reviews' and 'Labels'.
    The 'Reviews' column contains the contents of all '.txt' files in the 'pos' and 'neg' subdirectories of the input
    folder, concatenated together. The 'Labels' column contains binary labels indicating whether the corresponding
    review is positive (1) or negative (0).

    Parameters:
    - folder (Path): The directory path containing the 'pos' and 'neg' subdirectories

    Returns:
    - pd.DataFrame: A Pandas DataFrame with two columns: 'Reviews' and 'Labels'
    """
    # Get the reviews from the 'pos' and 'neg' subdirectories
    positive_reviews = get_reviews(folder / 'pos')
    negative_reviews = get_reviews(folder / 'neg')

    # Create the DataFrame with the combined reviews and binary labels
    data = pd.DataFrame({'Reviews': positive_reviews + negative_reviews,
                        'Labels': list('1' * len(positive_reviews) + '0' * len(negative_reviews))})
    # Convert the 'Labels' column to integer type
    data['Labels'] = data['Labels'].astype('int32')

    # Return the DataFrame
    return data


In [17]:
# this cell can take 15 mins to run
# create a train data set
train_data = make_dataframe(data_folder/'aclImdb/train')


In [18]:
# create a test data set
test_data = make_dataframe(data_folder/'aclImdb/test')


### <font color = 'dodgerblue'>**Save dataframe to csv file**

In [19]:
train_data.to_csv(data_folder/'aclImdb'/'train.csv')


In [20]:
test_data.to_csv(data_folder/'aclImdb'/'test.csv')


In [21]:
train_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  25000 non-null  object
 1   Labels   25000 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 293.1+ KB


# <font color = 'dodgerblue'>**Load csv file**

In [22]:
train_data = pd.read_csv(data_folder / 'aclImdb'/'train.csv', index_col=0)


In [23]:
# Printing shape of dataframe
train_data.shape


(25000, 2)

In [24]:
# diaplay first five rows
train_data.head()


Unnamed: 0,Reviews,Labels
0,Ever wanted to know just how much Hollywood co...,1
1,The movie itself was ok for the kids. But I go...,1
2,You could stage a version of Charles Dickens' ...,1
3,this was a fantastic episode. i saw a clip fro...,1
4,and laugh out loud funny in many scenes.<br />...,1


# <font color = 'dodgerblue'>**Import Spacy Model**

In [25]:
# check the models we have dowloaded in spacy folder
!python - m spacy download en_core_web_sm


2023-08-21 06:47:35.819370: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-21 06:47:37.034382: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-08-21 06:47:37.035620: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-08-21 06:47:37.036135: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there m

# <font color = 'dodgerblue'>**Compare tokenization approaches**

In [None]:
# We will load the model -en_core_web_sm
nlp = spacy.load('en_core_web_sm')


## <font color = 'dodgerblue'>**Method1 : Typical approach using spacy**

In [None]:
def tokenize(text: str) -> List[str]:
    """Tokenize the input text using spaCy.

    Args:
    text: The input text to be tokenized.

    Returns:
    A list of tokens.
    """
    # Apply the spaCy NLP model to the input text
    doc = nlp(text)
    # Extract the tokens from the spaCy doc and return as a list
    tokens = [token.text for token in doc]
    return tokens


In [None]:
# DONOT RUN THIS Cell in the class
# it is only for demonstration purpose, it can take a long time
# as indicated by the output below
# it took around 8 minutes on a 128 gb RAM machine
# it took 21 minutes on colab
train_data['tokens_method1'] = train_data['Reviews'].swifter.apply(tokenize)


Pandas Apply:   0%|          | 0/25000 [00:00<?, ?it/s]

In [None]:
train_data.head()


Unnamed: 0,Reviews,Labels,tokens_method1
0,An excellent example of the spectacular Busby ...,1,"[An, excellent, example, of, the, spectacular,..."
1,"In Manhattan, the American middle class Jim Bl...",1,"[In, Manhattan, ,, the, American, middle, clas..."
2,"""Foxes"" is a great film. The four young actres...",1,"["", Foxes, "", is, a, great, film, ., The, four..."
3,Another comment about this film made it sound ...,1,"[Another, comment, about, this, film, made, it..."
4,The energetic young producer of theatrical pro...,1,"[The, energetic, young, producer, of, theatric..."


## <font color = 'dodgerblue'>**Method 2: Using nlp.pipe from Spacy**

In [None]:
import os
os.cpu_count()


64

In [None]:
# DO NOT Run this cell in the class

# spaCy includes built-in support for multiprocessing with nlp.pipe
# this can speed up the processing
# it took 1 min 42 secs on a 128 gb RAM machine with 16 cores
# it took 10 mins on colab pro (colab pro  has 4 cores)

# initialize an empty list to store tokens
tokens_method2 = []

# process multiple documents in parallel using the spaCy NLP library
for doc in nlp.pipe(train_data.Reviews.values, batch_size=1000, n_process=3):
    # extract text of each token in the document and create a list of tokens
    tokens = [token.text for token in doc]
    # add the list of tokens to the tokens_method2
    tokens_method2.append(tokens)

# add the tokens_method2 to the train_data dataframe as a new column 'tokens_method2'
train_data['tokens_method2'] = tokens_method2


This code performs tokenization on the `train_data.Reviews.values` by using the spaCy NLP library (`nlp`).

- The **`nlp.pipe` method is used to process multiple documents in parallel**, where `batch_size=1000` and `n_process=32` specify the batch size and number of CPU processes to use respectively.

- For each document in the batch, the code creates a list of tokens, represented by the text of the spaCy token objects, using a list comprehension `[token.text for token in doc]`.

- The resulting list of tokens is then appended to `tokens_method2`. Finally, the `tokens_method2` list is added as a new column ``'tokens_method2'` to the `train_data` dataframe.






In [None]:
train_data.head()


Unnamed: 0,Reviews,Labels,tokens_method1,tokens_method2
0,An excellent example of the spectacular Busby ...,1,"[An, excellent, example, of, the, spectacular,...","[An, excellent, example, of, the, spectacular,..."
1,"In Manhattan, the American middle class Jim Bl...",1,"[In, Manhattan, ,, the, American, middle, clas...","[In, Manhattan, ,, the, American, middle, clas..."
2,"""Foxes"" is a great film. The four young actres...",1,"["", Foxes, "", is, a, great, film, ., The, four...","["", Foxes, "", is, a, great, film, ., The, four..."
3,Another comment about this film made it sound ...,1,"[Another, comment, about, this, film, made, it...","[Another, comment, about, this, film, made, it..."
4,The energetic young producer of theatrical pro...,1,"[The, energetic, young, producer, of, theatric...","[The, energetic, young, producer, of, theatric..."


## <font color = 'dodgerblue'>**Method 3: Using nlp.pipe and disable not required components**

In [None]:
# in addition to multiprocessing with nlp.pipe
# we can get significant speed improvements if we disable the components that we do not need
# it took around 3 minutes
# it took 26 secs on a 128 gb RAM machine with 16 cores
# 50 secs on colab

# initialize an empty list to store tokens
token_list_method3 = []

# temporarily disable the named pipes of spaCy NLP processing pipeline
disabled = nlp.select_pipes(
    disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])

# process multiple documents in parallel using the spaCy NLP library
for doc in nlp.pipe(train_data.Reviews.values, batch_size=1000, n_process=3):
    # extract text of each token in the document and create a list of tokens
    tokens = [token.text for token in doc]
    # add the list of tokens to the token_list_method3
    token_list_method3.append(tokens)

# add the token_list_method3 to the train_data dataframe as a new column 'tokens_method3'
train_data['tokens_method3'] = token_list_method3

# restore the named pipes that were disabled
disabled.restore()


In [None]:
train_data.head()


Unnamed: 0,Reviews,Labels,tokens_method1,tokens_method2,tokens_method3
0,An excellent example of the spectacular Busby ...,1,"[An, excellent, example, of, the, spectacular,...","[An, excellent, example, of, the, spectacular,...","[An, excellent, example, of, the, spectacular,..."
1,"In Manhattan, the American middle class Jim Bl...",1,"[In, Manhattan, ,, the, American, middle, clas...","[In, Manhattan, ,, the, American, middle, clas...","[In, Manhattan, ,, the, American, middle, clas..."
2,"""Foxes"" is a great film. The four young actres...",1,"["", Foxes, "", is, a, great, film, ., The, four...","["", Foxes, "", is, a, great, film, ., The, four...","["", Foxes, "", is, a, great, film, ., The, four..."
3,Another comment about this film made it sound ...,1,"[Another, comment, about, this, film, made, it...","[Another, comment, about, this, film, made, it...","[Another, comment, about, this, film, made, it..."
4,The energetic young producer of theatrical pro...,1,"[The, energetic, young, producer, of, theatric...","[The, energetic, young, producer, of, theatric...","[The, energetic, young, producer, of, theatric..."
