<a href="https://colab.research.google.com/github/Redcoder815/Deep_Learning_PyTorch/blob/main/29MachineTranslationAndDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
import torch

The line os.makedirs(root, exist_ok=True) is a Python command that creates a directory (folder) on your computer's file system.

os.makedirs(): This function from the os module (which provides a way of using operating system dependent functionality) is used to create directories recursively. This means if the root path contains multiple subdirectories that don't exist yet (e.g., data/fra-eng), it will create all of them.
root: This is the path to the directory you want to create (e.g., ./data).
exist_ok=True: This is an important argument. If set to True, the function will not raise an error if the directory already exists. It will simply proceed without doing anything. If exist_ok were False (which is the default behavior), and the directory already existed, os.makedirs() would raise a FileExistsError.

-------

So, if self.root is ./data, then os.path.join(self.root, 'fra-eng.zip') would result in ./data/fra-eng.zip on Linux/macOS or .\data\fra-eng.zip on Windows. This is safer than simply concatenating strings with / or \ because os.path.join handles the specifics for you.



------

The line urllib.request.urlretrieve(data_url, zip_file_path) is responsible for downloading the dataset from the internet.

urllib.request.urlretrieve(): This function from Python's urllib.request module is used to retrieve a remote URL object (like a file from a web server) and save it locally.
data_url: This is the first argument, representing the URL of the file to be downloaded (e.g., 'http://d2l-data.s3-accelerate.amazonaws.com/fra-eng.zip').
zip_file_path: This is the second argument, specifying the local path and filename where the downloaded content should be saved (e.g., './data/fra-eng.zip').
In essence, this line takes the URL of the zip file and downloads it, saving it to the specified location on your computer.

------

The line zipfile.ZipFile(zip_file_path, 'r') is used to open a zip archive for reading. Let me break it down:

zipfile: This is a built-in Python module that provides tools to work with ZIP archives.
zipfile.ZipFile(): This is the constructor for the ZipFile class within the zipfile module. It creates a ZipFile object, which represents a ZIP archive.
zip_file_path: This is the first argument, specifying the path to the ZIP file you want to open (e.g., ./data/fra-eng.zip).
'r': This is the second argument, representing the 'mode' in which the ZIP file should be opened. 'r' stands for 'read mode', meaning you intend to read contents from the archive. Other modes like 'w' (write) or 'a' (append) exist for creating or adding to zip files.
So, in summary, this code opens the specified zip file in read-only mode, allowing you to then list its contents or extract files from it.

-----

The line zf.extractall(self.root) is a method call on the ZipFile object (zf) that you just created. It's used to extract all the contents of the opened zip archive.

zf: This is the ZipFile object, representing the fra-eng.zip archive that was opened in read mode.
.extractall(): This is a method of the ZipFile object that extracts all the files and directories from the archive.
self.root: This argument specifies the directory where the contents of the zip file should be extracted. In this case, self.root is ./data, so all the files and folders within fra-eng.zip will be extracted into the ./data directory. For example, if the zip file contains a folder named fra-eng, it will be extracted to ./data/fra-eng.

-----------

Certainly! The line no_space = lambda char, prev_char: char in ',.!?' and prev_char != ' ' defines a small, anonymous function (a lambda function) called no_space.

lambda char, prev_char:: This declares the lambda function, specifying that it takes two arguments: char (the current character) and prev_char (the previous character).
char in ',.!?': This is the first part of the condition. It checks if the char is one of the specified punctuation marks: comma, period, exclamation mark, or question mark.
prev_char != ' ': This is the second part of the condition. It checks if the prev_char is not a space.
... and ...: The and operator means that both conditions must be True for the lambda function to return True.
In simple terms, this no_space function returns True if the current character is a punctuation mark (,.!?) and the character immediately preceding it is not a space. It's used in the preprocessing step to determine whether a space needs to be inserted before a punctuation mark to standardize spacing.

-----------

Let's trace how the line out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char for i, char in enumerate(text.lower())] and return ''.join(out) work with an example.

Imagine our raw_text contains just the sentence: "Go!".

text.lower(): First, "Go!" becomes "go!".

enumerate(text.lower()): This breaks down "go!" into indexed characters:

(0, 'g')
(1, 'o')
(2, '!')
List Comprehension Loop: We go through each of these (i, char) pairs:

For (0, 'g'): i is 0. The condition i > 0 is False. So, it falls to the else part, and 'g' is added to out.
out is now ['g'].

For (1, 'o'): i is 1. The condition i > 0 is True.
Next, no_space('o', text[1 - 1]) which is no_space('o', 'g') is evaluated. Is 'o' in ',.!?': False. So no_space returns False. The entire if condition (i > 0 and no_space(...)) is False. It falls to the else part, and 'o' is added to out.
out is now ['g', 'o'].

For (2, '!'): i is 2. The condition i > 0 is True.
Next, no_space('!', text[2 - 1]) which is no_space('!', 'o') is evaluated. Is '!' in ',.!?': True. Is previous character 'o' != ' ': True. So no_space returns True. The entire if condition (i > 0 and no_space(...)) is True. It executes the ' ' + char part, so ' !' (space followed by exclamation mark) is added to out.
out is now ['g', 'o', ' !'].

return ''.join(out): Finally, all the elements in the out list are joined together into a single string, with no characters in between them.

So, ['g', 'o', ' !'] becomes "go !".

This demonstrates how the code intelligently inserts a space before punctuation marks if they are not already preceded by one, standardizing the text.

---------------

Let's break down the line src.append([t for t in f'{parts[0]} <eos>'.split(' ') if t]) with an example, focusing on how it processes a single sentence part.

Imagine that parts[0] holds the English sentence "Go.".

f'{parts[0]} <eos>': First, an f-string is used to combine the sentence part with an end-of-sentence token, <eos>.

So, f'Go. <eos>' becomes the string "Go. <eos>".
.split(' '): Next, this combined string "Go. <eos>" is split into a list of words (tokens) wherever a space (' ') occurs.

This would produce the list: ['Go.', '<eos>'].
[t for t in ... if t] (List Comprehension): This part iterates through the list generated by split(). Its purpose is to filter out any potential empty strings that might result from splitting (e.g., if there were multiple spaces together, or leading/trailing spaces).

For each element t in ['Go.', '<eos>']:
Is t true (i.e., not an empty string)? Yes, "Go." is not empty.
Is t true? Yes, "<eos>" is not empty.
Since both are true, the list comprehension results in: ['Go.', '<eos>'].
src.append(...): Finally, this resulting list of tokens (['Go.', '<eos>']) is appended as a single item to the src list, which accumulates all the tokenized source sentences.

In our preprocessing, this ensures that each English sentence from parts[0] is transformed into a clean list of individual tokens, and correctly marks the end of the sentence with <eos> for subsequent machine translation models.

In [11]:
import os
import urllib.request
import zipfile

class MTFraEng:
    """The English-French dataset."""
    def __init__(self, root='./data'):
        self.root = root
        os.makedirs(root, exist_ok=True)

    def _download(self):
        data_url = 'http://d2l-data.s3-accelerate.amazonaws.com/fra-eng.zip'
        zip_file_path = os.path.join(self.root, 'fra-eng.zip')

        # Download the file
        if not os.path.exists(zip_file_path):
            print(f'Downloading {data_url} to {zip_file_path}')
            urllib.request.urlretrieve(data_url, zip_file_path)

        # Extract the file
        extract_path = os.path.join(self.root, 'fra-eng')
        if not os.path.exists(extract_path):
            print(f'Extracting {zip_file_path} to {extract_path}')
            with zipfile.ZipFile(zip_file_path, 'r') as zf:
                zf.extractall(self.root)

        with open(os.path.join(extract_path, 'fra.txt'), encoding='utf-8') as f:
            return f.read()

    def _preprocess(self, text):
        # Replace non-breaking space with space
        text = text.replace('\u202f', ' ').replace('\xa0', ' ')
        # Insert space between words and punctuation marks
        no_space = lambda char, prev_char: char in ',.!?' and prev_char != ' '
        out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
               for i, char in enumerate(text.lower())]
        return ''.join(out)

    def _tokenize(self, text, max_examples=None):
        src, tgt = [], []
        for i, line in enumerate(text.split('\n')):
            if max_examples and i > max_examples: break
            parts = line.split('\t')
            if len(parts) == 2:
                # Skip empty tokens
                src.append([t for t in f'{parts[0]} <eos>'.split(' ') if t])
                tgt.append([t for t in f'{parts[1]} <eos>'.split(' ') if t])
        return src, tgt

In [16]:
data = MTFraEng()
raw_text = data._download()
print(raw_text[:75])

Go.	Va !
Hi.	Salut !
Run!	Cours !
Run!	Courez !
Who?	Qui ?
Wow!	Ça alors !



In [14]:
text = data._preprocess(raw_text)
print(text[:80])

go .	va !
hi .	salut !
run !	cours !
run !	courez !
who ?	qui ?
wow !	ça alors !


In [17]:
src, tgt = data._tokenize(text)
src[:6], tgt[:6]

([['go', '.', '<eos>'],
  ['hi', '.', '<eos>'],
  ['run', '!', '<eos>'],
  ['run', '!', '<eos>'],
  ['who', '?', '<eos>'],
  ['wow', '!', '<eos>']],
 [['va', '!', '<eos>'],
  ['salut', '!', '<eos>'],
  ['cours', '!', '<eos>'],
  ['courez', '!', '<eos>'],
  ['qui', '?', '<eos>'],
  ['ça', 'alors', '!', '<eos>']])