<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/09_RNNs_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install d2l==1.0.0-alpha1.post0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.0/93.0 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.0/121.0 KB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## 9.1 Working with Sequences

## 9.2 Converting Raw Text into Sequence Data

In [None]:
import random
import re
import torch
from d2l import torch as d2l



* Python `random` Module: This module can be sued to generate random numbers.
* Python `re` library: A regular expression (or re) specifies a set of strings that matches it.
* `re.sub()`: This function retruns a string where all amtching occurances of the specified pattern are replaced by the replace string.

In [None]:
my_string = 'My name is Davood'

my_string1 = re.sub('v', 'b', my_string)

print(my_string)
print(my_string1)

My name is Davood
My name is Dabood


In [None]:
class TimeMachine(d2l.DataModule):
  def _download(self):
    fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
                         '090b5e7e70c295757f55df93cb0a180b9691891a')
    with open(fname) as f:
      return f.read()

In [None]:
data = TimeMachine()
raw_text = data._download()
raw_text[:60]

Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...


'The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Tra'

In [None]:
@d2l.add_to_class(TimeMachine)
def _preprocess(self, text):
  return re.sub('[^A-Za-z]+', ' ', text).lower() # replaces '[^A-Za-z]+' in the text with space.

In [None]:
text = data._preprocess(raw_text)
text[0:60]

'the time machine by h g wells i the time traveller for so it'

In [None]:
list(text[0:10])  # creates a list of the given string

['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']

* Python `join()`: Is an built-in string function in Python used to join elements of the sequence seperated by a string operator.

In [None]:
my_list = ['Davood', 'Soleymanzadeh']

' '.join(my_list) # joins the elements in my list seperated by an space

'Davood Soleymanzadeh'

#### Nested List in Python
* List comprehension is one of the unique features of Python which allows to create lists by iterating over an iteratable object.
* Nested list comprehension are list comprehension within another list comprehension which is similar to nested for loops.

In [None]:
# Flattening a 2-D list

my_list = [[1, 2, 3],
           [4, 5, 6],
           [7, 8, 9]]

flatten_list = [val for sublist in my_list for val in sublist]
flatten_list

In [None]:
@d2l.add_to_class(TimeMachine)
def _tokenize(self, text):
  return list(text)

In [None]:
tokens = data._tokenize(text)

# When tokens is a nested list, the following will flatten the list
if tokens and isinstance(tokens[0], list):
  tokens = [token for line in tokens for token in line]

* `Counter()` Class in `collection`s module: A counter is a container that stores elements as **dictionary keys**, and their counts are stored as **dictionary values**.


In [None]:
import collections

my_name = 'Davood Soleymanzadeh'

my_counter = collections.Counter(my_name)
my_counter

In [None]:
counter = collections.Counter(tokens)
counter

* Python built-in `sorted()` function: returns a sorted list from the iterable object.

In [None]:
my_list = [4, 1, 3, 2]

print(sorted(my_list))
print(sorted(my_list, reverse=True))

[1, 2, 3, 4]
[4, 3, 2, 1]


* The `key(optional)` argument in `sorted()` is a function that would serve as a key or basis for sort comparison.

In [None]:
my_dict = {'Akbar': 35, 'Mohsen': 32, 'Ahmad': 29, 'Davood': 24}

print(sorted(my_dict))   # Sorted based on strings
print(sorted(my_dict, key=lambda x: x[0])) # sorted based on keys
print(sorted(my_dict, key=lambda x: x[1])) # sorted based on values

['Ahmad', 'Akbar', 'Davood', 'Mohsen']
['Akbar', 'Ahmad', 'Davood', 'Mohsen']
['Davood', 'Ahmad', 'Akbar', 'Mohsen']


In [None]:
token_freqs = sorted(counter.items(), key=lambda x: x[1],    # sorted based on the values of each character in the string
                              reverse=True)
token_freqs

* `set()` in Python: A set is an unordered collection data type that is iterable, mutable, and has no duplicate elements. Since sets are unordered, we cannot access items using indexes as we do in lists.

In [None]:
set(['<unk>'] + [token for token, freq in token_freqs])

In [None]:
idx_to_token = list(sorted(set(['<unk>'] + [token for token, _ in token_freqs])))
idx_to_token, len(idx_to_token)

In [None]:
token_to_idx = {token: idx for idx, token in enumerate(idx_to_token)}
token_to_idx

* Python dictionary `get()` method: Returns the value for the given key if present in the dictionary. If not, the it will return a None.

In [None]:
my_dict = {'Akbar': 35, 'Mohsen': 32, 'Ahmad': 29, 'Davood': 24}

print(my_dict.get('Davood'))

24


* Python `__len__()` magic method: It is basically used to implement the 'len()' function in Python.

In [None]:
class Person:
  def __init__(self, name :str, age :int):
    self.name = name
    self.age = age

  def __len__(self):           # whenever len() is called on a Person object, it will return the age!!
    return self.age

person1 = Person('Davood', 29)
len(person1)

29

* Python `__getitem__()` in Python: When used in a class, allows its instance to use the [] (indexer) operator.

In [None]:
class Student:
  def __init__(self, name :str, scores=None):
    self.name = name
    self.scores = scores

  def __getitem__(self, key):
    return self.scores[key]

student1 = Student('Davood', [100, 90, 85])
print(student1[0])
print(student1[1])
print(student1[2])

* Python string `split()` method: split a string into a list of strings after breaking the given string by the specified seperator.

In [None]:
my_string = 'Davood Soleymanzadeh'

my_list = my_string.split()
my_list[1:], my_list[:-1]

(['Soleymanzadeh'], ['Davood'])

In [None]:
words = text.split() # creates a list of each word 

In [None]:
bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])]  # This will join words that are next to each other.


## 9.3 Language Models

### Understanding Slicing

* `a[start:stop]`: Items start through stop -1.
* `a[start:]`: Items start through the rest of the array.
* `a[:stop]`: Items from the beginning through step -1.
* `a[:]`: A copy of the whole array.
* `a[start:stop:step]`: Start through not past stop, by step. step can be a negative number.

* `a[-1]`: Last item in the array.
* `a[-2:]`: last two items in the array.
* `[a:-2]`: Everything except the last two items.
* `a[::-1]`: All items in the array, reversed.
* `a[1::-1]`: The first two items reversed.
* `a[:-3:-1]`: The last two items reversed.
* `a[-3::-1]`: Everything expect the last two items.

A slice object can represent a slicing operation.

* `a[slice(start, stop, step)]` is equivalent to `a[start, stop, step].` 

In [2]:
num_train = 10

idx = slice(0, num_train)
idx

slice(0, 10, None)

In [4]:
num_val = 5

idx1 = slice(num_train, num_train + num_val)
idx1

slice(10, 15, None)