<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/09_RNNs_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install d2l==1.0.0-alpha1.post0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.0/93.0 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.0/121.0 KB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[?25h

## 9.1 Working with Sequences

## 9.2 Converting Raw Text into Sequence Data

In [3]:
import random
import re
import torch
from d2l import torch as d2l



* Python `random` Module: This module can be sued to generate random numbers.
* Python `re` library: A regular expression (or re) specifies a set of strings that matches it.
* `re.sub()`: This function retruns a string where all amtching occurances of the specified pattern are replaced by the replace string.

In [4]:
my_string = 'My name is Davood'

my_string1 = re.sub('v', 'b', my_string)

print(my_string)
print(my_string1)

My name is Davood
My name is Dabood


In [5]:
class TimeMachine(d2l.DataModule):
  def _download(self):
    fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
                         '090b5e7e70c295757f55df93cb0a180b9691891a')
    with open(fname) as f:
      return f.read()

In [6]:
data = TimeMachine()
raw_text = data._download()
raw_text[:60]

Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...


'The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Tra'

In [8]:
@d2l.add_to_class(TimeMachine)
def _preprocess(self, text):
  return re.sub('[^A-Za-z]+', ' ', text).lower() # replaces '[^A-Za-z]+' in the text with space.

In [9]:
text = data._preprocess(raw_text)
text[0:60]

'the time machine by h g wells i the time traveller for so it'

In [10]:
list(text[0:10])  # creates a list of the given string

['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']

* Python `join()`: Is an built-in string function in Python used to join elements of the sequence seperated by a string operator.

In [11]:
my_list = ['Davood', 'Soleymanzadeh']

' '.join(my_list) # joins the elements in my list seperated by an space

'Davood Soleymanzadeh'

#### Nested List in Python
* List comprehension is one of the unique features of Python which allows to create lists by iterating over an iteratable object.
* Nested list comprehension are list comprehension within another list comprehension which is similar to nested for loops.

In [None]:
# Flattening a 2-D list

my_list = [[1, 2, 3],
           [4, 5, 6],
           [7, 8, 9]]

flatten_list = [val for sublist in my_list for val in sublist]
flatten_list

In [14]:
@d2l.add_to_class(TimeMachine)
def _tokenize(self, text):
  return list(text)

In [16]:
tokens = data._tokenize(text)

# When tokens is a nested list, the following will flatten the list
if tokens and isinstance(tokens[0], list):
  tokens = [token for line in tokens for token in line]

* `Counter()` Class in `collection`s module: A counter is a container that stores elements as **dictionary keys**, and their counts are stored as **dictionary values**.


In [None]:
import collections

my_name = 'Davood Soleymanzadeh'

my_counter = collections.Counter(my_name)
my_counter

In [None]:
counter = collections.Counter(tokens)
counter

* Python built-in `sorted()` function: returns a sorted list from the iterable object.

In [21]:
my_list = [4, 1, 3, 2]

print(sorted(my_list))
print(sorted(my_list, reverse=True))

[1, 2, 3, 4]
[4, 3, 2, 1]


* The `key(optional)` argument in `sorted()` is a function that would serve as a key or basis for sort comparison.

In [25]:
my_dict = {'Akbar': 35, 'Mohsen': 32, 'Ahmad': 29, 'Davood': 24}

print(sorted(my_dict))   # Sorted based on strings
print(sorted(my_dict, key=lambda x: x[0])) # sorted based on keys
print(sorted(my_dict, key=lambda x: x[1])) # sorted based on values

['Ahmad', 'Akbar', 'Davood', 'Mohsen']
['Akbar', 'Ahmad', 'Davood', 'Mohsen']
['Davood', 'Ahmad', 'Akbar', 'Mohsen']


In [None]:
token_freqs = sorted(counter.items(), key=lambda x: x[1],    # sorted based on the values of each character in the string
                              reverse=True)
token_freqs