# Day 22 - Real-world data representation using tensors

## Working with time series

### Shaping the data by time period

* We can reshape this data into $N$ (days) collections of $C$ (columns) of length $L$ (hours)
* Here, $C$ represents our different variables

In [1]:
import numpy as np
import torch

bikes_numpy = np.loadtxt(
    "./DLPT/data/bike-sharing-dataset/hour-fixed.csv",
    dtype=np.float32,
    delimiter=",",
    skiprows=1,
    converters={1: lambda x: float(x[8:10])}
)
bikes = torch.from_numpy(bikes_numpy)

bikes

tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 3.0000e+00, 1.3000e+01,
         1.6000e+01],
        [2.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 8.0000e+00, 3.2000e+01,
         4.0000e+01],
        [3.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 5.0000e+00, 2.7000e+01,
         3.2000e+01],
        ...,
        [1.7377e+04, 3.1000e+01, 1.0000e+00,  ..., 7.0000e+00, 8.3000e+01,
         9.0000e+01],
        [1.7378e+04, 3.1000e+01, 1.0000e+00,  ..., 1.3000e+01, 4.8000e+01,
         6.1000e+01],
        [1.7379e+04, 3.1000e+01, 1.0000e+00,  ..., 1.2000e+01, 3.7000e+01,
         4.9000e+01]])

* To reshape our data, we just have to get a new view over it

In [2]:
bikes.shape, bikes.stride()

(torch.Size([17520, 17]), (17, 1))

* If we want to reshape this into 24-hour chunks, then we need the stride along the $N$ dimension to be $24\times17=408$

In [3]:
daily_bikes = bikes.view(-1, 24, bikes.shape[1])

daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 24, 17]), (408, 17, 1))

* To get the correct shape, we now have to swap the rows and columns

In [4]:
daily_bikes = daily_bikes.transpose(1, 2)

daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 17, 24]), (408, 1, 17))

### Ready for training

* The `weathersit` variable is ordinal, but has no meaningful ordering, so we should turn it into a one-hot representation
* Let's initialize a tensor to represent the first day's weather situation at each hour

In [5]:
first_day = bikes[:24].long()
weather_onehot = torch.zeros(first_day.shape[0], 4)

first_day[:, 9] # 9 is the index of the `weathersit` variable

tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2])

* Now we `scatter` these indices into our matrix

In [6]:
weather_onehot.scatter_(
    dim=1,
    index=first_day[:, 9].unsqueeze(1).long() - 1,
    value=1.0
)

tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.]])

* We can now con`cat`enate this data to the first 24 hours of the bike data matrix

In [7]:
torch.cat((bikes[:24], weather_onehot), 1)[:1]

tensor([[ 1.0000,  1.0000,  1.0000,  0.0000,  1.0000,  0.0000,  0.0000,  6.0000,
          0.0000,  1.0000,  0.2400,  0.2879,  0.8100,  0.0000,  3.0000, 13.0000,
         16.0000,  1.0000,  0.0000,  0.0000,  0.0000]])

* The final four columns are now the one-hot representation of `weathersit`
* Let's apply this to our `daily_bikes`

In [8]:
daily_weather_onehot = torch.zeros(daily_bikes.shape[0], 4,
                                   daily_bikes.shape[2])

daily_weather_onehot.shape

torch.Size([730, 4, 24])

In [9]:
daily_weather_onehot.scatter_(
    dim=1,
    index=daily_bikes[:, 9, :].unsqueeze(1).long() - 1,
    value=1.0,
)
daily_weather_onehot.shape

torch.Size([730, 4, 24])

In [10]:
daily_bikes = torch.cat((daily_bikes, daily_weather_onehot), 1)

* An alternative to this one-hot representation is to pretend it's a continuous variable, which goes up as the weather worsens
* We can transform it into a float ranging from 0 to 1

In [11]:
daily_bikes[:, 9, :] = (daily_bikes[:, 9, :] - 1.0) / 3.0

* Aside from this simple method of mapping the variables from 0 to 1, we could also subtract their mean and divide by their standard deviation

In [12]:
temp = daily_bikes[:, 10, :]
temp_mean = torch.mean(temp)
temp_std = torch.std(temp)

daily_bikes[:, 10, :] = (temp - temp_mean) / temp_std

* This variable will then have zero mean and unitary standard deviation

## Representing text

### Converting text to numbers

* Two great sources of text data are [Project Gutenberg](https://gutenberg.org) and [English Corpora](https://english-corpora.org)
* There's even a Wikipedia corpus available
* For now, let's get started with Jane Austen's [Pride and Prejudice](http://www.gutenberg.org/files/1342/1342-0.txt)

In [13]:
data_path = "./DLPT/data/text/pride_and_prejudice.txt"
with open(data_path, "r") as f:
    text = f.read()

### One-hot-encoding characters

* Frist, we split the text into lines, and pick an arbitrary line to focus on

In [14]:
lines = text.split('\n')
line = lines[855]
line

'“Impossible, Mr. Bennet, impossible, when I am not acquainted with him'

* Let's create a tensor that can hold the one-hot encoding of each character of the line

In [15]:
letter_t = torch.zeros(len(line), 128)
letter_t.shape

torch.Size([70, 128])

In [16]:
for i, letter in enumerate(line.lower().strip()):
    letter_index = ord(letter) if ord(letter) < 128 else 0
    letter_t[i][letter_index] = 1

### One-hot encoding whole words

* To one-hot encode whole words, we first need to collect our vocabulary

In [17]:
def clean_words(input_str):
    punctuation = '.,;:"?!”“_-'
    word_list = input_str.lower().replace("\n", " ").split()
    return [word.strip(punctuation) for word in word_list]

In [18]:
words_in_line = clean_words(line)
line, words_in_line

('“Impossible, Mr. Bennet, impossible, when I am not acquainted with him',
 ['impossible',
  'mr',
  'bennet',
  'impossible',
  'when',
  'i',
  'am',
  'not',
  'acquainted',
  'with',
  'him'])

* We can now create a vocabulary of our whole text, and assign each word to an index

In [19]:
word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for i, word in enumerate(word_list)}

len(word2index_dict), word2index_dict["impossible"]

(7465, 3455)

* We can now one-hot encode our line, using this `word2index_dict`

In [20]:
word_t = torch.zeros(len(words_in_line), len(word2index_dict))
for i, word in enumerate(words_in_line):
    word_index = word2index_dict[word]
    word_t[i][word_index] = 1
    print(f"{i:2} {word_index:4} {word}")

word_t.shape

 0 3455 impossible
 1 4394 mr
 2  807 bennet
 3 3455 impossible
 4 7221 when
 5 3370 i
 6  408 am
 7 4529 not
 8  222 acquainted
 9 7291 with
10 3273 him


torch.Size([11, 7465])

* One intermediate representation between encoding characters and whole words is called $byte\ pair\ encoding$
* This starts with a dictionary of individual letters, but then adds the most common pairs of items until it reaches the prescribed dictionary size
* This may lead to a tokenization of our sentence that looks like this:

      ▁Im|pos|s|ible|,|▁Mr|.|▁B|en|net|,|▁impossible|,|▁when|▁I|▁am|▁not|▁acquainted|▁with|▁him

### Text embeddings

* These one-hot ecodings quickly become unwieldy for large vocabularies
* It would be great if we could compress them
* To do so, we could turn them from thousands of zeros and a single one into a couple hundred floating point numbers
* This is called an embedding, and useful ways of embedding similar words near each other can be learned by neural networks

### Text embeddings as a blueprint

* Embeddings are useful as soon as one-hot encoding becomes too cumbersome
* This can be the case even for non-textual, categorical data
* It is common to improve the prelearned embeddings while solving the problem at hand
* The techniques developed for natural language processing can often serve as inspiration, as blueprints, for preocessing of other sequential data, like embeddings being used for non-textual data

### Exercises

1. Take several pictures of red, blue, and green items with your phone or other digital camera (or download some from the internet, if a camera isn’t available).
    1. Load each image, and convert it to a tensor.
    1. For each image tensor, use the .mean() method to get a sense of how bright the image is.
    1. Take the mean of each channel of your images. Can you identify the red, green, and blue items from only the channel averages?

No.

2. Select a relatively large file containing Python source code.
    1. Build an index of all the words in the source file (feel free to make your tokenization as simple or as complex as you like; we suggest starting with replacing `r"[^a-zA-Z0-9_]+"` with spaces).
    1. Compare your index with the one we made for Pride and Prejudice. Which is larger?
    1. Create the one-hot encoding for the source code file.
    1. What information is lost with this encoding? How does that information compare to what’s lost in the Pride and Prejudice encoding?

No.