# Chapter 4 : Real-world data representation using tensors

### 1. Take several pictures of red, blue, and green items with your phone or other digital camera (or download some from the internet, if a camera isn’t available).
* Load each image, and convert it to a tensor.
* For each image tensor, use the `.mean()` method to get a sense of how bright the image is.
* Take the mean of each channel of your images. Can you identify the red, green, and blue items from only the channel averages?

In [1]:
from PIL import Image
main_path = "data/"
redthing = "red_thing.jpg"
greenthing = "green_thing.jpg"
bluething = "blue_thing.jpg"

img_red = Image.open(main_path+redthing)
img_green = Image.open(main_path+greenthing)
img_blue = Image.open(main_path+bluething)

In [2]:
# There are many ways to convert an img into arrays/tensor. 
# Here we use torchvision to get used to it 
from torchvision import transforms

preprocessor = transforms.ToTensor()
red_t = preprocessor(img_red)
green_t = preprocessor(img_green)
blue_t = preprocessor(img_blue)

print(f"shape red image : {red_t.shape}")
print(f"shape green image : {green_t.shape}")
print(f"shape blue image : {blue_t.shape}")
print("\n")
print(f"mean red image : {red_t.mean():.3f}")
print(f"mean green image : {green_t.mean():.3f}")
print(f"mean blue image : {blue_t.mean():.3f}")

shape red image : torch.Size([3, 1024, 1024])
shape green image : torch.Size([3, 338, 500])
shape blue image : torch.Size([3, 682, 1023])


mean red image : 0.235
mean green image : 0.314
mean blue image : 0.440


Here we can see that each image has 3 channels : one for the RGB channel, and two for the sizes.\
Also, the red image seems to be the brighter.

In [3]:
### Let's now look into the RGB channel of the red image
print(f"red channel : {red_t[0].mean():.3f}")
print(f"green channel : {red_t[1].mean():.3f}")
print(f"blue channel : {red_t[2].mean():.3f}")

red channel : 0.616
green channel : 0.024
blue channel : 0.064


In [4]:
### Let's now look into the RGB channel of the green image
print(f"red channel : {green_t[0].mean():.3f}")
print(f"green channel : {green_t[1].mean():.3f}")
print(f"blue channel : {green_t[2].mean():.3f}")

red channel : 0.303
green channel : 0.508
blue channel : 0.130


In [5]:
### Let's now look into the RGB channel of the blue image
print(f"red channel : {blue_t[0].mean():.3f}")
print(f"green channel : {blue_t[1].mean():.3f}")
print(f"blue channel : {blue_t[2].mean():.3f}")

red channel : 0.231
green channel : 0.410
blue channel : 0.679


As we expected, there is a dominance in the mean color channel respectively to each image.

### 2. Select a relatively large file containing Python source code.
* Build an index of all the words in the source file (feel free to make your tokenization as simple or as complex as you like; we suggest starting with replacing `r"[^a-zA-Z0-9_]+"` with spaces).
* Compare your index with the one we made for Pride and Prejudice. Which is larger?
* Create the one-hot encoding for the source code file.
* What information is lost with this encoding? How does that information compare to what’s lost in the Pride and Prejudice encoding?

In [6]:
import torch

# Loading a large file containing Python source code (2934 lines)
with open(main_path + 'init_python_file.txt', encoding='utf8') as f:
    text = f.read()
    
len(text)

111177

In [7]:
# Let's print a sample of our file
text[:200]

'"""Python Advanced Enumerations & NameTuples"""\n\nimport sys as _sys\npyver = float(\'%s.%s\' % _sys.version_info[:2])\n\ntry:\n    from collections import OrderedDict\nexcept ImportError:\n    OrderedDict = d'

In [8]:
import re

def preprocessor(input_text):
    """
    Turns string caracters into list of words
    input : string
    output : list
    """
    regex = r"[^a-zA-Z0-9]+"
    text_clean = re.sub(regex, ' ', input_text)
    text_clean = text_clean.lower().replace('\n',' ')
    word_list = text_clean.split()
    return word_list

word_list = preprocessor(text)
print(word_list[:50])
print("\nlen : ", len(word_list))

['python', 'advanced', 'enumerations', 'nametuples', 'import', 'sys', 'as', 'sys', 'pyver', 'float', 's', 's', 'sys', 'version', 'info', '2', 'try', 'from', 'collections', 'import', 'ordereddict', 'except', 'importerror', 'ordereddict', 'dict', 'from', 'collections', 'import', 'defaultdict', 'try', 'import', 'sqlite3', 'except', 'importerror', 'sqlite3', 'none', 'if', 'pyver', '3', 'from', 'functools', 'import', 'reduce', 'from', 'operator', 'import', 'or', 'as', 'or', 'and']

len :  12637


Now that we've clear our text, let's build our own word One-Hot Encoding.\
This mean that each word will be represented by a vector of length equal to the number of different word in the encoding.

In [9]:
word_list = sorted(set(word_list)) # Left out duplicates and sort
word2index_dict = {word: i for (i, word) in enumerate(word_list)} #Encoding each word with a number

In [10]:
# Tensor initialization
word_t = torch.zeros(len(word_list), len(word2index_dict))

# One-Hot Enconding
for i, word in enumerate(word_list):
    word_index = word2index_dict[word]
    word_t[i][word_index] = 1
    
print(word_t.shape)
print(word_t)

torch.Size([874, 874])
tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 1., 0.],
        [0., 0., 0.,  ..., 0., 0., 1.]])


Now, we can discuss about what we just did and the purpuse of such encoding.

First, the One-Hot word encoding (aka word-level encoding) provide a large sequence of words compare to character-level encoding. Indeed, there are significantly fewer characters than words and words provides much more meaning than individual character, but it comes with a price : the size of the sequence. On the other hand, dealing with a new word which is not in the sequence will raise issues since this One-Hot Encoding doesn't convey any dependency between two words.\
So, as a conclusion, for simple and basic text analysis, One-Hot Encoding may do the trick. But for advenced text analysis, there are more advanced ways to deal with words and characters that will provide fewer sequences and handling new appearance much better (see *byte pair encoding method* or *text embeddings method*) 