## Work with data

- Natural Language Processing (Text)
- Computer Vision (Images)
- Speech Processing (Voice)
- Music Processing (Audio)
- Time Series
- Mixed Data

## Natural Language Processing

- Preprocessing
- Tokenization
- Vectorization
- Embeddings

IMDB Dataset: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews#IMDB%20Dataset.csv

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

### Load data

In [150]:
import csv

samples = []
labels = []
with open('data/imdb/imdb_dataset.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    next(reader) #skip csv header
    for row in reader:
        samples += [row[0]]
        labels += [row[1]]

In [151]:
print(len(labels))

999


In [152]:
print(len(samples))

999


In [124]:
print(f"{samples[24]}, {labels[24]}")

This was the worst movie I saw at WorldFest and it also received the least amount of applause afterwards! I can only think it is receiving such recognition based on the amount of known actors in the film. It's great to see J.Beals but she's only in the movie for a few minutes. M.Parker is a much better actress than the part allowed for. The rest of the acting is hard to judge because the movie is so ridiculous and predictable. The main character is totally unsympathetic and therefore a bore to watch. There is no real emotional depth to the story. A movie revolving about an actor who can't get work doesn't feel very original to me. Nor does the development of the cop. It feels like one of many straight-to-video movies I saw back in the 90s ... And not even a good one in those standards.<br /><br />, negative


### Data preprocessing

In [125]:
import re

preprocessed_samples = []
for sample in samples:
    s = sample.lower()
    s = re.sub("[^а-яА-Яa-zA-Z0-9]", " ", s)
    s = re.sub("\s+", " ", s)
    s = s.strip()
    preprocessed_samples += [s]

In [126]:
print(f"{preprocessed_samples[24]}, {labels[24]}")

this was the worst movie i saw at worldfest and it also received the least amount of applause afterwards i can only think it is receiving such recognition based on the amount of known actors in the film it s great to see j beals but she s only in the movie for a few minutes m parker is a much better actress than the part allowed for the rest of the acting is hard to judge because the movie is so ridiculous and predictable the main character is totally unsympathetic and therefore a bore to watch there is no real emotional depth to the story a movie revolving about an actor who can t get work doesn t feel very original to me nor does the development of the cop it feels like one of many straight to video movies i saw back in the 90s and not even a good one in those standards br br, negative


### Tokenization

#### Word level

In [127]:
tokenized_samples = []
for sample in preprocessed_samples:
    s = sample.split()
    tokenized_samples += [s]

In [128]:
print(f"{tokenized_samples[24]}, {labels[24]}")

['this', 'was', 'the', 'worst', 'movie', 'i', 'saw', 'at', 'worldfest', 'and', 'it', 'also', 'received', 'the', 'least', 'amount', 'of', 'applause', 'afterwards', 'i', 'can', 'only', 'think', 'it', 'is', 'receiving', 'such', 'recognition', 'based', 'on', 'the', 'amount', 'of', 'known', 'actors', 'in', 'the', 'film', 'it', 's', 'great', 'to', 'see', 'j', 'beals', 'but', 'she', 's', 'only', 'in', 'the', 'movie', 'for', 'a', 'few', 'minutes', 'm', 'parker', 'is', 'a', 'much', 'better', 'actress', 'than', 'the', 'part', 'allowed', 'for', 'the', 'rest', 'of', 'the', 'acting', 'is', 'hard', 'to', 'judge', 'because', 'the', 'movie', 'is', 'so', 'ridiculous', 'and', 'predictable', 'the', 'main', 'character', 'is', 'totally', 'unsympathetic', 'and', 'therefore', 'a', 'bore', 'to', 'watch', 'there', 'is', 'no', 'real', 'emotional', 'depth', 'to', 'the', 'story', 'a', 'movie', 'revolving', 'about', 'an', 'actor', 'who', 'can', 't', 'get', 'work', 'doesn', 't', 'feel', 'very', 'original', 'to', 'm

#### Char level

In [129]:
tokenized_samples_char = []
for sample in preprocessed_samples:
    s = list(sample)
    tokenized_samples_char += [s]

In [130]:
print(f"{tokenized_samples_char[24]}, {labels[24]}")

['t', 'h', 'i', 's', ' ', 'w', 'a', 's', ' ', 't', 'h', 'e', ' ', 'w', 'o', 'r', 's', 't', ' ', 'm', 'o', 'v', 'i', 'e', ' ', 'i', ' ', 's', 'a', 'w', ' ', 'a', 't', ' ', 'w', 'o', 'r', 'l', 'd', 'f', 'e', 's', 't', ' ', 'a', 'n', 'd', ' ', 'i', 't', ' ', 'a', 'l', 's', 'o', ' ', 'r', 'e', 'c', 'e', 'i', 'v', 'e', 'd', ' ', 't', 'h', 'e', ' ', 'l', 'e', 'a', 's', 't', ' ', 'a', 'm', 'o', 'u', 'n', 't', ' ', 'o', 'f', ' ', 'a', 'p', 'p', 'l', 'a', 'u', 's', 'e', ' ', 'a', 'f', 't', 'e', 'r', 'w', 'a', 'r', 'd', 's', ' ', 'i', ' ', 'c', 'a', 'n', ' ', 'o', 'n', 'l', 'y', ' ', 't', 'h', 'i', 'n', 'k', ' ', 'i', 't', ' ', 'i', 's', ' ', 'r', 'e', 'c', 'e', 'i', 'v', 'i', 'n', 'g', ' ', 's', 'u', 'c', 'h', ' ', 'r', 'e', 'c', 'o', 'g', 'n', 'i', 't', 'i', 'o', 'n', ' ', 'b', 'a', 's', 'e', 'd', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 'a', 'm', 'o', 'u', 'n', 't', ' ', 'o', 'f', ' ', 'k', 'n', 'o', 'w', 'n', ' ', 'a', 'c', 't', 'o', 'r', 's', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'f', 'i',

### Vectorization

#### Create vocab

In [131]:
word2id = {}
id2word = []

for sample in tokenized_samples:
    for token in sample:
        if token not in word2id.keys():
            word2id[token] = len(id2word)
            id2word += [token]

In [132]:
len(id2word)

17933

In [133]:
print(f"{id2word[666]} - {word2id['credits']}")

credits - 666


#### Vectorize

In [134]:
digitized_samples = []
for sample in tokenized_samples:
    s = [word2id[token] for token in sample]
    digitized_samples += [s]

In [135]:
print(f"{digitized_samples[24]}, {labels[24]}")

[22, 34, 2, 649, 357, 117, 141, 317, 1358, 37, 66, 621, 1359, 2, 672, 1360, 1, 1361, 1362, 117, 185, 214, 1098, 66, 23, 1363, 627, 1122, 690, 77, 2, 1360, 1, 1364, 209, 43, 2, 368, 66, 238, 236, 60, 222, 1365, 1366, 146, 326, 238, 214, 43, 2, 357, 51, 49, 512, 894, 930, 1367, 23, 49, 744, 711, 1368, 248, 2, 1369, 1370, 51, 2, 476, 1, 2, 463, 23, 616, 60, 1371, 902, 2, 357, 23, 92, 962, 37, 748, 2, 120, 1002, 23, 379, 1372, 37, 1373, 49, 1374, 60, 400, 348, 23, 57, 377, 1031, 1375, 60, 2, 489, 49, 357, 1376, 33, 80, 1095, 157, 185, 127, 164, 485, 137, 127, 1001, 195, 620, 60, 28, 1377, 923, 2, 1027, 1, 2, 1378, 66, 1379, 376, 0, 1, 98, 1290, 60, 1380, 568, 117, 141, 543, 43, 2, 1381, 37, 48, 289, 49, 464, 0, 43, 687, 1382, 29, 29], negative


In [136]:
print(f"{[id2word[index] for index in digitized_samples[24]]}, {labels[24]}")

['this', 'was', 'the', 'worst', 'movie', 'i', 'saw', 'at', 'worldfest', 'and', 'it', 'also', 'received', 'the', 'least', 'amount', 'of', 'applause', 'afterwards', 'i', 'can', 'only', 'think', 'it', 'is', 'receiving', 'such', 'recognition', 'based', 'on', 'the', 'amount', 'of', 'known', 'actors', 'in', 'the', 'film', 'it', 's', 'great', 'to', 'see', 'j', 'beals', 'but', 'she', 's', 'only', 'in', 'the', 'movie', 'for', 'a', 'few', 'minutes', 'm', 'parker', 'is', 'a', 'much', 'better', 'actress', 'than', 'the', 'part', 'allowed', 'for', 'the', 'rest', 'of', 'the', 'acting', 'is', 'hard', 'to', 'judge', 'because', 'the', 'movie', 'is', 'so', 'ridiculous', 'and', 'predictable', 'the', 'main', 'character', 'is', 'totally', 'unsympathetic', 'and', 'therefore', 'a', 'bore', 'to', 'watch', 'there', 'is', 'no', 'real', 'emotional', 'depth', 'to', 'the', 'story', 'a', 'movie', 'revolving', 'about', 'an', 'actor', 'who', 'can', 't', 'get', 'work', 'doesn', 't', 'feel', 'very', 'original', 'to', 'm

### Correct length

In [137]:
correct_samples = []
max_len = 64

for sample in digitized_samples:
    if len(sample) < max_len:
        sample += [0] * (max_len - len(sample))
    
    correct_samples += [sample[:max_len]]

In [138]:
len(correct_samples[24])

64

In [139]:
print(f"{correct_samples[24]}, {labels[24]}")

[22, 34, 2, 649, 357, 117, 141, 317, 1358, 37, 66, 621, 1359, 2, 672, 1360, 1, 1361, 1362, 117, 185, 214, 1098, 66, 23, 1363, 627, 1122, 690, 77, 2, 1360, 1, 1364, 209, 43, 2, 368, 66, 238, 236, 60, 222, 1365, 1366, 146, 326, 238, 214, 43, 2, 357, 51, 49, 512, 894, 930, 1367, 23, 49, 744, 711, 1368, 248], negative


In [140]:
len(correct_samples[968])

64

In [141]:
print(f"{correct_samples[968]}, {labels[968]}")

[22, 23, 49, 236, 368, 4348, 37, 3576, 2, 466, 23, 2117, 1920, 17635, 464, 485, 60, 2, 4074, 117, 1001, 92, 3704, 51, 17636, 226, 2, 2562, 1, 2563, 46, 14, 54, 117, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], positive


### Prepare labels (digitize)

In [142]:
labels_dict = {"negative": 0, "positive": 1}

correct_labels = [labels_dict[label] for label in labels]

In [143]:
print(f"{correct_samples[24]}, {correct_labels[24]}")

[22, 34, 2, 649, 357, 117, 141, 317, 1358, 37, 66, 621, 1359, 2, 672, 1360, 1, 1361, 1362, 117, 185, 214, 1098, 66, 23, 1363, 627, 1122, 690, 77, 2, 1360, 1, 1364, 209, 43, 2, 368, 66, 238, 236, 60, 222, 1365, 1366, 146, 326, 238, 214, 43, 2, 357, 51, 49, 512, 894, 930, 1367, 23, 49, 744, 711, 1368, 248], 0


In [144]:
print(f"{correct_samples[968]}, {correct_labels[968]}")

[22, 23, 49, 236, 368, 4348, 37, 3576, 2, 466, 23, 2117, 1920, 17635, 464, 485, 60, 2, 4074, 117, 1001, 92, 3704, 51, 17636, 226, 2, 2562, 1, 2563, 46, 14, 54, 117, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 1


### Split data

In [145]:
train_data = correct_samples[:666]
train_labels = correct_labels[:666]
test_data = correct_samples[666:]
test_labels = correct_labels[666:]

In [146]:
len(train_data)

666

In [147]:
len(train_labels)

666

In [148]:
len(test_data)

333

In [149]:
len(test_labels)

333