### PyTorch

PyTorch is a popular open source machine learning library based on Torch library. Pytorch provides three set of libraries, i.e., torchvision, torchaudio, torchtext for Computer Vision, Audio and Text respectively.

It provides two high-level features:

* Tensor computation (like NumPy) with strong GPU acceleration.
* Deep neural networks built on a type-based autograd system.

**Topic Covered**

 - <strong>Handling Time Series Data</strong>
 - <strong>Handling Text Data</strong>

### Importing Libraries

In [None]:
import os
import numpy as np

import torch

from PIL import Image
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

### Working on Time Series Data

We use [Bike sharing](https://www.kaggle.com/c/bike-sharing-demand/data) data from kaggle given for a two year period. Though, the objective of the dataset is to predict the count of bikes rented during each hour, but here we are going to learn how to handle time series dataset using pytorch.

* Loading dataset using Numpy
* Skip the column names
* Convert string to float

In [None]:
# 2 Years Data
bikes_numpy = np.loadtxt("data/hour-fixed.csv",dtype=np.float32,
              delimiter=",",skiprows=1, converters={1: lambda x: float(x[8:10])})

* Convert numpy to pytorch tensor.
* Check after how many elements the next(second) records starts from zero offset.

In [None]:
print(f'Total Obervations: {bikes_numpy.shape} \n')
bikes = torch.from_numpy(bikes_numpy)
print(f'A Jump of 17 element in 1st dimension is required to go from one observation to next observation {bikes.stride()}')

### Preparing the Dataset into features

* **Convert the hour based into days as 1st dim, 24 hours as 2nd dim and 17 features as 3rd dim. It is quite common to reshape the timeseries data to find seasonality or trends.**
* 17520 / (2 * 365 * 24) = 1.0. (2 years, 365 Days, 24 hours - Total Number of Observations.)
* The original data is presented on hour bases.
* We have 2 years data, 730 days, each day has 24 hours and each hour has 17 columns/features.
* We convert the hours into days, which becomes our records or rows. Each row or record is further segregated among hours our second dim.
* Each row in second dim contains 17 elements.

In [None]:
daily_bikes = bikes.view(-1, 24, bikes.shape[1])
print(f'Reshaped Time Series Dataset: {daily_bikes.shape} \n')
print(f'Stride for the current data structure: {daily_bikes.stride()}')

Stride : 24 * 17 = 408

N: 730 days

* Reshaping the data to make it available for training purpose.
* Converting Class into One Hot encoding
* Stride value (408, 17, 1), it means to jump from one day (24x17 elements) to next day records, we need to jump 408 elements (feature values).

**Since prediction requires count of rent biker per hour, we transpose the feature dimensions.** Below we are performing a series of steps.

- Selecting 24 rows.
- Creating empty matrix with 24 rows for weather feature.(categorical variable)
- Creating One-Hot Encoder Vector for weather feature.

In [None]:
daily_bikes = daily_bikes.transpose(1, 2)
print(f'After Transpose: {daily_bikes.shape, daily_bikes.stride()} \n')
first_day = bikes[:24].long()
weather_onehot = torch.zeros(first_day.shape[0], 4) #4 is count of categories
print(f'Weather Feature (categorical): {first_day[:,9]}')

**Would encourage reader's to take up previous pytorch notebooks to understand better.**

In [None]:
print(f'One-Hot Encoded Weather Feature \n{weather_onehot.scatter_(dim=1,index=first_day[:,9].unsqueeze(1).long() - 1,value=1.0)}')

* **Combining the one hot encoded matrix with feature matrix.**
* **Creating Empty matrix to accomdate all the weather feature as one-hot encoder.**
* **Creating Zero Matrix for Class label.**

In [None]:
"""Using cat method, we concat two matrix. Combining first 24 obsersations and 
its feature with weather's one-hot encoded vectors.
"""

combinedFeature = torch.cat((bikes[:24], weather_onehot), 1) 
print(f'After combining features (17 feature is turned into 21 feature after one-hot encoding): {combinedFeature.shape}\n')

daily_weather_onehot = torch.zeros(daily_bikes.shape[0], 4, daily_bikes.shape[2])
print(f'Zero Matrix to accomdate weather\'s one-hot encoding {daily_weather_onehot.shape}')

**Turning Zero Empty matrix into One-Hot Encoded feature.**

In [None]:
daily_weather_onehot.scatter_(1, daily_bikes[:,9,:].long().unsqueeze(1) - 1, 1.0)

**Combining the existing and one-hot encoded features.**

In [None]:
daily_bikes = torch.cat((daily_bikes, daily_weather_onehot), dim=1)
print(f'Time Series Dataset: {daily_bikes.shape}')

**Keeping Temp values between [0, 1] with min-max Scaling.**

In [None]:
temp = daily_bikes[:, 10, :]
temp_min = torch.min(temp)
temp_max = torch.max(temp)
daily_bikes[:, 10, :] = ((daily_bikes[:, 10, :] - temp_min) / (temp_max - temp_min))
print(f'Min-Max Scaled: {daily_bikes[:,10,:]}')

****Standardization of Temperature Feature****

In [None]:
# temp = daily_bikes[:, 10, :]
# daily_bikes[:, 10, :] = ((daily_bikes[:, 10, :] - torch.mean(temp))
# / torch.std(temp))
# print(f'Standardized Temperature Feature: {daily_bikes[:,10,:]}')

### Working with Text

### Character Level Conversion

There are 128 Ascii character, we convert our text/letter into index or integer. Letters not present in ASCII are turned into 0.

In [None]:
with open("data/anna.txt", encoding='utf-8') as f:
    text = f.read()

* Converting text document into list of lines.
* Fetching One Line by Index
* Creating Matrix based on length of line and total number of characters possible.

In [None]:
lines = text.split("\n")
print(f'Total Number of Lines: {len(lines)} \n')
line = lines[100]
letter_t = torch.zeros(len(line), 128)
print(f'Number of characters in a line x Total No. of Characters Possible: {letter_t.shape}')

**Creating Character Matrix for a line. All ASCII characters are features, and each character from a line is separate observation. Since each observation is one character, we have that particular feature as 1 and other features as 0.**

In [None]:
line

In [None]:
for i, letter in enumerate(line.lower().strip()):
    letter_index = ord(letter) if ord(letter) < 128 else 0
    #print(letter_index)
    letter_t[i][letter_index] = 1
    
print(f'Character Matrix: {letter_t[0]}')

In [None]:
print(f'Column Number of the Feature In Character Matrix: {(letter_t[0]==1).nonzero(as_tuple=True)} \n')

print(f'First Character of the line: {ord(line[0])}')

### Word Level Conversion

* Clean the text
* Sorting the text
* Mapping the words into integer

In [None]:
def clean_words(input_str):
    punctuation = '.,;:"!?”“_-'
    word_list = input_str.lower().replace('\n',' ').split()
    word_list = [word.strip(punctuation) for word in word_list]
    return word_list

**Cleaning and Tokenization**

In [None]:
words_in_line = clean_words(line)
line, words_in_line

15K+ words are present in the text after cleaning. We are mapping each word with an unique id.

In [None]:
word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for (i, word) in enumerate(word_list)}
print(f'Total Number Of Unique Words: {len(word2index_dict.keys())}')

* Creating a Matrix where columns size is length of all words and each row is length of each document.

In [None]:
word_t = torch.zeros(len(words_in_line), len(word2index_dict))
word_t.shape

**Index of each word on the column axis**

In [None]:
for i, word in enumerate(words_in_line):
    word_index = word2index_dict[word]
    word_t[i][word_index] = 1
    print('{:2} {:4} {}'.format(i, word_index, word))

In [None]:
print(f'Word Matrix: \n{word_t}')

### Thanks For Reading. For Feedback, reach out on Github. Please don't spam.