<a href="https://colab.research.google.com/github/cest0/PPConstrFcst/blob/master/Python_for_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modules and Packages

In [1]:
import pickle,csv
from dataloader import get_data

ModuleNotFoundError: ignored

* [Python Standard Library](https://docs.python.org/3/library/) - Python runtime services,Generic Operating System,  Services, Debugging 
* Numpy, Matplotlib
* Pytorch, Tensorflow

# Data Sources  and Common data store formats

* Python objects - pkl 
* Numeric data - npz 
* Multi-data - csv 
* Plain text - txt 
* Large Datasets - HDF5 

In [2]:
import pickle
obj = { 'age':23,'hobbies':['photography','running','travelling'] }
pickle.dump(obj,open('store.pkl','wb'))

obj2 = pickle.load(open('store.pkl','rb'))
obj2

{'age': 23, 'hobbies': ['photography', 'running', 'travelling']}

In [3]:
import csv
import pprint
with open('data/iris.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        print(row)

FileNotFoundError: ignored

# Data Containers

Python offers a variety of containers each dedicated for different purpose and constrained to harness certain optimisations
* lists - generic container , numeric indexing
* tuples - immutable lists 
* dictionaries - key-value organisation 
* sets - collection of unique elements

## Lists
Pay attention as these are techniques to handle data pre-processing and manipulation in the batch loading phase 

In [4]:
homo_list = [12,45,900,78,34,66,17,85]
hetero_list = [10,'foo',1.3]
print(hetero_list[0])
tuple_list = [
                (1,'Erebor',800.45),
                (2,'Rivendell',500.67),
                (3,'Shire',900.12),
                (4,'Mordor',1112.30)
            ]

10


**Note** : Lists in batched data processing are particularly lists of tuples  
batch_instance = (utter,utterance_length,transcript,transcript_lens)

### Operations

In [6]:
l3 = homo_list * 2
l3

[12, 45, 900, 78, 34, 66, 17, 85, 12, 45, 900, 78, 34, 66, 17, 85]

This is different from the result you'd get when operating on numpy

In [7]:
homo_list + hetero_list

[12, 45, 900, 78, 34, 66, 17, 85, 10, 'foo', 1.3]

`sorted`, `sum`,`max`,`min`

In [8]:
print(sorted(homo_list))
print(sum(homo_list))

[12, 17, 34, 45, 66, 78, 85, 900]
1237


In [0]:
lst_tuples = [(1,11), (2,12), (4,9), (3,8)]

In [10]:
sorted(lst_tuples)

[(1, 11), (2, 12), (3, 8), (4, 9)]

###  Conditional operations - filtering:
There are two ways to filter lists:
* Index based - Slicing and Dicing
* Condition based - List comprehension

# Slicing and Dicing
` sliced_list = [ start_idx : end_idx+1 : step]`

In [11]:
print(homo_list)
print(homo_list[:5])
print(homo_list[-1:0:-2])

[12, 45, 900, 78, 34, 66, 17, 85]
[12, 45, 900, 78, 34]
[85, 66, 78, 45]


#### List comprehension

`*result*  = [*transform*    *iteration*         *filter*     ]` 
~~~~
res = [ manipulation(instance[2]) for instance in sorted_dataset ]
~~~~

In [12]:
res = [no for no in homo_list if no>50]
res

[900, 78, 66, 85]

In [13]:
%%timeit  
res = [i for i in range(10000)]

1000 loops, best of 3: 451 µs per loop


In [14]:
%%timeit
res = []
for i in range(10000):
    res.append(i)

1000 loops, best of 3: 1.11 ms per loop


## Usecase: Data Preprocessing and Loading


In [15]:
batch_dataset = get_data()
print(type(batch_dataset[0]))

## (utterance,utterance_size,transcripts,transcripts_size)
batch_dataset[0]

NameError: ignored

In [16]:
# sorting
sorted_dataset = sorted(batch_dataset,key=lambda x: x[1])

# max
max_transcript_len = max(batch_dataset,key=lambda x: x[3] )[3]

#list comprehension for extraction
transcripts = [ (instance[2],instance[3]) for instance in sorted_dataset]

#list comprehension for manipulation 
"""
Returns transpose of matrix
"""    
def manipulation(data):
    return data.T

pad_len = [ manipulation(instance[2]) for instance in sorted_dataset ]
pad_len[0]

NameError: ignored

### Classes

Specifically useful for datasets that are supposed to be 'iterable'

### Iterable and Iterators

In [0]:
class IterableADT:
    
    def __init__(self,train_data_src,train_data_src2, train_label_src):
        self.x = train_data_src
        self.x2 = train_data_src2
        self.y = train_label_src
        assert len(self.x) == len(self.x2)
        assert len(self.x2) == len(self.y)
    
    def __len__(self):
        return len(self.x)

    def __getitem__(self,key):
        return (self.x[key],self.x2[key],self.y[key])
    

### Generators
Instead of creating classes for iterators , you can use the generator 
Generators relieve the developer of recording the state of the iteration 
Simplistically, generators are functions that use `yield` statement instead of `return`


In [0]:
def pairwise_generator(input_data):
    for i in range(0,len(input_data),2):
        yield (input_data[i],input_data[i+1])

data = [1,'one',2,'two',3,'three',4,'four',5,'five']        
generator = pairwise_generator(data)
for elt in generator:
    print(elt)

(1, 'one')
(2, 'two')
(3, 'three')
(4, 'four')
(5, 'five')


# Debugging - Pdb

In [0]:
import pdb
def pairwise_generator(input_data):
    pdb.set_trace()
    for i in range(0,len(input_data),2):        
        yield (input_data[i],input_data[i+1])
        
data = [1,'one',2,'two',3,'three',4,'four']        
generator = pairwise_generator(data)
for elt in generator:
    print(elt)

> <ipython-input-62-327e211f3ec6>(4)pairwise_generator()
-> for i in range(0,len(data),2):
(Pdb) n
> <ipython-input-62-327e211f3ec6>(5)pairwise_generator()
-> yield (input_data[i],input_data[i+1])
(Pdb) n
(1, 'one')
> <ipython-input-62-327e211f3ec6>(4)pairwise_generator()
-> for i in range(0,len(data),2):
(Pdb) elt
(1, 'one')
(Pdb) input_data[i]
1
(Pdb) n
> <ipython-input-62-327e211f3ec6>(5)pairwise_generator()
-> yield (input_data[i],input_data[i+1])
(Pdb) input_data[i]
2
(Pdb) n
(2, 'two')
> <ipython-input-62-327e211f3ec6>(4)pairwise_generator()
-> for i in range(0,len(data),2):
(Pdb) elt
(2, 'two')
(Pdb) n
> <ipython-input-62-327e211f3ec6>(5)pairwise_generator()
-> yield (input_data[i],input_data[i+1])
(Pdb) n
(3, 'three')
> <ipython-input-62-327e211f3ec6>(4)pairwise_generator()
-> for i in range(0,len(data),2):
(Pdb) n
> <ipython-input-62-327e211f3ec6>(5)pairwise_generator()
-> yield (input_data[i],input_data[i+1])
(Pdb) c
(4, 'four')
