**Question** 

(Take a seat, its gonna be a while)

Most of the time, when your working in ML, you run into data that are just ugly,varying sizes  and/or not in a format/form that fits your need. When this happens, we need to write code to preprocess the data hence convert it into a useful form.

one such situation is,

Consider a dataset that is a pair of sentences as follows

>Who are you? , Helen.

>Where Are you?, I am at home.

Now consider the machine only knows these 4 sentences (The machine is a baby👶 who has only heard these 4 sentences ever in its life and has ridiculously good memory).
It's vocabulary should contain only



```
  0       1      2       3          4        5    6    7    8
['who' 'are' 'you' 'helen' 'where' 'i' 'am' 'at' 'home'] (vocabulary length - 9)
```
Now the objective is converting each sentence in such a way, the uniqueness of the words are not lost as well as its order.
For this we imagine an 1-dimensional numpy array to represent each word.

This numpy array has same length as the length of vocabulary (here, 9) such that

   > value 0 if the word at that index of vocabulary list is not the word the numpy array represents

   > value 1 if the word at that index of vocabulary list is the word the numpy array represents

example-

who  - 1,0,0,0,0,0,0,0,0 

are  - 0,1,0,0,0,0,0,0,0

you - 0,0,1,0,0,0,0,0,0

Therefore, the sentence as a whole becomes a 2-dimensional array as 


```
[ [1,0,0,0,0,0,0,0,0],
  [0,1,0,0,0,0,0,0,0],
  [0,0,1,0,0,0,0,0,0],]
```



Your mission, should you choose to accept it, is to make a google colab notebook which

>  Provides an upload option to upload the dataset(CSV file) and

> Step by Step conversion of the pair of sentences into pairs of 2-dimensional numpy arrays as explained above. (include text cells describing what each part of your code does, (to prevent accidental miracles 😜))

>  print the vocabulary list and the pairs of 2-dimensional arrays of each row in input.

*NOTES*
>  The vocabulary list is fixed throughout a single data set and not just 1 sentence. Hence the shape of each sentence would be 

   number of words in sentence, number of words in vocabulary
   So you should consider the full data set when making the vocabulary list.

> You may order the vocabulary list your own way but the same list should be used to find the 2-dimensional arrays throughout a single data set.

> you can ignore and remove punctuation (for now), they are just annoying.

> Sample input data can be found here -> http://dravog.me/data/input.csv
   Do remember to ignore the header row.
   
> Sample output can be found here -> http://dravog.me/data/output.csv

**Answer**

1.Provides an upload option to upload the dataset(CSV file)

In [0]:
import pandas as pd
import numpy as np
import re


import io
from google.colab import files



In [2]:
uploaded = files.upload()
try:
  #Returns the first value in the dictionary
  content  = list(uploaded.values())[0]
except:
  print("upload failed")
frame = pd.read_csv(io.BytesIO(content))
data = frame.values
print(data)


Saving vocabulary.csv to vocabulary (7).csv
[["'who'" "'are'" "'you'" "'helen'" "'where'" "'i'" "'am'" "'at'"
  "'home'"]]


Using Regex to remove everything that are not alphanumeric and space

Convert everything to lower case and use space as delimiter to split sentences to word lists


In [0]:
reg = re.compile(r"[^a-zA-Z0-9]")
new_data = list(data)

for row in new_data:
  for index,col in enumerate(row):
    #remove special characters
    col = reg.sub("",col) #replace all matching with ""
    col = col.lower()  #set all to lower case
    col = col.split(" ") #split with " " space as delimiter
    row[index] = col
data = new_data



get the vocabulary list

Uses the set object to get all distinct words in the dataset

makes a dictionary that maps a word to it corresponding encoded equivalent.

In [4]:
vocablist = set()
for row in data:
    for col in row:
      vocablist.update(col)
vocablist = list(vocablist)
vocabdict = dict()
for index,word in enumerate(vocablist):
  vocabdict[word] =[0]*index +[1] +[0] *(len(vocablist) - index - 1)
  

vocablist,vocab = vocablist,vocabdict
print(vocab)

{'where': [1, 0, 0, 0, 0, 0, 0, 0, 0], 'i': [0, 1, 0, 0, 0, 0, 0, 0, 0], 'who': [0, 0, 1, 0, 0, 0, 0, 0, 0], 'helen': [0, 0, 0, 1, 0, 0, 0, 0, 0], 'am': [0, 0, 0, 0, 1, 0, 0, 0, 0], 'you': [0, 0, 0, 0, 0, 1, 0, 0, 0], 'home': [0, 0, 0, 0, 0, 0, 1, 0, 0], 'at': [0, 0, 0, 0, 0, 0, 0, 1, 0], 'are': [0, 0, 0, 0, 0, 0, 0, 0, 1]}


function to encode the words

I replace each word with the encoded equivalent by using dictionary

In [5]:
for row in data:
  for col in row:
    for index,word in enumerate(col):
      col[index] = vocab.get(word,[0] * len(vocab))
encoded = np.array(data)
print(encoded)

[[list([[0, 0, 1, 0, 0, 0, 0, 0, 0]]) list([[0, 0, 0, 0, 0, 0, 0, 0, 1]])
  list([[0, 0, 0, 0, 0, 1, 0, 0, 0]]) list([[0, 0, 0, 1, 0, 0, 0, 0, 0]])
  list([[1, 0, 0, 0, 0, 0, 0, 0, 0]]) list([[0, 1, 0, 0, 0, 0, 0, 0, 0]])
  list([[0, 0, 0, 0, 1, 0, 0, 0, 0]]) list([[0, 0, 0, 0, 0, 0, 0, 1, 0]])
  list([[0, 0, 0, 0, 0, 0, 1, 0, 0]])]]


Output

In [7]:
print(vocab)
for row in encoded:
  print(np.array(row[0]),end = ",\n")
  print(np.array(row[1]))
  print("Who are you? , Helen. \nWhere Are you?, I am at home.")

{'where': [1, 0, 0, 0, 0, 0, 0, 0, 0], 'i': [0, 1, 0, 0, 0, 0, 0, 0, 0], 'who': [0, 0, 1, 0, 0, 0, 0, 0, 0], 'helen': [0, 0, 0, 1, 0, 0, 0, 0, 0], 'am': [0, 0, 0, 0, 1, 0, 0, 0, 0], 'you': [0, 0, 0, 0, 0, 1, 0, 0, 0], 'home': [0, 0, 0, 0, 0, 0, 1, 0, 0], 'at': [0, 0, 0, 0, 0, 0, 0, 1, 0], 'are': [0, 0, 0, 0, 0, 0, 0, 0, 1]}
[[0 0 1 0 0 0 0 0 0]],
[[0 0 0 0 0 0 0 0 1]]
Who are you? , Helen. 
Where Are you?, I am at home.
