# List Structure Processing

This notebook focuses on processing structured data lists, including word embeddings, casing information, and POS tags, to prepare them for deep learning tasks. The primary objective is to handle ensure the same lenght of all the entries.

## Overview
1. **Loading Libraries and Data**: Import necessary libraries and multiple datasets, including word embeddings, casing, and POS tags.
2. **Data Transformation**: Process the loaded data to handle specific structural requirements, such as one-hot encoding and ensuring consistency across different data elements.

## Goal
The main goal of this notebook is to organize and transform structured data lists into formats suitable for deep learning. By the end of the notebook, you will have preprocessed data ready for training and testing machine learning models, with correct handling of word embeddings, casing, and POS tag information.


In [1]:
import json
import copy
import torch

In [2]:
f=open("train_word_embeddings_reduced.json")
train_word_embeddings=json.load(f)

f=open("test_word_embeddings_reduced.json")
test_word_embeddings=json.load(f)

f=open("train_casing_onehot.json")
train_casing_onehot=json.load(f)

f=open("test_casing_onehot.json")
test_casing_onehot=json.load(f)

f=open("train_pos_onehot.json")
train_pos_onehot=json.load(f)

f=open("test_pos_onehot.json")
test_pos_onehot=json.load(f)

f=open("full_training_set_CRF_tagged.json")
training_set=json.load(f)

f=open("full_test_set_CRF_tagged.json")
test_set=json.load(f)

f=open("y_train.json")
y_train=json.load(f)

f=open("y_test.json")
y_test=json.load(f)

In [3]:
leng=[]

for sentence in train_casing_onehot:
    leng.append(len(sentence))

for sentence in test_casing_onehot:
    leng.append(len(sentence))

print(max(leng))

NameError: name 'train_casing_onehot' is not defined

In [3]:
def list_struct(l, length, length_vector):
    l_copy=copy.deepcopy(l)
    for l2 in l_copy:
        while len(l2)<length:
            l2.append([0.]*length_vector)
    return l_copy

In [20]:
def list_struct2(l, length):
    l_copy=copy.deepcopy(l)
    for l2 in l_copy:
        while len(l2)<length:
            l2.append("PAD")
    return l_copy

In [4]:
train_casing_onehot2=list_struct(train_casing_onehot, 377, 8)
test_casing_onehot2=list_struct(test_casing_onehot, 377, 8)

train_pos_onehot2=list_struct(train_pos_onehot, 377, 17)
test_pos_onehot2=list_struct(test_pos_onehot, 377, 17)

train_word_embeddings2=list_struct(train_word_embeddings, 377, 100)
test_word_embeddings2=list_struct(test_word_embeddings, 377, 100)

In [21]:
y_train2=list_struct2(y_train, 377)
y_test2=list_struct2(y_test, 377)

In [22]:
def numerical_tags(dict_index_tag, y):
    tags=[]
    for sentence in y:
        sentence_tags=[]
        for tag in sentence:
            l=[0.]*6
            l[dict_index_tag[tag]]=1.
            sentence_tags.append(l)
        tags.append(sentence_tags)
    return tags

In [23]:
dict_index_tag={"O":0, "NEG":1, "NSCO":2, "UNC":3, "USCO":4, "PAD":5}
dict_index_tag_inverted={0:"O", 1:"NEG", 2:"NSCO", 3:"UNC", 4:"USCO", 5:"PAD"}

y_train3=numerical_tags(dict_index_tag, y_train2)
y_test3=numerical_tags(dict_index_tag, y_test2)

In [8]:
with open("train_casing_onehot.json","w") as f:
  json.dump(train_casing_onehot2, f)

with open("test_casing_onehot.json","w") as f:
  json.dump(test_casing_onehot2, f)

with open("train_pos_onehot.json","w") as f:
  json.dump(train_pos_onehot2, f)

with open("test_pos_onehot.json","w") as f:
  json.dump(test_pos_onehot2, f)

with open("train_word_embeddings_reduced.json","w") as f:
  json.dump(train_word_embeddings2, f)

with open("test_word_embeddings_reduced.json","w") as f:
  json.dump(test_word_embeddings2, f)

with open("y_train_numerical.json","w") as f:
  json.dump(y_train3, f)

with open("y_test_numerical.json","w") as f:
  json.dump(y_test3, f)