# Dimension Reduction Notebook

This notebook is designed to reduce the dimensionality of high-dimensional embeddings. High-dimensional embeddings can be computationally heavy and challenging to work with, particularly for visualization and further analysis. By reducing the dimensionality, we aim to retain as much relevant information as possible while simplifying the dataset for more efficient processing.

## Parameters
- **input_file**: Path to the JSON file containing the original high-dimensional embeddings.
- **output_file**: Path to the JSON file where the reduced-dimensionality embeddings will be saved.
- **n_components**: Number of dimensions to reduce the data to (default is 100).

## Functionality
The notebook performs the following steps:
1. **Data Loading**: Loads the high-dimensional embeddings from the specified JSON file.
2. **Flattening Embeddings**: Converts the nested structure of embeddings into a flat array for processing.
3. **Dimensionality Reduction**: Applies Principal Component Analysis (PCA) to reduce the number of dimensions to the specified `n_components`.
4. **Unflattening Embeddings**: Converts the reduced-dimensionality embeddings back to their original nested structure.
5. **Data Saving**: Saves the reduced-dimensionality embeddings to the specified output JSON file.

## Techniques Used
- **Principal Component Analysis (PCA)**: A statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

## Goal
The main goal of this notebook is to reduce the computational load and complexity associated with high-dimensional embeddings, making them more manageable.


In [3]:
import json
import numpy as np
from sklearn.decomposition import PCA
import copy

def load_json(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

def save_json(data, file_path):
    with open(file_path, 'w') as f:
        json.dump(data, f)

def flatten_embeddings(data):
    flat_embeddings = []
    for sentence in data:
        for word in sentence:
            flat_embeddings.append(word)
    return np.array(flat_embeddings)

def unflatten_embeddings(flat_embeddings, original_data):
    index = 0
    new_data = []
    for sentence in original_data:
        new_sentence = []
        for _ in sentence:
            new_sentence.append(flat_embeddings[index].tolist())
            index += 1
        new_data.append(new_sentence)
    return new_data

def reduce_dimensionality(data, n_components=100):
    pca = PCA(n_components=n_components)
    reduced_data = pca.fit_transform(data)
    return reduced_data

def main(input_file, output_file, n_components=100):
    print("Loading data...")
    data = load_json(input_file)
    print("Flattening embeddings...")
    flat_embeddings = flatten_embeddings(data)
    print(f"Original shape: {flat_embeddings.shape}")

    print("Reducing dimensionality...")
    reduced_embeddings = reduce_dimensionality(flat_embeddings, n_components)
    print(f"Reduced shape: {reduced_embeddings.shape}")

    print("Reconstructing data structure...")
    new_data = unflatten_embeddings(reduced_embeddings, data)

    print("Saving reduced data...")
    save_json(new_data, output_file)
    print("Done!")

if __name__ == "__main__":
    input_file = 'word_embeddings.json'
    output_file = 'word_embeddings_reduced.json'
    main(input_file, output_file)


Loading data...


Flattening embeddings...
Original shape: (205114, 300)
Reducing dimensionality...
Reduced shape: (205114, 100)
Reconstructing data structure...
Saving reduced data...
Done!


In [4]:
data_reduced = load_json('word_embeddings_reduced.json')
print(len(data_reduced))
print(len(data_reduced[0]))
print(len(data_reduced[0][0]))

data_reduced = load_json('word_embeddings.json')
print(len(data_reduced))
print(len(data_reduced[0]))
print(len(data_reduced[0][0]))

11782
17
100
11782
17
300


In [4]:
import copy
import json

def load_json(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

def save_json(data, file_path):
    with open(file_path, 'w') as f:
        json.dump(data, f)

def list_struct(l, length, length_vector):
    l_copy=copy.deepcopy(l)
    for l2 in l_copy:
        while len(l2)<length:
            l2.append([0]*length_vector)
    return l_copy

In [10]:
fast_vector_train = load_json("w_fast_vectors_reduced.json")
fast_vector_test = load_json("w_fast_vectors_test_reduced.json")

In [11]:
print(len(fast_vector_train[0]))
fast_vector_train = list_struct(fast_vector_train,377,100)
fast_vector_test = list_struct(fast_vector_test,377,100)

17


In [12]:
save_json(fast_vector_train,'w_fast_vectors_reduced.json')
save_json(fast_vector_test,'w_fast_vectors_test_reduced.json')