## Create vector store index

This notebook can be used to create and save a vector index creates using text-embedding-ada-002 from OpenAI 

#### Set paths

In [1]:
# Path to root
path_to_root = '/work/PernilleHøjlundBrams#8577/NLP_2023_P'

# To API key file
path_to_key = f'{path_to_root}/config/keys.txt'

# To your data folder
path_to_data = f'{path_to_root}/data'

# To where you want to store your vector index
path_to_vector_store = f'{path_to_root}/index'

#### Load data

In [2]:
import pandas as pd
df = pd.read_csv(f'{path_to_data}/articles.csv', sep = ",")

### Create context chunks from documents
This section converts the .csv file containing newsarticles into smaller chunks containing the *article body* as the main text and *author*, *URL*, *source*, *date published* and *title* in a metadata dictionary

In [3]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader, Document

# Convert the DataFrame into a list of Document objects that the index can understand
documents = [Document(text=row['Article Body'],
                      metadata={'title': row['Article Header'],
                                'source': row['Source'],
                                'author': row['Author'],
                                'date': row['Published Date'],
                                'url': row['Url']}) for index, row in df.iterrows()] 

### Create servicecontex for the vector index

In [4]:
from llama_index import (
    ServiceContext,
    OpenAIEmbedding,
    PromptHelper,
)
from llama_index.text_splitter import SentenceSplitter

# --- Sentencesplitter to split into chunks
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)


### Split documents into nodes

In [5]:
nodes = text_splitter.get_nodes_from_documents(documents)

In [9]:
nodes_test = pd.DataFrame(nodes)

In [13]:
nodes_test[7][0]

('text',
 "['Discovering new materials and drugs typically involves a manual, trial-and-error process that can take decades and cost millions of dollars. To streamline this process, scientists often use machine learning to predict molecular properties and narrow down the molecules they need to synthesize and test in the lab.', 'Researchers from MIT and the MIT-Watson AI Lab have developed a new, unified framework that can simultaneously predict molecular properties and generate new molecules much more efficiently than these popular deep-learning approaches.', 'To teach a machine-learning model to predict a molecule’s biological or mechanical properties, researchers must show it millions of labeled molecular structures — a process known as training. Due to the expense of discovering molecules and the challenges of hand-labeling millions of structures, large training datasets are often hard to come by, which limits the effectiveness of machine-learning approaches.', 'By contrast, the sys

In [14]:
nodes_test.to_csv(f"{path_to_root}/data/prelim_dataframes/nodes.csv")

### Create VectorStore index

In [18]:
# import sys
# sys.path.append(f'{path_to_root}/src')

# from utils import read_api_key

In [19]:
# # --- Load API key
# api_key = read_api_key(path_to_key)

# import os

# # Set the OpenAI API key in the environment variables
# os.environ["OPENAI_API_KEY"] = api_key


In [None]:
# # --- Generate vector index
# index = VectorStoreIndex(nodes,show_progress = True)

# # --- Persist index to disk
# index.storage_context.persist("full_dataset_nodes_index")