Visuallising Corpus.Json


In [1]:
import json 
with open('data/corpus.json','r') as file:
    corpus_data = json.load(file)
corpus_data[0]


{'title': "200+ of the best deals from Amazon's Cyber Monday sale",
 'author': None,
 'source': 'Mashable',
 'published_at': '2023-11-27T08:45:59+00:00',
 'category': 'entertainment',
 'url': 'https://mashable.com/article/cyber-monday-deals-amazon-2023',
 'body': 'Table of Contents Table of Contents Echo, Fire TV, and Kindle deals Apple deals TV deals Laptop deals Headphone and earbud deals Tablet deals Gaming deals Speaker deals Vacuum deals Kitchen deals Smart home deals Fitness deals Beauty tech deals Drone deals Camera deals Lego deals Gift card deals\n\nUPDATE: Nov. 27, 2023, 5:00 a.m. EST This post has been updated with all of the latest Cyber Monday deals available at Amazon.\n\nAmazon is dragging out the year\'s biggest shopping holiday(s) into 11 days of deals.\n\nThe retail giant began its Black Friday sale in the early morning of Friday, Nov. 17 (a week ahead of schedule) and was on top of making the switch to Cyber Monday language in the wee hours of Saturday, Nov. 25. Offi

Preprocess the data in body 

In [2]:
import re
def preprocess_text(text: str) -> str:
    # Remove unwanted characters and extra spaces
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = re.sub(r'[\n\r]+', ' ', text)  # Replace newlines with a space
    text = re.sub(r'[^a-zA-Z0-9\s.,!?\'"]', '', text)  # Remove special characters
    text = text.strip()  # Remove leading/trailing whitespace
    return text
for i in range(len(corpus_data)):
    corpus_data[i]['body'] = preprocess_text(corpus_data[i]['body'])
corpus_data[0]

{'title': "200+ of the best deals from Amazon's Cyber Monday sale",
 'author': None,
 'source': 'Mashable',
 'published_at': '2023-11-27T08:45:59+00:00',
 'category': 'entertainment',
 'url': 'https://mashable.com/article/cyber-monday-deals-amazon-2023',
 'body': 'Table of Contents Table of Contents Echo, Fire TV, and Kindle deals Apple deals TV deals Laptop deals Headphone and earbud deals Tablet deals Gaming deals Speaker deals Vacuum deals Kitchen deals Smart home deals Fitness deals Beauty tech deals Drone deals Camera deals Lego deals Gift card deals UPDATE Nov. 27, 2023, 500 a.m. EST This post has been updated with all of the latest Cyber Monday deals available at Amazon. Amazon is dragging out the year\'s biggest shopping holidays into 11 days of deals. The retail giant began its Black Friday sale in the early morning of Friday, Nov. 17 a week ahead of schedule and was on top of making the switch to Cyber Monday language in the wee hours of Saturday, Nov. 25. Official Cyber Mond

Importing RecursiveCharacterTextSplitter to split data into chunks 

In [3]:
import textwrap
text = corpus_data[0]['body']
from langchain.text_splitter import RecursiveCharacterTextSplitter

def print_wrapped_text(text: str, width: int = 70) -> None:
    wrapped_text = textwrap.fill(text, width)
    print(wrapped_text)


text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)

chunks = text_splitter.split_text(text)


Writing function to split corpus into chunks 

In [4]:
from tqdm import tqdm
import pandas as pd

def chunk_corpus(corpus_data: list) -> list:
    chunked_data = []
    for index, article in tqdm(enumerate(corpus_data), total=len(corpus_data)):
        # Preprocess the body of the article
        body = article.get("body", "")
        preprocessed_body = preprocess_text(body)

        # Split the preprocessed body into chunks
        chunks = text_splitter.split_text(preprocessed_body)
        
        # Create a new dictionary for each chunk
        for chunk in chunks:
            chunked_data.append({
                "index": index,  # Original index in corpus_data
                "title": article.get("title", ""),
                "author": article.get("author", ""),
                "source": article.get("source", ""),
                "published_at": article.get("published_at", ""),
                "category": article.get("category", ""),
                "url": article.get("url", ""),
                "body_chunk": chunk,
                "chunk_character_size": len(chunk),
                "chunk_word_size": len(chunk.split(" ")),
                "chunk_token_size":len(chunk)/4  # Token size is approximately 4 ~ one character
            })
    return chunked_data

chunked_data = chunk_corpus(corpus_data)
chunked_data_df = pd.DataFrame(chunked_data)
chunked_data_df.describe().round(2)

100%|██████████| 609/609 [00:01<00:00, 555.21it/s]


Unnamed: 0,index,chunk_character_size,chunk_word_size,chunk_token_size
count,7864.0,7864.0,7864.0,7864.0
mean,293.4,966.45,165.46,241.61
std,178.63,120.45,23.01,30.11
min,0.0,198.0,26.0,49.5
25%,136.0,994.0,161.0,248.5
50%,277.0,996.0,170.0,249.0
75%,451.0,998.0,177.0,249.5
max,608.0,1000.0,224.0,250.0


Checking our chunks using random samples

In [5]:
import random 
random.sample(chunked_data, 2)

[{'index': 102,
  'title': 'He’s Hockey’s Brightest Young Star. This Is What Makes His Shot So Special.',
  'author': 'The New York Times',
  'source': 'The New York Times',
  'published_at': '2023-11-17T21:47:19+00:00',
  'category': 'sports',
  'url': 'https://theathletic.com/5028179/2023/11/16/blackhawks-connor-bedard-shot/',
  'body_chunk': 'putting in on the other side of the puck. Even in practice, weve talked to him about maybe tracking harder and attacking pucks on the forecheck and showing him a couple clips, Richardson said. Ten, 12 games in, hes really figuring things out and realizes why sit back and let things come to him? Go get it. None of this is surprising to anyone involved. Bedard doesnt show it if he impresses himself. And as grateful as the Blackhawks are to have drafted Bedard, this is what they expected. This is why teams lined up to take losses last season. Its not the NHL that he was playing in the last few years, Davidson said, but the level of performance and

Embedding body chunks 

In [6]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


using all-mpnet-vase-v2 , will use another small one afterwards 

using all-MiniLM-L12-V2 as all-mpnet-vase-v2 would have been too big 

In [24]:
embedding_model = SentenceTransformer('all-MiniLM-L12-v2')
def embed_chunks(model, chunks: list) -> list:
    for chunk in tqdm(chunks, total=len(chunks)):
        chunk["embedding"] = model.encode(chunk["body_chunk"], batch_size=16,convert_to_tensor=True)
    return chunks

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [25]:
embedding_model.to('cuda')
embedded_chunk_data = embed_chunks(embedding_model, chunked_data)


  attn_output = torch.nn.functional.scaled_dot_product_attention(
100%|██████████| 7864/7864 [01:25<00:00, 91.97it/s] 


Inspecting embeddings 

In [26]:
embedded_chunk_data[0]["embedding"]

tensor([-2.0043e-02, -6.0836e-02,  3.9368e-02,  1.4167e-02,  6.0959e-02,
        -5.5071e-03,  4.9471e-02, -3.7021e-02,  7.6204e-03, -2.4665e-02,
         3.2560e-02,  3.7383e-02, -3.0699e-03,  5.8343e-02,  4.5017e-02,
        -3.9857e-02,  3.7575e-02, -1.3631e-01, -2.4427e-02, -3.4500e-03,
        -3.0166e-02,  2.7162e-02, -2.4404e-02, -2.1133e-02,  1.4097e-02,
         1.4758e-02, -5.8497e-02, -4.9087e-02, -9.9336e-02, -4.1075e-02,
        -1.1829e-02,  8.5526e-03,  5.4564e-02,  5.7569e-02, -3.4209e-02,
        -7.0601e-02,  7.2215e-02, -4.4971e-02, -1.4822e-02, -3.7274e-02,
         6.4015e-03, -2.0498e-02, -7.8796e-03,  4.0990e-02,  7.2523e-02,
         8.1320e-02,  2.0978e-02, -7.2389e-03,  3.5576e-02, -5.2985e-03,
         5.1509e-02,  1.8238e-02,  1.4572e-02, -1.4508e-02,  3.3997e-03,
         2.0619e-03, -5.9862e-02, -6.0601e-03,  3.9673e-02, -1.1435e-01,
         8.6996e-02, -7.4820e-02,  4.7117e-02,  4.6229e-02, -9.7422e-02,
        -1.8035e-02, -6.2589e-02,  7.7754e-03, -5.8

In [27]:
embedded_chunk_data_df = pd.DataFrame(embedded_chunk_data)


In [28]:
embedded_chunk_data_df["embedding"].iloc[0].shape

torch.Size([384])

Saving embeddings for future use 

In [30]:
embedded_chunk_data_save_path = "data/embedded_chunk_data_all_Mini.csv"
#embedded_chunk_data_df.to_csv(embedded_chunk_data_save_path,index=False)