# Vannevar Labs Dataset for NatSec Hackathon 2024

Thanks for participating in the hackathon! This notebook is available at https://vl-nat-sec-hackathon-may-2024.s3.us-east-2.amazonaws.com/vl-data-download.ipynb, and we will be updating it over the course of the week with additional data and resources.

The dataset we are providing are of Russian social media posts from Telegram and VK that are related to current geopolitical events, a lot of them specifically about events occurring in Ukraine. The data in `attachment_urls` are media files that we will be providing in the same s3 bucket later this week.

If there are any issues with data access please email charu@vannevarlabs.com

In [27]:
!pip install boto3 botocore pandas

In [1]:
import boto3
import pandas as pd
from io import BytesIO
from botocore import UNSIGNED
from botocore.config import Config


# # Create a boto3 session with an anonymous user
# s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# bucket_name = 'vl-nat-sec-hackathon-may-2024'
# file_key = 'russia_social_media.csv'

# # Get the object from S3
# response = s3.get_object(Bucket=bucket_name, Key=file_key)

# file_content = response['Body'].read()

# Also available here: https://vl-nat-sec-hackathon-may-2024.s3.us-east-2.amazonaws.com/russia_social_media.csv



In [34]:
# Load the file content into a pandas DataFrame
readdata = pd.read_csv('../../deftech/russia_social_media.csv', nrows=100000)


In [32]:
len(readdata['translation'])

480501

In [None]:
# Write the first 20 rows of the DataFrame to a CSV file
readdata.head(20).to_csv('first_20_rows.csv', index=False)


In [None]:
res = readdata[readdata['translation'].str.contains('S-300', na=False)]['translation']
import json

# JSON encode the 'res' Series and print
# print(json.dumps(res.head(20).replace("\n", "", regex=True).to_list()))


In [None]:
[print(x, readdata['translation'][x].replace("\n", "") + "\n\n") for x in range(300, 330) ]


In [None]:
res = res.head(20)

In [None]:
[x for x in list(readdata['translation'][:10]) if x != nan]

NameError: name 'nan' is not defined

In [None]:
len(res['translation'])

TypeError: list indices must be integers or slices, not str

In [18]:
import json
import pandas as pd

# Attempt to open a JSON file and handle potential errors
try:
    with open('o.json', 'r') as file:
        data = json.load(file)
    print("JSON file read successfully")
except FileNotFoundError:
    print("File not found. Please check the file name and path.")
except json.JSONDecodeError:
    print("File is not a valid JSON. Please check the file content.")
except Exception as e:
    print(f"An error occurred: {e}")


JSON file read successfully


In [7]:
data[0][0]

{'event': 'NULL', 'description': 'NULL', 'location': 'NULL'}

In [19]:
import math
text_data = [x[0]['description'] for x in data if isinstance(x, list) ]




# text_data = list(readdata['translation'])
text_data = [x for x in text_data if isinstance(x, str) or not math.isnan(x)]



In [20]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import math

from llm import LLM
model = LLM()
# text_data = res
chunk_size = 1000
embeds = []
for start in range(0, len(text_data), chunk_size):
    end = start + chunk_size
    chunk = text_data[start:end]
    filtered_chunk = [x for x in chunk if isinstance(x, str) or not math.isnan(x)]
    embeds.extend([x.embedding for x in model.embed(filtered_chunk).data])
    

In [37]:
import pickle

# Save the embeddings to a file
with open('embeds.pkl', 'wb') as file:
    pickle.dump(embeds, file)


In [1]:
import pickle

# Load the embeddings from the file
with open('embeds.pkl', 'rb') as file:
    embeddings = pickle.load(file)


In [None]:
embeddings

In [None]:
embeddings_list = [embed.embedding for embed in embeds.data]

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

# info = cosine_similarity(embeddings[:30000])
info = cosine_similarity(embeds)


In [22]:
info

array([[1.        , 0.9999989 , 0.99999883, ..., 0.10533007, 0.07223707,
        0.11615046],
       [0.9999989 , 1.        , 0.99999884, ..., 0.10532055, 0.07225019,
        0.11612752],
       [0.99999883, 0.99999884, 1.        , ..., 0.10526928, 0.07219942,
        0.11615007],
       ...,
       [0.10533007, 0.10532055, 0.10526928, ..., 1.        , 0.32176593,
        0.16537877],
       [0.07223707, 0.07225019, 0.07219942, ..., 0.32176593, 1.        ,
        0.19287856],
       [0.11615046, 0.11612752, 0.11615007, ..., 0.16537877, 0.19287856,
        1.        ]])

In [None]:
res = list(res)

In [23]:
clustered = {}
used_js = set()
for i, row in enumerate(info):
    clustered[i] = []
    for j, value in enumerate(row):
        if value > 0.8 and i != j and i not in used_js:
            clustered[i].append(j)
            used_js.add(j)


In [24]:
for index, similar_indices in clustered.items():
    if len(similar_indices) > 2:
        print(f"Text at index {index}:")
        print(text_data[index].replace("\n", ""))
        for similar_index in similar_indices:
            print(f"Similar text at index {similar_index}:")
            print(text_data[similar_index].replace("\n", ""))
        print("\n" + "-"*80 + "\n")

Text at index 0:
NULL
Similar text at index 1:
NULL
Similar text at index 2:
NULL
Similar text at index 9:
NULL
Similar text at index 11:
NULL
Similar text at index 31:
NULL
Similar text at index 41:
NULL
Similar text at index 49:
NULL
Similar text at index 52:
NULL
Similar text at index 73:
NULL
Similar text at index 83:
NULL
Similar text at index 86:
NULL
Similar text at index 87:
NULL
Similar text at index 93:
NULL
Similar text at index 100:
NULL
Similar text at index 106:
NULL
Similar text at index 109:
NULL
Similar text at index 124:
NULL
Similar text at index 125:
NULL
Similar text at index 128:
NULL
Similar text at index 138:
NULL
Similar text at index 143:
NULL
Similar text at index 144:
NULL
Similar text at index 145:
NULL
Similar text at index 148:
NULL
Similar text at index 159:
NULL
Similar text at index 162:
NULL
Similar text at index 169:
NULL
Similar text at index 170:
NULL
Similar text at index 180:
NULL
Similar text at index 188:
NULL
Similar text at index 191:
NULL
Si

In [17]:
text_data

['NULL',
 'NULL',
 'NULL',
 'Tanks were deployed against enemy infantry in a landing operation in the Avdeevsky direction.',
 "The most valuable assets of the Russian Black Sea Fleet have been withdrawn from Crimea, leaving behind 'one loser' among the missile carriers. Ukrainian Navy Speaker Pletenchuk made this statement.",
 'The Ukrainian Armed Forces shelled residential buildings in Belgorod, causing significant damage and endangering civilians. The consequences of this attack highlight the ongoing conflict in the region.',
 'The group received long-awaited and expensive plates from Moscow, which arrived quickly. The unpacking process was done carefully due to the fragile nature of the plates. This acquisition marks a new level of assistance provided by the group.',
 'NULL',
 'On March 31, 1814, Russian troops led by Emperor Alexander I triumphantly entered Paris, marking the capture of France as the final battle of the foreign campaign of the Russian army. Napoleon abdicated the t