In [9]:
# !hdfs dfs -get hdfs://harunava/user/leon.kepler/clm/data.json ./data.json

In [1]:
import json
with open("data.json","r") as f:
    data = json.load(f)

In [2]:
def format_record(record):
    return f"claim: {record['claim']}, notes: {record['notes']}"
   
str_data = [format_record(record) for record in data]

In [3]:
for i in range(10):
    print(str_data[i])

claim: , notes: This is a satire. It may be offensive through linking china to covid, but it is not misinformation as it is humour. 
claim: , notes: The video describes the Alcea rugosa water extract preparation for the treatment of stomach ulcer (collect petals, cover with cold water, leave it overnight) as well as the treatment instructions (take one tablespoon before food for 10-15 days). 
Traditional herbal medicine can not treat stomach ulcers since it is a severe health problem requiring drug treatment (antibiotics in case ulcers were caused by the infection with bacteria Helicobacter pylori).

claim: , notes: DESCRIPTION: the claim, made by candace owens, is that because black Americans commit more crime, that is why they are killed by the police. Her methodology is flawed.  “These quantities can differ enormously: When officers encounter many more white civilians (due to whites’ majority status, for example), the proportion of killings involving black civilians can be small, ev

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]='1'

In [3]:
import cudf
from cudf import Series
from cudf import DataFrame

from cuml.feature_extraction.text import TfidfVectorizer

In [4]:
import numpy as np
import pandas as pd
import warnings


pd.set_option('display.max_colwidth', 1000000)
warnings.filterwarnings('ignore')

In [5]:
def join_df(path, df_lib=cudf):
    data = df_lib.DataFrame()
    temp = df_lib.read_csv(path)
    # temp = temp[temp.lang=='en']
    data = df_lib.concat([data,temp])
    # for file in os.listdir(path):
    #     print(f"In path : {path}{file}")
    #     temp = df_lib.read_csv(path+file)
    #     temp = temp[temp.lang=='en']
    #     data = df_lib.concat([data,temp])
    return data

In [6]:
df = join_df('/opt/tiger/workspace/LLM/data.csv')
tweets = Series(df['text'])
len(tweets)

176722

In [7]:
vec = TfidfVectorizer(stop_words='english')

tfidf_matrix = vec.fit_transform(tweets)
tfidf_matrix.shape

(176722, 288685)

In [8]:
from cuml.common.sparsefuncs import csr_row_normalize_l2


def efficient_csr_cosine_similarity(query, tfidf_matrix, matrix_normalized=False):
    query = csr_row_normalize_l2(query, inplace=False)
    if not matrix_normalized:
        tfidf_matrix = csr_row_normalize_l2(tfidf_matrix, inplace=False)
    
    return tfidf_matrix.dot(query.T)


def document_search(text_df, query, vectorizer, tfidf_matrix, top_n=10):
    query_vec = vectorizer.transform(Series([query]))
    similarities = efficient_csr_cosine_similarity(query_vec, tfidf_matrix, matrix_normalized=True)
    similarities = similarities.todense().reshape(-1)
    best_idx = similarities.argsort()[-top_n:][::-1]
    
    pp = cudf.DataFrame({
        'text': text_df['text'].iloc[best_idx],
        'similarity': similarities[best_idx]
    })
    return pp

In [17]:
text = "enter their peaceful village and slaughtered women children and the elderly and they did this in forty three different villages in palestine and forced the palestinians to leave the land at gunpoint the real culprit is britain"
document_search(df, text, vec, tfidf_matrix)

Unnamed: 0,text,similarity
174533,"claim: ""Israel was founded in 1948 during the Nakba, when Zionist terrorists invaded 500 Palestinian villages where they killed and raped Palestinians as well as took their lands. "", notes: ""Israel was founded after an event Palestinians refer to as the Nakba. During the Nakba, Zionist militias launched attacks in hundreds of Palestinian villages. There are several reports of Zionist militias killing and raping Palestinians. Thousands of Palestinians were forced to leave their homes. """,0.207645
169866,"claim: ""Israel captured civilian men who they claimed were terrorists and stripped them down to their underwear in the streets. One of the men captured was a journalist named Diaa Al-Khalout, who was arrested at gunpoint and forced to leave his disabled daughter, notes: ""Journalist Diaa-Al Khalout was arrested at gunpoint by Israeli forces and forced to leave his disabled daughter. He was one of the hundred of Palestinian men who were stripped to their underwear and made to lean on a street in Northern Gaza. """,0.18134
85144,"claim: , notes: This clip contains inflammatory language that implies Ukrainian armed forces shelled a peaceful village and did not allow inhabitants to leave. This is a report on Russian state TV channel Russia-24, which is government owned and a known producer of misinformation. However, because the specific village in question is small and now in Russian-held territory, there does not appear to be independent information that would allow us to verify the legitimacy of these claims at the current time. As additional information is available the classification of this video may change.",0.178268
51194,"claim: , notes: This video talks about the current situation in Israel and Palestine, where forced evictions of Palestinians have taken place. Reports of this are true. Some of the first hand accounts cannot be fact checked as events are ongoing.",0.167036
57994,"claim: , notes: This video suggests that Myanmar junta forces launched airstrikes on a village in Ye-U township, Sagaing on December 20, 2021. Based on local media reports, this is not misinformation. The Burmese-language post translates in English as: “No media has reported about Ye-U township. Terrorist military council fired airstrikes on Yemyat village in Yae-U Township and set the whole village on fire. Villagers got wounded.” RFA reported about the incident that there were junta’s airstrikes in Yemyat village around 1PM on December 20. Some houses in villages were burned due to air attack, the report said. Junta controlled state media also issued a statement saying the military forces attacked “KIA, NLD, PDF terrorist” in the village. AFP was not able to independently confirm the damage in the village.\n",0.160251
652,"claim: , notes: Claim: In Palestine, ""they"" are abducting and raping women. There have been many human rights abuses in Palestine, including violence against women and girls. Without knowing what the ""they"" refers to, it's hard to check this.",0.159727
82140,"claim: , notes: opinion, Israel Palestine",0.156724
81107,"claim: , notes: opinion, Israel/Palestine",0.156724
97279,"claim: Russians took 40 villages in east Ukraine., notes: According to a report by Reuters, Russian forces have captured 42 villages in the eastern Donetsk region but Ukrainian officials say that they might take them back. This claim is unconfirmed, more information over this topic is yet to be revealed.",0.156637
1974,"claim: , notes: In this video, a man makes claims that the Trump supporters on Jan. 6 were peaceful. PolitiFact rated a claim that protests were peaceful Pants on Fire! ​​ Misinformation, factually inaccurate. \n",0.154043
