## __ICPR 2024 Competition on Multilingual Claim-Span Identification__

### Installing Dependencies

In [None]:
!pip install indic-nlp-library



### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import string
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from PIL import Image
from collections import Counter
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

from indicnlp.tokenize import indic_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Loading Datasets

In [None]:
# Training English and Hindi datasets
train_en = pd.read_json("/content/drive/MyDrive/Multilingual Datasets/Encoded/train-en_encoded.json")
train_hi = pd.read_json("/content/drive/MyDrive/Multilingual Datasets/Encoded/train-hi_encoded.json")


# Validation English and Hindi datasets
val_en = pd.read_json("/content/drive/MyDrive/Multilingual Datasets/Encoded/val-en_encoded.json")
val_hi = pd.read_json("/content/drive/MyDrive/Multilingual Datasets/Encoded/val-hi_encoded.json")

### Explore English Training Data

In [None]:
train_en.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded
0,500,[],"[#VAERS, 17y, ♂, ️, #Pfizer, #Covidvaccine, #S...",0
1,501,"[{'index': 0, 'start': 0, 'end': 29, 'terms': ...","[We've, truly, come, a, long, way, from, Decem...",1
2,502,"[{'index': 0, 'start': 8, 'end': 16, 'terms': ...","[Fuck, that, ., Its, not, faux, outrage, ., In...",1
3,503,"[{'index': 0, 'start': 7, 'end': 21, 'terms': ...","[@U55750420, Which, ..., makes, no, sense, ., ...",1
4,504,"[{'index': 0, 'start': 8, 'end': 18, 'terms': ...","[Fact, or, Fiction, ,, you, decide, :, The, up...",1


In [None]:
# Review the data
train_en.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5999 entries, 0 to 5998
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           5999 non-null   int64 
 1   claims          5999 non-null   object
 2   text_tokens     5999 non-null   object
 3   claims_encoded  5999 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 187.6+ KB


In [None]:
#Checking Missing Values

train_en.isnull().sum()

index             0
claims            0
text_tokens       0
claims_encoded    0
dtype: int64

In [None]:
# Levels counts

train_en['claims_encoded'].value_counts()

claims_encoded
1    4940
0    1059
Name: count, dtype: int64

In [None]:
# Average number of claims per tweet
avg_claims_per_tweet = train_en['claims_encoded'].mean()
avg_claims_per_tweet

0.8234705784297383

#### Entire Cleaning Tweets text

The cleaning process covers several common text preprocessing steps that are often sufficient for many machine learning tasks involving text data. These steps include:

**Converting text to lowercase:** This helps in standardizing the text data, ensuring consistency across different entries.

**Removing URLs:** URLs typically do not contribute to the semantic meaning of text data and can be safely removed.

**Removing punctuation:** Punctuation marks often do not carry significant meaning in many natural language processing tasks and can be removed to reduce noise.

**Removing extra whitespaces:** Extra whitespaces between words are unnecessary and can be removed to standardize the text format.

These cleaning steps can help in reducing noise and irrelevant information from the text data, making it more suitable for training machine learning models.

While the provided cleaning process covers many common preprocessing steps, the adequacy of these steps ultimately depends on the specific requirements of your task and the characteristics of your data. It's essential to evaluate the impact of preprocessing on model performance and iterate based on experimentation and evaluation.

In [None]:
def clean_text_tokens(text_tokens):
    # Join the list of tokens into a single string
    text_string = ' '.join(text_tokens)

    # Convert the text to lowercase
    cleaned_text = text_string.lower()

    # Remove URLs
    cleaned_text = re.sub(r'http\S+', '', cleaned_text)

    # Remove punctuation except for hashtags and mentions
    cleaned_text = re.sub(r'[^\w\s#@]', '', cleaned_text)

    # Remove extra whitespaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    return cleaned_text

# Assuming train_en is your DataFrame and 'text_tokens' is the column to be cleaned
train_en['tokens_clean'] = train_en['text_tokens'].apply(clean_text_tokens)
train_en.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded,tokens_clean
0,500,[],"[#VAERS, 17y, ♂, ️, #Pfizer, #Covidvaccine, #S...",0,#vaers 17y #pfizer #covidvaccine #suicide atte...
1,501,"[{'index': 0, 'start': 0, 'end': 29, 'terms': ...","[We've, truly, come, a, long, way, from, Decem...",1,weve truly come a long way from december and j...
2,502,"[{'index': 0, 'start': 8, 'end': 16, 'terms': ...","[Fuck, that, ., Its, not, faux, outrage, ., In...",1,fuck that its not faux outrage inject them wit...
3,503,"[{'index': 0, 'start': 7, 'end': 21, 'terms': ...","[@U55750420, Which, ..., makes, no, sense, ., ...",1,@u55750420 which makes no sense the vaccine ca...
4,504,"[{'index': 0, 'start': 8, 'end': 18, 'terms': ...","[Fact, or, Fiction, ,, you, decide, :, The, up...",1,fact or fiction you decide the upcoming corona...


In [None]:
#Checking duplicates tweets

train_en["tokens_clean"].duplicated().sum()

3

In [None]:
# Drop duplicated tweets
# We removed the duplicated cleaned tweets. How is the class balance after the cleaning?

train_en.drop_duplicates("tokens_clean", inplace=True)

train_en.claims_encoded	.value_counts()

claims_encoded
1    4938
0    1058
Name: count, dtype: int64

In [None]:
train_en.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5996 entries, 0 to 5998
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           5996 non-null   int64 
 1   claims          5996 non-null   object
 2   text_tokens     5996 non-null   object
 3   claims_encoded  5996 non-null   int64 
 4   tokens_clean    5996 non-null   object
dtypes: int64(2), object(3)
memory usage: 281.1+ KB


In [None]:
# Dropping columns

columns_to_drop = ['index', 'claims', 'text_tokens']
train_en = train_en.drop(columns=columns_to_drop)

train_en.head()

Unnamed: 0,claims_encoded,tokens_clean
0,0,#vaers 17y #pfizer #covidvaccine #suicide atte...
1,1,weve truly come a long way from december and j...
2,1,fuck that its not faux outrage inject them wit...
3,1,@u55750420 which makes no sense the vaccine ca...
4,1,fact or fiction you decide the upcoming corona...


In [None]:
# Renaming the columns for better understanding

df_train_en = train_en.rename(columns={'claims_encoded': 'claims', 'tokens_clean': 'text_tokens'})
df_train_en.head()

Unnamed: 0,claims,text_tokens
0,0,#vaers 17y #pfizer #covidvaccine #suicide atte...
1,1,weve truly come a long way from december and j...
2,1,fuck that its not faux outrage inject them wit...
3,1,@u55750420 which makes no sense the vaccine ca...
4,1,fact or fiction you decide the upcoming corona...


In [None]:
# Saving DataFrame to a JSON file
df_train_en.to_json('train_en_encoded_clean.json', orient='records')

### Explore Hindi Training Data

In [None]:
train_hi.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded
0,500,[],"[भाइयों, इसको, प्रशासन, कहे, ,, कि, कुशासन, कह...",0
1,501,"[{'index': 0, 'start': 4, 'end': 31, 'terms': ...","[मौसम, विभाग, के, मुताबिक, अगले, 24, घंटे, में...",1
2,502,"[{'index': 0, 'start': 0, 'end': 37, 'terms': ...","[योगी, सरकार, मे, 50, लाख, अधिक, बच्चे, स्कूल,...",1
3,503,[],"[@U45195860, @U84700880, तुमलोग, कितने, भी, पढ...",0
4,504,"[{'index': 0, 'start': 0, 'end': 39, 'terms': ...","[दिल्ली, का, एक, ', नमूना, ', उत्तर, प्रदेश, आ...",1


In [None]:
# Review the data
train_hi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           6098 non-null   int64 
 1   claims          6098 non-null   object
 2   text_tokens     6098 non-null   object
 3   claims_encoded  6098 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 190.7+ KB


In [None]:
#Checking Missing Values

train_hi.isnull().sum()

index             0
claims            0
text_tokens       0
claims_encoded    0
dtype: int64

In [None]:
# Levels counts

train_hi['claims_encoded'].value_counts()

claims_encoded
1    4879
0    1219
Name: count, dtype: int64

In [None]:
# Average number of claims per tweet
avg_claims_per_tweet = train_hi['claims_encoded'].mean()
avg_claims_per_tweet

0.80009839291571

This code should clean the text tokens in the Hindi language, removing URLs, punctuation marks (except for hashtags and mentions), and extra whitespaces. Make sure to adjust the list of punctuation marks as needed for your specific requirements.

In [None]:
# Hindi Language Text Cleaning

def clean_text_tokens(text_tokens):
    # Join the list of tokens into a single string
    text_string = ' '.join(text_tokens)

    # Remove URLs
    cleaned_text = re.sub(r'http\S+', '', text_string)

    # Tokenize Hindi text
    tokens = indic_tokenize.trivial_tokenize(cleaned_text, lang='hi')

    # Remove punctuation except for hashtags and mentions
    # Adjust punctuation marks as needed for Hindi
    hindi_punctuation = '।!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~'

    # Exclude "#" and "@" from removal as they might be used for hashtags and mentions
    cleaned_tokens = [token for token in tokens if token not in hindi_punctuation or token in ['#', '@']]

    # Join cleaned tokens back into a string
    cleaned_text = ' '.join(cleaned_tokens)

    return cleaned_text

# Assuming train_hi is your DataFrame and 'text_tokens' is the column to be cleaned
train_hi['tokens_clean'] = train_hi['text_tokens'].apply(clean_text_tokens)
train_hi.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded,tokens_clean
0,500,[],"[भाइयों, इसको, प्रशासन, कहे, ,, कि, कुशासन, कह...",0,भाइयों इसको प्रशासन कहे कि कुशासन कहें कि दुशा...
1,501,"[{'index': 0, 'start': 4, 'end': 31, 'terms': ...","[मौसम, विभाग, के, मुताबिक, अगले, 24, घंटे, में...",1,मौसम विभाग के मुताबिक अगले 24 घंटे में पश्चिम ...
2,502,"[{'index': 0, 'start': 0, 'end': 37, 'terms': ...","[योगी, सरकार, मे, 50, लाख, अधिक, बच्चे, स्कूल,...",1,योगी सरकार मे 50 लाख अधिक बच्चे स्कूल पहुंचे 9...
3,503,[],"[@U45195860, @U84700880, तुमलोग, कितने, भी, पढ...",0,@ U45195860 @ U84700880 तुमलोग कितने भी पढ़ लि...
4,504,"[{'index': 0, 'start': 0, 'end': 39, 'terms': ...","[दिल्ली, का, एक, ', नमूना, ', उत्तर, प्रदेश, आ...",1,दिल्ली का एक नमूना उत्तर प्रदेश आता है और कहता...


In [None]:
#Checking duplicates tweets

train_hi["tokens_clean"].duplicated().sum()

26

In [None]:
# Drop duplicated tweets
# We removed the duplicated cleaned tweets. How is the class balance after the cleaning?

train_hi.drop_duplicates("tokens_clean", inplace=True)

train_hi.claims_encoded	.value_counts()

claims_encoded
1    4862
0    1210
Name: count, dtype: int64

In [None]:
train_hi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6072 entries, 0 to 6097
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           6072 non-null   int64 
 1   claims          6072 non-null   object
 2   text_tokens     6072 non-null   object
 3   claims_encoded  6072 non-null   int64 
 4   tokens_clean    6072 non-null   object
dtypes: int64(2), object(3)
memory usage: 284.6+ KB


In [None]:
# Dropping columns

columns_to_drop = ['index', 'claims', 'text_tokens']
train_hi = train_hi.drop(columns=columns_to_drop)

train_hi.head()

Unnamed: 0,claims_encoded,tokens_clean
0,0,भाइयों इसको प्रशासन कहे कि कुशासन कहें कि दुशा...
1,1,मौसम विभाग के मुताबिक अगले 24 घंटे में पश्चिम ...
2,1,योगी सरकार मे 50 लाख अधिक बच्चे स्कूल पहुंचे 9...
3,0,@ U45195860 @ U84700880 तुमलोग कितने भी पढ़ लि...
4,1,दिल्ली का एक नमूना उत्तर प्रदेश आता है और कहता...


In [None]:
# Renaming the columns for better understanding

df_train_hi = train_hi.rename(columns={'claims_encoded': 'claims', 'tokens_clean': 'text_tokens'})
df_train_hi.head()

Unnamed: 0,claims,text_tokens
0,0,भाइयों इसको प्रशासन कहे कि कुशासन कहें कि दुशा...
1,1,मौसम विभाग के मुताबिक अगले 24 घंटे में पश्चिम ...
2,1,योगी सरकार मे 50 लाख अधिक बच्चे स्कूल पहुंचे 9...
3,0,@ U45195860 @ U84700880 तुमलोग कितने भी पढ़ लि...
4,1,दिल्ली का एक नमूना उत्तर प्रदेश आता है और कहता...


In [None]:
# Saving DataFrame to a JSON file
df_train_hi.to_json('train_hi_encoded_clean.json', orient='records')

### Explore English Validation Data

In [None]:
val_en.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded
0,0,"[{'index': 0, 'start': 3, 'end': 9, 'terms': '...","[Listen, people, ., The, vaccines, work, ., Th...",1
1,1,[],"[I, ’, ll, take, the, covid, vaccine, in, 5, y...",0
2,2,[],"[@U72240666, Hearing, that, the, Pope, says, r...",0
3,3,"[{'index': 0, 'start': 0, 'end': 7, 'terms': '...","[Trump, has, got, the, new, russian, vaccine, ...",1
4,4,"[{'index': 0, 'start': 20, 'end': 30, 'terms':...","[@U41390182, @U31519664, Gotta, agree, ., We, ...",1


In [None]:
# Review the data
val_en.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           500 non-null    int64 
 1   claims          500 non-null    object
 2   text_tokens     500 non-null    object
 3   claims_encoded  500 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 15.8+ KB


In [None]:
#Checking Missing Values

val_en.isnull().sum()

index             0
claims            0
text_tokens       0
claims_encoded    0
dtype: int64

In [None]:
# Levels counts

val_en['claims_encoded'].value_counts()

claims_encoded
1    407
0     93
Name: count, dtype: int64

In [None]:
# Average number of claims per tweet
avg_claims_per_tweet = val_en['claims_encoded'].mean()
avg_claims_per_tweet

0.814

In [None]:
def clean_text_tokens(text_tokens):
    # Join the list of tokens into a single string
    text_string = ' '.join(text_tokens)

    # Convert the text to lowercase
    cleaned_text = text_string.lower()

    # Remove URLs
    cleaned_text = re.sub(r'http\S+', '', cleaned_text)

    # Remove punctuation except for hashtags and mentions
    cleaned_text = re.sub(r'[^\w\s#@]', '', cleaned_text)

    # Remove extra whitespaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    return cleaned_text

# Assuming train_en is your DataFrame and 'text_tokens' is the column to be cleaned
val_en['tokens_clean'] = val_en['text_tokens'].apply(clean_text_tokens)
val_en.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded,tokens_clean
0,0,"[{'index': 0, 'start': 3, 'end': 9, 'terms': '...","[Listen, people, ., The, vaccines, work, ., Th...",1,listen people the vaccines work they work that...
1,1,[],"[I, ’, ll, take, the, covid, vaccine, in, 5, y...",0,i ll take the covid vaccine in 5 years after a...
2,2,[],"[@U72240666, Hearing, that, the, Pope, says, r...",0,@u72240666 hearing that the pope says refusing...
3,3,"[{'index': 0, 'start': 0, 'end': 7, 'terms': '...","[Trump, has, got, the, new, russian, vaccine, ...",1,trump has got the new russian vaccine and he c...
4,4,"[{'index': 0, 'start': 20, 'end': 30, 'terms':...","[@U41390182, @U31519664, Gotta, agree, ., We, ...",1,@u41390182 @u31519664 gotta agree we already i...


In [None]:
#Checking duplicates tweets

val_en["tokens_clean"].duplicated().sum()

1

In [None]:
# Drop duplicated tweets
# We removed the duplicated cleaned tweets. How is the class balance after the cleaning?

val_en.drop_duplicates("tokens_clean", inplace=True)

val_en.claims_encoded.value_counts()

claims_encoded
1    407
0     92
Name: count, dtype: int64

In [None]:
# Dropping columns

columns_to_drop = ['index', 'claims', 'text_tokens']
val_en = val_en.drop(columns=columns_to_drop)

val_en.head()

Unnamed: 0,claims_encoded,tokens_clean
0,1,listen people the vaccines work they work that...
1,0,i ll take the covid vaccine in 5 years after a...
2,0,@u72240666 hearing that the pope says refusing...
3,1,trump has got the new russian vaccine and he c...
4,1,@u41390182 @u31519664 gotta agree we already i...


In [None]:
val_en.info()

<class 'pandas.core.frame.DataFrame'>
Index: 499 entries, 0 to 499
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   claims_encoded  499 non-null    int64 
 1   tokens_clean    499 non-null    object
dtypes: int64(1), object(1)
memory usage: 11.7+ KB


In [None]:
# Renaming the columns for better understanding

df_val_en = val_en.rename(columns={'claims_encoded': 'claims', 'tokens_clean': 'text_tokens'})
df_val_en.head()

Unnamed: 0,claims,text_tokens
0,1,listen people the vaccines work they work that...
1,0,i ll take the covid vaccine in 5 years after a...
2,0,@u72240666 hearing that the pope says refusing...
3,1,trump has got the new russian vaccine and he c...
4,1,@u41390182 @u31519664 gotta agree we already i...


In [None]:
# Saving DataFrame to a JSON file
df_val_en.to_json('val_en_encoded_clean.json', orient='records')

### Explore Hindi Validation Data

In [None]:
val_hi.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded
0,0,"[{'index': 0, 'start': 0, 'end': 16, 'terms': ...","[पतंग, के, साथ, हवा, में, उड़ती, बच्ची, का, वी...",1
1,1,"[{'index': 0, 'start': 0, 'end': 16, 'terms': ...","[4, सितंबर, 2020, को, मध्यप्रदेश, के, अलग, -, ...",1
2,2,"[{'index': 0, 'start': 0, 'end': 20, 'terms': ...","[NCB, ने, रिया, से, दूसरे, राउंड, की, पूछताछ, ...",1
3,3,"[{'index': 0, 'start': 6, 'end': 12, 'terms': ...","[अम्मी, जान, कहती, थी, -, "", कोई, भी, धंधा, बु...",1
4,4,"[{'index': 0, 'start': 0, 'end': 18, 'terms': ...","[पश्चिम, बंगाल, में, हिंदूओं, पर, ज़ुल्म, हुगल...",1


In [None]:
# Review the data
val_hi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           500 non-null    int64 
 1   claims          500 non-null    object
 2   text_tokens     500 non-null    object
 3   claims_encoded  500 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 15.8+ KB


In [None]:
#Checking Missing Values

val_hi.isnull().sum()

index             0
claims            0
text_tokens       0
claims_encoded    0
dtype: int64

In [None]:
# Levels counts

val_hi['claims_encoded'].value_counts()

claims_encoded
1    409
0     91
Name: count, dtype: int64

In [None]:
# Average number of claims per tweet
avg_claims_per_tweet = val_hi['claims_encoded'].mean()
avg_claims_per_tweet

0.818

In [None]:
# Hindi Language Text Cleaning

def clean_text_tokens(text_tokens):
    # Join the list of tokens into a single string
    text_string = ' '.join(text_tokens)

    # Remove URLs
    cleaned_text = re.sub(r'http\S+', '', text_string)

    # Tokenize Hindi text
    tokens = indic_tokenize.trivial_tokenize(cleaned_text, lang='hi')

    # Remove punctuation except for hashtags and mentions
    # Adjust punctuation marks as needed for Hindi
    hindi_punctuation = '।!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~'

    # Exclude "#" and "@" from removal as they might be used for hashtags and mentions
    cleaned_tokens = [token for token in tokens if token not in hindi_punctuation or token in ['#', '@']]

    # Join cleaned tokens back into a string
    cleaned_text = ' '.join(cleaned_tokens)

    return cleaned_text

# Assuming train_hi is your DataFrame and 'text_tokens' is the column to be cleaned
val_hi['tokens_clean'] = val_hi['text_tokens'].apply(clean_text_tokens)
val_hi.head()

Unnamed: 0,index,claims,text_tokens,claims_encoded,tokens_clean
0,0,"[{'index': 0, 'start': 0, 'end': 16, 'terms': ...","[पतंग, के, साथ, हवा, में, उड़ती, बच्ची, का, वी...",1,पतंग के साथ हवा में उड़ती बच्ची का वीडियो सोशल...
1,1,"[{'index': 0, 'start': 0, 'end': 16, 'terms': ...","[4, सितंबर, 2020, को, मध्यप्रदेश, के, अलग, -, ...",1,4 सितंबर 2020 को मध्यप्रदेश के अलग अलग क्षेत्र...
2,2,"[{'index': 0, 'start': 0, 'end': 20, 'terms': ...","[NCB, ने, रिया, से, दूसरे, राउंड, की, पूछताछ, ...",1,NCB ने रिया से दूसरे राउंड की पूछताछ शुरू की आ...
3,3,"[{'index': 0, 'start': 6, 'end': 12, 'terms': ...","[अम्मी, जान, कहती, थी, -, "", कोई, भी, धंधा, बु...",1,अम्मी जान कहती थी कोई भी धंधा बुरा नहीं होता औ...
4,4,"[{'index': 0, 'start': 0, 'end': 18, 'terms': ...","[पश्चिम, बंगाल, में, हिंदूओं, पर, ज़ुल्म, हुगल...",1,पश्चिम बंगाल में हिंदूओं पर ज़ुल्म हुगली के ते...


In [None]:
#Checking duplicates tweets

val_hi["tokens_clean"].duplicated().sum()

0

In [None]:
# Dropping columns

columns_to_drop = ['index', 'claims', 'text_tokens']
val_hi = val_hi.drop(columns=columns_to_drop)

val_hi.head()

Unnamed: 0,claims_encoded,tokens_clean
0,1,पतंग के साथ हवा में उड़ती बच्ची का वीडियो सोशल...
1,1,4 सितंबर 2020 को मध्यप्रदेश के अलग अलग क्षेत्र...
2,1,NCB ने रिया से दूसरे राउंड की पूछताछ शुरू की आ...
3,1,अम्मी जान कहती थी कोई भी धंधा बुरा नहीं होता औ...
4,1,पश्चिम बंगाल में हिंदूओं पर ज़ुल्म हुगली के ते...


In [None]:
# Renaming the columns for better understanding

df_val_hi = val_hi.rename(columns={'claims_encoded': 'claims', 'tokens_clean': 'text_tokens'})
df_val_hi.head()

Unnamed: 0,claims,text_tokens
0,1,पतंग के साथ हवा में उड़ती बच्ची का वीडियो सोशल...
1,1,4 सितंबर 2020 को मध्यप्रदेश के अलग अलग क्षेत्र...
2,1,NCB ने रिया से दूसरे राउंड की पूछताछ शुरू की आ...
3,1,अम्मी जान कहती थी कोई भी धंधा बुरा नहीं होता औ...
4,1,पश्चिम बंगाल में हिंदूओं पर ज़ुल्म हुगली के ते...


In [None]:
# Saving DataFrame to a JSON file
df_val_hi.to_json('val_hi_encoded_clean.json', orient='records')

## Merging Training English and Hindi Datasets

In [None]:
# Merging two Train Data Sets
train_en_hi = pd.concat([df_train_en, df_train_hi], ignore_index = True)
train_en_hi.shape

(12068, 2)

In [None]:
train_en_hi.head()

Unnamed: 0,claims,text_tokens
0,0,#vaers 17y #pfizer #covidvaccine #suicide atte...
1,1,weve truly come a long way from december and j...
2,1,fuck that its not faux outrage inject them wit...
3,1,@u55750420 which makes no sense the vaccine ca...
4,1,fact or fiction you decide the upcoming corona...


In [None]:
train_en_hi.tail()

Unnamed: 0,claims,text_tokens
12063,1,Vodafone Idea ने पेश किया 46 रुपये का नया प्ला...
12064,1,यह इटली का फोटो है जहां कोरोनावायरस के चलते 20...
12065,0,जब अपनें ही शामिल होते है दुश्मनों की चाल में ...
12066,0,6 साल जिसे कोसने के बाद खूब बुरा भला बोलने के ...
12067,1,गरीबी पर वार होगा सपना ये साकार होगा कांग्रेस ...


In [None]:
# Saving DataFrame to a JSON file
train_en_hi.to_json('training_en_hi_encoded_clean.json', orient='records')

## Merging Validation English and Hindi Datasets

In [None]:
# Merging two Train Data Sets
val_en_hi = pd.concat([df_val_en, df_val_hi], ignore_index = True)
val_en_hi.shape

(999, 2)

In [None]:
val_en_hi.head()

Unnamed: 0,claims,text_tokens
0,1,listen people the vaccines work they work that...
1,0,i ll take the covid vaccine in 5 years after a...
2,0,@u72240666 hearing that the pope says refusing...
3,1,trump has got the new russian vaccine and he c...
4,1,@u41390182 @u31519664 gotta agree we already i...


In [None]:
val_en_hi.tail()

Unnamed: 0,claims,text_tokens
994,1,आज मुरैना जिले में ऑक्सीजन जाँच अभियान चलाया ग...
995,1,# MatoShree दाऊद के नाम पर फोन आने के बाद महार...
996,1,IPL छोड़ आए सुरेश रैना ने तोड़ी चुप्पी बताई अप...
997,0,हम तो सच बोलते हैं किसी किसी को पैलेट गन जैसा ...
998,1,ग्रूप के सभी माननीय सदस्यों को सूचित किया जाता...


In [None]:
# Saving DataFrame to a JSON file
val_en_hi.to_json('validation_en_hi_encoded_clean.json', orient='records')