# Final Dataset Creation

**This notebook mainly involves processing the earlier three datasets to add the categories for the news text for the 2 datasets.
For a brief understanding of what is there in this notebook, do read here.**

**We first look at the first notebook to understand what the categories of 'Subject' are. We fix the typos here and change the category names if it doesn't make sense.**

**Next, we train a supervised Machine Learning model for the purpose of catgeorization. Since it is a classifying problem, we tried several models such as multinomial Naive Bayes, SVM and CatBoostClassifiers. CatBoostClassifier gives the greatest general accuracy and F1 score thus we have included the same in this notebook.**

**Then, this model is employed in the other dataset to add categories for the same. The 3rd dataset has been ignored in this notebook as it gives several errors due to different features in the cleaned text. We have chosen to ignore them as they only constitute around 10000 datapoints.**

**Finally, the datasets are concatenated into a single dataframe. This is next saved to use for visualization and EDA in the next notebook.**

#### The below block is executed to view the addresses for the files in the kaggle input folder. These are the files we will be primarily using.

In [2]:
#This block is included as this notebook has been created using Kaggle
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/df_net_with_subject.csv
/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/categories.csv
/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/df_net_no_subject.csv


#### Importing the first dataframe for inspection before building the model.

In [3]:
df = pd.read_csv("/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/categories.csv")

In [4]:
df.head() #Viewing the structure of the dataframe through the first 5 rows.

Unnamed: 0,Text,Subject,Label,Country
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,1,India
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,1,India
2,LAC tensions: China's strategy behind delibera...,TERROR,1,India
3,India has signed 250 documents on Space cooper...,COVID-19,1,India
4,Tamil Nadu chief minister's mother passes away...,ELECTION,1,India


In [5]:
df.describe() #Basic statistical information regarding the dataframe

Unnamed: 0,Text,Subject,Label,Country
count,54170,54170,54170,54170
unique,50974,9,2,1
top,"DMK lacks leadership, will split after 2021 TN...",GOVERNMENT,1,India
freq,3,10366,35958,54170


#### Viewing the unique values of 'Subject' for a better understanding on categorization

In [6]:
df['Subject'].unique()

array(['COVID-19', 'VIOLENCE', 'TERROR', 'ELECTION', 'GOVERNMENT',
       'POLITICS', 'TRAD', 'MISLEADING', 'MISLEADIND'], dtype=object)

#### Figuring out the difference between MISLEADING & MISLEADIND to correct the errors, if any

In [7]:
df[df['Subject'] == 'MISLEADING'].head(5)

Unnamed: 0,Text,Subject,Label,Country
35959,Fact Check: Conspiracy theory claims Sushant w...,MISLEADING,Fake,India
35963,Fact Check: This girl with a beautiful voice i...,MISLEADING,Fake,India
35965,Fact Check: This is not the juvenile involved ...,MISLEADING,Fake,India
35968,Viral Test: Is Madhya Pradesh govt's request t...,MISLEADING,Fake,India
35970,"Viral Test: Did Salman Khan's Father Write ""Ka...",MISLEADING,Fake,India


In [8]:
df[df['Subject'] == 'MISLEADIND'].head(5)

Unnamed: 0,Text,Subject,Label,Country
54046,Fact Check: Old list of blood donors circulate...,MISLEADIND,Fake,India
54052,Fact Check: Picture of man injured in Mewat vi...,MISLEADIND,Fake,India
54053,Fact Check: This CISF officer was not injured ...,MISLEADIND,Fake,India
54054,Fact Check: Old picture from Prayagraj shared ...,MISLEADIND,Fake,India
54058,Fact Check: This picture of Kejriwal without m...,MISLEADIND,Fake,India


#### Since there is no difference, we replace MISLEADIND to MISLEADING. Additionally, MISLEADING isn't an appropriate category, thus we change it to entertainment (as it involved everything from movies, theatre, people interactions, interviews, etc.)

In [9]:
df['Subject'] = df['Subject'].replace("MISLEADIND", "MISLEADING")
df['Subject'] = df['Subject'].replace("MISLEADING", "ENTERTAINMENT")

In [10]:
df.head() #Viewing the new changed dataframe

Unnamed: 0,Text,Subject,Label,Country
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,1,India
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,1,India
2,LAC tensions: China's strategy behind delibera...,TERROR,1,India
3,India has signed 250 documents on Space cooper...,COVID-19,1,India
4,Tamil Nadu chief minister's mother passes away...,ELECTION,1,India


# Building the ML Model

### To build a ML model that involves text feature we need to process natural language. This is known as Natural Language Processing (NLP). In this notebook, we use the nltk and sklearn library.

NLP involved **Tokenization**, which is dividing a sentence into each words as separate **tokens**. Following that, we **remove stopwords** such as 'at','and','or', etc. Once this is done, we **vectorise** these words which is basically converting each of the words into a numerical vector. We have used **CountVectorizer** here. This is a **'Bag-Of-Words'** technique where each word is converted into a numeric vector, stored at an individual index/position and have a **frequency** attribute based on the training data for vocabulary building. BoW ignores the order of words and only includes their frequency

In [11]:
import nltk
from nltk.tokenize import word_tokenize #For tokenization
from nltk.corpus import stopwords #to remove stopwords
from sklearn.feature_extraction.text import CountVectorizer #for vectorization
from sklearn.model_selection import train_test_split #splitting the test-train data
from sklearn.naive_bayes import MultinomialNB #first model tried, rejected due to low accuracy
from catboost import CatBoostClassifier #model used in this notebook
from sklearn.metrics import accuracy_score, f1_score #for accuracy of the model

#### Creating a function for easier text processing

In [12]:
def preprocess_text(text):
    tokens = word_tokenize(text) #Tokenizing words
    tokens = [word.lower() for word in tokens if word.isalpha()] #Removing punctuations and changing all to lowercase letters
    stop_words = set(stopwords.words('english')) #Removing stop words such as at, and, as, etc.
    filtered_tokens = [word for word in tokens if word not in stop_words] #Filtering the tokens for cleaner title
    preprocessed_text = " ".join(filtered_tokens) #Collating all the processed words to form the sentence back
    return preprocessed_text

In [13]:
df["clean_text"] = df["Text"].apply(preprocess_text) #transforming the text and adding it as a separate feature

In [14]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['clean_text']) #getting the vectorized data.

#### Splitting the dataframe for test/train data.

CatBoostClassifier is a classifier provided by the CatBoost library, which is an implementation of gradient boosting specifically designed for categorical features.

iterations=500 specifies the number of boosting iterations (trees) to be used in the model.

learning_rate=0.7 sets the learning rate, which controls the step size during optimization.

depth=7 specifies the depth of the trees in the ensemble.

verbose=100 prints training progress every 100 iterations, providing information about the model's performance during training.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, df['Subject'], test_size=0.2, random_state=42)
classifier = CatBoostClassifier(iterations=500,  # Number of boosting iterations
                                learning_rate=0.7,  # Learning rate
                                depth=7)  # Depth of trees
classifier.fit(X_train, y_train, verbose=100)  # Verbose=100 to see training progress every 100 iterations

#Testing the dataframe
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

0:	learn: 1.7963026	total: 758ms	remaining: 6m 18s
100:	learn: 1.1591535	total: 1m 8s	remaining: 4m 32s
200:	learn: 1.0749383	total: 2m 16s	remaining: 3m 23s
300:	learn: 1.0294961	total: 3m 24s	remaining: 2m 15s
400:	learn: 0.9974344	total: 4m 33s	remaining: 1m 7s
499:	learn: 0.9703242	total: 5m 40s	remaining: 0us
Accuracy: 0.6175004615100609


### The F1 score is a metric commonly used in binary classification tasks to evaluate the performance of a model. It combines precision and recall into a single value. The F1 score is the harmonic mean of precision and recall.

##### Precision measures the proportion of true positive predictions (correctly classified positive instances) out of all positive predictions made by the model. 

##### Recall measures the proportion of true positive predictions out of all actual positive instances in the data. 



In [16]:
f2 = f1_score(y_test, y_pred, average='weighted')
print(f2)

0.6113622126670606


# Employing ML model to categorize the rest of the datasets

In [17]:
df1 = pd.read_csv("/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/df_net_no_subject.csv")

In [18]:
df1.head() #Basic structure

Unnamed: 0,Text,Label,Country
0,my walgreens offbrand mucinex was engraved wit...,0,USA
1,bride and groom exchange vows after fatal shoo...,0,USA
2,rabbi meat from cloned pig could be kosher for...,0,USA
3,jesus christ converting local teens to christi...,2,USA
4,victory the great european crusade vichy france,5,USA


#### Processing & vectorizing the text for classification

In [19]:
df1["clean_text"] = df1["Text"].apply(preprocess_text)
x = vectorizer.fit_transform(df1["clean_text"])

In [20]:
sub = classifier.predict(x)
df1["Subject"] = [label[0] for label in sub]

In [21]:
df1.head()

Unnamed: 0,Text,Label,Country,clean_text,Subject
0,my walgreens offbrand mucinex was engraved wit...,0,USA,walgreens offbrand mucinex engraved letters mu...,GOVERNMENT
1,bride and groom exchange vows after fatal shoo...,0,USA,bride groom exchange vows fatal shooting wedding,GOVERNMENT
2,rabbi meat from cloned pig could be kosher for...,0,USA,rabbi meat cloned pig could kosher jews eat milk,GOVERNMENT
3,jesus christ converting local teens to christi...,2,USA,jesus christ converting local teens christiani...,VIOLENCE
4,victory the great european crusade vichy france,5,USA,victory great european crusade vichy france,GOVERNMENT


In [22]:
df2 = pd.read_csv("/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/df_net_with_subject.csv")

In [23]:
df2.describe()

Unnamed: 0,Label
count,11505.0
mean,2.723946
std,1.562203
min,0.0
25%,1.0
50%,3.0
75%,4.0
max,5.0


### Considering the last dataset only has 11,505 records with multiple feature errors while employing the model, we ignore this dataset

In [24]:
df.head() #First, useable dataset

Unnamed: 0,Text,Subject,Label,Country,clean_text
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,1,India,praises india aarogya setu app says helped ide...
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,1,India,delhi deputy us secretary state stephen biegun...
2,LAC tensions: China's strategy behind delibera...,TERROR,1,India,lac tensions china strategy behind deliberatel...
3,India has signed 250 documents on Space cooper...,COVID-19,1,India,india signed documents space cooperation count...
4,Tamil Nadu chief minister's mother passes away...,ELECTION,1,India,tamil nadu chief minister mother passes away


In [25]:
df1.head() #2nd useable dataset

Unnamed: 0,Text,Label,Country,clean_text,Subject
0,my walgreens offbrand mucinex was engraved wit...,0,USA,walgreens offbrand mucinex engraved letters mu...,GOVERNMENT
1,bride and groom exchange vows after fatal shoo...,0,USA,bride groom exchange vows fatal shooting wedding,GOVERNMENT
2,rabbi meat from cloned pig could be kosher for...,0,USA,rabbi meat cloned pig could kosher jews eat milk,GOVERNMENT
3,jesus christ converting local teens to christi...,2,USA,jesus christ converting local teens christiani...,VIOLENCE
4,victory the great european crusade vichy france,5,USA,victory great european crusade vichy france,GOVERNMENT


### Creating the final dataframe

In [26]:
df_net = pd.concat([df,df1],axis=0)

In [27]:
df_net.head()

Unnamed: 0,Text,Subject,Label,Country,clean_text
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,1,India,praises india aarogya setu app says helped ide...
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,1,India,delhi deputy us secretary state stephen biegun...
2,LAC tensions: China's strategy behind delibera...,TERROR,1,India,lac tensions china strategy behind deliberatel...
3,India has signed 250 documents on Space cooper...,COVID-19,1,India,india signed documents space cooperation count...
4,Tamil Nadu chief minister's mother passes away...,ELECTION,1,India,tamil nadu chief minister mother passes away


In [28]:
df_net.describe()

Unnamed: 0,Text,Subject,Label,Country,clean_text
count,370246,370246,370246,370246,370246
unique,355471,8,8,2,353582
top,my dad front in vietnam in didnt know this pho...,GOVERNMENT,0,USA,dad front vietnam didnt know photo existed cam...
freq,39,296734,205578,316076,40


In [29]:
df_net = df_net.drop_duplicates()

## 3,58,957 Total Records For Analysis And Detection Model Creation

In [30]:
df_net.describe()

Unnamed: 0,Text,Subject,Label,Country,clean_text
count,358957,358957,358957,358957,358957
unique,355471,8,8,2,353582
top,"DMK lacks leadership, will split after 2021 TN...",GOVERNMENT,0,USA,shot amazing sports photos
freq,3,286509,196276,304788,7


#### Saving the final concatenated dataframe for further usage.

In [31]:
df_net.to_csv("final_df.csv",index=False)