# Categorization & Concatenation

**This notebook will cover 2 tasks**
1) We will add categories for news text for __ datasets. This will be done via a CatBoostClassifier, a supervised Machine Learning Model that we train

2) We concatenate the now categorised datasets with the other datasets (insrt names) into a single dataframe. This finalized dataframe will be used for visualization and exploratory data analysis

**Below is a brief explanation for 1)**


**This notebook mainly involves processing the earlier three datasets to add the categories for the news text for the 2 datasets.
For a brief understanding of what is there in this notebook, do read here.**

**We first look at the first notebook to understand what the categories of 'Subject' are. We fix the typos here and change the category names if it doesn't make sense.**

**Next, we train a supervised Machine Learning model for the purpose of catgeorization. CatBoostClassifier has been included for the same in this notebook.**

**Then, this model is employed in the other dataset to add categories for the same. The 3rd dataset has been ignored in this notebook as it gives several errors due to different features in the cleaned text. We have chosen to ignore them as they only constitute around 10000 datapoints.**

**Finally, the datasets are concatenated into a single dataframe. This is next saved to use for visualization and EDA in the next notebook.**

#### The below block is executed to view the addresses for the files in the kaggle input folder. These are the files we will be primarily using.

In [128]:
#This block is included as this notebook has been created using Kaggle
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/categories_one.csv
/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/with_subject.csv
/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/no_subject.csv


#### Importing the first dataframe for inspection before building the model.

In [129]:
df = pd.read_csv("/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/categories_one.csv")

In [130]:
df.head() #Viewing the structure of the dataframe through the first 5 rows.

Unnamed: 0,Text,Subject,TruthRating,Country
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,5,India
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,5,India
2,LAC tensions: China's strategy behind delibera...,TERROR,5,India
3,India has signed 250 documents on Space cooper...,COVID-19,5,India
4,Tamil Nadu chief minister's mother passes away...,ELECTION,5,India


In [131]:
df.describe() #Basic statistical information regarding the dataframe

Unnamed: 0,Text,Subject,TruthRating,Country
count,54170,54170,54170,54170
unique,50974,9,2,1
top,"DMK lacks leadership, will split after 2021 TN...",GOVERNMENT,5,India
freq,3,10366,35958,54170


#### Viewing the unique values of 'Subject' for a better understanding on categorization

In [132]:
df['Subject'].unique()

array(['COVID-19', 'VIOLENCE', 'TERROR', 'ELECTION', 'GOVERNMENT',
       'POLITICS', 'TRAD', 'MISLEADING', 'MISLEADIND'], dtype=object)

#### Figuring out the difference between MISLEADING & MISLEADIND to correct the errors, if any

In [133]:
df[df['Subject'] == 'MISLEADING'].head(100)

Unnamed: 0,Text,Subject,TruthRating,Country
35959,Fact Check: Conspiracy theory claims Sushant w...,MISLEADING,Fake,India
35963,Fact Check: This girl with a beautiful voice i...,MISLEADING,Fake,India
35965,Fact Check: This is not the juvenile involved ...,MISLEADING,Fake,India
35968,Viral Test: Is Madhya Pradesh govt's request t...,MISLEADING,Fake,India
35970,"Viral Test: Did Salman Khan's Father Write ""Ka...",MISLEADING,Fake,India
...,...,...,...,...
36243,Fact Check: Viral claim of EVMs being stolen i...,MISLEADING,Fake,India
36246,Fact Check: Misleading picture of cow deaths g...,MISLEADING,Fake,India
36250,"Fact Check: No, this young Muslim girl is not ...",MISLEADING,Fake,India
36254,Fact Check: Manish Sisodia's tweet on EVM inci...,MISLEADING,Fake,India


In [134]:
df[df['Subject'] == 'MISLEADIND'].head(67)

Unnamed: 0,Text,Subject,TruthRating,Country
54046,Fact Check: Old list of blood donors circulate...,MISLEADIND,Fake,India
54052,Fact Check: Picture of man injured in Mewat vi...,MISLEADIND,Fake,India
54053,Fact Check: This CISF officer was not injured ...,MISLEADIND,Fake,India
54054,Fact Check: Old picture from Prayagraj shared ...,MISLEADIND,Fake,India
54058,Fact Check: This picture of Kejriwal without m...,MISLEADIND,Fake,India
54063,"Fact Check: Another lockdown in Delhi? No, thi...",MISLEADIND,Fake,India
54075,Fact Check: These pictures of farmers� protest...,MISLEADIND,Fake,India
54077,"Fact Check: This is not Amulya, the girl who r...",MISLEADIND,Fake,India
54083,Fact Check: Obama never warned Africans agains...,MISLEADIND,Fake,India
54089,Fact Check: Old images of pro-Khalistan demons...,MISLEADIND,Fake,India


#### Since there is no difference, we replace MISLEADIND to MISLEADING. Additionally, MISLEADING isn't an appropriate category, thus we change it to gossip/opinion (as it involved everything from factchecks, viral news, etc)

In [142]:
df['Subject'] = df['Subject'].replace("MISLEADIND", "MISLEADING")
df['Subject'] = df['Subject'].replace("MISLEADING", "GOSSIP/OPINION")
df["TruthRating"] = df["TruthRating"].replace({'Fake' : 0})
df = df[df["TruthRating"]!="nan"]
df['TruthRating'] = pd.to_numeric(df['TruthRating'], errors='coerce', downcast='integer')

In [144]:
df.head() #Viewing the new changed dataframe

Unnamed: 0,Text,Subject,TruthRating,Country,clean_text
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,5,India,praises india aarogya setu app says helped ide...
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,5,India,delhi deputy us secretary state stephen biegun...
2,LAC tensions: China's strategy behind delibera...,TERROR,5,India,lac tensions china strategy behind deliberatel...
3,India has signed 250 documents on Space cooper...,COVID-19,5,India,india signed documents space cooperation count...
4,Tamil Nadu chief minister's mother passes away...,ELECTION,5,India,tamil nadu chief minister mother passes away


# Building the ML Model

### To build a ML model that involves text feature we need to process natural language. This is known as Natural Language Processing (NLP). In this notebook, we use the nltk and sklearn library.

NLP involved **Tokenization**, which is dividing a sentence into each words as separate **tokens**. Following that, we **remove stopwords** such as 'at','and','or', etc. Once this is done, we **vectorise** these words which is basically converting each of the words into a numerical vector. We have used **CountVectorizer** here. This is a **'Bag-Of-Words'** technique where each word is converted into a numeric vector, stored at an individual index/position and have a **frequency** attribute based on the training data for vocabulary building. BoW ignores the order of words and only includes their frequency

In [145]:
import nltk
from nltk.tokenize import word_tokenize #For tokenization
from nltk.corpus import stopwords #to remove stopwords
from sklearn.feature_extraction.text import CountVectorizer #for vectorization
from sklearn.model_selection import train_test_split #splitting the test-train data
from sklearn.naive_bayes import MultinomialNB #first model tried, rejected due to low accuracy
from catboost import CatBoostClassifier #model used in this notebook
from sklearn.metrics import accuracy_score, f1_score #for accuracy of the model

#### Creating a function for easier text processing

In [146]:
def preprocess_text(text):
    tokens = word_tokenize(text) #Tokenizing words
    tokens = [word.lower() for word in tokens if word.isalpha()] #Removing punctuations and changing all to lowercase letters
    stop_words = set(stopwords.words('english')) #Removing stop words such as at, and, as, etc.
    filtered_tokens = [word for word in tokens if word not in stop_words] #Filtering the tokens for cleaner title
    preprocessed_text = " ".join(filtered_tokens) #Collating all the processed words to form the sentence back
    return preprocessed_text

**This df is infact referring to df7 from the previous notebook**

In [147]:
df["clean_text"] = df["Text"].apply(preprocess_text) #transforming the text and adding it as a separate feature

In [148]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['clean_text']) #getting the vectorized data.

#### Splitting the dataframe for test/train data.

CatBoostClassifier is a classifier provided by the CatBoost library, which is an implementation of gradient boosting specifically designed for categorical features.

iterations=500 specifies the number of boosting iterations (trees) to be used in the model.

learning_rate=0.7 sets the learning rate, which controls the step size during optimization.

depth=7 specifies the depth of the trees in the ensemble.

verbose=100 prints training progress every 100 iterations, providing information about the model's performance during training.

In [149]:
X_train, X_test, y_train, y_test = train_test_split(X, df['Subject'], test_size=0.2, random_state=42)
classifier = CatBoostClassifier(iterations=500,  # Number of boosting iterations
                                learning_rate=0.7,  # Learning rate
                                depth=7)  # Depth of trees
classifier.fit(X_train, y_train, verbose=100)  # Verbose=100 to see training progress every 100 iterations

#Testing the dataframe
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

0:	learn: 1.7963026	total: 816ms	remaining: 6m 47s
100:	learn: 1.1591535	total: 1m 10s	remaining: 4m 37s
200:	learn: 1.0749383	total: 2m 19s	remaining: 3m 27s
300:	learn: 1.0294961	total: 3m 28s	remaining: 2m 17s
400:	learn: 0.9974344	total: 4m 37s	remaining: 1m 8s
499:	learn: 0.9703242	total: 5m 46s	remaining: 0us
Accuracy: 0.6175004615100609


### The F1 score is a metric commonly used in binary classification tasks to evaluate the performance of a model. It combines precision and recall into a single value. The F1 score is the harmonic mean of precision and recall.

##### Precision measures the proportion of true positive predictions (correctly classified positive instances) out of all positive predictions made by the model. 

##### Recall measures the proportion of true positive predictions out of all actual positive instances in the data. 



In [150]:
f2 = f1_score(y_test, y_pred, average='weighted')
print(f2)

0.6113622126670606


# Employing ML model to categorize the rest of the datasets

In [151]:
df1 = pd.read_csv("/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/no_subject.csv")

In [152]:
df1.head() #Basic structure

Unnamed: 0,Text,TruthRating,Country
0,My Walgreens offbrand Mucinex was engraved wit...,0,USA
1,Bride and groom exchange vows after fatal shoo...,0,USA
2,Rabbi: Meat from cloned pig could be kosher fo...,0,USA
3,Jesus Christ converting local teens to Christi...,2,USA
4,"« Victory, The great european crusade », Vichy...",5,USA


#### Processing & vectorizing the text for classification

In [153]:
df1["clean_text"] = df1["Text"].apply(preprocess_text)
x = vectorizer.fit_transform(df1["clean_text"])

In [154]:
sub = classifier.predict(x)
df1["Subject"] = [label[0] for label in sub]

In [155]:
df1.head()

Unnamed: 0,Text,TruthRating,Country,clean_text,Subject
0,My Walgreens offbrand Mucinex was engraved wit...,0,USA,walgreens offbrand mucinex engraved letters mu...,GOVERNMENT
1,Bride and groom exchange vows after fatal shoo...,0,USA,bride groom exchange vows fatal shooting wedding,POLITICS
2,Rabbi: Meat from cloned pig could be kosher fo...,0,USA,rabbi meat cloned pig could kosher jews eat milk,GOVERNMENT
3,Jesus Christ converting local teens to Christi...,2,USA,jesus christ converting local teens christiani...,GOVERNMENT
4,"« Victory, The great european crusade », Vichy...",5,USA,victory great european crusade vichy france,GOVERNMENT


In [156]:
df2 = pd.read_csv("/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/with_subject.csv")

In [157]:
df2.describe()

Unnamed: 0,TruthRating
count,11505.0
mean,2.723946
std,1.562203
min,0.0
25%,1.0
50%,3.0
75%,4.0
max,5.0


In [158]:
df2["clean_text"] = df2["Text"].apply(preprocess_text)
x = vectorizer.fit_transform(df2["clean_text"])

In [159]:
sub = classifier.predict(x)
df2["Subject"] = [label[0] for label in sub]

CatBoostError: /src/catboost/catboost/libs/data/model_dataset_compatibility.cpp:72: Feature 11870 is present in model but not in pool.

### Considering the last dataset only has 11,505 records with multiple feature errors while employing the model, we ignore this dataset

In [160]:
df.head() #First, useable dataset

Unnamed: 0,Text,Subject,TruthRating,Country,clean_text
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,5,India,praises india aarogya setu app says helped ide...
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,5,India,delhi deputy us secretary state stephen biegun...
2,LAC tensions: China's strategy behind delibera...,TERROR,5,India,lac tensions china strategy behind deliberatel...
3,India has signed 250 documents on Space cooper...,COVID-19,5,India,india signed documents space cooperation count...
4,Tamil Nadu chief minister's mother passes away...,ELECTION,5,India,tamil nadu chief minister mother passes away


In [161]:
df1.head() #2nd useable dataset

Unnamed: 0,Text,TruthRating,Country,clean_text,Subject
0,My Walgreens offbrand Mucinex was engraved wit...,0,USA,walgreens offbrand mucinex engraved letters mu...,GOVERNMENT
1,Bride and groom exchange vows after fatal shoo...,0,USA,bride groom exchange vows fatal shooting wedding,POLITICS
2,Rabbi: Meat from cloned pig could be kosher fo...,0,USA,rabbi meat cloned pig could kosher jews eat milk,GOVERNMENT
3,Jesus Christ converting local teens to Christi...,2,USA,jesus christ converting local teens christiani...,GOVERNMENT
4,"« Victory, The great european crusade », Vichy...",5,USA,victory great european crusade vichy france,GOVERNMENT


### Creating the final dataframe

In [162]:
df_net = pd.concat([df,df1],axis=0)

In [163]:
df_net.head()

Unnamed: 0,Text,Subject,TruthRating,Country,clean_text
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,5,India,praises india aarogya setu app says helped ide...
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,5,India,delhi deputy us secretary state stephen biegun...
2,LAC tensions: China's strategy behind delibera...,TERROR,5,India,lac tensions china strategy behind deliberatel...
3,India has signed 250 documents on Space cooper...,COVID-19,5,India,india signed documents space cooperation count...
4,Tamil Nadu chief minister's mother passes away...,ELECTION,5,India,tamil nadu chief minister mother passes away


In [164]:
df_net.describe()

Unnamed: 0,TruthRating
count,370403.0
mean,1.331255
std,1.888873
min,0.0
25%,0.0
50%,0.0
75%,2.0
max,5.0


In [165]:
df_net = df_net.drop_duplicates()

## 361,363 Total Records For Exploratory Data Analysis And Detection Model Creation

In [166]:
df_net.describe()

Unnamed: 0,TruthRating
count,361363.0
mean,1.357978
std,1.900495
min,0.0
25%,0.0
50%,0.0
75%,2.0
max,5.0


#### Saving the final concatenated dataframe for further usage.

In [167]:
df_net.to_csv("final.csv",index=False)