<a href="https://colab.research.google.com/github/Eswinpaul/NewProject/blob/main/NLP_ResearchPaper_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**

This notebook demonstrates how to classify research papers based on their abstracts. Natural Language Processing (NLP) and Machine Learning (ML) algorithms are used to achieve this.

# Required Functions and Modules

Before running the code cells, you will need to import the following functions and modules:

- `numpy` for numerical calculations
- `pandas` for data manipulation
- `matplotlib` for data visualization
- `sklearn` for machine learning algorithms
- `nltk` for NLP tasks
- `string` for string operations
- `contractions` for expanding contractions in text data

You can import these modules by running the following code:

'''
!pip install -r requirements.txt
'''

In [1]:
import nltk
import numpy
!pip install scikit-learn
import pandas as pd
nltk.download('popular')
!pip install contractions
import contractions
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
!pip install scikit-multilearn
from sklearn.svm import SVC
from skmultilearn.problem_transform import LabelPowerset
from sklearn.metrics import hamming_loss, accuracy_score
from sklearn.tree import DecisionTreeClassifier

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.5/287.5 KB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.5/104.5 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/sim

# **Data Loading**
Load the data from a CSV file

The link to the Data is given below
https://www.kaggle.com/datasets/jainpooja/topic-modeling-for-research-articles-20

I have trimmed the data into Abstract and the four columns Computer Science, Physics, Mathematis and Statistics

In [3]:
Data = pd.read_csv("Train_data.csv")

# **Data Preprocessing**

- Clean the data by removing punctuations, stopwords and numbers.
- Expand contractions and tokenize the text.
- Performs lemmatization on the tokens and a mapping from part-of-speech tags to WordNet tag names.


In [4]:
word_Lemmatized = WordNetLemmatizer()
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

def clean_text(text):
  translation_table = str.maketrans("", "", string.punctuation)
  text = contractions.fix(str(text))
  text = word_tokenize(text)
  text = [token.lower() for token in text]
  text = [s.translate(translation_table) for s in text]
  text = [i for i in text if not i.isnumeric()]
  text = [w for w in text if len(w)>1]
  text = [word for word in text if word not in stopwords.words("english")]
  text =" ".join([word_Lemmatized.lemmatize(word,tag_map[tag[0]]) for word,tag in pos_tag(text)])
  return text

Data["ABSTRACT"] = Data["ABSTRACT"].apply(clean_text)

# **Split the dataset**
- Split the data into training and test sets


In [5]:
X = Data["ABSTRACT"].values

labels = Data.drop(["ABSTRACT","id"],axis = 1).values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,labels, test_size=0.33, random_state=42)

# **Feature Extraction**

In [6]:
Tfidf_vect = TfidfVectorizer(max_features = 2000, ngram_range=(1,3))


X_train_tfidf = Tfidf_vect.fit_transform(X_train).toarray()


X_test_tfidf = Tfidf_vect.transform(X_test).toarray()

feature_names = Tfidf_vect.get_feature_names()



# **Model Building**
Build a multilabel classification model using Label Powerset and decision tree classification model using the preprocessed data.

In [8]:
classifier = LabelPowerset(DecisionTreeClassifier())
classifier.fit(X_train_tfidf,y_train)
predictions = classifier.predict(X_test_tfidf) 
predictions.toarray()[1000]

array([1, 0, 0, 0])

# **Model Evaluation**

In [9]:
print(hamming_loss(y_test,predictions))
print(accuracy_score(y_test,predictions))

0.18449805279099957
0.5616616183470359


In [10]:
xtest = ["The atmospheric greenhouse effect, an idea that many authors trace back to the traditional works of Fourier (1824), Tyndall (1861), and Arrhenius (1896), and which is still supported in global climatology, essentially describes a fictitious mechanism, in which a planetary atmosphere acts as a heat pump driven by an environment that is radiatively interacting with but radiatively equilibrated to the atmospheric system. According to the second law of thermodynamics, such a planetary machine can never exist. Nevertheless, in almost all texts of global climatology and in a widespread secondary literature, it is taken for granted that such a mechanism is real and stands on a firm scientific foundation. In this paper, the popular conjecture is analyzed and the underlying physical principles are clarified. By showing that (a) there are no common physical laws between the warming phenomenon in glass houses and the fictitious atmospheric greenhouse effects, (b) there are no calculations to determine an average surface temperature of a planet, (c) the frequently mentioned difference of 33° is a meaningless number calculated wrongly, (d) the formulas of cavity radiation are used inappropriately, (e) the assumption of a radiative balance is unphysical, (f) thermal conductivity and friction must not be set to zero, the atmospheric greenhouse conjecture is falsified."]
xclean = clean_text(xtest)
xtest_cleaned = [xclean]
xtest_tfidf = Tfidf_vect.transform(xtest_cleaned).toarray()
pred = classifier.predict(xtest_tfidf)

In [11]:
pred.toarray()

array([[0, 0, 1, 0]])

In [13]:
Data.columns

Index(['id', 'ABSTRACT', 'Computer Science', 'Mathematics', 'Physics',
       'Statistics'],
      dtype='object')