# Lab Evaluation Details

## Problem Statement

https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts

Download the above dataset and build a text classifier model that can predict the subject areas given paper abstracts and titles using latest NLP techniques taught to you. Show the use of word2vec and BERT especially here.

Extra 1 hour is given to write comments in your code and upload it. Proper commenting after each function or wherever seems fit should be done. Upload the pdf of your code here. Plag should not be more than 10.

## 

# Check if GPU is Connected

In [1]:
!nvidia-smi

Mon May  1 23:54:56 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1650 Ti      On | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P0               18W /  50W|    783MiB /  4096MiB |     37%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Import Required Libraries

In [2]:
import pandas as pd
import numpy as np
import re

from gensim.models import Word2Vec

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression

import torch

from nltk import sent_tokenize, word_tokenize, PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

stopWords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()



# Import Data

In [3]:
data = pd.read_csv(r'./data/arxiv_data.csv')
data

Unnamed: 0,titles,summaries,terms
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']"
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']"
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV']
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']"
...,...,...,...
51769,Hierarchically-coupled hidden Markov models fo...,We address the problem of analyzing sets of no...,"['stat.ML', 'physics.bio-ph', 'q-bio.QM']"
51770,Blinking Molecule Tracking,We discuss a method for tracking individual mo...,"['cs.CV', 'cs.DM']"
51771,Towards a Mathematical Foundation of Immunolog...,We attempt to set a mathematical foundation of...,"['stat.ML', 'cs.LG', 'q-bio.GN']"
51772,A Semi-Automatic Graph-Based Approach for Dete...,Diffusion Tensor Imaging (DTI) allows estimati...,['cs.CV']


# Explore the Data

In [4]:
data.columns

Index(['titles', 'summaries', 'terms'], dtype='object')

In [5]:
newColumns = {
    'titles': 'Titles',
    'summaries': 'Summaries',
    'terms': 'Labels'
}

data = data.rename(columns=newColumns)

In [6]:
data.isnull().sum()

Titles       0
Summaries    0
Labels       0
dtype: int64

In [7]:
data['Labels'].value_counts()

['cs.CV']                                                            17369
['cs.LG', 'stat.ML']                                                  5251
['cs.LG']                                                             2732
['cs.CV', 'cs.LG']                                                    2067
['cs.LG', 'cs.AI']                                                    1702
                                                                     ...  
['cs.LG', 'stat.ML', 'I.6.4; I.5.3; I.4.6; I.2.4']                       1
['cs.LG', 'math.ST', 'stat.ML', 'stat.TH', '62H22, 62R01, 62J99']        1
['cs.LG', 'cs.RO', 'math.ST', 'stat.TH']                                 1
['cs.LG', 'cs.AI', 'cs.DS', '68T01, 68T09', 'I.2.6; I.5.1']              1
['stat.ML', 'cs.CV', 'cs.LG', 'q-bio.QM']                                1
Name: Labels, Length: 3157, dtype: int64

# Preprocessing using our Pipeline

In [8]:
recordList = []

In [9]:
for title in data['Titles']:
    # Convert the title to lowercase using REGEX
    for f in re.findall("([A-Z]+)", title):
        title = title.replace(f, f.lower())

    # Removing special characters and replacing them with a space
    title = re.sub("[^A-Za-z0-9]", " ", title, 0, re.IGNORECASE)
    
    # From a single sentence, store all the words 
    wordsInTitle = word_tokenize(title)
    
    # Filter out all the stop words
    wordsInTitle = [word for word in wordsInTitle if word not in stopWords]
    
    # Lemmatize each of the summaries
    wordsInTitle = [lemmatizer.lemmatize(word) for word in wordsInTitle]
    
    title = ' '.join(wordsInTitle)
    
    recordToAppend = {
        'Processed Titles': title,
    }
    
    recordList.append(recordToAppend)

In [10]:
processedData = pd.DataFrame(data=recordList, columns=['Processed Titles'])

data = pd.concat([data, processedData['Processed Titles']], axis=1)

data

Unnamed: 0,Titles,Summaries,Labels,Processed Titles
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']",survey semantic stereo matching semantic depth...
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']",future ai guiding principle consensus recommen...
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']",enforcing mutual consistency hard region semi ...
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV'],parameter decoupling strategy semi supervised ...
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']",background foreground segmentation interior se...
...,...,...,...,...
51769,Hierarchically-coupled hidden Markov models fo...,We address the problem of analyzing sets of no...,"['stat.ML', 'physics.bio-ph', 'q-bio.QM']",hierarchically coupled hidden markov model lea...
51770,Blinking Molecule Tracking,We discuss a method for tracking individual mo...,"['cs.CV', 'cs.DM']",blinking molecule tracking
51771,Towards a Mathematical Foundation of Immunolog...,We attempt to set a mathematical foundation of...,"['stat.ML', 'cs.LG', 'q-bio.GN']",towards mathematical foundation immunology ami...
51772,A Semi-Automatic Graph-Based Approach for Dete...,Diffusion Tensor Imaging (DTI) allows estimati...,['cs.CV'],semi automatic graph based approach determinin...


# Creating `Word2Vec` Model

In [11]:
# # Train a word2vec model on your preprocessed text data
sentences = [text.split() for text in data["Processed Titles"]]
labels = data['Labels'].apply(lambda x: x.strip('][').split(', '))

In [12]:
sentences

[['survey',
  'semantic',
  'stereo',
  'matching',
  'semantic',
  'depth',
  'estimation'],
 ['future',
  'ai',
  'guiding',
  'principle',
  'consensus',
  'recommendation',
  'trustworthy',
  'artificial',
  'intelligence',
  'future',
  'medical',
  'imaging'],
 ['enforcing',
  'mutual',
  'consistency',
  'hard',
  'region',
  'semi',
  'supervised',
  'medical',
  'image',
  'segmentation'],
 ['parameter',
  'decoupling',
  'strategy',
  'semi',
  'supervised',
  '3d',
  'left',
  'atrium',
  'segmentation'],
 ['background',
  'foreground',
  'segmentation',
  'interior',
  'sensing',
  'automotive',
  'industry'],
 ['edgeflow',
  'achieving',
  'practical',
  'interactive',
  'segmentation',
  'edge',
  'guided',
  'flow'],
 ['efficient',
  'hybrid',
  'transformer',
  'learning',
  'global',
  'local',
  'context',
  'urban',
  'sence',
  'segmentation'],
 ['towards',
  'robust',
  'generalized',
  'medical',
  'image',
  'segmentation',
  'framework'],
 ['semi',
  'supervised

In [13]:
labels

0                                       ['cs.CV', 'cs.LG']
1                              ['cs.CV', 'cs.AI', 'cs.LG']
2                                       ['cs.CV', 'cs.AI']
3                                                ['cs.CV']
4                                       ['cs.CV', 'cs.LG']
                               ...                        
51769            ['stat.ML', 'physics.bio-ph', 'q-bio.QM']
51770                                   ['cs.CV', 'cs.DM']
51771                     ['stat.ML', 'cs.LG', 'q-bio.GN']
51772                                            ['cs.CV']
51773    ['stat.ML', 'physics.med-ph', 'stat.AP', 'stat...
Name: Labels, Length: 51774, dtype: object

# Apply `W2V` to get `X`

In [15]:
model = Word2Vec(sentences, min_count=1, vector_size=100)
X = np.array([np.mean([model.wv.get_vector(word) for word in sentence], axis=0) for sentence in sentences])

# Encode the Labels

In [None]:
# Transform the labels into binary labels
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(labels)

# Select the greatest value of them
y = np.argmax(y, axis=1)

# Train-test Split

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Predict and Evaluate

In [16]:
# Fit the logistic regression classifier on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate the model on the test data
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.6512795750845003


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


---