<a href="https://colab.research.google.com/github/Mobikhani/NLP-Assignment-/blob/main/NLP_ASSIGNMENT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q1: Choose any corpus of your choice of at least 200 MBs of any domain in NLP and perform
the following tasks:

• Text Preprocessing (Text Cleaning, Stemming / Lemmatization)

• Word Embedding (using an algorithm like Word2Vec, Glove, FastText)

• Encoding Techniques (Bag of Words, One – Hot)

• Parts of Speech tagging.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
from google.colab import drive
import os

# Mount your Google Drive
drive.mount('/content/drive')

# Define the file path (make sure this is the correct path in your drive)
file_path = '/content/drive/MyDrive/arxiv_papers.csv'

# Check if the file exists
if os.path.exists(file_path):
    print("File found and loaded!")
else:
    print("File not found. Please check the path.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
File found and loaded!


In [10]:
import pandas as pd

file_path = '/content/drive/MyDrive/arxiv_papers.csv' # Removed extra spaces at the beginning of this line
df = pd.read_csv(file_path) # Removed extra spaces at the beginning of this line
print(df.columns)  # Removed extra spaces at the beginning of this line



Index(['abstract', 'author', 'date', 'pdf_url', 'title', 'pdf_text'], dtype='object')


In [14]:
import pandas as pd

# Load the dataset (assuming it's a CSV)
file_path = '/content/drive/MyDrive/arxiv_papers.csv'  # Adjust file path

# Read the CSV file
df = pd.read_csv(file_path)

# Check the columns and data types
print(df.columns)

# Select the 'abstract' or 'pdf_text' column for NLP tasks
texts = df['abstract']  # or df['pdf_text'] if that's more relevant

print(texts.head())  # Preview the first few entries


Index(['abstract', 'author', 'date', 'pdf_url', 'title', 'pdf_text'], dtype='object')
0    We first present our view of detection and cor...
1    We first present our view of detection and cor...
2    The choice of modeling units is critical to au...
3    Why should computers interpret language increm...
4    Stance detection is a classification problem i...
Name: abstract, dtype: object


In [13]:
!ls /content/drive/MyDrive/arxiv_papers.csv

/content/drive/MyDrive/arxiv_papers.csv


Step 1: Text Preprocessing (Text Cleaning, Tokenization, Lemmatization)

In [15]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Preprocessing function
def preprocess_text(text):
    # Remove non-alphabetic characters and convert to lowercase
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text.lower())

    # Tokenize the cleaned text
    tokens = word_tokenize(cleaned_text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    return lemmatized_words

# Apply preprocessing to all abstracts
df['processed_text'] = df['abstract'].apply(preprocess_text)  # or df['pdf_text']
print(df['processed_text'].head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0    [first, present, view, detection, correction, ...
1    [first, present, view, detection, correction, ...
2    [choice, modeling, unit, critical, automatic, ...
3    [computer, interpret, language, incrementally,...
4    [stance, detection, classification, problem, n...
Name: processed_text, dtype: object


Step 2: Word Embedding (Using Word2Vec)

In [16]:
from gensim.models import Word2Vec

# Prepare the data for Word2Vec (list of token lists)
text_list = df['processed_text'].tolist()

# Create the Word2Vec model
word2vec_model = Word2Vec(text_list, vector_size=100, window=5, min_count=2, workers=4)

# Save the model
word2vec_model.save("word2vec.model")

# Example: Find similar words to 'data'
print(word2vec_model.wv.most_similar('data'))


[('corpus', 0.5758063197135925), ('labeled', 0.5582237839698792), ('sample', 0.5454695224761963), ('indomain', 0.5227629542350769), ('unlabeled', 0.5107916593551636), ('resource', 0.5081765055656433), ('amount', 0.5058395266532898), ('scarce', 0.5009562969207764), ('qamrs', 0.4915483891963959), ('testing', 0.49047186970710754)]


Step 3: Encoding Techniques (Bag of Words and One-Hot Encoding)

(i)Bag of Words (BoW):


In [17]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert processed text back to strings for BoW
lemmatized_texts = [' '.join(text) for text in df['processed_text']]

# Create a Bag of Words representation
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(lemmatized_texts)

# Display the BoW matrix and feature names
print(X_bow.toarray())
print(vectorizer.get_feature_names_out()[:20])  # Show first 20 feature names


[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
['aa' 'aaai' 'aac' 'aadit' 'aae' 'aaelike' 'aalstm' 'aalto' 'aam' 'aan'
 'aapr' 'aardvark' 'aarnethompsonuther' 'aat' 'ab' 'abacha' 'abandon'
 'abandoned' 'abandoning' 'abater']


(ii)One-Hot Encoding:

In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Flatten all tokens and extract unique words from the processed texts
unique_words = list(set([word for text in df['processed_text'] for word in text]))

# Reshape for OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoded = one_hot_encoder.fit_transform(np.array(unique_words).reshape(-1, 1))

print(one_hot_encoded)  # Display the one-hot encoded matrix


Step 4: Parts of Speech (POS) Tagging

In [None]:
nltk.download('averaged_perceptron_tagger')

# Function for POS tagging
def pos_tagging(text):
    return nltk.pos_tag(text)

# Apply POS tagging to the processed text
df['pos_tags'] = df['processed_text'].apply(pos_tagging)

# Display POS tags for the first row
print(df['pos_tags'].head())


Q2: Basic NLP Tasks
For the second part, we’ll choose Named Entity Recognition (NER) and Topic Modeling (LDA).

Task 1: Named Entity Recognition (NER)
We’ll use spaCy to extract named entities from each abstract or pdf_text.

In [None]:
import spacy

# Download spaCy's small English model
!python -m spacy download en_core_web_sm

# Load the model
nlp = spacy.load('en_core_web_sm')

# Function to perform Named Entity Recognition
def ner(text):
    doc = nlp(' '.join(text))  # Join processed tokens back into text
    return [(ent.text, ent.label_) for ent in doc.ents]

# Apply NER to the processed text
df['entities'] = df['processed_text'].apply(ner)

# Display named entities for the first row
print(df['entities'].head())


Task 2: Topic Modeling (LDA)
We’ll use Gensim to perform topic modeling using

 Latent Dirichlet Allocation (LDA).

In [None]:
from gensim import corpora
from gensim.models import LdaModel

# Create a dictionary and corpus for LDA
dictionary = corpora.Dictionary(df['processed_text'])
corpus = [dictionary.doc2bow(text) for text in df['processed_text']]

# Train the LDA model
lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

# Print the topics
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)
