
# spaCy Pipeline for Research Abstract Processing (arXiv Dataset)

Course: Natural Language Processing  
Assignment: Text Preprocessing with NLTK and spaCy  

**Objective:**  
Analyze research abstracts using spaCy to extract noun phrases, named entities, and technical term patterns, and visualize results.


In [None]:

!pip install spacy pandas matplotlib seaborn
!python -m spacy download en_core_web_sm


In [None]:

import pandas as pd
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from spacy.matcher import Matcher


In [None]:

nlp = spacy.load("en_core_web_sm")


## Load arXiv Dataset

In [None]:

df = pd.read_csv("arxiv_data.csv")
df.head()



## Select Abstract Text
Assuming the column name containing abstracts is `abstract`.


In [None]:

texts = df['abstract'].dropna().tolist()
texts[:3]


## Process Abstracts using spaCy

In [None]:

docs = list(nlp.pipe(texts[:500]))
len(docs)


## Extract Frequent Noun Phrases

In [None]:

noun_phrases = []

for doc in docs:
    for chunk in doc.noun_chunks:
        noun_phrases.append(chunk.text.lower())

from collections import Counter
noun_phrase_freq = Counter(noun_phrases)
noun_phrase_freq.most_common(20)


## Extract Named Entities

In [None]:

entities = []

for doc in docs:
    for ent in doc.ents:
        if ent.label_ in ['ORG', 'DATE', 'PRODUCT', 'GPE']:
            entities.append((ent.text, ent.label_))

entity_freq = Counter(entities)
entity_freq.most_common(20)


## Rule-Based Matching for Technical Terms

In [None]:

matcher = Matcher(nlp.vocab)

pattern = [
    {"POS": "ADJ", "OP": "*"},
    {"POS": "NOUN", "OP": "+"}
]

matcher.add("TECHNICAL_TERM", [pattern])

matched_terms = []

for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        matched_terms.append(doc[start:end].text.lower())

Counter(matched_terms).most_common(20)


## Visualization: Top Noun Phrases

In [None]:

top_np = noun_phrase_freq.most_common(10)
np_df = pd.DataFrame(top_np, columns=['Noun Phrase', 'Frequency'])

plt.figure(figsize=(8,5))
sns.barplot(data=np_df, x='Frequency', y='Noun Phrase')
plt.title("Top Noun Phrases in arXiv Abstracts")
plt.show()


## Visualization: Entity Frequency

In [None]:

entity_labels = [label for _, label in entity_freq]
entity_label_freq = Counter(entity_labels)

plt.figure(figsize=(6,4))
sns.barplot(x=list(entity_label_freq.keys()), y=list(entity_label_freq.values()))
plt.title("Named Entity Frequency")
plt.xlabel("Entity Type")
plt.ylabel("Count")
plt.show()



## Expected Output Summary

- Frequent noun phrases representing technical concepts  
- Named entities such as organizations and dates  
- Rule-based matched technical terms  
- Visual summaries using bar charts  

This demonstrates the use of spaCy pipeline on real-world research text.
