# 🔖 Research Article Recommendation:

**Discovering Relevant Research Articles Using Text Similarity**
**Author: Tammireddy Sri Vallabh**

---

### Motivation – Why did you pick this topic?

As a student constantly exploring new research areas, I often found it difficult to discover relevant academic articles quickly. Most platforms offer keyword-based search, but that often misses context or nuanced connections between research topics. I wanted to create a smarter way to surface articles based on meaning and similarity, not just keywords. This led me to build a content-based recommendation system that understands and processes natural language input to find the most relevant research articles using NLP techniques.

---

### Learning Outcomes

From this project, I gained hands-on experience with:

- **Web scraping** using BeautifulSoup to collect academic article metadata from Springer.
- **Natural Language Processing (NLP)** techniques like tokenization, POS tagging, and stopword removal.
- Building **TF-IDF-based vectorizers** to convert textual data into numerical formats for similarity calculations.
- Using **cosine similarity** to rank results based on relevance.
- Implementing an **end-to-end Streamlit web app** with user input validation, real-time recommendations, and clean UI.
- Understanding how multiple textual signals (titles, keywords, authors) can be combined for better document representation.

---

### Code / Notebook Highlights

- **Data Collection**: Articles scraped from Springer journals using BeautifulSoup.
- **Preprocessing**: Custom functions to clean and normalize text, group articles by journals, and extract useful tags.
- **TF-IDF Matrix Generation**: Journal-level and article-level TF-IDF matrices created to support search.
- **Web App**: Built using Streamlit with an interactive UI for topic input and link-based output.

You can view the code and run the app through the `app.py` script. All models and matrices are serialized using `pickle` for easy deployment.

---

### Reflections

**(a) What surprised me?**
I was surprised at how effective a basic TF-IDF model could be when paired with thoughtful preprocessing. By cleaning and combining different text fields, the model could make surprisingly good recommendations — even without using deep learning.

**(b) Scope for improvement:**

- Integrate **contextual embeddings** (like Sentence-BERT) instead of TF-IDF for deeper semantic understanding.
- Expand to **multimodal inputs** by including images, abstracts, or full papers.
- Add **feedback mechanisms** for users to rate articles and fine-tune recommendations over time.
- Scale to a larger dataset and optimize processing with multiprocessing or distributed computing.


## 🧹 From Raw Data to Search-Ready Insights: How Journals Are Preprocessed

When building a smart research discovery system, raw data alone isn’t enough. Behind the scenes, there's a thoughtful transformation that prepares this data for meaningful search and recommendation. Here's how it works.

Imagine a huge collection of research articles, each tied to different journals, authors, and topics. To make sense of it all, we start by **organizing the articles journal by journal**. This means pulling together all article titles that belong to the same publication. Along with this, we collect all the authors who’ve contributed to these articles, and the keywords that hint at what each journal is about.

Next comes the cleaning phase. Raw titles and author names often include extra punctuation, inconsistent casing, or irrelevant words. So each piece of text — whether it's an article title, an author name, or a keyword — goes through a process that removes unnecessary clutter. This ensures we're left with clean, relevant terms that truly describe the journal’s focus.

Once the text is cleaned, everything is **merged into a single, rich description for each journal**. This description captures the essence of what that journal publishes, who contributes to it, and what topics it frequently covers. We call this the journal’s "tag profile."

This tag profile is the **backbone of the recommendation system**. When a user searches for a topic, these tags help the system quickly match the user’s interest to the most relevant journals — making sure suggestions are based on actual content, not just titles or keywords.

In short, this step transforms messy, scattered metadata into an intelligent foundation for search — helping users discover journals that matter to them, faster and more accurately.


# Code


## 📦 Primary Imports

The following libraries are essential for data preprocessing, natural language processing (NLP), and text similarity computation:

- **os, re, unicodedata**: For basic system operations and text normalization.
- **pandas**: For structured data manipulation.
- **nltk**: For tokenization, part-of-speech tagging, and stopword handling.
- **sklearn**: For TF-IDF vectorization and cosine similarity to match user input with articles.
- **pickle**: To load pre-saved models and data efficiently.


In [1]:
import os
import pandas as pd
import psycopg2
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import unicodedata

These NLTK resources provide tokenization (`punkt`), part-of-speech tagging (`averaged_perceptron_tagger`), stopword filtering (`stopwords`), and internal optimizations (`punkt_tab`, `averaged_perceptron_tagger_eng`) essential for core NLP tasks.


In [2]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mssri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mssri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mssri\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mssri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\mssri\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

The `get_paragraph` function takes a row from a DataFrame and a column name (`index`) and **Concatenates** all strings in that column (which is expected to be a list of strings) into one lowercase paragraph-like string.


In [3]:
def get_paragraph(row, index):
    ans = ''
    for x in row[index]:
        ans = ans + ' ' + x.lower()
    return ans

#### `remove_accents(text)` Removes diacritical marks (accents) from characters using Unicode normalization.


In [4]:

def remove_accents(text):
    text = unicodedata.normalize('NFKD', text).encode(
        'ASCII', 'ignore').decode('utf-8')
    return text

#### `get_clean_text(row, index)`

- Cleans and tokenizes text from a specified column in a row.
- Applies the following steps:
  - Lowercases the text
  - Tokenizes it
  - Removes commas and accents
  - Filters out:
    - Non-alphabetic tokens
    - Stopwords (common words like “the”, “is”)
    - Very short words
    - Words with a dot as their second character (likely abbreviations)


In [5]:
def get_clean_text(row, index):
    if not isinstance(row[index], str):
        return ''
    if row[index] == "NULL":
        return ''
    clean_text = ''
    words = word_tokenize(row[index].lower())
    for word in words:
        word = word.replace(',', ' ')
        word = remove_accents(word)
        if re.match(r'^[a-zA-Z]+$', word) and word not in stop_words and len(word) > 1 and word[1] != '.':
            clean_text += ' ' + word
    return clean_text

#### `combine(row, indices)` concatenates the text from multiple columns (`indices`) of a row into one string.


In [6]:
def combine(row, indices):
    ans = ''
    for i in indices:
        ans = ans + ' ' + row[i]
    return ans

`stop_words` are a set of common English stopwords used to filter out uninformative words during cleaning.


In [7]:
stop_words = set(stopwords.words('english'))

In [8]:
main_df = pd.read_csv('compressed_data.bz2', compression='bz2')
main_df = main_df.drop(['item_doi'], axis=1)

In [9]:
main_df.head()

Unnamed: 0,item_title,publication_title,authors,publication_year,url,keywords
0,correction to strong field physics pursued wit...,aapps bulletin,vishwa bandhu pathakseong ku leeki hong paecal...,2021,http://link.springer.com/article/10.1007/s4367...,AAPPS Bulletin Atomic Molecular Optical and Pl...
1,correction time reversal and reciprocity,aapps bulletin,olivier sigwarthchristian miniatura,2022,http://link.springer.com/article/10.1007/s4367...,AAPPS Bulletin Atomic Molecular Optical and Pl...
2,ultrasound detection using microcavity raman l...,aapps bulletin,xiao,2022,http://link.springer.com/article/10.1007/s4367...,AAPPS Bulletin Atomic Molecular Optical and Pl...
3,relativistic density functional theory in nucl...,aapps bulletin,jie mengpengwei zhao,2021,http://link.springer.com/article/10.1007/s4367...,AAPPS Bulletin Atomic Molecular Optical and Pl...
4,muon cooling and acceleration,aapps bulletin,masashi otani,2022,http://link.springer.com/article/10.1007/s4367...,AAPPS Bulletin Atomic Molecular Optical and Pl...


### The below function does Preprocessing and Tag Generation for Journals

This function takes a raw DataFrame of article metadata and processes it to generate a cleaned and structured DataFrame at the **journal level**, with enriched text features for similarity-based search or NLP tasks.

#### 🔧 Key Processing Steps:

1. **Group Articles by Journal**

2. **Group Authors by Journal**

3. **Extract Keywords for Each Journal**

4. **Join the Article Titles, Authors, and Keywords**

5. **Text Cleaning**

6. **Create Combined Tags**


In [10]:
def get_journal_df(df):
    journal_art = df.groupby('publication_title')['item_title'].apply(
        list).reset_index(name='Articles')
    journal_art.set_index(['publication_title'], inplace=True)

    journal_auth = df.groupby('publication_title')['authors'].apply(
        list).reset_index(name='authors')
    journal_auth.set_index('publication_title', inplace=True)

    journal_key = df.drop_duplicates(
        subset=["publication_title", "keywords"], keep='first')
    journal_key = journal_key.drop(
        ['item_title', 'authors', 'publication_year', 'url'], axis=1)
    journal_key.set_index(['publication_title'], inplace=True)

    journal_main = journal_art.join([journal_key, journal_auth])
    journal_main.reset_index(inplace=True)

    journal_main['Articles'] = journal_main.apply(
        get_paragraph, index='Articles', axis=1)
    journal_main['Articles'] = journal_main.apply(
        get_clean_text, index='Articles', axis=1)
    journal_main['authors'] = journal_main.apply(
        get_paragraph, index='authors', axis=1)
    journal_main['authors'] = journal_main.apply(
        get_clean_text, index='authors', axis=1)
    journal_main['keywords'] = journal_main.apply(
        get_clean_text, index='keywords', axis=1)

    journal_main['Tags'] = journal_main.apply(
        combine, indices=['keywords', 'Articles', 'authors'], axis=1)
    journal_main['Tags'] = journal_main.apply(
        get_clean_text, index='Tags', axis=1)

    return journal_main

### Applying it to journal_main


In [11]:
# Journal Dataframe
journal_main = get_journal_df(main_df)
print('journal_main processed')

journal_main processed


Saving it


In [12]:
import pickle
# Save journal_main
with open("journal_main.pkl", "wb") as f:
    pickle.dump(journal_main, f)

print("Saved journal_main.pkl")

Saved journal_main.pkl


## TF-IDF (Term Frequency-Inverse Document Frequency)

#### Formula Explanation

1. **Term Frequency (TF)**:
   The formula for **TF** is:

   $$
   \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
   $$

2. **Inverse Document Frequency (IDF)**:
   The formula for **IDF** is:

   $$
   \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing the term } t} \right)
   $$

3. **TF-IDF Calculation**:
   The TF-IDF score of a term \( t \) in a document \( d \) is calculated as the product of the term's TF and IDF scores:

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$


In [13]:
def get_tfidfs(journal_main):
    vectorizer = TfidfVectorizer(decode_error='ignore', strip_accents='ascii')
    journal_tfidf_matrix = vectorizer.fit_transform(journal_main['Tags'])
    return vectorizer, journal_tfidf_matrix

In [14]:
vectorizer, journal_tfidf_matrix = get_tfidfs(journal_main)
print('tfids and vectorizer for journals completed')

tfids and vectorizer for journals completed


In [15]:
# Save the vectorizer
with open("vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

# Save the journal TF-IDF matrix
with open("journal_tfidf_matrix.pkl", "wb") as f:
    pickle.dump(journal_tfidf_matrix, f)

print("Saved vectorizer.pkl and journal_tfidf_matrix.pkl")

Saved vectorizer.pkl and journal_tfidf_matrix.pkl


The function `get_article_df(row)` processes and transforms article data by cleaning and extracting relevant information from a DataFrame containing journal articles. Below is a step-by-step explanation of each part of the code:

### 1. **Filter Data to Select the Right Article**

### 2. **Clean Text Data for Item Title and Authors**

### 3. **Tokenize the Item Title**

### 4. **Apply Part-of-Speech (POS) Tagging**

### 5. **Extract Nouns and Adjectives**

### 6. **Further Processing on Extracted Tags**

### 7. **Add Authors and Publication Year to Tags**

### 8. **Drop Unnecessary Columns**

### 9. **Reset and Set Index**

### 10. **Return the Processed DataFrame**

This approach allows you to extract meaningful information from the text of journal articles, making it easier to analyze or build features for tasks like text mining or natural language processing.


In [28]:
def get_article_df(row):
    article = main_df.loc[main_df['publication_title'] ==
                          journal_main['publication_title'][row.name]].copy()
    article['item_title'] = article.apply(
        get_clean_text, index='item_title', axis=1)
    article['authors'] = article.apply(get_clean_text, index='authors', axis=1)
    article['Tokenized'] = article['item_title'].apply(word_tokenize)
    article['Tagged'] = article['Tokenized'].apply(pos_tag)
    article['Tags'] = article['Tagged'].apply(lambda x: [word for word, tag in x if
                                                         tag.startswith('NN') or tag.startswith('JJ') and word.lower() not in stop_words])
    article['Tags'] = article.apply(get_paragraph, index='Tags', axis=1)
    article['Tags'] = article.apply(
        lambda x: x['Tags'] + ' ' + x['authors'] + ' ' + str(x['publication_year']), axis=1)
    article = article.drop(['keywords', 'publication_title',
                           'Tokenized', 'Tagged', 'authors', 'publication_year'], axis=1)
    article.reset_index(inplace=True)
    article.set_index('index', inplace=True)
    return article

The `get_vectorizer` function initializes and returns a `TfidfVectorizer` with options to ignore decoding errors and strip accents to ASCII characters.


In [17]:
def get_vectorizer(row):
    vectorizer = TfidfVectorizer(decode_error='ignore', strip_accents='ascii')
    return vectorizer

We need to fit a TF-IDF vectorizer to the 'Tags' column of the `article_df` DataFrame and returns the resulting TF-IDF matrix.


In [18]:
def get_tfidf_matrix(row):
    tfidf_matrix = row['article_vectorizer'].fit_transform(
        row['article_df']['Tags'])
    return tfidf_matrix

The `article_preprocessing` function processes a DataFrame `df` by:

1. Applying `get_article_df` to each row to extract and clean article data, storing the result in the `article_df` column.
2. Applying `get_vectorizer` to each row to create a TF-IDF vectorizer for the article text, storing it in the `article_vectorizer` column.
3. Applying `get_tfidf_matrix` to each row to generate a TF-IDF matrix using the vectorizer, storing it in the `article_matrix` column.

It returns the updated DataFrame with these new columns for further analysis.


In [19]:
def article_preprocessing(df):
    df['article_df'] = df.apply(get_article_df, axis=1)
    df['article_vectorizer'] = df.apply(get_vectorizer, axis=1)
    df['article_matrix'] = df.apply(get_tfidf_matrix, axis=1)
    return df

In [20]:

journal_main = article_preprocessing(journal_main)
print('done')

done


In [21]:
journal_main.to_pickle('journal_main.pkl')
print('saved journal_main.pkl')

saved journal_main.pkl


In [22]:
!pip install streamlit pyngrok --quiet



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Sure! Here’s an explanation **before** the code, introducing each function and its purpose—ideal for documentation, a blog, or your notebook:

---

## 🧠 Building Streamlit App

This app recommends **research articles** based on a user’s input (topic, phrase, or keyword). It uses Natural Language Processing (NLP), specifically **TF-IDF** and **cosine similarity**, to find and rank relevant articles from a dataset of journal publications. Below, we explain the **purpose of each function** used in the app, **before showing the code**.

---

### 🔍 `get_journal_index(user_input)`

This function identifies the most relevant journals based on a user's query:

- It transforms the input into a TF-IDF vector.
- It then compares the vector with precomputed TF-IDF representations of journal content.
- Using **cosine similarity**, it finds and returns the indices of the top-matching journals.

---

### 📄 `get_article_recommendations(user_input)`

Once we have the top journals, this function digs deeper:

- For each recommended journal, it uses that journal’s own vectorizer to encode the input.
- It then compares this vector with TF-IDF vectors of the articles in that journal.
- The most similar articles are collected and returned as recommendations.

---

### 🔗 `get_links(user_input)`

This function brings everything together:

- First, it **validates** the input using `validation()`.
- If the input is valid, it calls `get_article_recommendations()` to get article matches.
- Then it extracts useful metadata like title, link, article ID, and journal ID for each result.

---

### ✅ `validation(text)`

Before doing any analysis, we want to make sure the input is meaningful:

- This function uses **Part-of-Speech tagging** to extract adjectives and nouns from the input.
- If the input lacks relevant words, it marks it as invalid.
- Otherwise, it returns a cleaned version of the input to be used in search.

---

### 🌐 The Streamlit User Interface

The rest of the code builds the user interface:

- It loads preprocessed data and vectorizers from `.pkl` files.
- It provides a search box where users can enter topics.
- On submit, it shows a list of matching articles with clickable links.
- It also includes warning messages if input is too vague.

---

Once these functions are in place, the user can simply **enter a topic** like “AI in medicine” and instantly discover relevant academic articles sorted by relevance.

Would you like me to format this as a markdown cell for your notebook?


In [23]:
%%writefile app.py
import streamlit as st
import pickle
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.metrics.pairwise import cosine_similarity

# Load models and data
with open("journal_main.pkl", "rb") as f:
    journal_main = pickle.load(f)

with open("vectorizer.pkl", "rb") as f:
    vectorizer = pickle.load(f)

with open("journal_tfidf_matrix.pkl", "rb") as f:
    journal_tfidf_matrix = pickle.load(f)

# Parameters
journal_threshold = 4
article_threshold = 10

# Logic
def get_journal_index(user_input):
    user_tfidf = vectorizer.transform([user_input])
    cosine_similarities = cosine_similarity(user_tfidf, journal_tfidf_matrix).flatten()
    indices = cosine_similarities.argsort()[::-1]
    top_recommendations = [i for i in indices if cosine_similarities[i] > 0][:min(journal_threshold, len(indices))]
    return top_recommendations

def get_article_recommendations(user_input):
    recommended_journals = get_journal_index(user_input)
    recommendations = []
    for journal_id in recommended_journals:
        user_tfidf = journal_main['article_vectorizer'][journal_id].transform([user_input])
        cosine_similarities = cosine_similarity(user_tfidf, journal_main['article_matrix'][journal_id]).flatten()
        indices = cosine_similarities.argsort()[::-1]
        top_recommendation_articles = [
            (cosine_similarities[i], i, journal_id)
            for i in indices if cosine_similarities[i] > 0
        ][:min(article_threshold, len(indices))]
        recommendations += top_recommendation_articles
    recommendations.sort(reverse=True)
    return recommendations

def get_links(user_input):
    check = validation(user_input)
    if check['validation'] == 'valid':
        recommendations = get_article_recommendations(check['sentence'])
        links = []
        for article in recommendations:
            similarity, article_id, journal_id = article
            link = {
                "title": journal_main['article_df'][journal_id].iloc[article_id, 0],
                "url": journal_main['article_df'][journal_id].iloc[article_id, 1],
                "article_id": int(article_id),
                "journal_id": int(journal_id)
            }
            links.append(link)
        return links
    return []

def validation(text):
    words = word_tokenize(text)
    tagged_words = pos_tag(words)
    adjectives = [word for word, pos in tagged_words if pos.startswith('JJ')]
    nouns = [word for word, pos in tagged_words if pos.startswith('NN')]

    result = {}
    if not adjectives and not nouns:
        result['validation'] = 'invalid'
    else:
        combined_sentence = f"{' '.join(adjectives)} {' '.join(nouns)}".strip()
        result['validation'] = 'valid'
        result['sentence'] = combined_sentence

    return result

# Streamlit UI
st.set_page_config(page_title="Discover Research Articles", layout="centered")
st.markdown("""
    <style>
        .title { text-align: center; font-size: 36px; color: #2c3e50; font-weight: bold; margin-top: 30px; }
        .subtitle { text-align: center; font-size: 18px; color: #333; margin-bottom: 20px; font-style: italic; }
        .section { padding: 30px; margin-bottom: 40px; }
        .input-box { width: 100%; padding: 15px; font-size: 18px; border-radius: 8px; border: 1px solid #ddd; background-color: #ffffff; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }
        .button { background-color: #3498db; color: white; padding: 15px 25px; font-size: 20px; border: none; border-radius: 8px; cursor: pointer; width: 100%; }
        .button:hover { background-color: #2980b9; }
        .article { padding: 15px; margin-bottom: 12px; border-radius: 8px; background-color: #ffffff; box-shadow: 0 3px 6px rgba(0,0,0,0.1); }
        .article-title { font-size: 20px; color: #3498db; font-weight: bold; text-decoration: none; }
        .article-meta { font-size: 12px; color: #888; margin-top: 5px; }
    </style>
    <div class="title">🔎 Discover Relevant Research Articles</div>
    <div class="subtitle">Enter a topic, keyword, or phrase to explore the latest articles in your field of interest!</div>
""", unsafe_allow_html=True)

with st.container():
    st.markdown('<div class="section">', unsafe_allow_html=True)
    st.subheader("Find Research Articles")
    st.markdown("Type in a topic or keyword that you'd like to explore. Our system will fetch articles based on your input.")
    article_input = st.text_input("", placeholder="e.g., Quantum Computing, AI, Rocket Science ...", key="article_input", 
                                  help="Enter your research topic or keyword to find related articles.")
    
    if st.button("🔗 Find Articles", key="generate_links"):
        if article_input:
            validation_result = validation(article_input)
            
            if validation_result['validation'] == 'invalid':
                st.warning("Please try entering more descriptive terms, including nouns or adjectives.")
            else:
                result = get_links(validation_result['sentence'])
                
                if result:
                    st.markdown("### 🔗 Top Matching Articles", unsafe_allow_html=True)
                    for i, article in enumerate(result):
                        st.markdown(f"""
                            <div class="article">
                                <a class="article-title" href="{article['url']}" target="_blank">
                                    {i+1}. {article['title']}
                                </a>
                                <p class="article-meta">Article ID: {article['article_id']} | Journal ID: {article['journal_id']}</p>
                            </div>
                        """, unsafe_allow_html=True)
                else:
                    st.warning("No articles matched your query. Try being more specific with your input.")
        else:
            st.warning("Please enter a topic or keyword to get article recommendations.")
    
    st.markdown('</div>', unsafe_allow_html=True)

Overwriting app.py


In [24]:
from pyngrok import ngrok

# Set your authtoken (replace with your real one)
ngrok.set_auth_token("2wm7thIz3K5BCEVCmPBP3Ka6mB8_46uj7zKLMdGV3bykSfT4W")

In [27]:
import threading
import time
ngrok.kill()
# Start Streamlit in a separate thread
def run():
    !streamlit run app.py

thread = threading.Thread(target=run)
thread.start()

# Wait for it to boot up
time.sleep(5)

# Connect ngrok properly
public_url = ngrok.connect(8502, "http", bind_tls=True)
print(f"🌐 Your app is live at: {public_url}")


🌐 Your app is live at: NgrokTunnel: "https://44d1-37-120-216-226.ngrok-free.app" -> "http://localhost:8502"
