# Research-Paper ranking system
- This is a research paper ranking system which ranks a research paper based on the following:-
    1. Metadata quality
    2. Recency
    3. Popularity
    4. Personlisation
    5. Score
    6. Similarity
- Topics Covered:-
    1. NLP
    2. RAG
    3. NN

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# Analysing the data

In [2]:
### Loading the dataset to dataframe

rpdf = pd.read_csv(r"../Data/dblp-v10.csv")

In [3]:
rpdf.head()

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id
0,"In this paper, a robust 3D triangular mesh wat...","['S. Ben Jabra', 'Ezzeddine Zagrouba']",50,"['09cb2d7d-47d1-4a85-bfe5-faa8221e644b', '10aa...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,We studied an autoassociative neural network w...,"['Joaquín J. Torres', 'Jesús M. Cortés', 'Joaq...",50,"['4017c9d2-9845-4ad2-ad5b-ba65523727c5', 'b118...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,It is well-known that Sturmian sequences are t...,"['Genevi eve Paquin', 'Laurent Vuillon']",50,"['1c655ee2-067d-4bc4-b8cc-bc779e9a7f10', '2e4e...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,One of the fundamental challenges of recognizi...,"['Yaser Sheikh', 'Mumtaz Sheikh', 'Mubarak Shah']",221,"['056116c1-9e7a-4f9b-a918-44eb199e67d6', '05ac...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,This paper generalizes previous optimal upper ...,"['Efraim Laksman', 'Håkan Lennerstad', 'Magnus...",0,"['01a765b8-0cb3-495c-996f-29c36756b435', '5dbc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e


In [4]:
metadata_df = rpdf.drop("abstract", axis=1)

In [5]:
metadata_df.head()

Unnamed: 0,authors,n_citation,references,title,venue,year,id
0,"['S. Ben Jabra', 'Ezzeddine Zagrouba']",50,"['09cb2d7d-47d1-4a85-bfe5-faa8221e644b', '10aa...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,"['Joaquín J. Torres', 'Jesús M. Cortés', 'Joaq...",50,"['4017c9d2-9845-4ad2-ad5b-ba65523727c5', 'b118...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,"['Genevi eve Paquin', 'Laurent Vuillon']",50,"['1c655ee2-067d-4bc4-b8cc-bc779e9a7f10', '2e4e...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,"['Yaser Sheikh', 'Mumtaz Sheikh', 'Mubarak Shah']",221,"['056116c1-9e7a-4f9b-a918-44eb199e67d6', '05ac...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,"['Efraim Laksman', 'Håkan Lennerstad', 'Magnus...",0,"['01a765b8-0cb3-495c-996f-29c36756b435', '5dbc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e


In [6]:
metadata_df['year']

0         2008
1         2007
2         2007
3         2005
4         2015
          ... 
999995    2016
999996    2016
999997    2017
999998    2016
999999    2017
Name: year, Length: 1000000, dtype: int64

In [7]:
metadata_df.columns

Index(['authors', 'n_citation', 'references', 'title', 'venue', 'year', 'id'], dtype='object')

# Pre-processing

In [8]:
import ast
import pandas as pd

def to_list_safe(x):
    if pd.isna(x):
        return []
    if isinstance(x, list):
        return x
    if isinstance(x, str):
        x = x.strip()
        if x.startswith("[") and x.endswith("]"):
            try:
                return ast.literal_eval(x)   # convert string repr of list → list
            except:
                return []
        return [x]  # fallback single value
    return []

# list columns
metadata_df['authors'] = metadata_df['authors'].apply(to_list_safe)
metadata_df['references'] = metadata_df['references'].apply(to_list_safe)

# string columns
metadata_df['title'] = metadata_df['title'].fillna("").astype(str)
metadata_df['venue'] = metadata_df['venue'].fillna("").astype(str)
metadata_df['id'] = metadata_df['id'].fillna("").astype(str)

# numeric columns
metadata_df['year'] = pd.to_numeric(metadata_df['year'], errors='coerce')
metadata_df['n_citation'] = pd.to_numeric(metadata_df['n_citation'], errors='coerce').fillna(0).astype(int)

In [9]:
metadata_df.head()

Unnamed: 0,authors,n_citation,references,title,venue,year,id
0,"[S. Ben Jabra, Ezzeddine Zagrouba]",50,"[09cb2d7d-47d1-4a85-bfe5-faa8221e644b, 10aa16d...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,"[Joaquín J. Torres, Jesús M. Cortés, Joaquín M...",50,"[4017c9d2-9845-4ad2-ad5b-ba65523727c5, b118738...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,"[Genevi eve Paquin, Laurent Vuillon]",50,"[1c655ee2-067d-4bc4-b8cc-bc779e9a7f10, 2e4e57c...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,"[Yaser Sheikh, Mumtaz Sheikh, Mubarak Shah]",221,"[056116c1-9e7a-4f9b-a918-44eb199e67d6, 05ac52a...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,"[Efraim Laksman, Håkan Lennerstad, Magnus Nils...",0,"[01a765b8-0cb3-495c-996f-29c36756b435, 5dbc8cc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e


In [10]:
len(metadata_df['authors'])

1000000

In [11]:
metadata_df.shape

(1000000, 7)

In [12]:
metadata_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   authors     1000000 non-null  object
 1   n_citation  1000000 non-null  int64 
 2   references  1000000 non-null  object
 3   title       1000000 non-null  object
 4   venue       1000000 non-null  object
 5   year        1000000 non-null  int64 
 6   id          1000000 non-null  object
dtypes: int64(2), object(5)
memory usage: 588.4 MB


In [13]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   authors     1000000 non-null  object
 1   n_citation  1000000 non-null  int64 
 2   references  1000000 non-null  object
 3   title       1000000 non-null  object
 4   venue       1000000 non-null  object
 5   year        1000000 non-null  int64 
 6   id          1000000 non-null  object
dtypes: int64(2), object(5)
memory usage: 53.4+ MB


# Null values are removed
- Next step is to turn the data to features:-
    1. Paper how much old
    2. number of authors
    3. number of references
    4. title similarity + TF-IDF
    5. venue ranking
    6. Reference influence (Page Score)
    7. N-grams
    8. Abstract similarity

# Feature Extraction

In [14]:
features_df = pd.DataFrame()
features_df['id'] = metadata_df['id']
features_df['age'] = (2026 - metadata_df['year']).astype('Int64')
features_df['number_of_citations'] = metadata_df['n_citation']
features_df['number_of_references'] = metadata_df['references'].str.len()
features_df['number_of_authors'] = metadata_df['authors'].str.len()

In [15]:
features_df.head()

Unnamed: 0,id,age,number_of_citations,number_of_references,number_of_authors
0,4ab3735c-80f1-472d-b953-fa0557fed28b,18,50,7,2
1,4ab39729-af77-46f7-a662-16984fb9c1db,19,50,3,4
2,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de,19,50,7,2
3,4ab3a98c-3620-47ec-b578-884ecf4a6206,21,221,10,3
4,4ab3b585-82b4-4207-91dd-b6bce7e27c4e,11,0,9,3


In [16]:
features_df.shape

(1000000, 5)

In [17]:
rpdf['abstract'] = rpdf['abstract'].fillna("").astype(str)

In [18]:
rpdf.head()

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id
0,"In this paper, a robust 3D triangular mesh wat...","['S. Ben Jabra', 'Ezzeddine Zagrouba']",50,"['09cb2d7d-47d1-4a85-bfe5-faa8221e644b', '10aa...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,We studied an autoassociative neural network w...,"['Joaquín J. Torres', 'Jesús M. Cortés', 'Joaq...",50,"['4017c9d2-9845-4ad2-ad5b-ba65523727c5', 'b118...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,It is well-known that Sturmian sequences are t...,"['Genevi eve Paquin', 'Laurent Vuillon']",50,"['1c655ee2-067d-4bc4-b8cc-bc779e9a7f10', '2e4e...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,One of the fundamental challenges of recognizi...,"['Yaser Sheikh', 'Mumtaz Sheikh', 'Mubarak Shah']",221,"['056116c1-9e7a-4f9b-a918-44eb199e67d6', '05ac...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,This paper generalizes previous optimal upper ...,"['Efraim Laksman', 'Håkan Lennerstad', 'Magnus...",0,"['01a765b8-0cb3-495c-996f-29c36756b435', '5dbc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e


**Embeddings**

In [19]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cuda')
import numpy as np

def batch_embedding(data, model, batch):
    all_embeddings = []
    for i in range(0, len(data), batch):
        chunk = data[i : i+batch]
        
        emb = model.encode(
            chunk,
            batch_size = 64,
            show_progress_bar=True,
            convert_to_numpy=True
        )
        all_embeddings.append(emb)
    final_embeddings = np.vstack(all_embeddings)
    return final_embeddings

abstract_list = rpdf['abstract'].to_list()
embeddings = batch_embedding(abstract_list, model=model, batch=5000)
len(embeddings)

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 534.47it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
Batches: 100%|██████████| 79/79 [00:11<00:00,  7.08it/s]
Batches: 100%|██████████| 79/79 [00:10<00:00,  7.42it/s]
Batches: 100%|██████████| 79/79 [00:10<00:00,  7.44it/s]
Batches: 100%|██████████| 79/79 [00:10<00:00,  7.38it/s]
Batches: 100%|██████████| 79/79 [00:10<00:00,  7.20it/s]
Batches: 100%|██████████| 79/79 [00:10<00:00,  7.54it/s]
Batches: 100%|██████████| 79/79 [00:10<00:00,  7.26it/s]
Batches: 100%|██████████| 79/79 [00:10<00:00,  7.32it/s]
Batches: 100%|█████

1000000

In [21]:
embeddings[0].shape

(384,)

In [22]:
title_list = metadata_df['title'].to_list()
title_embeddings = batch_embedding(title_list, model=model, batch=5000)
print(f"Size of the title embedding : {title_embeddings}")
print(f"The first embedding : {title_embeddings[0]}")

Batches: 100%|██████████| 79/79 [00:01<00:00, 43.09it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 59.27it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 59.33it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 59.30it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 60.10it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 60.47it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 59.54it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 59.35it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 56.44it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 55.99it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 56.87it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 56.83it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 56.68it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 56.76it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 58.88it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 58.28it/s]
Batches: 100%|██████████| 79/79 [00:01<00:00, 56.11it/s]
Batches: 100%|██████████| 79/79

Size of the title embedding : [[-0.04507454  0.05951044  0.00491631 ... -0.03561482  0.05013518
   0.03349502]
 [-0.02696531 -0.12445521  0.08340576 ...  0.07071421 -0.03329855
  -0.0273542 ]
 [-0.14816862 -0.01981856 -0.0090866  ... -0.0393058   0.01654782
  -0.02167541]
 ...
 [-0.01995485 -0.01371121 -0.04310349 ...  0.04149333  0.04768206
  -0.01445454]
 [-0.03132265  0.00847286 -0.00157695 ... -0.0039679  -0.01916718
  -0.04453075]
 [-0.04935111 -0.02329287  0.02790714 ... -0.08344132 -0.0174238
  -0.02562844]]
The first embedding : [-4.50745411e-02  5.95104434e-02  4.91631404e-03 -2.06336658e-02
  6.82850704e-02 -1.70882810e-02  1.13347910e-01 -4.95932326e-02
 -1.89431682e-02 -8.25456530e-02 -7.65116438e-02  1.58824585e-02
  3.28384899e-02  5.68047799e-02 -5.67055941e-02 -4.52596992e-02
 -7.89626539e-02  7.75448158e-02 -1.34889735e-02 -3.63755710e-02
  3.28187197e-02 -1.22902252e-01 -2.27700435e-02  2.33107042e-02
  3.83999273e-02  5.41148297e-02  1.28380314e-01  2.16654483e-02
  

In [23]:
len(title_embeddings)

1000000

# Saving embeddings

In [38]:
print(f"Title Embedding Shape : {title_embeddings.shape}\nAbstract Embedding Shape : {embeddings.shape}")

Title Embedding Shape : (1000000, 384)
Abstract Embedding Shape : (1000000, 384)


In [39]:
import numpy as np
np.save("abstract_embeddings", embeddings)
np.save("title_embeddings", title_embeddings)

# Tf-IDf Vectorize Score

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
temp = ['hello my name is jai saraswat', 'jai is a good boy', 'jai is working.']
result = tfidf.fit_transform(temp)
print(tfidf.vocabulary_)
print(result)

{'hello': 2, 'my': 5, 'name': 6, 'is': 3, 'jai': 4, 'saraswat': 7, 'good': 1, 'boy': 0, 'working': 8}
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 13 stored elements and shape (3, 9)>
  Coords	Values
  (0, 2)	0.4613807291012212
  (0, 5)	0.4613807291012212
  (0, 6)	0.4613807291012212
  (0, 3)	0.27249889105838787
  (0, 4)	0.27249889105838787
  (0, 7)	0.4613807291012212
  (1, 3)	0.35959372325985667
  (1, 4)	0.35959372325985667
  (1, 1)	0.6088450986844796
  (1, 0)	0.6088450986844796
  (2, 3)	0.4532946552278861
  (2, 4)	0.4532946552278861
  (2, 8)	0.7674945674619879


In [28]:
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)


idf values:
boy : 1.6931471805599454
good : 1.6931471805599454
hello : 1.6931471805599454
is : 1.0
jai : 1.0
my : 1.6931471805599454
name : 1.6931471805599454
saraswat : 1.6931471805599454
working : 1.6931471805599454


In [31]:
from sklearn.metrics.pairwise import cosine_similarity
query = "How is Jai?"
q_vec = tfidf.transform([query])
score = cosine_similarity(q_vec, result).flatten()

In [32]:
score

array([0.38537163, 0.50854232, 0.64105545])

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

title_vectorizer = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1,2),
    max_features=3000
)

abstract_vectorizer = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1,2),
    max_features=7000
)

In [35]:
X_title = title_vectorizer.fit_transform(title_list)
X_abstract = abstract_vectorizer.fit_transform(abstract_list)

In [36]:
from scipy import sparse

sparse.save_npz("X_title.npz", X_title)
sparse.save_npz("X_abstract.npz", X_abstract)

In [37]:
import joblib
joblib.dump(title_vectorizer, "title_vec.pkl")
joblib.dump(abstract_vectorizer, "abstract_vec.pkl")

['abstract_vec.pkl']

In [40]:
metadata_df.head(10)

Unnamed: 0,authors,n_citation,references,title,venue,year,id
0,"[S. Ben Jabra, Ezzeddine Zagrouba]",50,"[09cb2d7d-47d1-4a85-bfe5-faa8221e644b, 10aa16d...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,"[Joaquín J. Torres, Jesús M. Cortés, Joaquín M...",50,"[4017c9d2-9845-4ad2-ad5b-ba65523727c5, b118738...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,"[Genevi eve Paquin, Laurent Vuillon]",50,"[1c655ee2-067d-4bc4-b8cc-bc779e9a7f10, 2e4e57c...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,"[Yaser Sheikh, Mumtaz Sheikh, Mubarak Shah]",221,"[056116c1-9e7a-4f9b-a918-44eb199e67d6, 05ac52a...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,"[Efraim Laksman, Håkan Lennerstad, Magnus Nils...",0,"[01a765b8-0cb3-495c-996f-29c36756b435, 5dbc8cc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e
5,"[Simonetta Balsamo, Gian–Luca Dei Rossi, Andre...",6,"[1c26e228-57d2-4b2c-b0c9-8d5851c17fac, 7539920...",Applying BCMP multi-class queueing networks fo...,International Journal of Computer Aided Engine...,2015,4ab3e768-78c9-4497-8b8e-9e934cb5f2e4
6,"[Andrea Mazzanti, Pietro Andreani]",50,"[0a09db01-264a-4bdf-942c-d33cceb35d3c, 36c942d...",A Push–Pull Class-C CMOS VCO,IEEE Journal of Solid-state Circuits,2013,4ab3f7cd-140b-4e29-99d4-f4e8006c4f65
7,[Daniil Ryabko],2,[505f493b-e09d-444d-9ee2-5e5db6a5b8ac],On computability of pattern recognition problems,algorithmic learning theory,2005,4ab404e2-6f4b-4fb4-b093-50775e765b13
8,"[Maria Chiara Carrozza, Paolo Dario, Arianna M...",50,"[5ecd70e1-7ccc-4b2f-ac09-b91953cca5cd, 7fa711e...",Manipulating biological and mechanical micro-o...,international conference on robotics and autom...,1998,4ab4244d-fb3e-49a3-b125-367df3d8e6ba
9,"[Zhanjun Bai, Xing Zhou, Ralph Mason]",3,"[54f270aa-ce44-4ece-a2ca-c63a9f266cb3, 638c488...",A novel Injection Locked Rotary Traveling Wave...,international symposium on circuits and systems,2014,4ab439a4-9379-44f5-b98b-87125ae7366e


In [41]:
rpdf.head(10)

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id
0,"In this paper, a robust 3D triangular mesh wat...","['S. Ben Jabra', 'Ezzeddine Zagrouba']",50,"['09cb2d7d-47d1-4a85-bfe5-faa8221e644b', '10aa...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,We studied an autoassociative neural network w...,"['Joaquín J. Torres', 'Jesús M. Cortés', 'Joaq...",50,"['4017c9d2-9845-4ad2-ad5b-ba65523727c5', 'b118...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,It is well-known that Sturmian sequences are t...,"['Genevi eve Paquin', 'Laurent Vuillon']",50,"['1c655ee2-067d-4bc4-b8cc-bc779e9a7f10', '2e4e...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,One of the fundamental challenges of recognizi...,"['Yaser Sheikh', 'Mumtaz Sheikh', 'Mubarak Shah']",221,"['056116c1-9e7a-4f9b-a918-44eb199e67d6', '05ac...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,This paper generalizes previous optimal upper ...,"['Efraim Laksman', 'Håkan Lennerstad', 'Magnus...",0,"['01a765b8-0cb3-495c-996f-29c36756b435', '5dbc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e
5,Queueing networks with multiple classes of cus...,"['Simonetta Balsamo', 'Gian–Luca Dei Rossi', '...",6,"['1c26e228-57d2-4b2c-b0c9-8d5851c17fac', '7539...",Applying BCMP multi-class queueing networks fo...,International Journal of Computer Aided Engine...,2015,4ab3e768-78c9-4497-8b8e-9e934cb5f2e4
6,A CMOS oscillator employing differential trans...,"['Andrea Mazzanti', 'Pietro Andreani']",50,"['0a09db01-264a-4bdf-942c-d33cceb35d3c', '36c9...",A Push–Pull Class-C CMOS VCO,IEEE Journal of Solid-state Circuits,2013,4ab3f7cd-140b-4e29-99d4-f4e8006c4f65
7,In statistical setting of the pattern recognit...,['Daniil Ryabko'],2,['505f493b-e09d-444d-9ee2-5e5db6a5b8ac'],On computability of pattern recognition problems,algorithmic learning theory,2005,4ab404e2-6f4b-4fb4-b093-50775e765b13
8,We first discuss some general aspects of micro...,"['Maria Chiara Carrozza', 'Paolo Dario', 'Aria...",50,"['5ecd70e1-7ccc-4b2f-ac09-b91953cca5cd', '7fa7...",Manipulating biological and mechanical micro-o...,international conference on robotics and autom...,1998,4ab4244d-fb3e-49a3-b125-367df3d8e6ba
9,,"['Zhanjun Bai', 'Xing Zhou', 'Ralph Mason']",3,"['54f270aa-ce44-4ece-a2ca-c63a9f266cb3', '638c...",A novel Injection Locked Rotary Traveling Wave...,international symposium on circuits and systems,2014,4ab439a4-9379-44f5-b98b-87125ae7366e


In [42]:
features_df.head(10)

Unnamed: 0,id,age,number_of_citations,number_of_references,number_of_authors
0,4ab3735c-80f1-472d-b953-fa0557fed28b,18,50,7,2
1,4ab39729-af77-46f7-a662-16984fb9c1db,19,50,3,4
2,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de,19,50,7,2
3,4ab3a98c-3620-47ec-b578-884ecf4a6206,21,221,10,3
4,4ab3b585-82b4-4207-91dd-b6bce7e27c4e,11,0,9,3
5,4ab3e768-78c9-4497-8b8e-9e934cb5f2e4,11,6,7,3
6,4ab3f7cd-140b-4e29-99d4-f4e8006c4f65,13,50,8,2
7,4ab404e2-6f4b-4fb4-b093-50775e765b13,21,2,1,1
8,4ab4244d-fb3e-49a3-b125-367df3d8e6ba,28,50,4,4
9,4ab439a4-9379-44f5-b98b-87125ae7366e,12,3,3,3


In [46]:
features_df.to_parquet("training_features_parquet", index=False)
rpdf.to_parquet("original_data_parquet", index=False)