In [60]:
import numpy as np
import pandas as pd

In [59]:
df = pd.read_csv('/content/papers.csv', on_bad_lines='skip')

In [61]:
df.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


In [62]:
df.shape

(7241, 7)

In [63]:
df = df.iloc[:5000, :]

In [64]:
df.shape

(5000, 7)

In [65]:
df['paper_text'][0]

'767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABASE\nAND ITS APPLICATIONS\nHisashi Suzuki and Suguru Arimoto\nOsaka University, Toyonaka, Osaka 560, Japan\nABSTRACT\nAn efficient method of self-organizing associative databases is proposed together with\napplications to robot eyesight systems. The proposed databases can associate any input\nwith some output. In the first half part of discussion, an algorithm of self-organization is\nproposed. From an aspect of hardware, it produces a new style of neural network. In the\nlatter half part, an applicability to handwritten letter recognition and that to an autonomous\nmobile robot system are demonstrated.\n\nINTRODUCTION\nLet a mapping f : X -+ Y be given. Here, X is a finite or infinite set, and Y is another\nfinite or infinite set. A learning machine observes any set of pairs (x, y) sampled randomly\nfrom X x Y. (X x Y means the Cartesian product of X and Y.) And, it computes some\nestimate j : X -+ Y of f to make small, the estimation erro

In [66]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [67]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [68]:
stop_words = set(stopwords.words('english')) #set for unique stopwords, although the stopwords are unique

In [69]:
#creating custom stopwords
new_words = ["fig", "figure", "sample", "using", "shoe", "result", "Large",
             "also", "one", "two", "three", "four", "five", "seven", "eight", "nine"]

In [70]:
stop_words = list(stop_words.union(new_words))

In [71]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [72]:
def preprocessing_text(txt):
  txt = txt.lower()
  txt = re.sub(r'<.*?>', ' ', txt)
  txt = re.sub(r'[^a-zA-Z]', ' ', txt)
  txt = nltk.word_tokenize(txt)
  txt = [word for word in txt if word not in stop_words]
  txt = [word for word in txt if len(word) > 3] #removing keywords conaining less than 3 alphabets
  txt = [PorterStemmer().stem(word) for word in txt] #converting word into its base form (loving -> love)
  txt = ' '.join(txt)

  return txt

In [73]:
docs = df['paper_text'].apply(preprocessing_text)

In [74]:
docs

Unnamed: 0,paper_text
0,self organ associ databas applic hisashi suzuk...
1,mean field theori layer visual cortex applic a...
2,store covari associ long term potenti depress ...
3,bayesian queri construct neural network model ...
4,neural network ensembl cross valid activ learn...
...,...
4995,rank time frequenc synthesi matthieu kowalski ...
4996,state space model decod auditori attent modul ...
4997,effici structur matrix rank minim adam wanli y...
4998,cient minimax signal detect graph jing qian di...


In [75]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df=95, max_features=5000, ngram_range=(1,3))
word_count_vectors = cv.fit_transform(docs)

## Document-Term Matrix Example

**Documents:**

1. "The quick brown fox jumps over the lazy dog."
2. "The dog barks at the cat."
3. "The fox is quick and brown."

**Vocabulary:**

| Word | Index |
|---|---|
| quick | 0 |
| brown | 1 |
| fox | 2 |
| jump | 3 |
| lazi | 4 |
| dog | 5 |
| bark | 6 |
| cat | 7 |


**Document-Term Matrix:**

| Document | quick | brown | fox | jump | lazi | dog | bark | cat |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

In [76]:
word_count_vectors

<5000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 332940 stored elements in Compressed Sparse Row format>

## Scenario: TF-IDF Calculation Example

**Documents:**

1. "The cat sat on the mat."
2. "The dog chased the cat."
3. "The dog sat on the mat."

**Steps:**

**1. Count Vectorization:**

We create a vocabulary and represent each document as a vector of word counts:

| Document | cat | sat | on | mat | dog | chased |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 |
| 3 | 0 | 1 | 1 | 1 | 1 | 0 |

**2. TF-IDF Transformation:**

We apply TF-IDF to the word counts:

**TF (Term Frequency):**  Frequency of a word in a document.
**IDF (Inverse Document Frequency):** Importance of a word across all documents.
**TF-IDF:** TF * IDF

**Example: Calculating TF-IDF for "cat" in Document 1:**

* **TF:** 1 (count of "cat") / 5 (total words in Document 1) = 0.2
* **IDF:** log(3 (total documents) / 2 (documents containing "cat")) ≈ 0.405
* **TF-IDF:** 0.2 * 0.405 ≈ 0.081

**Resulting TF-IDF Matrix:**

| Document | cat | sat | on | mat | dog | chased |
|---|---|---|---|---|---|---|
| 1 | 0.081 | 0.06 | 0.06 | 0.06 | 0 | 0 |
| 2 | 0.096 | 0 | 0 | 0 | 0.119 | 0.119 |
| 3 | 0 | 0.086 | 0.086 | 0.086 | 0.139 | 0 |

**Observation:**

The word "cat" has higher TF-IDF values in documents where it appears (Documents 1 and 2) and lower values where it doesn't (Document 3). This highlights the importance of the word within specific documents and across the entire corpus.

In [81]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer = transformer.fit(word_count_vectors)

In [89]:
def extract_topn_from_vector(feature_names, sorted_items, topn=10):

    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        feature_vals.append(feature_names[idx])
        score_vals.append(round(score, 3))

    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]

    return results

**Scenario:** Let's use a sample document:

"The quick brown fox jumps over the lazy dog. The dog barks."

**1. `tfidf_vector.tocoo()`**

After calculating TF-IDF scores, imagine `tfidf_vector` stores the following (simplified):

| Word | Index | TF-IDF Score |
|---|---|---|
| the | 0 | 0.0 |
| quick | 1 | 0.3 |
| brown | 2 | 0.2 |
| fox | 3 | 0.4 |
| jumps | 4 | 0.1 |
| over | 5 | 0.2 |
| lazy | 6 | 0.6 |
| dog | 7 | 0.5 |
| barks | 8 | 0.0 |

Applying `tocoo()` transforms this into COO format:

| Row | Column (Word Index) | Value (TF-IDF Score) |
|---|---|---|
| 0 | 1 | 0.3 |
| 0 | 2 | 0.2 |
| 0 | 3 | 0.4 |
| 0 | 4 | 0.1 |
| 0 | 7 | 0.5 |
| 0 | 5 | 0.2 |
| 0 | 6 | 0.6 |
| 0 | 8 | 0.0 |

**2. `zip(tfidf_vector.tocoo().col, tfidf_vector.tocoo().data)`**

This combines word indices and TF-IDF scores:

> Output:
[(1, 0.3), (2, 0.2), (3, 0.4), (4, 0.1), (7, 0.5), (5, 0.2), (6, 0.6), (8, 0.0)]


Each tuple represents (word index, TF-IDF score).

**3. `sorted(..., key=lambda x: x[1], reverse=True)`**

This sorts the pairs based on TF-IDF scores (descending order):

> Output:
[(6, 0.6), (7, 0.5), (3, 0.4), (1, 0.3), (2, 0.2), (5, 0.2), (4, 0.1), (8, 0.0)]



In [97]:
feature_names = cv.get_feature_names_out()

In [98]:
def get_keywords(idx, docs):
  tfidf_vector = tfidf_transformer.transform(cv.transform([docs[idx]]))

  sorted_items = sorted(zip(tfidf_vector.tocoo().col, tfidf_vector.tocoo().data), key=lambda x: x[1], reverse=True)
  keywords = extract_topn_from_vector(feature_names, sorted_items, 10)

  return keywords

In [99]:
def print_keywords(idx, df, keywords):
  print('\n======title======')
  print(df['title'][idx])
  print('\n======abstract======')
  print(df['abstract'][idx])
  print('\n======keywords======')
  for k in keywords:
    print(k, keywords[k])

In [100]:
keywords = get_keywords(1, docs)
print_keywords(1, df, keywords)


A Mean Field Theory of Layer IV of Visual Cortex and Its Application to Artificial Neural Networks

Abstract Missing

cortic cell 0.633
monocular 0.334
ocular domin 0.225
ocular 0.201
cell activ 0.196
binocular 0.185
singl cell 0.183
hopfield model 0.146
munro 0.143
synapt modif 0.138


In [101]:
import pickle
pickle.dump(cv, open('cv.pkl', 'wb'))
pickle.dump(tfidf_transformer, open('tfidf.pkl', 'wb'))
pickle.dump(feature_names, open('feature_names.pkl', 'wb'))