## Part 1 -- Baseline Analysis
### Mission Description
- Mission 1: Classification (Predict the label of literatures)
- Mission 2: Keywords Extraction 

### Feature Extraction
1. TF-IDF Algorithm: `TfidfVectorizer`  
    TF (Term Frequency): frequence of word occurance in document.  
    $$ 
    TF = \frac{\#SpecificWordOfDoc}{\#TotalWordOfDoc}
    $$
    IDF (Inverse Document Frequency): metric to evaluate the weight of word. The high value means a unnormal word.
    $$
    IDF = log(\frac{\#DocInCorpus}{\#DocWithinSpecificWord + 1})
    $$
    The importance of a specific word positively correlate to its occurance in document, and negatively to its occurance in Corpus.
    $$
    TF-IDF = TF * IDF
    $$
    
    
2. BOW (Bag Of Words) model: `CountVectorizer`

Operation: Just extract features through `sklearn` package.

### Keyword Extraction
**Tokenization**: Break unstructured data and natural language text into chunks of information that can be considered as discrete elements.  
[Blog - Introduction of Tokenization in NLP](https://neptune.ai/blog/tokenization-in-nlp)


**N-grams**: sequence of N words.  
[Blog - Introduction of N-grams](https://blog.xrds.acm.org/2017/10/introduction-n-grams-need/)

1. `word_tokenize`
2. `ngrams`

Operation:
1. Stop word removal
2. Extract keywords through frequency.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from nltk import word_tokenize, ngrams

from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

In [2]:
# Load Raw Data and Preprocessing
train = pd.read_csv('./data/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('./data/test.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')

In [3]:
train.head()

Unnamed: 0,uuid,title,author,abstract,Keywords,label
0,0,Accessible Visual Artworks for Blind and Visua...,"Quero, Luis Cavazos; Bartolome, Jorge Iranzo; ...",Despite the use of tactile graphics and audio ...,accessibility technology; multimodal interacti...,0
1,1,Seizure Detection and Prediction by Parallel M...,"Li, Chenqi; Lammie, Corey; Dong, Xuening; Amir...","During the past two decades, epileptic seizure...",CNN; Seizure Detection; Seizure Prediction; EE...,1
2,2,Fast ScanNet: Fast and Dense Analysis of Multi...,"Lin, Huangjing; Chen, Hao; Graham, Simon; Dou,...",Lymph node metastasis is one of the most impor...,Histopathology image analysis; computational p...,1
3,3,Long-Term Effectiveness of Antiretroviral Ther...,"Huang, Peng; Tan, Jingguang; Ma, Wenzhe; Zheng...",In order to assess the effectiveness of the Ch...,HIV; ART; mortality; observational cohort stud...,0
4,4,Real-Time Facial Affective Computing on Mobile...,"Guo, Yuanyuan; Xia, Yifan; Wang, Jing; Yu, Hui...",Convolutional Neural Networks (CNNs) have beco...,facial affective computing; convolutional neur...,0


In [4]:
test.head()

Unnamed: 0,uuid,title,author,abstract,Keywords
0,0,Monitoring Changes in Intracellular Reactive O...,"Al-Hassan M Mustafa,Ramy Ashry,Oliver H Krämer...",Reactive oxygen species (ROS) are induced by s...,Flow cytometry; HDACi; Leukemia; ROS.
1,1,Source Printer Classification Using Printer Sp...,"Joshi, Sharad; Khanna, Nitin",The knowledge of the source printer can help i...,Printer classification; local texture patterns...
2,2,Plasma-processed CoSn/RGO nanocomposite: A low...,"Omelianovych, Oleksii; Larina, Liudmila L.; Oh...",The high cost of state-of-the-art Pt counter e...,Plasma reduction; Bimatalic alloy CoxSn1-x; Re...
3,3,Immediate Antiretroviral Therapy: The Need for...,"Mgbako, Ofole; E. Sobieszczyk, Magdalena; Olen...","Immediate antiretroviral therapy (iART), defin...",HIV; antiretroviral therapy; rapid; health equity
4,4,Design and analysis of an ultra-low-power LC q...,"Lee, Kin Keung; Bryant, Carl; Tormanen, Markus...",This paper presents the design of an ultra-low...,Varactor; Spiral inductor; Quadrature generati...


In [5]:
# Integration (Prepare for Feature Extraction)
train['text'] = train['title'] + ' ' + train['author'].fillna('') + ' ' + train['abstract'] + train['Keywords'].fillna('')
test['text'] = test['title'] + ' ' + test['author'].fillna('') + ' ' + test['abstract'] + test['Keywords'].fillna('')

In [6]:
# TF-IDF Model
vm = TfidfVectorizer()
# print(vm.vocabulary_)  # 'term:incidce' dict
# print(vm.idf_)  # IDF vector

# BOW Model
# vm = CountVectorizer()
# print(vm.vocabulary_)  # 'term:incidce' dict

# Transform datasets
vector = vm.fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])

In [7]:
print(train_vector)

  (0, 65446)	0.030549364150652764
  (0, 65318)	0.02444242755778677
  (0, 65077)	0.028246014053659677
  (0, 65064)	0.029621156367709178
  (0, 64735)	0.045579794197544786
  (0, 64139)	0.18451628473970147
  (0, 64129)	0.14963072908017255
  (0, 64116)	0.040629329389665736
  (0, 63711)	0.06799786078737419
  (0, 63115)	0.04117265806080924
  (0, 63114)	0.049814452337536036
  (0, 63104)	0.08489887630119931
  (0, 63095)	0.06580743475418266
  (0, 62619)	0.04564370573228639
  (0, 62164)	0.02391576337827588
  (0, 61137)	0.27723453687872074
  (0, 60831)	0.07228570446246294
  (0, 60352)	0.012362244765386258
  (0, 60207)	0.03270267777936728
  (0, 60171)	0.08841011496442515
  (0, 60162)	0.024068278327935694
  (0, 60149)	0.07872629173608289
  (0, 60136)	0.03951871431621518
  (0, 59683)	0.036425241093392825
  (0, 59174)	0.2945830456249297
  :	:
  (5999, 17533)	0.31504007312955895
  (5999, 15906)	0.11619793197754757
  (5999, 15797)	0.09401162905085142
  (5999, 15385)	0.023249449153589008
  (5999, 14079)	

In [8]:
# Training
model = LogisticRegression()
model.fit(train_vector, train['label'])
test['label'] = model.predict(test_vector)
test[['uuid', 'Keywords', 'label']].to_csv("./submit/task3_baseline.csv", index=False)

In [9]:
# Stop Words Definition
stops = [
    'will', 'can', "couldn't", 'same', 'own', "needn't", 'between', "shan't", 'very',
     'so', 'over', 'in', 'have', 'the', 's', 'didn', 'few', 'should', 'of', 'that', 
     'don', 'weren', 'into', "mustn't", 'other', 'from', "she's", 'hasn', "you're",
     'ain', 'ours', 'them', 'he', 'hers', 'up', 'below', 'won', 'out', 'through',
     'than', 'this', 'who', "you've", 'on', 'how', 'more', 'being', 'any', 'no',
     'mightn', 'for', 'again', 'nor', 'there', 'him', 'was', 'y', 'too', 'now',
     'whom', 'an', 've', 'or', 'itself', 'is', 'all', "hasn't", 'been', 'themselves',
     'wouldn', 'its', 'had', "should've", 'it', "you'll", 'are', 'be', 'when', "hadn't",
     "that'll", 'what', 'while', 'above', 'such', 'we', 't', 'my', 'd', 'i', 'me',
     'at', 'after', 'am', 'against', 'further', 'just', 'isn', 'haven', 'down',
     "isn't", "wouldn't", 'some', "didn't", 'ourselves', 'their', 'theirs', 'both',
     're', 'her', 'ma', 'before', "don't", 'having', 'where', 'shouldn', 'under',
     'if', 'as', 'myself', 'needn', 'these', 'you', 'with', 'yourself', 'those',
     'each', 'herself', 'off', 'to', 'not', 'm', "it's", 'does', "weren't", "aren't",
     'were', 'aren', 'by', 'doesn', 'himself', 'wasn', "you'd", 'once', 'because', 'yours',
     'has', "mightn't", 'they', 'll', "haven't", 'but', 'couldn', 'a', 'do', 'hadn',
     "doesn't", 'your', 'she', 'yourselves', 'o', 'our', 'here', 'and', 'his', 'most',
     'about', 'shan', "wasn't", 'then', 'only', 'mustn', 'doing', 'during', 'why',
     "won't", 'until', 'did', "shouldn't", 'which'
]

# Keywords Extraction By Frequency
def extract_keywords_by_freq(title: str, abstract: str) -> list:
    # Tokenize 
    ngrams_count = list(ngrams(word_tokenize(title.lower()), 2)) + list(ngrams(word_tokenize(abstract.lower()), 2))
    ngrams_count = pd.DataFrame(ngrams_count)
    # Filter by stop words and length of words
    for i in range(ngrams_count.shape[1]):
        ngrams_count = ngrams_count[~ngrams_count[i].isin(stops)]
        ngrams_count = ngrams_count[ngrams_count[i].apply(len) > 3]
    # Concate words into phrase
    ngrams_count['phrase'] = ngrams_count.apply(lambda x: " ".join([i for i in x]), axis=1)
    ngrams_count = ngrams_count['phrase'].value_counts()
    # Filter by frequency
    ngrams_count = ngrams_count[ngrams_count > 1]
    return list(ngrams_count.index)[:5]

In [10]:

test_keywords = []
for row in test.iterrows():
    pred_keywords = extract_keywords_by_freq(row[1].title, row[1].abstract)
    # Capitalize first letter of each word
    pred_keywords = [x.title() for x in pred_keywords]
    if len(pred_keywords) == 0:
        pred_keywords = ['A', 'B']
    test_keywords.append("; ".join(pred_keywords))

test['Keywords'] = test_keywords
test[['uuid', 'Keywords', 'label']].to_csv('./submit/task2_baseline.csv', index=False)
