## Part 1 -- Baseline Analysis
### Mission Description
- Mission 1: Classification (Predict the label of literatures)
- Mission 2: Keywords Extraction 

### Feature Extraction
1. TF-IDF Algorithm: `TfidfVectorizer`  
    TF (Term Frequency): frequence of word occurance in document.  
    $$ 
    TF = \frac{\#SpecificWordOfDoc}{\#TotalWordOfDoc}
    $$
    IDF (Inverse Document Frequency): metric to evaluate the weight of word. The high value means a unnormal word.
    $$
    IDF = log(\frac{\#DocInCorpora}{\#DocWithinSpecificWord + 1})
    $$
    The importance of a specific word positively correlate to its occurance in document, and negatively to its occurance in corpora.
    $$
    TF-IDF = TF * IDF
    $$
    
    
2. BOW (Bag Of Words) model: `CountVectorizer`

Operation: Just extract features through `sklearn` package.

### Keyword Extraction
**Tokenization**: Break unstructured data and natural language text into chunks of information that can be considered as discrete elements.  
[Blog - Introduction of Tokenization in NLP](https://neptune.ai/blog/tokenization-in-nlp)


**N-grams**: sequence of N words.  
[Blog - Introduction of N-grams](https://blog.xrds.acm.org/2017/10/introduction-n-grams-need/)

1. `word_tokenize`
2. `ngrams`

Operation:
1. Stop word removal
2. Extract keywords through frequency of word.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression

from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

In [None]:
# Load Raw Data and NA Processing
train = pd.read_csv('')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')

In [None]:
# Feature Extraction
train['text'] = train['title'] + ' ' + train['author'].fillna('') + ' ' + train['abstract'] + train['keywords'].fillna('')

test['text'] = test['title'] + ' ' + test['author'].fillna('') + ' ' + test['abstract'] + test['keywords'].fillna('')

In [None]:
# TF-IDF Model

# BOW Model
