# Test KNN model 

Lets have a look at the test data and predict the research field for randomly chosen reseach papers. This will enable us to see how the model works on new data.


We have to import:

* Trained models
* TF-IDF model

In [44]:
import pickle
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import punkt
from nltk.corpus.reader import wordnet
from nltk.stem import WordNetLemmatizer
import random
from sklearn.feature_extraction.text import TfidfVectorizer
pd.set_option('display.max_colwidth', -1) 

#### Trained models

In [2]:
path_models = "C:/Users/Keletso/Documents/Research Paper Classification/3. Model Training/Models/"

# SVM
path_knnc = path_models + 'best_knnc.pickle'
with open(path_knnc, 'rb') as data:
    knnc_model = pickle.load(data)

#### TF-IDF object

In [3]:
path_tfidf = "C:/Users/Keletso/Documents/Research Paper Classification/2. Feature Engineering/Pickles/tfidf.pickle"
with open(path_tfidf, 'rb') as data:
    tfidf = pickle.load(data)

#### Research paper mapping dictionary

In [4]:
Research_Field_codes = {
    'Computer Science': 0,
    'Physics': 1,
    'Mathematics': 2,
    'Statistics': 3,
    'Quantitative Biology': 4,
    'Quantitative Finance': 5
}

#### Cleaning pipeline

In [32]:
punctuation_signs = list("?:!.,;")
stop_words = list(stopwords.words('english'))
def clean_text(df_test): 
    df=df_test.copy()
    df['Title_Parsed_1'] = df['TITLE'].str.replace("\n", " ")
    df['Title_Parsed_1'] = df['Title_Parsed_1'].str.replace("$", " ")
    df['Title_Parsed_1'] = df['Title_Parsed_1'].str.replace("\\", " ")
    df['Title_Parsed_1'] = df['Title_Parsed_1'].str.replace("\`", " ")                                                                    
    df['ABSTRACT_Parsed_1'] = df['ABSTRACT'].str.replace("\n", " ")
    df['ABSTRACT_Parsed_1'] = df['ABSTRACT_Parsed_1'].str.replace("$", " ")
    df['ABSTRACT_Parsed_1'] = df['ABSTRACT_Parsed_1'].str.replace("\\", " ")
    df['ABSTRACT_Parsed_1'] = df['ABSTRACT_Parsed_1'].str.replace("\`", " ")
    df['Title_Parsed_2'] = df['Title_Parsed_1'].str.lower()
    df['ABSTRACT_Parsed_2'] = df['ABSTRACT_Parsed_1'].str.lower()
    df['Title_Parsed_3'] = df['Title_Parsed_2']
    df['ABSTRACT_Parsed_3'] = df['ABSTRACT_Parsed_2']
    for punct_sign in punctuation_signs:
        df['Title_Parsed_3'] = df['Title_Parsed_3'].str.replace(punct_sign, '')
        df['ABSTRACT_Parsed_3'] = df['ABSTRACT_Parsed_3'].str.replace(punct_sign, '')
    df['Title_Parsed_4'] = df['Title_Parsed_3'].str.replace("'s", "")
    df['ABSTRACT_Parsed_4'] = df['ABSTRACT_Parsed_3'].str.replace("'s", "")
    wordnet_lemmatizer = WordNetLemmatizer()
    nrows = len(df)
    lemmatized_title_list = []
    lemmatized_abstract_list = []
        # Create an empty list containing lemmatized words
    title_list = []
    abstract_list = []

    # Save the text and its words into an object
    title = df['Title_Parsed_4'].to_string(index=False)
    title_words = title.split(" ")

    abstract = df['ABSTRACT_Parsed_4'].to_string(index=False)
    abstract_words = abstract.split(" ")

    # Iterate through every word to lemmatize
    for word in title_words:
        title_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))

    for word in abstract_words:
        abstract_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    # Join the list
    lemmatized_title = " ".join(title_list)
    lemmatized_abstract = " ".join(abstract_list)

    # Append to the list containing the texts
    lemmatized_title_list.append(lemmatized_title)
    lemmatized_abstract_list.append(lemmatized_abstract)
    df['Title_Parsed_5'] = lemmatized_title_list
    df['ABSTRACT_Parsed_5'] = lemmatized_abstract_list
    df['Title_Parsed_6'] = df['Title_Parsed_5']
    df['ABSTRACT_Parsed_6'] = df['ABSTRACT_Parsed_5']
    for stop_word in stop_words:

        regex_stopword = r"\b" + stop_word + r"\b"
        df['Title_Parsed_6'] = df['Title_Parsed_6'].str.replace(regex_stopword, '')
        df['ABSTRACT_Parsed_6'] = df['ABSTRACT_Parsed_6'].str.replace(regex_stopword, '')
    list_columns = ["Title_Parsed_6", "ABSTRACT_Parsed_6"]
    df = df[list_columns]

    df = df.rename(columns={'Title_Parsed_6': 'Title_Parsed','ABSTRACT_Parsed_6': 'ABSTRACT_Parsed'})
    df['Article_Description']=df['Title_Parsed'] + df['ABSTRACT_Parsed']

    df_final=df['Article_Description']
    
    return df_final

Let's write a function that tells us the research field given the category code:

In [75]:
def get_fields(field_id): 
    for field, id_  in Research_Field_codes.items():
        if id_ == field_id:      
            return field

Finally, here's the function that will predict the research field given a title and description:

In [79]:
def predict_from_text(text):
    
    # Predict using the input model
    features = tfidf.transform([text]).toarray()
    prediction_knnc = knnc_model.predict(features)[0]
    
    # Return result
    field_knnc = get_fields(prediction_knnc)
    
    print("The predicted category using the knn model is %s." %(field_knnc) )

We get unseen data from the test dataset 

In [92]:
df_test = pd.read_csv('C:/Users/Keletso/Documents/Research Paper Classification/0. Raw Data Set/test.csv')

In [93]:
sample_test=df_test.sample(1)
sample_cleaned_text=clean_text(sample_test).to_string(index=False)
sample_cleaned_text

'  macroscopic irreversibility  decay  kinetic equilibrium  classical hard-sphere systems     paper  condition  investigate   occurrence   -called macroscopic irreversibility property   relate phenomenon  decay  kinetic equilibrium  may characterize   1- body probability density function (pdf) associate  hard-sphere systems  problem  set   framework   axiomatic "ab initio" approach  classical statistical mechanics recently develop [tessarotto  textit{et al} 2013-2017]   relate establishment   exact kinetic equation realize  master equation    kinetic pdf  show   paper  task involve  introduction   suitable functional    1- body pdf  identify    textit{master kinetic information}  goal   show  provide   pdf  realize  term   arbitrary suitably-smooth particular solution   master kinetic equation  two properties indicate   indeed realize     functional  unrelated either   boltzmann-shannon entropy   fisher information '

In [94]:
predict_from_text(sample_cleaned_text)

The predicted category using the knn model is Physics.


In [95]:
sample_test=df_test.sample(1)
sample_cleaned_text=clean_text(sample_test).to_string(index=False)
sample_cleaned_text

'  concordances  connect sum  torus knot  l-space knot     knot   nontrivial connect sum  positive torus knot     concordant   l-space knot '

In [96]:
predict_from_text(sample_cleaned_text)

The predicted category using the knn model is Mathematics.


In [97]:
sample_test=df_test.sample(1)
sample_cleaned_text=clean_text(sample_test).to_string(index=False)
sample_cleaned_text

'  rational neural network  approximate jump discontinuities  graph convolution operator    node level graph encode  recent important state--art method   graph convolutional network (gcn)  nicely integrate local vertex feature  graph topology   spectral domain however current study suffer  several drawbacks (1) graph cnns rely  chebyshev polynomial approximation  result  oscillatory approximation  jump discontinuities (2) increase  order  chebyshev polynomial  reduce  oscillations issue  also incur unaffordable computational cost (3) chebyshev polynomials require degree   omega (poly(1/  epsilon ))  approximate  jump signal    |x|   rational function  need   mathcal{} (poly log(1/  epsilon )) cite{liang2016deeptelgarsky2017neural} however  non-trivial  apply rational approximation without increase computational complexity due   denominator   paper  superiority  rational approximation  exploit  graph signal recover ratioanlnet  propose  integrate rational function  neural network  show 

In [98]:
predict_from_text(sample_cleaned_text)

The predicted category using the knn model is Computer Science.
