Before the actuall usage you want to download the stopwords for nltk by running:

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\AMOR
[nltk_data]     1\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

inside a python console.

To use AuDoLab in a project:

In [2]:
from AuDoLab import AuDoLab

Then you want to create an instance of the AuDoLab class

In [4]:
audo = AuDoLab.AuDoLab()

In this example we used publicly available data from the nltk package:

In [5]:
from nltk.corpus import reuters
import numpy as np
import pandas as pd

data = []

for fileid in reuters.fileids():
    tag, filename = fileid.split("/")
    data.append(
        (filename,
         ", ".join(
             reuters.categories(fileid)),
            reuters.raw(fileid)))

data = pd.DataFrame(data, columns=["filename", "categories", "text"])

Then you want to scrape abstracts, e.g. from IEEE with the abstract scraper (when using it the the IEEE scraper the first time, pypeteer will be downloaded automatically):

In [7]:
scraped_documents = audo.get_ieee("https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=cotton&highlight=true&returnFacets=ALL&returnType=SEARCH&matchPubs=true&rowsPerPage=100&pageNumber=1", pages=1)

The algorithm is iterating through 2 pages
Total number of abstracts that will be scraped: 203


100%|████████████████████████████████████████████████████████████████████████████████| 203/203 [02:08<00:00,  1.58it/s]


The data as well as the scraped papers need to be preprocessed before use in the
classifier:

In [8]:
preprocessed_target = audo.text_cleaning(data=data, column="text")

preprocessed_paper = audo.text_cleaning(
    data=scraped_documents, column="abstract")

target_tfidf, training_tfidf = audo.tf_idf(
    data=preprocessed_target,
    papers=preprocessed_paper,
    data_column="lemma",
    papers_column="lemma",
    features=100000,
)

100%|██████████████████████████████████████████████████████████████████████████| 10788/10788 [00:04<00:00, 2316.42it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 181/181 [00:00<00:00, 3213.22it/s]


Afterwards we can train and use the classifiers and choose the desired
one:

In [9]:
o_svm_result = audo.one_class_svm(
    training=training_tfidf,
    predicting=target_tfidf,
    nus=np.round(np.arange(0.001, 0.5, 0.01), 7),
    quality_train=0.9,
    min_pred=0.001,
    max_pred=0.05,
)
result = audo.choose_classifier(preprocessed_target, o_svm_result, 0)

nu: 0.381 data predicted: 32 training_data predicted: 177


And finally you can estimate the topics of the data:

In [10]:
lda_target = audo.lda_modeling(data=result, num_topics=5)

audo.lda_visualize_topics(type="pyldavis")