In [1]:
import sys
sys.path.append('../TTI/')
%load_ext autoreload
%autoreload 2

# Taksonomia, identyfikacja tekstu

Dany jest fragment hierarchii klasyfkacji tematycznej z Wikipedii (https://en.wikipedia.org/wiki/Category:Main_topic_classifications) w postaci pliku CSV.
Klasyfkacja jest grafem spójnym, gdzie węzły są tematami, a krawędzie reprezentują uszczegółowienie tematu.

Celem projektu jest zapropnowanie i przetestowanie mechanizmu automatycznej klasyfikacji tekstu Wejściem jest plik tekstowy w języku angielskim. Wyjściem jest zbiór węzłów w/w klasyfikacji tematycznej.


## Dane wejściowe

Dane wejściowe do zadania do graf spójny o 225765 węzłach, kady węzeł reprezentuje jedną kategorię. Graf nie jest uporządkowanym drzewem, może również zawierać pętle.

In [2]:
from TTI.CategoriesGraph import CategoriesGraph

categories = CategoriesGraph()

Reading topics graph


In [3]:
print("Ilość krawędzi", categories._edge_list.shape)
print("Ilość węzłów", categories._graph.number_of_nodes())

Ilość krawędzi (339250, 2)
Ilość węzłów 225765


## Zbiór treningowy

Zbiór treningowy został przygotowany z wykorzystaniem notebooka `01-tti-training-set-generate.ipynb`. Tam jest też więcej informacji o procesie generacji.

In [4]:
from TTI.config import DATABASE_PATH
import sqlite3
import pandas as pd
import json
import numpy as np
from tensorflow.keras.utils import to_categorical

table_name = "training_set_25"
connection = sqlite3.connect(DATABASE_PATH)


In [5]:
dataset = pd.read_sql('select * from {}'.format(table_name), connection)

In [6]:
dataset["Representation"] = dataset["Representation"].apply(lambda i : json.loads(i))
dataset["Category"] = dataset["Category"].apply(lambda i : i[9:])
dataset["Words"] = dataset["Words"].apply(lambda i : json.loads(i))

In [7]:
print("Dataset size:", dataset.shape)
print("Numeric represntation vector size:", len(dataset.iloc[12]["Representation"]))
print("Number of nodes in the graph:", len(dataset.iloc[12]["Words"]))

Dataset size: (225765, 3)
Numeric represntation vector size: 300
Number of nodes in the graph: 25


In [8]:
dataset

Unnamed: 0,Category,Words,Representation
0,Main_topic_classifications,"[academic, culture, human, entertainment, heal...","[-0.3755445182323456, 0.010519789531826973, -0..."
1,Main topic articles,"[academic, culture, human, entertainment, heal...","[-0.40671899914741516, 0.013835961930453777, -..."
2,Academic disciplines,"[academic, art, academics, euthenics, studies,...","[-0.09239675104618073, -0.46590009331703186, -..."
3,Subfields by academic discipline,"[subfield, academic, areas, evolutionary, fiel...","[0.085173599421978, 0.010392077267169952, -0.3..."
4,Scholars by subfield,"[subfield, academic, architects, studies, clas...","[-0.15292514860630035, -0.5975006222724915, -0..."
...,...,...,...
225760,World Wide Web stubs,"[internet, wide, system, technology, bioinform...","[0.216136172413826, -0.024581177160143852, -0...."
225761,Internet publication stubs,"[service, wide, entertainment, online, news, s...","[0.2748589515686035, 0.2310565859079361, -0.34..."
225762,Website stubs,"[websites, service, wide, entertainment, onlin...","[0.1632257103919983, 0.16291794180870056, -0.2..."
225763,Wikimedia Foundation stubs,"[websites, service, wide, entertainment, onlin...","[0.19932252168655396, 0.19686073064804077, -0...."


## Wyszukiwanie najbardziej podobnych wektorów

Do klasyfikacji posłuże się obliczaniem odległości geometrycznej pomiędzy wektorami reprezentacji doc2vec. Wektory o najmniejszej odległości zostaną zakwalifikowane jako najbardziej podobne.

In [9]:
dataset.loc[dataset['Category'] == "Machine learning algorithms"]


Unnamed: 0,Category,Words,Representation
2692,Machine learning algorithms,"[checksum, algorithmic, trading, compression, ...","[0.302137166261673, 0.3030090630054474, -1.029..."


In [10]:
from scipy import spatial

name = dataset["Category"][2692]
vector = dataset["Representation"][2692]

print("Name of category", name)

Name of category Machine learning algorithms


Teraz należe znaleźć najbardziej podobne kategorie. 

In [11]:
import tqdm

def find_simmilar(vector, count, df):
    """ Finds 'count' best matching categories with vectors simmilar to 'vector'"""
    categories = []
    for index, row in tqdm.tqdm(df.iterrows(), total=df.shape[0]):
        vec = row["Representation"]
        name = row["Category"]
        categories.append((name, spatial.distance.cosine(vector, vec)))
    sorted_categories = sorted(categories, key=lambda i: i[1])
    return sorted_categories[0:count]


In [12]:
best_matching = find_simmilar(vector, 20, dataset)

100%|██████████| 225765/225765 [00:40<00:00, 5637.08it/s]


In [13]:
best_matching

[('Machine learning algorithms', 0.0),
 ('Heuristic algorithms', 0.03048611901635534),
 ('Cryptographic algorithms', 0.033841232971325574),
 ('Computer arithmetic algorithms', 0.034831473629752474),
 ('Data mining algorithms', 0.03487279219354722),
 ('Compression algorithms', 0.035516547716642144),
 ('Digit-by-digit algorithms', 0.03591962687260464),
 ('Algorithms', 0.03608325172873006),
 ('Bioinformatics algorithms', 0.03669657755224942),
 ('Approximation algorithms', 0.03737963968331992),
 ('Statistical algorithms', 0.037658969638839856),
 ('Quantum algorithms', 0.037759472173684916),
 ('Graph algorithms', 0.038569196633341796),
 ('Pseudo-polynomial time algorithms', 0.039990344459672755),
 ('Routing algorithms', 0.04005739003276043),
 ('Selection algorithms', 0.04025399420522191),
 ('Algorithm description languages', 0.043534683809688834),
 ('Calendar algorithms', 0.045350172497429675),
 ('Distributed algorithms', 0.04660966816597345),
 ('Streaming algorithms', 0.04691549780262483)]

Jak widać na przykładzie powyżej dla kategorii `Machine learning algorithms` algorytm znalazł 20 najbardziej podobnych klas. Najbardziej podobna okazała się kategoria `Heuristic algroithms`.

In [14]:
best_matching[1]

('Heuristic algorithms', 0.03048611901635534)

## Klasyfikacja dokumentu tekstowego

Mając już algorytm będący w stanie porównać 2 wektory reprezentacji numerycznej `doc2vec` można przejść do właściwej implementacji zadania, czyli klasyfikacji prawdziwego artykułu. 

W pierwszym kroku należy przeprowadzić ekstrakcję zbioru słów charakterystycznych dla danego dokumentu tekstowego. Słowa te zostaną następnie wykorzystane do generacji wektora numerycznej reprezentacji artykułu przy użyciu modeulu `doc2vec`. 

In [15]:
from TTI.TextDocument import get_document_representation, get_article_content

# Finds Wikipedia article by name and downloads it using Wikipedia API
article = get_article_content("K-nearest_neighbors_algorithm")

representation = get_document_representation(article, words_count=75) # How many od the words found in the article should when creating vector

document_words = representation['words']
document_vector = representation['vector']

print("Found words: ", document_words)

Found words:  ['class', 'datum', 'example', 'point', 'classification', 'training', 'neighbor', 'algorithm', 'feature', 'distance', 'error', 'classifier', 'rate', 'set', 'prototype', 'label', 'number', 'analysis', 'reduction', 'outlier', 'regression', 'value', 'weight', 'neighbour', 'search', 'dimension', 'data', 'input', 'object', 'distribution', 'vector', 'space', 'query', 'border', 'ratio', 'test', 'problem', 'approach', 'result', 'extraction', 'method', 'accuracy', 'technique', 'step', 'map', 'density', 'boundary', 'decision', 'case', 'output', 'property', 'function', 'scale', 'x', 'sample', 'constant', 'metric', 'variable', 'way', 'representation', 'effect', 'information', 'size', 'risk', 'expansion', 'term', 'order', 'figure', 'fig', 'k', 'vote', 'average', 'type', 'computation', 'evaluation']


In [16]:
print("Numeric vector: ", document_vector)

Numeric vector:  [-0.13973025977611542, 0.8803226351737976, -0.954968273639679, -1.4638932943344116, 0.7967787384986877, 1.2005836963653564, -0.2745325565338135, -0.22412803769111633, -1.8432432413101196, -0.6706209778785706, -0.536053478717804, 0.4469451606273651, 0.6560133695602417, -0.5908300876617432, -0.45293983817100525, -0.33585020899772644, -0.16023124754428864, -0.5544490814208984, 0.06164596974849701, 0.6012845039367676, -0.35053005814552307, 0.12409500032663345, -0.12851309776306152, 0.05405285954475403, 0.44916659593582153, -0.7805129885673523, -0.06957641243934631, 0.08045163005590439, -0.7114368677139282, -0.8525400757789612, -0.8904473781585693, 0.00835314579308033, 0.28191035985946655, 0.5773045420646667, 0.21824778616428375, 0.6881959438323975, 0.3126390874385834, -0.11073076725006104, 1.3967258930206299, -0.018726421520113945, -0.6426990032196045, 0.8602242469787598, -0.06603698432445526, 0.8466103076934814, -0.46289312839508057, 0.8824690580368042, 0.0795854628086090

In [17]:
best_matching = find_simmilar(document_vector, 5, dataset)


100%|██████████| 225765/225765 [00:46<00:00, 4819.82it/s]


In [18]:
best_matching

[('Object recognition and categorization', 0.6043233070349052),
 ('Learning in computer vision', 0.6117157606695703),
 ('Air force units and formations by type', 0.6123609738092128),
 ('Internet advertising methods', 0.615647974472145),
 ('Contextual advertising', 0.6175729562705745)]

Algorytm wykrył 5 kategorii z grafu wejściowego do których najbardziej "pasuje" artykuł o "K nearest neighbors algorithm".

Są to klasy:
* Object recognition and categorization
* Internet advertising methods
* Contextual advertising
* Air force units and formations by type

## Inne przykłady

In [19]:
def get_classified_categories(document_name, words_count=50, categories_count=5):
    article = get_article_content(document_name)
    representation = get_document_representation(article, words_count) # How many od the words found in the article should when creating vector

    document_words = representation['words']
    document_vector = representation['vector']
    print("Found words: ", document_words)

    best_matching = find_simmilar(document_vector, categories_count, dataset)
    return best_matching


In [26]:
get_classified_categories("Maxwell's equations")

  0%|          | 593/225765 [00:00<00:38, 5923.02it/s]Found words:  ['equation', 'field', 'charge', 'law', 'material', 'formulation', 'surface', 'current', 'form', 'space', 'unit', 'wave', 'time', 'vector', 'light', 'phenomenon', 'quantum', 'term', 'tensor', 'density', 'condition', 'version', 'theory', 'volume', 'spacetime', 'line', 'relation', 'scale', 'polarization', 'loop', 'definition', 'solution', 'speed', 'example', 'operator', 'curl', 'system', 'magnetization', 'force', 'change', 'consequence', 'vacuum', 'matter', 'potential', 'value', 'relativity', 'physics', 'component', 'quantity', 'divergence']
100%|██████████| 225765/225765 [00:36<00:00, 6130.19it/s]


[('Dirac equation', 0.47262195755523584),
 ('Units of electrical conductance', 0.5105260627377728),
 ('Units of electrical resistance', 0.5201984424947701),
 ('Quantum electrodynamics', 0.5211182432355765),
 ('Magnetic monopoles', 0.524841188862891)]

In [27]:
get_classified_categories("COVID-19")

  0%|          | 1019/225765 [00:00<00:43, 5156.81it/s]Found words:  ['covid-19', 'virus', 'infection', 'disease', 'people', '%', 'case', 'symptom', 'risk', 'treatment', 'protein', 'rate', 'death', 'antibody', 'age', 'study', 'surface', 'cell', 'time', 'day', 'lung', 'woman', 'syndrome', 'transmission', 'hand', 'use', 'result', 'factor', 'population', 'blood', 'response', 'mortality', 'number', 'patient', 'level', 'storm', 'effect', 'condition', 'measure', 'research', 'human', 'man', 'year', 'testing', 'mask', 'group', 'host', 'health', 'hospital', 'heart']
100%|██████████| 225765/225765 [00:37<00:00, 5944.79it/s]


[('Elder law', 0.565619938441521),
 ('Risk factors', 0.5676282560972064),
 ('Cross-sectional analysis', 0.5794792989270499),
 ('Medical law journals', 0.5802471747474864),
 ('Observational study', 0.5815167426234842)]

In [28]:
get_classified_categories("Machine learning")

  1%|          | 1220/225765 [00:00<00:36, 6118.12it/s]Found words:  ['learning', 'machine', 'datum', 'algorithm', 'model', 'training', 'example', 'method', 'system', 'input', 'set', 'feature', 'computer', 'task', 'network', 'field', 'rule', 'decision', 'problem', 'bias', 'classification', 'prediction', 'approach', 'analysis', 'knowledge', 'output', 'function', 'regression', 'detection', 'statistic', 'mining', 'program', 'representation', 'time', 'goal', 'environment', 'technique', 'tree', 'application', 'theory', 'performance', 'variable', 'neuron', 'people', 'intelligence', 'research', 'pattern', 'class', 'programming', 'information']
100%|██████████| 225765/225765 [00:36<00:00, 6146.28it/s]


[('Machine learning researchers', 0.4397311011007218),
 ('Logic programming researchers', 0.4412704569362673),
 ('Loss functions', 0.4591873063314801),
 ('Expert systems', 0.46839583850828126),
 ('Markov models', 0.4742454348302344)]

In [29]:
get_classified_categories("React Native")

  1%|          | 1154/225765 [00:00<00:38, 5775.71it/s]Found words:  ['application', 'framework', 'developer', 'platform', 'version', 'datum', 'ios', 'background', 'thread', 'web', 'app', 'development', 'code', 'team', 'css', 'source', 'tv', 'macos', 'capability', 'history', 'mistake', 'company', 'experience', 'way', 'element', 'basis', 'prototype', 'order', 'technology', 'month', 'talk', 'production', 'implementation', 'principle', 'dom', 'process', 'end', 'device', 'communicate', 'bridge', 'react', 'api', 'paradigm', 'styling', 'syntax', 'message', 'view', 'language', 'example', 'program']
100%|██████████| 225765/225765 [00:37<00:00, 6083.22it/s]


[('Web frameworks', 0.5893907456203118),
 ('Ajax (programming)', 0.5933646821684826),
 ('Component-based software engineering', 0.5971404285860622),
 ('Web developers', 0.5978359344648014),
 ('Audio to video synchronization', 0.6031584457844819)]