<a href="https://colab.research.google.com/github/TheUnifyingForce/Taxonomy-of-Multi-Modal-Datasets/blob/main/%20WordNet%26%26GloVe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## WordNet

#### A large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

#### WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions.


*   First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated.
*   Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.



In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.corpus import wordnet

In [None]:
# synonyms sets
synsets_image = wordnet.synsets('image')
print(synsets_image)

[Synset('image.n.01'), Synset('persona.n.02'), Synset('picture.n.01'), Synset('prototype.n.01'), Synset('trope.n.01'), Synset('double.n.03'), Synset('image.n.07'), Synset('image.n.08'), Synset('effigy.n.01'), Synset('image.v.01'), Synset('visualize.v.01')]


In [None]:
# definition
synset_image = wordnet.synset('image.n.03')
print(synset_image.definition())

a visual representation (of an object or scene or person or abstraction) produced on a surface


In [None]:
# synonyms sets
synsets_painting = wordnet.synsets('painting')
print(synsets_painting)

[Synset('painting.n.01'), Synset('painting.n.02'), Synset('painting.n.03'), Synset('painting.n.04'), Synset('paint.v.01'), Synset('paint.v.02'), Synset('paint.v.03'), Synset('paint.v.04')]


In [None]:
# definition
synset_painting = wordnet.synset('painting.n.01')
print(synset_painting.definition())

graphic art consisting of an artistic composition made by applying paints to a surface


In [None]:
# all formats
lemmas = synset_image.lemmas()
for lemma in lemmas:
    print(lemma.name())


picture
image
icon
ikon


In [None]:
# hypernyms
hypernyms = synset_image.hypernyms()
print(hypernyms)

[Synset('representation.n.02')]


In [None]:
# hypernyms
hypernyms = synset_painting.hypernyms()
print(hypernyms)

[Synset('graphic_art.n.01')]


In [None]:
hypernyms = wordnet.synset('entity.n.01').hypernyms()
print(hypernyms)

[]


In [None]:
# hyponyms
hyponyms = synset_image.hyponyms()
print(hyponyms)

[Synset('bitmap.n.01'), Synset('chiaroscuro.n.01'), Synset('collage.n.01'), Synset('foil.n.04'), Synset('graphic.n.01'), Synset('iconography.n.01'), Synset('inset.n.01'), Synset('likeness.n.02'), Synset('panorama.n.02'), Synset('reflection.n.05'), Synset('scan.n.02'), Synset('sonogram.n.01')]


In [None]:
# hyponyms
hyponyms = synset_painting.hyponyms()
print(hyponyms)

[Synset('abstraction.n.04'), Synset('cityscape.n.02'), Synset('daub.n.03'), Synset('distemper.n.04'), Synset('finger-painting.n.01'), Synset('icon.n.03'), Synset('landscape.n.02'), Synset('miniature.n.01'), Synset('monochrome.n.01'), Synset('mural.n.01'), Synset('nude.n.01'), Synset('oil_painting.n.01'), Synset('pentimento.n.01'), Synset('sand_painting.n.01'), Synset('seascape.n.02'), Synset('semi-abstraction.n.01'), Synset('still_life.n.01'), Synset('tanka.n.02'), Synset('trompe_l'oeil.n.01'), Synset('watercolor.n.01')]


## GloVe

#### https://nlp.stanford.edu/pubs/glove.pdf
#### http://nlp.stanford.edu/projects/glove/
#### A global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.

In [None]:
from gensim.models import KeyedVectors

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2024-04-25 14:45:26--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-04-25 14:45:26--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-04-25 14:45:26--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [None]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
# load model
glove_model = KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=False, no_header=True)

In [None]:
# Get the vector representation of a word
vector = glove_model['image']
print(vector)

[ 0.0065397  0.12888   -0.12518   -0.48984   -0.15711   -0.21177
 -0.24202    0.58976    0.36224   -1.8508     0.63151    0.085513
  0.54285   -0.085016  -0.37762   -0.069195   0.29284   -0.4923
 -0.32498   -0.78226   -0.41842    0.46761    0.7178     0.077404
 -0.18717   -0.27108   -0.41315   -0.18285    0.7223     0.88308
  0.17942    0.087     -0.20642    0.37093   -0.78857    0.65427
  0.16744   -0.56659    0.055369  -0.071142   0.31651    0.1864
  0.24052   -0.12697   -0.44088   -0.37085   -0.078073  -0.49468
  0.11863   -0.20714    0.15563   -0.3569     0.71721   -0.13097
  0.28312    0.0065318 -0.2072     0.24517    0.064092  -0.013316
 -0.024133   0.25595    0.071349  -0.13245    0.33029    0.38863
 -0.31963   -0.60819    0.76155    0.30382   -0.13193   -0.18959
  0.37761   -0.023783   0.26459   -0.7078     0.08579    0.017591
 -0.038224   0.12083   -1.0479    -0.015986  -0.45525   -0.40214
 -0.070824   0.3338    -0.50845    0.043117   0.081949  -0.091315
 -0.64257    0.10965  

In [None]:
# Calculate the semantic similarity between words
similarity = glove_model.similarity('image', 'picture')
print(similarity)

0.5823847


In [None]:
# Find the word that is most similar to the given word.
similar_words = glove_model.most_similar('image')
print(similar_words)

[('images', 0.6631151437759399), ('picture', 0.5823846459388733), ('reputation', 0.5361216068267822), ('tarnished', 0.5312350988388062), ('photograph', 0.5112592577934265), ('photo', 0.49377116560935974), ('perception', 0.48240020871162415), ('look', 0.47880640625953674), ('color', 0.46893924474716187), ('pictures', 0.4675300419330597)]


In [None]:
# Solve the analogy problem in vocabulary (such as king - man + woman = queen)
analogy = glove_model.most_similar(positive=['king', 'woman'], negative=['man'])
print(analogy)

[('queen', 0.6713277101516724), ('princess', 0.5432624816894531), ('throne', 0.5386103987693787), ('monarch', 0.5347574949264526), ('daughter', 0.49802514910697937), ('mother', 0.49564430117607117), ('elizabeth', 0.4832652509212494), ('kingdom', 0.47747090458869934), ('prince', 0.4668239951133728), ('wife', 0.46473270654678345)]


## Apply


1.   Get Datatypes from some paragraph
2.   Classify?

     (1) Check if it's exsist, if not go to step 2, if yes mark down the linkage of the dataset

     (2) Sementic classify: put the new data type in a big genre (image, audio, text etc) first

     (3) Calculate word relations of this new
3.   



In [None]:
import spacy

# Load the SpaCy English model
nlp = spacy.load("en_core_web_sm")

def extract_nouns(text):
    # Process the text with SpaCy
    doc = nlp(text)
    # Extract nouns from the processed document
    nouns = [token.text for token in doc if token.pos_ == "NOUN"]
    return nouns

In [None]:
# Example paragraph
paragraph = "WikiArt contains painting from 195 different artists. The dataset has 42129 images for training and 10628 images for testing."

# Extract nouns from the paragraph
nouns = extract_nouns(paragraph)
print("Nouns in the paragraph:", nouns)

Nouns in the paragraph: ['artists', 'dataset', 'images', 'training', 'images', 'testing']


In [None]:
def extract_data_types(description):
    # Process the description with SpaCy
    doc = nlp(description)

    # Initialize a set to store unique data types
    data_types = set()

    # Iterate over tokens in the processed document
    for token in doc:
        # Check if the token is a noun and add it to the set of data types
        if token.pos_ == "NOUN":
            data_types.add(token.text.lower())  # Convert to lowercase for consistency

    return data_types

# Example dataset description
description = "WikiArt contains painting from 195 different artists. The dataset has 42129 images for training and 10628 images for testing."

# Extract data types from the description
data_types = extract_data_types(description)
print("Data types in the description:", data_types)

Data types in the description: {'images', 'training', 'dataset', 'artists', 'testing'}
