Wczytanie zbioru danych

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

file_path = '/content/drive/My Drive/data_sety/ready_data_set.csv'
data_set = pd.read_csv(file_path)
print(data_set.head())

   label                                               text  \
0      1  ounce feather bowl hummingbird opec moment ala...   
1      1  wulvob get your medircations online qnb ikud v...   
2      0   computer connection from cnn com wednesday es...   
3      1  university degree obtain a prosperous future m...   
4      0  thanks for all your answers guys i know i shou...   

                                       filtered_text  \
0  ounce feather bowl hummingbird opec moment ala...   
1  wulvob get your medircations online qnb ikud v...   
2   computer connection from cnn com wednesday es...   
3  university degree obtain a prosperous future m...   
4  thanks for all your answers guys i know i shou...   

                                     normalized_text  \
0  ounce feather bowl hummingbird opec moment ala...   
1  wulvob get your medircations online qnb ikud v...   
2   computer connection from cnn com wednesday es...   
3  university degree obtain a prosperous future m...   
4  t

Zamiana ciągów znaków na listę słów

In [3]:
import ast
data_set['lemmatized_text'] = data_set['lemmatized_text'].apply(ast.literal_eval)
print(data_set.head())

   label                                               text  \
0      1  ounce feather bowl hummingbird opec moment ala...   
1      1  wulvob get your medircations online qnb ikud v...   
2      0   computer connection from cnn com wednesday es...   
3      1  university degree obtain a prosperous future m...   
4      0  thanks for all your answers guys i know i shou...   

                                       filtered_text  \
0  ounce feather bowl hummingbird opec moment ala...   
1  wulvob get your medircations online qnb ikud v...   
2   computer connection from cnn com wednesday es...   
3  university degree obtain a prosperous future m...   
4  thanks for all your answers guys i know i shou...   

                                     normalized_text  \
0  ounce feather bowl hummingbird opec moment ala...   
1  wulvob get your medircations online qnb ikud v...   
2   computer connection from cnn com wednesday es...   
3  university degree obtain a prosperous future m...   
4  t

Pobranie pretrenowanych wektorów GloVe (niestety nie ma tutaj dedykowanego pakietu tak jak w przypadku word2vec)

In [4]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2024-05-13 17:05:17--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-05-13 17:05:17--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-05-13 17:05:18--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

Konwersja formatu GloVe do formatu Word2Vec, który jest obsługiwany przez gensim

In [5]:
from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = '/content/glove.6B.100d.txt'
word2vec_output_file = '/content/glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

  glove2word2vec(glove_input_file, word2vec_output_file)


(400000, 100)

Załadowanie skonwertowanych wektorów do Gensim, dzięki temu mamy teraz gotowy model

In [6]:
from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)


Zapisanie modelu w folderze

In [7]:
model_path = '/content/drive/My Drive/data_sety/word_models/glove_model.model'
model.save(model_path)


Osadzanie tekstów w całym zbiorze danych

In [8]:
import numpy as np

def document_glove_embedding(doc, model):
    vectors = [model[word] for word in doc if word in model]

    if len(vectors) == 0:
        return np.zeros(model.vector_size)

    return np.mean(vectors, axis=0)

glove_embeddings = np.array([document_glove_embedding(doc, model) for doc in data_set['lemmatized_text']])


Normalizacja wektorów osadzeń do normy euklidesowej - Dzięki temu różnice w długości wektorów nie wpłyną na wyniki algorytmów klasteryzacji

Po normalizacji, punkty są skalowane w taki sposób, by ich odległość od początku układu współrzędnych wynosiła 1.

In [9]:
from sklearn.preprocessing import normalize

normalized_embeddings = normalize(glove_embeddings, norm='l2', axis=1)

glove_embeddings_df = pd.DataFrame(normalized_embeddings)

Zapisanie zbioru gotowego do klasteryzacji

In [10]:
glove_embeddings_df.to_csv('/content/drive/My Drive/data_sety/normalized_glove_embeddings.csv', index=False)
