# II Training Doc2Vec Model & Inferring Comment Vectors

## Table of Contents

1. [Loading the Data and Necessary Libraries](#loading-dependencies)
2. [Preprocess comments](#Preprocessing)
3. [Load or Train Model ](#Gensim-Model)
4. [Calculate Vector representations and Save results](#Vectors)


## Loading Data and Libraries 
<a class="anchor" id="loading-dependencies"></a>

In [1]:
import pandas as pd
import gensim
from gensim.models.doc2vec import Doc2Vec
import logging
import os
from tqdm import tqdm
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

df_c = pd.read_parquet('Comments.parquet')
df_c = df_c[['commentID','commentBody']]

## Preprocess comments 
<a class="anchor" id="Preprocessing"></a>

In [18]:
def preprocess(row):
    '''
    -Converts a comment into tokens using Gensim's simple_preprocess 
    -Returns a TaggedDocument containing tokens and the row index
    '''
    index = row.name  
    comment = row['commentBody']
    tokens = gensim.utils.simple_preprocess(comment)
    return gensim.models.doc2vec.TaggedDocument(tokens, [index])

df_c['preprocessed_sentences'] = df_c.apply(preprocess, axis=1)

## Load or Train Model 
<a class="anchor" id="Gensim-Model"></a>

The Doc2VEc model is trained on the entire comment corups of the dataset and is being stored as indepredent model File 
Before training the code check if a model file is alredy present in the directory ready to be loaded.

In [None]:
tagged_data = df_c['preprocessed_sentences'].values

model_file = "Doc2vec.model"

if os.path.exists(model_file):
    print("Loading existing model...")
    model = Doc2Vec.load(model_file)
else:
    print("Training new model...")
    model = Doc2Vec(tagged_data, vector_size=100, window=5, min_count=15, workers=8, epochs=20)
    model.save(model_file)
    print("Model saved as", model_file)

## Calculate Vector representations and Save results
<a class="anchor" id="Vectors"></a>

In [20]:
def get_vector(comment_clean):
    '''
    -Infers a vector for the preprocessed comment tokens using the Doc2Vec model.
    '''
    vector = model.infer_vector(comment_clean[0])
    return vector


df_c['comment_vector'] = df_c['preprocessed_sentences'].progress_apply(get_vector)
df_c = df_c.drop('preprocessed_sentences', axis=1)
df_c.to_parquet('Comment_embeddings.parquet')