# Part 1.3: Text Vectorization

## Data Loading:
How many different attribute values do you observe in each feature? (e.g. how many
subreddits are there?) Is there any missing or duplicated data? (Referring to textual
features)
b. How does the empirical distribution of the number of characters in each comment look
like? How is the distribution of the number of comments per author? Is the supervised
dataset balanced between male and female? Are there only comments in English? Hint: use
the library langdetect.


In [None]:
!pip install pandas scikit-learn nltk 

In [None]:
!pip install langdetect

In [None]:
!pip install seaborn matplotlib

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
import re
import nltk


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

## Data Exploration:
a. How many different attribute values do you observe in each feature? (e.g. how many
subreddits are there?) Is there any missing or duplicated data? (Referring to textual
features)
b. How does the empirical distribution of the number of characters in each comment look
like? How is the distribution of the number of comments per author? Is the supervised
dataset balanced between male and female? Are there only comments in English? Hint: use
the library langdetect.

In [None]:

# loading the supervised data
df = pd.read_csv('../data/data_supervised.csv') #COMMENT
#print(df.head(3))

print(f"Dataframe size: {df.shape}")
m = df.shape

# count differents subreddit topics and authors
n_distinc_authors = len(pd.unique(df['author']))
print(f"There are {n_distinc_authors} distinct authors")

# count distinct subreddit (topics)
n_distinct_subreddit = len(pd.unique(df['subreddit']))
print(f"There are {n_distinct_subreddit} distinct subreddit")

# count distinct body (comments)
n_distinct_body = len(pd.unique(df['body']))
print(f"There are {n_distinct_body} distinct body")

# checking missing values
print(df.isna().sum())

# checking empty stirng
print("Empty bodies: ",(df['body'].str.strip() == "").sum())

# checking duplicate values
n_duplicate = df.duplicated().sum()
print(f"Duplicate values {n_duplicate}")

# count rows with same author and same body
print(df['body'].duplicated().sum())


In [None]:
# collect the number of character for each comment
df['char_len'] = df['body'].str.len()

# describing the empirical distribution
print(df['char_len'].describe())

# plot the histogram
sns.histplot(df['char_len'], bins=50)
plt.title("Distribution of Comment Length (characters)")
plt.xlim(0, 14271) #CCC
plt.show()



In [None]:
# comments per author
comments_per_author = df['author'].value_counts()
print(comments_per_author.describe())

plt.xlabel("Comments")
plt.ylabel("Users")

# plotting
sns.histplot(comments_per_author, bins=50, log_scale=True)
plt.title("Distribution of Comments per Author (log scale)")
plt.show()

In [None]:
# reading the target_supervised
target = pd.read_csv('../data/target_supervised.csv')

target['gender'].value_counts()
target['gender'].value_counts(normalize=True)

sns.countplot(data=target, x='gender')
plt.title("Gender Distribution")
plt.show()

#CCC percentage

In [None]:
!pip install fasttext

# Data Cleaning and Text Standardization.

a. Uniform text formats (e.g., case normalization, Hint: standardize the letters in lower case).
If necessary, clean the comment text (e.g. URLs, subreddit refs, …).

b. Stop words are not contributing much to our ML tasks, such as "the", "a", since they carry
very little information. Take care of these kinds of words.

c. Reduce words to their base or root form using Stemming/Lemmatization. This helps in
reducing inflected words to a common base form. (Hint: Consider using libraries like NLTK
or spaCy for tokenization).


In [None]:
!pip install spacy

!python -m spacy download en_core_web_sm

In [None]:
# import needed python libraries

%matplotlib inline
from tqdm import tqdm
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

import html
import spacy
nlp = spacy.load("en_core_web_sm", disable=["parser","ner","textcat"])
from langdetect import detect

In [None]:
df_supervised   = pd.read_csv("../data/data_supervised.csv")
df_unsupervised = pd.read_csv("../data/data_unsupervised.csv")
df_target       = pd.read_csv("../data/target_supervised.csv")

print(df_supervised.shape, df_unsupervised.shape, df_target.shape)

Uniform text formats (e.g., case normalization, Hint: standardize the letters in lower case). If necessary, clean the comment text (e.g. URLs, subreddit refs, …).



In [None]:
remove_pattern = r'https?://\S+|www\.\S+|r/\w+|u/\w+'

df_supervised['body_normalized'] = (
    df_supervised['body']
    .fillna('')                                     # Gestisce i NaN
    .astype(str)                                    # Assicura formato stringa
    .str.lower()                                    # Case normalization (Punto a.)
    .apply(html.unescape)                           # Decodifica HTML (es. &amp; -> &)
    .str.replace(remove_pattern, ' ', regex=True) # Rimuove URL, r/, u/
    .str.replace(r'\s+', ' ', regex=True)           # Rimuove doppi spazi
    .str.strip()                                    # Pulisce spazi inizio/fine
)

df_unsupervised['body_normalized'] = (
    df_unsupervised['body']
    .fillna('')
    .astype(str)
    .str.lower()
    .apply(html.unescape)
    .str.replace(remove_pattern, ' ', regex=True)
    .str.replace(r'\s+', ' ', regex=True)
    .str.strip()
)


In [None]:
# CHECKK!!!
df_supervised[["body", 'body_normalized']].head()

b. Stop words are not contributing much to our ML tasks, such as "the", "a", since they carry very little information. Take care of these kinds of words.

c. Reduce words to their base or root form using Stemming/Lemmatization. This helps in reducing inflected words to a common base form. (Hint: Consider using libraries like NLTK or spaCy for tokenization).

In [None]:
def process_text_full(text_series, batch_size=2000):
    clean_texts = []

    total_docs = len(text_series)

    # tqdm show the process bar
    for doc in tqdm(nlp.pipe(text_series, batch_size=batch_size), total=total_docs, desc="Processing"):

        tokens = []
        for token in doc:
            # 1. Filtering Stop Words e punctation (b)
            if not token.is_stop and not token.is_punct and not token.is_space:
                # 2. Take the lemma using spaCy (c)
                tokens.append(token.lemma_)

        clean_texts.append(" ".join(tokens))

    return clean_texts

print("Elaboration of SUPERVISED dataset (smaller)...")
df_supervised['body_clean'] = process_text_full(df_supervised['body_normalized'].astype(str))

df_supervised.to_csv("./clean_supervised.csv", index=False)

print("Elaboration of UNSUPERVISED  dataset (bigger)...")
df_unsupervised['body_clean'] = process_text_full(df_unsupervised['body_normalized'].astype(str))

df_unsupervised.to_csv("./clean_unsupervised.csv", index=False)


# 1.3 Text Vectorization.

A. Only for the supervised task (data_supervised.csv): Group and join all comments of the
same author, creating a “new” dataset to be used for the supervised task (Section 2).

B: As ML algorithms struggle to handle directly the raw textual data. You are required to
convert the text into numerical representations (vectors) through Bag of Words (BoW).

C: Another way to assign a vector representation to a word is to associate the TF-IDF
representation (Term Frequency-Inverse Document Frequency) to each user/comment.
Can you observe and explain the differences between the numerical representations
generated by BoW and TF-IDF?


A Wrap up section at the bottom describing which files are created and where by this notebook is at the very end of the notebook. To change the input file instead view the very first cell of the notebook. Originally meant for only the supervised dataset as described by the requirements.

For problems contact Matteo Sottocornola on Telegram.

## Part 1 of 1.3

Only for the supervised task (data_supervised.csv): Group and join all comments of the
same author, creating a “new” dataset to be used for the supervised task (Section 2).

In [None]:
# TODO capire se aggiungere aggregated subreddit/created_utc

# import as panda dataframe.
import pandas as pd
df = pd.read_csv("./clean_supervised.csv") #In principio da usare solo su clean_supervised.
print(df.shape)

In [None]:
#Drop the two unneded columns inside clean_supervised
#Rename the cleaned body to just body cause I prefer that way.
df = df.drop(columns=['body','body_normalized'])
df = df.rename(columns={'body_clean':'body'})

In [None]:
#Non sicuro se anche questi andrano tenuti/concatenati come body quindi drop per ora.
df_text_only = df.drop(columns=['created_utc','subreddit'])
print(df_text_only)

In [None]:
print(df_text_only.shape)
df_text_only = df_text_only.dropna(subset=['body']) #perdiamo un 6000 su 296,000 posts.
print(df_text_only.shape)


In [None]:
df_grouped = df_text_only.groupby('author')['body'].apply(" ".join).reset_index()
print(df_grouped)

In [None]:
#Quick sanity check.
i=0 #select an index, and as such a user.
user = df.iloc[i,0]
print("user: ", user," posted this: ", df.iloc[i, 3])

print(df.groupby("author").size().loc[user] )
#df.groupby("author").count()['author'=df.iloc[0, 0]]

row = df_grouped[df_grouped["author"] == user] #user
print(row.iloc[0,1])

## Part 2 of 1.3
B: As ML algorithms struggle to handle directly the raw textual data. You are required to
convert the text into numerical representations (vectors) through Bag of Words (BoW).

Bag of Words (BoW) is a technique widely used to transform textual data into machine-readable format, specifically numerical values, without considering grammar and word order.

We will be counting the occurence of every word in the vocabulary we use. Where the word was and it's actual structure is lost. Basically you are adding a new column for each word that is in our dataset and adding the number of times it was used for each row.

Note that executing 1.2 to remove stop words first is heavily recomended to reduce the number of words and hence attributes we get with BoW.

In [None]:
#useful example of BoW
#https://www.datacamp.com/tutorial/python-bag-of-words-model?dc_referrer=https%3A%2F%2Fwww.google.com%2F

from collections import defaultdict
import string

#df_grouped = df_grouped[:400] #remove after 1.2 available, done to reduce complexity for now.

# Function to preprocess and tokenize
def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize: split the text into words
    tokens = text.split()
    return tokens

# Apply preprocessing to each text individually first.
processed_corpus = []
for sentence in df_grouped.iloc[:, 1]:
    processed_corpus.append(preprocess(sentence))

print(processed_corpus)

In [None]:
#Take the individual tokenized texts and create a single vocabulary.

vocabulary = set()

# Build the vocabulary
for sentence in processed_corpus:
    vocabulary.update(sentence)

# Convert to a sorted list
vocabulary = sorted(list(vocabulary))
print("Vocabulary:", vocabulary)
#If it doesn't look correct it's cause it tokenizes numbers too and resorts them in it's own logic
#scroll enough and you find the words like you should.

print( len(vocabulary) ) #How many unique words and as such how many features we end up adding.
#19186 first time with only first 100, too much. Need the preprocessing of 1.1 and 1.2 to reduce.
#I'd also consider removing numbers which are being classed as words.
#Could also pre process by rounding numbers to nearest multiple of ten to reduce the unique values.

In [None]:
word2idx = {w:i for i,w in enumerate(vocabulary)} #Create a dictionary so future lookups are O(1)

def create_bow_vector(sentence, vocab):
    vector = [0] * len(vocab)  # Initialize a vector of zeros
    for word in sentence:
        idx = word2idx.get(word)  # Find the index of the word in the vocabulary using dictionary.
        if idx is not None:
            vector[idx] += 1  # Increment the count at that index
    return vector


# Create BoW vector for each sentence in the processed corpus
bow_vectors = [create_bow_vector(sentence, vocabulary) for sentence in processed_corpus]
print("Bag of Words Vectors:")
for vector in bow_vectors[:100]:
    print(vector)

#Lento perche troppe parole nel vocabulary.

In [None]:
#DEBUG
#Double check if plausible used a certain number of times the same word.
row = df_grouped[df_grouped["author"] == df_grouped.iloc[17,0]]
print(row.iloc[0,1])

In [None]:
#drop no longer needed body.
df_grouped_bow = df_grouped.drop(columns = ['body'])

#Add the BoW instead.
df_grouped_bow["bow"] = bow_vectors

#Print a few to see
print(df_grouped_bow[:10])

# Part 3 of 1.3 - TF-IDF

Another way to assign a vector representation to a word is to associate the TF-IDF
representation (Term Frequency-Inverse Document Frequency) to each user/comment.
Can you observe and explain the differences between the numerical representations
generated by BoW and TF-IDF?

For TF-IDF it is necessary to create the vocabulary of all the distinct words and then for each word performing the following calculation which requires computing these two first... 

![Fig1.png](Fig1.png)


In practice for every word A and each user we calculate two metrics, one is what percentage of overall users used word A (log of this number). And the other is what percentage of the words used by that specific user correspond to A. The index is then the multiple of these two.

It's meant to provide a measure of how important each word is to that text, but corrected to cancel out words that are just common in general and not common to this specific text.

In [None]:
#We reuse the vocabulary computed for the previous section so execute that first.
print(vocabulary)

In [None]:
from collections import Counter
import math

# Count in how many documents each word appears
df_counter = Counter()

for sentence in processed_corpus:
    unique_words = set(sentence)     # ensure each word counted only once per doc
    df_counter.update(unique_words)

N = len(processed_corpus)            # number of documents

idf = [math.log(N / df_counter.get(word, 1)) for word in vocabulary]


In [None]:
print(type(idf))
print(len(idf))
print(len(vocabulary))

In [None]:
#Compute the idf factor for each word in the vocabulary.
#idf = []
#import math

#for word in vocabulary:
#    count = 0
#    for sentence in processed_corpus:
#       if word in sentence:
#           count += 1
#    count = (df_grouped.shape[0]) / (count) #Aggiunta 1 necessaria per evitare eventuali divisioni per zero.
##    count = math.log(count) 
 #   idf.append(count)

#print(idf)

In [None]:
#for i in range(3000):
#    print(i, vocabulary[i], " ", idf[i])\
    
#LE PAROLE MENO COMUNI HANNO VALORI PIU ALTI.
#print("parola comune: ", vocabulary[2996], " idf: ", idf[2996])
#print("parola rara: ", vocabulary[2781], " idf: ", idf[2781])
#print("Le parole le usate dai diversi users hanno valori piu bassi, quanto sono stati usati dal singolo user non influisce.")

In [None]:
#now to compute the second metric TF and directly the TF-IDF

#word2idx

def create_TF_IDF_vector(sentence, vocab, idf):
    vector = [0] * len(vocab)  # If not present in user's comment then TF-IDF is 0.
    for word in sentence:
        idx = word2idx.get(word)  # Find the index of the word in the vocabulary
        if idx is not None:
            vector[idx] += 1  # Increment the count at that index
    
    
    for i in range(len(vocab)):
        vector[i] = vector[i] / len(sentence)
        vector[i] = vector[i] * idf[i]
    
    return vector

In [None]:
# Create TF-IDF vector for each sentence in the processed corpus
TFIDF_vectors = [create_TF_IDF_vector(sentence, vocabulary, idf) for sentence in processed_corpus]
print("TF-IDF Vectors:")
for vector in TFIDF_vectors[:100]:
    print(vector)

In [None]:
#Debug - checks
for sentence in processed_corpus:
    print(len(sentence))

In [None]:
#Debug - checks
ind = 0
for el in TFIDF_vectors[1]:
    if el != 0.0:
        print(el, ind)
    ind = ind + 1

In [None]:
#Debug - checks
print(idf[8513]*0.5, idf[18680]*0.5)

In [None]:
for i in range(10):
    i = i+20 #solo per non guardare sempre gli stessi.
    print("index: ", i, "TF-IDF: ",TFIDF_vectors[i])
    print("index: ", i, "bow_words: ",bow_vectors[i])
    print("\n")


In [None]:
ind = 0
for el in TFIDF_vectors[1]:
    if el != 0.0:
        print(el, ind)
        print("index: ", ind, " TF-IDF: ", el, " for word: ", vocabulary[ind])
    ind = ind + 1

In [None]:
ind = 0
for el in bow_vectors[1]:
    if el != 0.0:
        print(el, ind)
        print("index: ", ind, " TF-IDF: ", el, " for word: ", vocabulary[ind])
    ind = ind + 1

The main difference is visible quite clearly by printing the previous two cells for user with index 1. As we can see that user ever only typed five words, presumably in the same comment which we can divine from the word to have been 'really just read the faq'

With BoW we replace each word simply with the count of it, and hence get a vector of only zeroes for all the other words in vocabulary and 1 for these five.

On the other hand TF-IDF does a more complex computation where 'the' also considers the count in the sentence (1) but also how common the word is among all the users and seeing that it's quite common it gets a much smaller value than the others. This is meant to give us an idea that the word 'the' probably doesn't carry a lot of significance as it's more common.

Which is best for our task seems disputable, for instance if there was a hypothetical word predominantly used by male redditers TF-IDF would eroneously assign it a small value as a large number of the users would have used it. On the other hand such a magical classifying word is unlikely to be present and intuitively it is preferable to give lesser weight to overly common words unlikely to carry much significance.

The solution naturally is to try our models with both and check which is better.

# Wrap up - Writing the datasets

Naturally at this point I must save the two stored representations with a name that allows us to distinguish them at a glance yet remember they came from the output of 1.3

I will go with
supervised-1.3-BoW
supervised-1.3-TF-IDF

I am uncertain whether the vocabulary is also needed for further sections so I will also create
1.3-vocab.csv

which I will save to the data folder but also add to the .gitignore.
If you need the files re-execute 1.3 locally.
Remember you can change which file is used as input at the top.

For questions/problems it was Matteo Sottocornola who did this part.

In [None]:
#print(vocabulary)
vocab_df = pd.DataFrame(vocabulary, columns=["word"])
#print(vocab_df)   #presenza emoji normale, presenti nel testo iniziale.

In [None]:
print(df_grouped)

In [None]:
df_grouped = df_grouped.drop(columns=['body'])
usernames = df_grouped["author"].reset_index(drop=True)

In [None]:
print(len(TFIDF_vectors))
print(type(TFIDF_vectors))
print(type(usernames))
print(usernames.shape)

In [None]:
tfidf_usernames = [
    [usernames.iloc[i]] + TFIDF_vectors[i]
    for i in range(len(TFIDF_vectors))
]

In [None]:
#tfidf_df = pd.DataFrame(TFIDF_vectors, columns=vocabulary)
#bow_df = pd.DataFrame(bow_vectors, columns=vocabulary)
#tfidf_df = pd.merge(usernames, tfidf_df, left_index=True, right_index=True)
#bow_df = pd.merge(usernames, bow_df, left_index=True, right_index=True)

In [None]:
print(len(tfidf_usernames))

for row in tfidf_usernames[:5]:
    print(row)

In [None]:
vocab_df.to_csv("./1.3-vocab.csv", index=False)



In [None]:
import csv

with open("supervised-1.3-TF-IDF.csv","w",newline="") as f:
    writer = csv.writer(f)
    writer.writerows(tfidf_usernames)

In [None]:
bow_usernames = [
    [usernames.iloc[i]] + bow_vectors[i]
    for i in range(len(bow_vectors))
]

In [None]:
with open("supervised-1.3-BoW.csv","w",newline="") as f:
    writer = csv.writer(f)
    writer.writerows(tfidf_usernames)