## Analysis of Reddit Dataset

https://nbviewer.org/github/Data-Science-for-Linguists-2023/Text-Based-Age-Recognition/blob/main/notebooks/data_analysis/final/reddit_final.ipynb


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Loading

In [3]:
# Load the dataset
data = pd.read_pickle('../../../data_samples/reddit_samples/all.pkl')
data

Unnamed: 0,text,age
0,What happened to my comment....it was soo good...,te
1,"A shit ton of censorship. And I don't mean ""de...",te
2,Wasn't aware of the drama between /r/askmen an...,te
3,Nice username I too am from Finland,te
4,Your comment was on the [other post]( lol,te
...,...,...
64861,"And, after 10 years of marriage, you can get 5...",th
64862,Yes. Thank you for this response. I don’t view...,th
64863,Better hope that you're contacted before someo...,th
64864,Thank you for this question. I also find mysel...,th


## Model and Vectorier

The data is now more or less in it's final form.

Now let's define the main vectorizer and classifier. We might make changes to these, but for now, we're keeping it all in one place for covenience.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


def train_classifier(texts, labels, ngrams):
    # create a CountVectorizer with character-based bigrams
    vectorizer = CountVectorizer(analyzer='char', ngram_range=ngrams, lowercase=False, stop_words=None)
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def train_word_classifier(texts, labels):
    # create a CountVectorizer with words
    vectorizer = CountVectorizer(lowercase=False, stop_words=None)
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def predict_age_group(classifier, vectorizer, new_text):
    # take in a classifier as input and return the prediction
    new_X = vectorizer.transform([new_text])
    predicted_age_group = classifier.predict(new_X)
    return predicted_age_group

def evaluate_classifier(classifier, vectorizer, test_texts, test_labels):
    # transform the test data
    X_test = vectorizer.transform(test_texts)
    # predict the age group and return score
    predicted_age_groups = classifier.predict(X_test)
    return accuracy_score(test_labels, predicted_age_groups)


def analyze_features(N, clf, vec):
    feature_names = list(vec.vocabulary_.keys())
    log_prob = clf.feature_log_prob_
    top_N_features = []
    for i in range(clf.classes_.shape[0]):
        top_N_indices = log_prob[i].argsort()[::-1][:N]
        top_N_features.extend([feature_names[idx] for idx in top_N_indices])
    print("Top {} most significant textual features:".format(N))
    print(top_N_features)


## Preprocessing

From the analysis in the `../draft/` folder, we have determined that some preprocessing helps, but not all the way (with lowercase, stemming, lemmatizing, etc.). This also helps to an extent to get the non-english text out, or at least, non-english characters.

In [5]:
import re

def preprocess_texts(texts):
    """
    Cleans the given text by removing non-English characters and punctuation.
    """
    # Keep only English characters and punctuation
    text = [re.sub(r'[^a-zA-Z\s\.,?!]', '', text) for text in texts]
    
    return text

In [6]:
preprocessed = preprocess_texts(list(data['text']))

In [7]:
preprocessed[0:3]

['What happened to my comment....it was soo good....  so ima just repeat myself  U callin yer selfs D O G S eh?',
 'A shit ton of censorship. And I dont mean dem libtards takin muh free spheech kind of censorship. Basically if you disagree with a commenter, and reply stating a counter, your comments will be removed. Or even factual fucking errors are counted as invalidating.  Extreme example   is  and you just got to accept that  Its actually  but i get what youre saying  Your comment was removed for Invalidation.',
 'Wasnt aware of the drama between raskmen and rAskWomen, and yet hearing about it now is completely unsurprising.']

In [8]:
labels = list(data['age'])

In [9]:
labels[0:3]

['te', 'te', 'te']

Let's make a function to drop text that's too short, and if there aren't any alphabetical bits in it.

In [10]:
import string

def drop_short_text(text, labels, min_length=5):
    """
    Drop elements from `text` and `labels` lists where the length of `text` is less than `min_length`
    or where `text` only contains punctuation or is empty
    """
    filtered_text = []
    filtered_labels = []
    for i in range(len(text)):
        if len(text[i]) < min_length or all(char in string.punctuation or char.isspace() for char in text[i]):
            # check if text is too short or only contains punctuation or whitespace
            continue  # skip to next iteration if true
        if not any(c.isalpha() for c in text[i]):
            # check if text has at least one alphabet character
            continue  # skip to next iteration if false
        filtered_text.append(text[i])
        filtered_labels.append(labels[i])
    return filtered_text, filtered_labels

In [11]:
print(len(preprocessed))
print(len(labels))

64866
64866


In [12]:
text, labels = drop_short_text(preprocessed, labels)

In [13]:
print(len(text))
print(len(labels))

64507
64507


Okay, looks like we filtered out a good bit. Now, let's split it into train and test sets.

In [14]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(text, labels, test_size=0.2)

## Training and Evaluation

### BoW

We try the BoW approach to get the baseline

In [15]:
# Train a classifier on the training data
clf, vectorizer = train_word_classifier(train_texts, train_labels)

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)

analyze_features(10, clf, vectorizer)

Accuracy: 0.7052394977522864
Top 10 most significant textual features:
['Pratchetts', 'knockout', 'skipping', 'combines', 'Urban', 'Hongkonger', 'simulator', 'adjustments', 'OOP', 'savoured', 'Pratchetts', 'knockout', 'skipping', 'combines', 'simulator', 'adjustments', 'Hongkonger', 'Urban', 'savoured', 'dialect', 'Pratchetts', 'knockout', 'skipping', 'combines', 'simulator', 'savoured', 'Urban', 'adjustments', 'dialect', 'Hongkonger']


### N-grams

In [16]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (2,2))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.6013021237017516
Top 10 most significant textual features:
['dA', 'up', '\xa0r', 'ca', 'Ax', 'Px', 'Ee', 'OB', '.u', 'su', 'dA', 'up', '\xa0r', 'ca', 'Ax', 'Px', '.u', 'OB', 'Ee', 'jv', 'dA', 'up', '\xa0r', 'ca', 'Px', 'Ax', '.u', 'OB', 'Ee', 'su']


In [17]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (3,3))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)


Accuracy: 0.6686560223221206
Top 10 most significant textual features:
['! Y', 'GAN', 'vip', 'Thi', 'Tzu', 'ean', 'TOL', 'e,p', 'lfS', 'oo ', '! Y', 'GAN', 'vip', 'e,p', 'Thi', 'TOL', 'Tzu', 'ean', 'lfS', 'lyr', '! Y', 'GAN', 'vip', 'e,p', 'Thi', 'Tzu', 'ean', 'lfS', 'TOL', 'LYR']


In [18]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (4,4))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.696248643621144
Top 10 most significant textual features:
['ns t', ' Uno', 'up.c', ' npc', 'YESS', 'heel', 'g, y', 'ue f', 'fo. ', 'qrst', 'ns t', ' Uno', 'up.c', 'heel', ' npc', 'YESS', 'g, y', ' mod', 'ue f', 'fo. ', 'ns t', ' Uno', 'heel', 'up.c', ' npc', 'YESS', 'g, y', 'd I?', 'fo. ', 'ue f']


In [19]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (5,5))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.7007440706867153
Top 10 most significant textual features:
['ay. y', 'A foc', 'rong ', 'sy, a', 'Elize', 'amn n', 't fit', 'CY as', 'te Wa', 'ght h', 'ay. y', 'A foc', 'sy, a', 'rong ', 'Elize', 'ght h', 'try a', 'ned e', 'tucki', 'void.', 'ay. y', 'A foc', 'sy, a', 'rong ', 'ght h', 'Elize', 'try a', 'tucki', 'void.', 'ia an']


In [20]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (6,6))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.6982638350643311
Top 10 most significant textual features:
['he ase', 'f my o', 'ut sim', 'e I sh', 'me st ', 'ly. Ke', 'iroman', 'o pres', ' grann', 'ad pro', 'he ase', 'e I sh', 'me st ', 'ad pro', 'o pres', 'f my o', 'ut sim', 'ns? Li', 'ly. Ke', '.You c', 'he ase', 'e I sh', 'ad pro', 'o pres', 'me st ', 'ut sim', 'f my o', 'ns? Li', 'euroty', 'ge giv']


In [21]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (7,7))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.6902030692915827
Top 10 most significant textual features:
[' two fr', 'generat', 'hat , w', 'red wor', ' say pp', ' lmao n', 'op. Im ', 'ilure. ', ' good P', 'l. some', 'generat', ' two fr', 'red wor', 'y Manag', 're a gr', ' say pp', ' i assu', 'l. some', 'h STDs ', ' lmao n', 'generat', 'irl, th', 'lex I k', ' two fr', 'y Manag', 'a. she ', 'red wor', 're a gr', 'accddsu', 'per fri']


## Model Improvements

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Vectorize the text data with Bag-of-Words
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Train and evaluate Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, train_labels)
lr_accuracy = lr.score(X_test, test_labels)

# Train and evaluate Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, train_labels)
dt_accuracy = dt.score(X_test, test_labels)

# Train and evaluate Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, train_labels)
rf_accuracy = rf.score(X_test, test_labels)

# Train and evaluate Support Vector Machine
svm = SVC()
svm.fit(X_train, train_labels)
svm_accuracy = svm.score(X_test, test_labels)

# Print the accuracy scores
print(f"Logistic Regression accuracy: {lr_accuracy}")
print(f"Decision Tree accuracy: {dt_accuracy}")
print(f"Random Forest accuracy: {rf_accuracy}")
print(f"SVM accuracy: {svm_accuracy}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression accuracy: 0.6889629514803907
Decision Tree accuracy: 0.5234072236862501
Random Forest accuracy: 0.6319175321655557
SVM accuracy: 0.6685785149589211


Looks like I'm not really getting too much out of the other models.

I tried it on the optimal # of N-grams for a many hours and it failed to run, but I doubt that it's going to get a lot better