# Reddit dataset analysis

Here's the plan for the upcoming work in the notebook --

- CountVectorizer (vector representation of text) on N-grams (vary the N)
- Train a Naive Bayes Classifier
- Look at most significant features
- Iteratively improve feature selection

In [3]:
import pandas as pd

loaded_df = pd.read_pickle('../../data_samples/reddit_samples/all.pkl')
loaded_df


Unnamed: 0,text,age
0,What happened to my comment....it was soo good...,te
1,"A shit ton of censorship. And I don't mean ""de...",te
2,Wasn't aware of the drama between /r/askmen an...,te
3,Nice username I too am from Finland,te
4,Your comment was on the [other post]( lol,te
...,...,...
64861,"And, after 10 years of marriage, you can get 5...",th
64862,Yes. Thank you for this response. I don’t view...,th
64863,Better hope that you're contacted before someo...,th
64864,Thank you for this question. I also find mysel...,th


In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


def train_classifier(texts, labels):
    # create a CountVectorizer with character-based bigrams
    vectorizer = CountVectorizer(analyzer='char', ngram_range=(2,2))
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def train_word_classifier(texts, labels):
    # create a CountVectorizer with words
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def predict_age_group(classifier, vectorizer, new_text):
    # take in a classifier as input and return the prediction
    new_X = vectorizer.transform([new_text])
    predicted_age_group = classifier.predict(new_X)
    return predicted_age_group

def evaluate_classifier(classifier, vectorizer, test_texts, test_labels):
    # transform the test data
    X_test = vectorizer.transform(test_texts)
    # predict the age group and return score
    predicted_age_groups = classifier.predict(X_test)
    return accuracy_score(test_labels, predicted_age_groups)


In [14]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(list(loaded_df['text']), list(loaded_df['age']), test_size=0.2)

# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels)

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)

Accuracy: 0.5955757669184523
