## Age Classificaion/Analysis of Text and Age

Let's actually start with a classification model, gather the results, and then investigate.

Here's the plan for the upcoming work in the notebook --

- CountVectorizer (vector representation of text) on N-grams
- Train a Naive Bayes Classifier
- Look at most significant features
- Form hypotheses and iterate

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Load the dataset
data = pd.read_csv('../../data/blogtext 2.csv')
data.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [3]:
data['text'] = data['text'].str.strip()
data = data.drop(['id', 'gender', 'topic', 'sign', 'date'], axis = 1)
data['age_category'] = pd.cut(data['age'], bins=[10, 19, 29, 39], labels=['10s', '20s', '30s'])
data.head()


Unnamed: 0,age,text,age_category
0,15,"Info has been found (+/- 100 pages, and 4.5 MB...",10s
1,15,These are the team members: Drewes van der L...,10s
2,15,In het kader van kernfusie op aarde: MAAK JE ...,10s
3,15,testing!!! testing!!!,10s
4,33,Thanks to Yahoo!'s Toolbar I can now 'capture'...,30s


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


def train_classifier(texts, labels, ngrams):
    # create a CountVectorizer with character-based bigrams
    vectorizer = CountVectorizer(analyzer='char', ngram_range=ngrams, lowercase=False, stop_words=None)
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def train_word_classifier(texts, labels):
    # create a CountVectorizer with words
    vectorizer = CountVectorizer(lowercase=False, stop_words=None)
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def predict_age_group(classifier, vectorizer, new_text):
    # take in a classifier as input and return the prediction
    new_X = vectorizer.transform([new_text])
    predicted_age_group = classifier.predict(new_X)
    return predicted_age_group

def evaluate_classifier(classifier, vectorizer, test_texts, test_labels):
    # transform the test data
    X_test = vectorizer.transform(test_texts)
    # predict the age group and return score
    predicted_age_groups = classifier.predict(X_test)
    return accuracy_score(test_labels, predicted_age_groups)


def analyze_features(N, clf, vec):
    feature_names = list(vec.vocabulary_.keys())
    log_prob = clf.feature_log_prob_
    top_N_features = []
    for i in range(clf.classes_.shape[0]):
        top_N_indices = log_prob[i].argsort()[::-1][:N]
        top_N_features.extend([feature_names[idx] for idx in top_N_indices])
    print("Top {} most significant textual features:".format(N))
    print(top_N_features)


In [8]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(list(data['text']), list(data['age_category']), test_size=0.2)

Let's see if preprocessing matters. But first, let's get a bench mark.

In [9]:
# Train a classifier on the training data
clf, vectorizer = train_word_classifier(train_texts, train_labels)

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)

analyze_features(10, clf, vectorizer)

In [None]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (2,2))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.4683135545329781


In [None]:
import re

def preprocess_texts(texts):
    preprocessed_texts = []
    
    for text in texts:
        # lowercase the text
        text = text.lower()
        
        # remove URLs and email addresses
        text = re.sub(r'http\S+|www\S+|https\S+|ftp\S+|@\S+', '', text, flags=re.MULTILINE)
        
        # remove non-alphanumeric characters except for spaces
        text = re.sub(r'[^a-z\s]', '', text)
        
        # remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        # re-join the tokens into a single string
        preprocessed_text = text
        
        preprocessed_texts.append(preprocessed_text)
    
    return preprocessed_texts


In [None]:
preprocessed_train, preprocess_test = preprocess_texts(train_texts), preprocess_texts(test_texts)

In [None]:
# Train a classifier on the training data
clf, vectorizer = train_word_classifier(preprocessed_train, train_labels)

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, preprocess_test, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.6405689248992712


In [None]:
preprocessed_train[0:5]

['urllink center for american progress the progress report page tenet targets white house in a major speech this morning addressing the failure to find wmd in iraq cia director george tenet said the intelligence community never told the white house that iraq was an imminent threat to america a stunning blow to the white house',
 'probably to spend the rest of the day cod maybe ill blog tomorrow',
 'p stands for the holy grill went to play football with isaiah khairul daryl and the crazy nut who just runs on and on by his own before khairul cuts him down with more grace than a swan and oks is just mugging and mugging and mugging away if isaiah can run until the cows return home for the sixtieth time at a go oks can mug twice as much and exams argh i hate exams kenn just makes me so effing jealous cos he just studies so much you could smell his studiousness lingering in the air like a very long very loud and above all very disgusting fart bonus points for a squisssshhhhhbrapppppp accompa

In [None]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(preprocessed_train, train_labels, (2,2))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, preprocess_test, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)


Accuracy: 0.47052995442435985


In [None]:
# Train a classifier on the training data
clf, vectorizer = train_word_classifier(preprocessed_train, train_labels)

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, preprocess_test, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.6405689248992712


finished preprocessing vs raw data. let's see what features were learned

In [None]:

# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (3,3))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)

Looks like we have a lot to improve upon. Let's start with using words instead of N-grams

In [None]:
import sys
sys.path.insert(0, '../..')
from scripts import classify

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

def train_word_classifier(texts, labels):
    # create a CountVectorizer with words
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

# Train a classifier on the training data
clf, vectorizer = train_word_classifier(train_texts, train_labels)

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)

Accuracy: 0.6310868432447507


Okay much better, I guess I'll have to read up on where and when to use N-grams