## Age Classificaion/Analysis of Text and Age

Let's actually start with a classification model, gather the results, and then investigate.

Here's the plan for the upcoming work in the notebook --

- CountVectorizer (vector representation of text) on N-grams
- Train a Naive Bayes Classifier
- Look at most significant features
- Form hypotheses and iterate

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Load the dataset
data = pd.read_csv('../../../data/blogtext 2.csv')
data.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [3]:
data['text'] = data['text'].str.strip()
data = data.drop(['id', 'gender', 'topic', 'sign', 'date'], axis = 1)
data['age_category'] = pd.cut(data['age'], bins=[10, 19, 29, 39], labels=['10s', '20s', '30s'])
data.head()


Unnamed: 0,age,text,age_category
0,15,"Info has been found (+/- 100 pages, and 4.5 MB...",10s
1,15,These are the team members: Drewes van der L...,10s
2,15,In het kader van kernfusie op aarde: MAAK JE ...,10s
3,15,testing!!! testing!!!,10s
4,33,Thanks to Yahoo!'s Toolbar I can now 'capture'...,30s


The data is now more or less in it's final form.

Now let's define the main vectorizer and classifier. We might make changes to these, but for now, we're keeping it all in one place for covenience.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


def train_classifier(texts, labels, ngrams):
    # create a CountVectorizer with character-based bigrams
    vectorizer = CountVectorizer(analyzer='char', ngram_range=ngrams, lowercase=False, stop_words=None)
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def train_word_classifier(texts, labels):
    # create a CountVectorizer with words
    vectorizer = CountVectorizer(lowercase=False, stop_words=None)
    X = vectorizer.fit_transform(texts)
    # train the classifier and return it
    clf = MultinomialNB()
    clf.fit(X, labels)
    return clf, vectorizer

def predict_age_group(classifier, vectorizer, new_text):
    # take in a classifier as input and return the prediction
    new_X = vectorizer.transform([new_text])
    predicted_age_group = classifier.predict(new_X)
    return predicted_age_group

def evaluate_classifier(classifier, vectorizer, test_texts, test_labels):
    # transform the test data
    X_test = vectorizer.transform(test_texts)
    # predict the age group and return score
    predicted_age_groups = classifier.predict(X_test)
    return accuracy_score(test_labels, predicted_age_groups)


def analyze_features(N, clf, vec):
    feature_names = list(vec.vocabulary_.keys())
    log_prob = clf.feature_log_prob_
    top_N_features = []
    for i in range(clf.classes_.shape[0]):
        top_N_indices = log_prob[i].argsort()[::-1][:N]
        top_N_features.extend([feature_names[idx] for idx in top_N_indices])
    print("Top {} most significant textual features:".format(N))
    print(top_N_features)


From the analysis in the `../draft/` folder, we have determined that some preprocessing helps, but not all the way (with lowercase, stemming, lemmatizing, etc.). This also helps to an extent to get the non-english text out, or at least, non-english characters.

In [6]:
import re

def preprocess_texts(texts):
    """
    Cleans the given text by removing non-English characters and punctuation.
    """
    # Keep only English characters and punctuation
    text = [re.sub(r'[^a-zA-Z\s\.,?!]', '', text) for text in texts]
    
    return text

In [7]:
preprocessed = preprocess_texts(list(data['text']))

In [11]:
preprocessed[0:3]

['Info has been found   pages, and . MB of .pdf files Now i have to wait untill our team leader has processed it and learns html.',
 'These are the team members   Drewes van der Laag           urlLink mail  Ruiyu Xie                     urlLink mail  Bryan Aaldering me          urlLink mail',
 'In het kader van kernfusie op aarde  MAAK JE EIGEN WATERSTOFBOM   How to build an HBomb From ascotttartarus.uwa.edu.au Andrew Scott Newsgroups rec.humor Subject How To Build An HBomb humorous! Date  Feb   GMT Organization The University of Western Australia  Original file dated th November . Seemed to be a transcript of a Seven Days article. Poorly formatted and corrupted. I have added the text between examine under a microscope and malleable, like gold, as it was missing. If anyone has the full text, please distribute. I am not responsible for the accuracy of this information. Converted to HTML by DionisioInfiNet.com . Did a little spellchecking and some minor edits too. Stolen from  urlLink ht

In [12]:
labels = list(data['age_category'])

In [13]:
labels[0:3]

['10s', '10s', '10s']

Let's make a function to drop text that's too short, and if there aren't any alphabetical bits in it.

In [14]:
import string

def drop_short_text(text, labels, min_length=5):
    """
    Drop elements from `text` and `labels` lists where the length of `text` is less than `min_length`
    or where `text` only contains punctuation or is empty
    """
    filtered_text = []
    filtered_labels = []
    for i in range(len(text)):
        if len(text[i]) < min_length or all(char in string.punctuation or char.isspace() for char in text[i]):
            # check if text is too short or only contains punctuation or whitespace
            continue  # skip to next iteration if true
        if not any(c.isalpha() for c in text[i]):
            # check if text has at least one alphabet character
            continue  # skip to next iteration if false
        filtered_text.append(text[i])
        filtered_labels.append(labels[i])
    return filtered_text, filtered_labels

In [16]:
print(len(preprocessed))
print(len(labels))

681284
681284


In [17]:
text, labels = drop_short_text(preprocessed, labels)

In [18]:
print(len(text))
print(len(labels))

676839
676839


Okay, looks like we filtered out a good bit. Now, let's split it into train and test sets.

In [20]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(text, labels, test_size=0.2)

In [21]:
# Train a classifier on the training data
clf, vectorizer = train_word_classifier(train_texts, train_labels)

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)

analyze_features(10, clf, vectorizer)

Accuracy: 0.6376470066780923
Top 10 most significant textual features:
['oratote', 'comreligion', 'notquiteright', 'holderjoint', 'Muhahahhaha', 'Casacommunitas', 'slurtalking', 'smilingandflirting', 'LisaaaaAAA', 'Reng', 'oratote', 'comreligion', 'notquiteright', 'holderjoint', 'Muhahahhaha', 'smilingandflirting', 'Reng', 'Casacommunitas', 'slurtalking', 'aloofly', 'oratote', 'comreligion', 'notquiteright', 'holderjoint', 'smilingandflirting', 'Muhahahhaha', 'Reng', 'Casacommunitas', 'aloofly', 'slurtalking', 'oratote', 'comreligion', 'notquiteright', 'holderjoint', 'smilingandflirting', 'Muhahahhaha', 'Reng', 'Casacommunitas', 'aloofly', 'CastorKing']


In [22]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (2,2))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.45596448200460965
Top 10 most significant textual features:
['oN', 'rs', 'Kg', 'Uu', 'ti', '\xa0,', 'iR', ',A', 'wA', 'il', 'oN', 'rs', 'Kg', 'Uu', 'ti', '\xa0,', 'iR', 'wA', ',A', 'il', 'oN', 'rs', 'Uu', 'Kg', 'ti', '\xa0,', 'iR', 'wA', ',A', 'sT', 'oN', 'rs', 'Uu', 'ti', '\xa0,', 'Kg', 'iR', 'wA', ',A', 'sT']


In [23]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (3,3))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)


Accuracy: 0.4816721824951244
Top 10 most significant textual features:
['sht', 'MKM', 'YSk', 'i l', 'xTM', 'XNx', 'nge', 'Wxq', 'jvh', '.bd', 'sht', 'MKM', 'YSk', 'xTM', 'i l', 'XNx', 'Wxq', 'nge', 'jvh', '.bd', 'sht', 'MKM', 'YSk', 'xTM', 'i l', 'XNx', 'Wxq', 'nge', 'jvh', '.bd', 'sht', 'MKM', 'YSk', 'xTM', 'i l', 'XNx', 'nge', 'Wxq', 'jvh', '.bd']


In [24]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (4,4))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.5296968264286981
Top 10 most significant textual features:
['..eh', 'czx ', 'rUBI', 'earg', 'zzto', 'adhe', 'iaen', 'pami', ' Eee', 'stwi', '..eh', 'czx ', 'rUBI', 'earg', 'zzto', 'adhe', 'pami', 'iaen', ' Eee', 'OfIn', '..eh', 'czx ', 'rUBI', 'earg', 'zzto', 'adhe', 'pami', 'iaen', ' Eee', 'tor!', '..eh', 'czx ', 'rUBI', 'earg', 'zzto', 'adhe', 'pami', 'iaen', ' Eee', 'tor!']


In [25]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (5,5))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.5787409136575853
Top 10 most significant textual features:
['ayabu', '...Ng', 'ilor ', 'hyWeb', 't. .M', 'Anhs ', 'Duuuu', 'mG C ', 'ada?U', '?, mu', 'ayabu', '...Ng', 'ilor ', 'hyWeb', 'Duuuu', 't. .M', 'm dOn', 'mG C ', 'zs la', 'Anhs ', 'ayabu', '...Ng', 'ilor ', 'hyWeb', 'Duuuu', 't. .M', 'm dOn', 'zs la', 'mG C ', 'FOS G', 'ayabu', '...Ng', 'ilor ', 'hyWeb', 'Duuuu', 'm dOn', 't. .M', 's Zen', 'zs la', 'mG C ']


In [26]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (6,6))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.6175314697712901
Top 10 most significant textual features:
['A. Thi', '! daik', 'g de a', 'kur. D', 'oesnt?', 'ohn. I', 'ue, Pu', 'ame, h', 'ND Bel', 'ythmSe', 'A. Thi', 'g de a', '! daik', 'kur. D', 'ohn. I', 'ierenc', 'ue, Pu', 'cccbcc', 'di bid', 'lxsimp', 'A. Thi', 'kur. D', 'g de a', '! daik', 'ierenc', 'ohn. I', 'di bid', 'cccbcc', ' . Aid', 'anma B', 'A. Thi', 'kur. D', 'g de a', '! daik', 'ierenc', ' i tag', 'di bid', 'cccbcc', 'lxsimp', 'ohn. I']


In [27]:
# Train a classifier on the training data
clf, vectorizer = train_classifier(train_texts, train_labels, (7,7))

# Evaluate the classifier on the test data
accuracy = evaluate_classifier(clf, vectorizer, test_texts, test_labels)

# Print the accuracy score
print("Accuracy:", accuracy)
analyze_features(10, clf, vectorizer)

Accuracy: 0.6405797529696826
Top 10 most significant textual features:
['ason, S', 'ofit.As', 'Last xm', 'n. Nana', 'ch warm', '. NO BR', 'cate al', 'nse aff', 'RA! din', 'th aspe', 'ch warm', 'RA! din', 'ason, S', 'cate al', 'nse aff', 'ofit.As', 'nn Proj', ' eh!! d', '. NO BR', 'wns his', 'ch warm', 'nn Proj', ' eh!! d', 'RA! din', 'wns his', 'cate al', 'nse aff', 'ason, S', 'ofit.As', '. NO BR', 'ch warm', 'RA! din', 'th aspe', 'cate al', 'nse aff', 'nn Proj', ' eh!! d', 'ason, S', 'c n na ', '. NO BR']
