## Naive Bayes Text Classification ##

Tha following program applies the Naive Bayes classifier provided by NLTK to input data files to determine the relevancy of a set of Twitter data. Default relevancy is to keywords relating to the Boston marathon bombing of 2013.

In [None]:
# Author: Elizabeth Brooks
# Date Modified: 07/09/2015

# PreProcessor Directives
import os
import sys
sys.path.append(os.path.realpath('../'))
import csv
import yaml
import re
from nltk.classify import apply_features
import random
# Directives for twc yaml
import twittercriteria as twc
twc.loadCriteria()
keyword = twc.getKeywordRegex()
twc.clearCriteria()

# Global field declarations
current_dir = os.getcwd()
# Set the output file path
resultsPath = current_dir + '/relevantTweetResults.txt'
# Initialize the training and dev data sets
trainSet, devSet, labeledTweetDict, featureSet = []

# Function to clean up tweet strings 
# by manually removing irrelevant data (not words)
def cleanUpTweet(tweet_text):
    # Irrelevant characters
    twitterMarkup = ['&amp;', 'http://t.co/']
    temp = tweet_text.lower()
    # Use regex to create a regular expression 
    # for removing undesired characters
    temp = re.sub('|'.join(twitterMarkup), r"", temp)
    return temp
# End cleanUpTweet


The following function creates a dictionary of relevent features based on the input class feature text files. Each line in the input class text files is a tweet that is separated by word, labeled and then stored in a relevancy dictionary set.

In [None]:
# Function to initialize the feature sets
def initDictSet(class1_path='relevantTraining.txt', class2_path='irrelevantTraining.txt'):
    # Loop through the txt files line by line
    # Assign labels to tweets
    # Two classes, relevant and irrelevant to the marathon
    with open(current_dir + class1_path, "r") as relevantFile:
        for line in relevantFile:
            for word in line.split():
                labeledTweetDict.append(word, 'relevant')
    with open(current_dir + class2_path, "r") as irrelevantFile:
        for line in irrelevantFile:
            for word in line.split():
                labeledTweetDict.append(word, 'irrelevant')
    # Randomize the data
    random.shuffle(labeledTweets)
    # Close the files
    relevantTxtFile.close()
    irrelevantTxtFile.close()
# End initDictSet


The extractFeatures(train_file) function is used by the trainClassifier() function to assign an input term to a feature set indicating marathon relevance. The feature set may then split into a training and development set. The training set is used by the Naive Bayes classifier provided by NLTK to train the classifier object.

In [None]:
# Function to extract features from tweets
def extractFeatures(train_file):
    # Iterate through the Twitter data csv files by tweet text
    with open(current_dir + '/../' + train_file + '.csv') as csvfile:  
        tweetIt = csv.DictReader(csvfile)
        # Retrieve terms in tweets
        for twitterData in tweetIt:
            # Send the tweet text to the function for removing unncessary characters
            tweetText = cleanUpTweet(twitterData['tweet_text'])
            # Determine the feature sets
            for word in tweetText.split():
                featureSet = [(word, relevance) for (word, relevance) in labeledTweetDict]
            # End for
        # End for
    # End with
# End extractFeatures

# Function for training the classifier
def trainClassifier():   
    # Establish the training set
    # Add dev set assignment here
    trainSet = featuresSet

    # Train the Naive Bayes (NB) classifier
    classifierNB = nltk.NaiveBayesClassifier.train(trainSet)
# End trainClassifyData

The below classifyCSV(test_file) function cleans and classifys a data set in CSV file format, while the isRelevant(tweet_text) function may be used to classify a single tweet string. However, before the function may be called, the classifier must be trained using the previously defined functions.

In [None]:
# Function to classify input test data, csv file format
def classifyCSV(test_file):    
    # Classify input test data
    # Create object for writting to a text file
    tweetResultsFile = open(resultsPath, "w")
    # Iterate through the Twitter data csv files by tweet text
    with open(current_dir + '/../' + test_file + '.csv') as csvfile:  
        tweetIt = csv.DictReader(csvfile)
        # Retrieve terms in tweets
        for twitterData in tweetIt:
            # Send the tweet text to the function for removing unncessary characters
            tweetText = cleanUpTweet(twitterData['tweet_text'])
            # Send the results of the classifier to a txt file
            tweetResultsFile.write(classifierNB.classify(tweetText))
        # End for
    # End with
    # Close file
    tweetResultsFile.close()
# End classifyCSV

# Function to classify input cleaned tweet txt
def isRelevant(tweet_text):
    # Return the use of the classifier
    return classifierNB.classify(tweet_text)
# End isRelevant


The main() method may be used to run the script and classify a set okf Twitter data by requesting user input of not only the training and test data files, but also the two (at this time) classes. Where the classes are text files of "labeled" (organized by file) tweets.

In [None]:
# The main method
def main():
    # Request user input of text class files
    inputClassFile1 = raw_input("Enter the first class feature set txt file name...\nEx: relevantTraining.txt")
    inputClassFile2 = raw_input("Enter the second class feature set txt file name...\nEx: irrelevantTraining.txt")

    # Initialize the classifier dictionary based on relevant features
    initDictSet(inputClassFile1, inputClassFile2)

    # Request user input of the file name of train/dev data to be processed
    inputTrainFile = raw_input("Enter training data set csv file name...\nEx: cleaned_geo_tweets_Apr_12_to_22")
    # Request file name of data to be classified
    inputTestFile = raw_input("Enter test data set csv file name...\nEx: cleaned_geo_tweets_Apr_12_to_22")
    
    # Extract features and train the NB classifier using input training data
    extractFeatures(inputTrainFile)
    trainClassifier()
    
    # Classify the input test data, csv file format
    classifyCSV(inputTestFile)
# End main

# Run the script via the main method
if __name__ == "__main__":
    main()
    
# End script