# Unfinished, too slow. See sk-learn version
### Naive Bayes Classifier sentiment analysis with NLTK

In this notebook, a Naive Bayes Classifier is used to classify french and german tweets. In the provided data, a sentiment analysis has already been done on the french and german tweets, but only using the smileys. Now, they classified tweets are used to classify the other tweets that do not have these smileys and that have been classified as neutral. One problem is that we do not have tweets that have been classified as neural BECAUSE they are neutral. Indeed, most neutral tweets are in fact unclassified tweets. Because of class imbalances, we only consider subsets of positive tweets. This code is using NLTK, which is extremely SLOW compared to other libraries (scikit-learn). We learned this the hard way, when running it.

Import the libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
from urllib.parse import urlparse
from nltk.tokenize.casual import TweetTokenizer
from nltk.stem.snowball import FrenchStemmer, GermanStemmer
from nltk.corpus import stopwords

Use the snowball algorithm (http://snowballstem.org/) to stem french and german words. A casual tokenizer is used as it is more adapted to Twitter data. Stopwords are also loaded for both languages. A simple regular expression is compiled to detect links.

In [2]:
fr_stemmer = FrenchStemmer()
de_stemmer = GermanStemmer()
tokenizer = TweetTokenizer(strip_handles=True)
stop = stopwords.words('french') + stopwords.words('german')
#re_url = re.compile('(http[s]?://)?(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
re_url = re.compile('^(http[s]?://)?[\S]+\.\S[\S]+[/?#][\S]+$')

Let's now define some useful function to make the code easier to read

Remove stopwords, URLs and single character words

In [3]:
def filter_word(word):
    return word not in stop and not re_url.match(word) and len(word) > 1

Stem th word after removing a possible hashsign

In [4]:
def clean_word(stemmer, word):
    return stemmer.stem(word.replace('#', ''))

Convert a list of lists of items to a list of items (2D -> 1D)

In [5]:
def flatten(lst):
    return [item for sublist in lst for item in sublist]

Transform a tweet into a list of cleaned, stemmed words

In [6]:
def process(x, stemmer):
    return [clean_word(stemmer, word) for word in tokenizer.tokenize(x) if filter_word(word)]

This function generate another function that will be used to find what features are present in a tweet.

In [7]:
def generate_callback(features):
    def extract_features(text):
        extract = {}#[None]*len(features)
        for feature in features:
            extract[feature] = feature in text
        return extract
    
    return extract_features

Extract the sentiment given the features. Don't try to extract sentiment if no features are present for this sample.

In [62]:
def extract_sentiment(classifier, features):
    if True in features.values():
        return classifier.classify(features)
    return 'NEUTRAL'

Parse one month of data from a CSV file for a given language. Tweets are separated by sentiment in three different Pandas DataFrames.

In [8]:
def parse_month(curr_pos, curr_neg, curr_neu, lang, month):
    raw = pd.read_csv('processed/tweets_non-en_msg_{}.csv'.format(month), escapechar='\\')
    raw.columns=['source_location', 'lang', 'main', 'sentiment']

    filtered = raw[raw.lang == lang]

    pos = filtered[filtered.sentiment == 'POSITIVE']
    neg = filtered[filtered.sentiment == 'NEGATIVE']
    neu = filtered[filtered.sentiment == 'NEUTRAL']
    
    stemmer = fr_stemmer if lang == 'fr' else de_stemmer if lang == 'de' else None

    pos = pos[['sentiment']].assign(tokenized=pos['main'].apply(lambda x: process(x, stemmer)))
    neg = neg[['sentiment']].assign(tokenized=neg['main'].apply(lambda x: process(x, stemmer)))
    neu = neu[['sentiment']].assign(tokenized=neu['main'].apply(lambda x: process(x, stemmer)))
    
    if curr_pos is None:
        return (pos, neg, neu)
    else:
        return (pd.concat([curr_pos, pos], copy=False),
                pd.concat([curr_neg, neg], copy=False),
                pd.concat([curr_neu, neu], copy=False))

Load the data from the CSVs. Store the tweets of different languages in different DataFrames. The <lang>_all DataFrames contains positive and negative tweets for a specific language. Empty tweets (after processing) are removed from these DataFrames.

In [9]:
fr_pos = None
fr_neg = None
fr_neu = None
de_pos = None
de_neg = None
de_neu = None

months = [
    'january',
    'february',
    'march',
    'april',
    'may',
    'june',
    'july',
    'august',
    'september',
    'october'
]

# fr
for month in months:
    fr_pos, fr_neg, fr_neu = parse_month(fr_pos, fr_neg, fr_neu, 'fr', month)

fr_pos = fr_pos[0:len(fr_neg)]
fr_all = pd.concat([fr_pos, fr_neg], copy=False)
fr_all = [tuple(x) for x in fr_all[fr_all.astype(str).tokenized != '[]'][['tokenized', 'sentiment', 'source_location']].values]
    
# de
for month in months:
    de_pos, de_neg, de_neu = parse_month(de_pos, de_neg, de_neu, 'de', month)

de_pos = de_pos[0:len(de_neg)]
de_all = pd.concat([de_pos, de_neg], copy=False)
de_all = [tuple(x) for x in de_all[de_all.astype(str).tokenized != '[]'][['tokenized', 'sentiment', 'source_location']].values]

In [10]:
fr_pos_freq = nltk.FreqDist(flatten(fr_pos.tokenized.tolist()))
fr_neg_freq = nltk.FreqDist(flatten(fr_neg.tokenized.tolist()))
de_pos_freq = nltk.FreqDist(flatten(de_pos.tokenized.tolist()))
de_neg_freq = nltk.FreqDist(flatten(de_neg.tokenized.tolist()))

fr_all_words = list(set(fr_pos_freq.keys()) | set(fr_neg_freq.keys()))
de_all_words = list(set(de_pos_freq.keys()) | set(de_neg_freq.keys()))

fr_extract = generate_callback(fr_all_words)
de_extract = generate_callback(de_all_words)

fr_training_set = nltk.classify.apply_features(fr_extract, fr_all)
de_training_set = nltk.classify.apply_features(de_extract, de_all)

In [11]:
fr_classifier = nltk.NaiveBayesClassifier.train(fr_training_set)
#de_classifier = nltk.NaiveBayesClassifier.train(de_training_set)

In [67]:
#fr_neu_back = fr_neu

print(len(fr_neu.sentiment[fr_neu.sentiment == 'NEUTRAL']))
fr_neu['sentiment'] = fr_neu['tokenized'].apply(lambda x: extract_sentiment(fr_classifier, fr_extract(x)))
print(len(fr_neu.sentiment[fr_neu.sentiment == 'NEUTRAL']))

2007354


KeyboardInterrupt: 

In [None]:
fr_neu