# Text Analytics over twitter dataset

## Classify best and worst tweets

### Step one: from all the tweets, create the Training Set

First of all, extract the data from a group into the DataFrame `tweets`

In [None]:
import pandas

pandas.options.display.max_colwidth = 500
tweets = pandas.read_csv('twitter/g2.txt', sep='\t', header=0)

Or else, get data from all the groups and create one DataFrame with them all

```Python
import pandas

pandas.options.display.max_colwidth = 500
g1 = pandas.read_csv('twitter/g1.txt', sep='\t', header=0)
g2 = pandas.read_csv('twitter/g2.txt', sep='\t', header=0)
g3 = pandas.read_csv('twitter/g3.txt', sep='\t', header=0)
g4 = pandas.read_csv('twitter/g4.txt', sep='\t', header=0)
g5 = pandas.read_csv('twitter/g5.txt', sep='\t', header=0)

groups = [g1,g2,g3,g4,g5]

tweets = pandas.concat(groups, ignore_index=True)
```

Remove any content that is not in English (this could be done for any language)

In [None]:
drop_index = []
for i,tweet in tweets.iterrows():
    if('lang="en"' not in tweet['FULL_TEXT_HTML']):
        drop_index.append(i)
        
tweets = tweets.drop(drop_index)

Sort the tweets according to likes and retweets, get the top 20% as `good_set`, and bottom as `bad_set`. <br>
Finally, for each set add the tag `good` or `bad`, then join them and shuffle.

In [None]:
import random

tweets = tweets.sort_values(by=['NLIKE','NRETWEET'], ascending=False)
set_size = int(len(tweets)*0.2)

good_set = tweets['FULL_TEXT'].head(set_size)
bad_set = tweets['FULL_TEXT'].tail(set_size)

good_set = [ (i, 'good') for i in good_set]
bad_set = [ (i, 'bad') for i in bad_set]

training_data = good_set + bad_set
random.shuffle(training_data)

### Step two: clean the Training Set

Extract all the tokens along with their respective tweet's tag

In [None]:
import nltk
tokens = [(nltk.word_tokenize(tweet),tag) for tweet,tag in training_data]

Apply stemming and lemmatization to the tokens

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

wordnet_lemmatizer = WordNetLemmatizer()
porter_stemmer = PorterStemmer()

stemmed = [porter_stemmer.stem(word) for t in tokens for word in t[0]]
lemmatized = [wordnet_lemmatizer.lemmatize(word) for word in stemmed]

Remove stopwords, digits (not numbers) and punctuation, then create a `vocabulary` of the most frequent words

In [None]:
from nltk.corpus import stopwords
import string

stopwords = set(stopwords.words('english') 
                + list(string.punctuation) 
                + list(string.digits) 
                + list(['“', '”', '’', '‘', '–', '…']))

words = [word.lower() for word in lemmatized if word not in stopwords]

vocabulary = [w[0] for w in nltk.FreqDist(words).most_common(3000)]

### Step three: generate sets for training and testing the algorithm

Create a Feature Set from `tokens` and `vocabulary`

In [None]:
def docFeatures(document):
    doc_words = set(document)
    features = {}
    for word in vocabulary:
        features['contains({})'.format(word)] = (word in doc_words)
    return features

feature_set = [(docFeatures(tweet), tag) for tweet,tag in tokens]

Now we can finally create the Naive Bayes classifier, and test it with our Test Set

In [None]:
x = int(len(feature_set)*0.2)

train_set, test_set = feature_set[x:], feature_set[:x]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(20)

# <br><br>Sentiment Analysis classifier

This time we will use nltk twitter samples to train the classifier, then we will try to classify our tweets from the museums.

If not already done:

```Python
import nltk
nltk.download('twitter_samples')
```

### Step One: get the tweets and define tokens and training data

Gather the tweets and tokens from both the positive and negative set

In [None]:
import nltk
from nltk.corpus import twitter_samples 

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

pos_tokens = twitter_samples.tokenized('positive_tweets.json')
neg_tokens = twitter_samples.tokenized('negative_tweets.json')

Add for each tweet a `pos` or `neg` tag

In [None]:
positive_tweets = [ (i, 'pos') for i in positive_tweets]
negative_tweets = [ (i, 'neg') for i in negative_tweets]

training_data = positive_tweets + negative_tweets

pos_tokens = [ (i, 'pos') for i in pos_tokens]
neg_tokens = [ (i, 'neg') for i in neg_tokens]

tokens = pos_tokens + neg_tokens

Apply Stemming on each token

End the cleaning process by removing stopwords, punctuation, digits, links and citations

In [None]:
from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()

pos_stemmed = [porter_stemmer.stem(word) 
               for t in pos_tokens for word in t[0]]
neg_stemmed = [porter_stemmer.stem(word) 
               for t in neg_tokens for word in t[0]]

And then Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

pos_lemmatized = [wordnet_lemmatizer.lemmatize(word)
                  for word in pos_stemmed]
neg_lemmatized = [wordnet_lemmatizer.lemmatize(word) 
                  for word in neg_stemmed]

End the cleaning process by removing stopwords, punctuation, digits, links and citations

In [None]:
from nltk.corpus import stopwords
import string, re

lemmatized = pos_lemmatized + neg_lemmatized
stopwords = set(stopwords.words('english') 
                + list(string.punctuation) 
                + list(string.digits))

words = [word.lower() for word in lemmatized if word not in stopwords]

for word in words:
    word = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', word)
    word = re.sub("(@[A-Za-z0-9_]+)","", word)
    
vocabulary = [w[0] for w in nltk.FreqDist(words).most_common(5000)]

### Step Two: define the feature set from our Twitter Groups

Import the tweets, tokenize...

In [None]:
import pandas
import random
import nltk

pandas.options.display.max_colwidth = 500
g1 = pandas.read_csv('twitter/g1.txt', sep='\t', header=0)
g2 = pandas.read_csv('twitter/g2.txt', sep='\t', header=0)
g3 = pandas.read_csv('twitter/g3.txt', sep='\t', header=0)
g4 = pandas.read_csv('twitter/g4.txt', sep='\t', header=0)
g5 = pandas.read_csv('twitter/g5.txt', sep='\t', header=0)

groups = [g1]
#groups = [g1,g2,g3,g4,g5]

tweets = pandas.concat(groups, ignore_index=True)

# Keep English tweets
drop_index = []
for i,tweet in tweets.iterrows():
    if('lang="en"' not in tweet['FULL_TEXT_HTML']):
        drop_index.append(i)
        
tweets = tweets.drop(drop_index)

testing_data = tweets['FULL_TEXT']

testing_tokens = [(nltk.word_tokenize(tweet)) for tweet in testing_data]

...And create a feature set using the vocabulary defined in Step One

In [None]:
def docFeatures(document):
    doc_words = set(document)
    features = {}
    for word in vocabulary:
        features['contains({})'.format(word)] = (word in doc_words)
    return features

testing_feature_set = [(docFeatures(tweet)) for tweet in testing_tokens]

### Step Three: train and test the classifier

Handle negation and create unigram features (applying a minimum frequency of 10)

In [None]:
from nltk.sentiment.util import *

sent_analyzer = nltk.sentiment.SentimentAnalyzer()

negation_words = sent_analyzer.all_words([mark_negation(doc) for doc in tokens])
unigram_features = sent_analyzer.unigram_word_feats(negation_words, min_freq=10)

Create the Feature Set and split it for training and testing the classifier

In [None]:
x = int(len(training_data)*0.2)

sent_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_features)

feature_sets = sent_analyzer.apply_features(tokens)
train_set, test_set = feature_sets[x:], feature_sets[:x]

Finally, train the Naive Bayes classifier and print all the evaluation metrics

In [None]:
classifier = sent_analyzer.train(nltk.classify.NaiveBayesClassifier.train, train_set)
for key,value in sorted(sent_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))
    
classifier.show_most_informative_features(20)

### Step Four: verify the classifier

Execute the classifier on top of our `testing_feature_set`

In [None]:
confidence = [classifier.prob_classify(f) for f in testing_feature_set]

Print the first 10 results

In [None]:
for i in range(10):
    print(confidence[i].max(),
          confidence[i].prob(confidence[i].max()),
          '\n' + testing_data[i] + '\n')

### Final Note
In the first part of this notebook we saw how to create a classifier, how to normalize our data, how to extract the features and divide the feature set for training and testing.
<br>
The second part focuses more on sentiment analysis, trying to classify the attitude (Positive or Negative) of the tweets we had. In order to do so, it was mandatory to train the model using structured data. Luckily nltk offered that...
After training the classifier, and extracting our `testing_feature_set`, we could finally run the classification. Although those tweets weren't specifically for sentiment analysis, it was interesting to work on this second task, showing the power of these features.