# Data/Text mining course project
### Twitter sentiment classification
By Lukas Lönnroth & Wille Strengell

This is our submission for the course project.

In this project we will build a nltk classifier using the naive bayes algorithm inspired by this [kernel](https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis), that we train with the [First GOP Debate Twitter Sentiment](https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment) dataset.

After that we will apply it to our own data that we have gathered from twitter and output a file that has a column that shows the sentiment of each tweet.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt

### Praparing our training data
We will use the data from the Sentiment.csv file to train our classifier so we can use it on our own data. So we read the file with pandas and get the columns we need (sentiment and text)

In [None]:
sentiment = pd.read_csv('../input/first-gop-debate-twitter-sentiment/Sentiment.csv')
training_data = sentiment[['sentiment', 'text']]
training_data = training_data[training_data.sentiment != "Neutral"]
training_data = training_data[:100]

Then we need to remove stop words from our data, we can use the nltk stopwords corpus for this

In [None]:
tweets = []
stopwords_set = set(stopwords.words("english"))

for index, row in training_data.iterrows():
    words_filtered = [e.lower() for e in row.text.split() if len(e) >= 3]
    words_cleaned = [word for word in words_filtered
        if 'http' not in word
        and not word.startswith('@')
        and not word.startswith('#')
        and word != 'RT']
    words_without_stopwords = [word for word in words_cleaned if not word in stopwords_set]
    tweets.append((words_without_stopwords, row.sentiment))

#### Training our classifier

Extracting our word features:

In [None]:
def get_words_in_tweets(tweets):
    all = []
    for (words, sentiment) in tweets:
        all.extend(words)
    return all

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    features = wordlist.keys()
    return features
w_features = get_word_features(get_words_in_tweets(tweets))

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in w_features:
        features['contains(%s)' % word] = (word in document_words)
    return features


Training our classifier, as an example we only use 100 tweets that we split in half for training and testing data.

In [None]:
featuresets = nltk.classify.apply_features(extract_features,tweets)
len(featuresets)

In [None]:
training_set = featuresets[50:]
test_set = featuresets[:50]

In [None]:
naive_classifier = nltk.NaiveBayesClassifier.train(training_set)

In [None]:
print(nltk.classify.accuracy(naive_classifier, test_set))

#### Gathering our own data

We have gathered our own data from twitter using a python script and tweepy. In the ```brexit-26-march.csv``` there is about 60 000 tweets with the #brexit, most of them where tweeted on the 26th of march 2019

##### The script:
Our script for getting the tweets. You run the script from the cli and enter your hashtag and filename as arguments and it will output a .csv file containing 2 weeks worth of tweets for your hashtag.
```python

if len(sys.argv) == 1:
    print('###ERROR: Please enter required arguments: filename, hashtag')
    exit()

# sys args
fileName = sys.argv[1]
hashtag = '#' + sys.argv[2]

print('Filename: ' + fileName)
print('Hashtag: ' + hashtag)

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)


# Open/Create a file to append data
csvFile = open(fileName+ '.csv', 'a')
#Use csv Writer
csvWriter = csv.writer(csvFile)

for tweet in tweepy.Cursor(api.search,q=hashtag,tweet_mode='extended',
                           count=100,
                           lang="en",
                           since="2017-04-03").items():
    if 'retweeted_status' in dir(tweet):
        text = tweet.retweeted_status.full_text
    else:
        text = tweet.full_text

    print (tweet.created_at, text)
    csvWriter.writerow([tweet.created_at, text])


```

In [None]:
brexit_data = pd.read_csv('../input/brexit-tweets/brexit-26-march.csv', header=None)
brexit_data.head()

Let's have a look at what our classifier can tell us about our data before we clean it, using only the first 100 tweets.

In [None]:
pos_tweets = []
neg_tweets = []
pos_cnt = 0
neg_cnt = 0

In [None]:
smaller_data_with_stopwords = brexit_data[:100]

for obj in smaller_data_with_stopwords[1]: 
    res =  naive_classifier.classify(extract_features(obj.split()))
    if(res == 'Negative'): 
        neg_tweets.append(obj)
        neg_cnt += 1
    elif(res == 'Positive'): 
        pos_tweets.append(obj)
        pos_cnt += 1

Guess it looks alright, it seems plausible that #brexit tweets might be more on the negative side

In [None]:
print('positive tweets: %s' %pos_cnt)
print('negative tweets: %s' %neg_cnt)

Alright so lets remove the stopwords from our brexit data:

In [None]:
for index, row in brexit_data.iterrows():
    words_filtered = [e.lower() for e in row[1].split() if len(e) >= 3]
    words_cleaned = [word for word in words_filtered
        if 'http' not in word
        and not word.startswith('@')
        and not word.startswith('#')
        and word != 'RT']
    words_without_stopwords = [word for word in words_cleaned if not word in stopwords_set]
    row[1] = ' '.join(words_without_stopwords)

In [None]:
brexit_data.head()

Now we can add a sentiment column so we can use our data easier in the future. Even though we have 60 000 tweets we will now only use 5000 since it would take so much time to classify the whole set.

In [None]:
def classify_data(data, classifier):
    pos_tweets = []
    neg_tweets = []
    pos_cnt = 0
    neg_cnt = 0
    data.insert(2, 3, '')

    for index, row in data.iterrows():
        obj = row[1]
        res =  classifier.classify(extract_features(obj.split()))
        row[3] = res
        if(res == 'Negative'): 
            neg_tweets.append(obj)
            neg_cnt += 1
        elif(res == 'Positive'): 
            pos_tweets.append(obj)
            pos_cnt += 1
    return pos_tweets, neg_tweets, pos_cnt, neg_cnt, data

In [None]:
data = brexit_data[:2000]
pos_tweets, neg_tweets, pos_cnt, neg_cnt, classified_data = classify_data(data, naive_classifier)

#### There it is!
We have now classified our tweets and can output them to a separate file for future use:

In [None]:
classified_data.to_csv('classified_brexit_tweets.csv', index = False)

### Analyzing our data
Now we can take a look at some of the mostly used words in the tweets

In [None]:
print('Positive tweets: %s' %pos_cnt)
print('Negative tweets: %s' %neg_cnt)

In [None]:
wordcloud = WordCloud(background_color="white", width=2500, height=2000).generate(' '.join(pos_tweets))
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Positive words in tweets')
plt.show()

In [None]:
wordcloud = WordCloud(width=2500, height=2000).generate(' '.join(neg_tweets))
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Negative words in tweets')
plt.show()

## Trying it with the decision tree classifier

In [None]:
dtree_classifier = nltk.classify.DecisionTreeClassifier.train(training_set)

In [None]:
print(nltk.classify.accuracy(dtree_classifier, test_set))

In [None]:
data = brexit_data[:1000]
pos_tweets, neg_tweets, pos_cnt, neg_cnt, dtree_classified_data = classify_data(data, dtree_classifier)

In [None]:
print('Positive tweets: %s' %pos_cnt)
print('Negative tweets: %s' %neg_cnt)

In [None]:
wordcloud = WordCloud(background_color="white", width=2500, height=2000).generate(' '.join(pos_tweets))
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Positive words in tweets')
plt.show()

In [None]:
wordcloud = WordCloud(width=2500, height=2000).generate(' '.join(neg_tweets))
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('Negative words in tweets')
plt.show()

## Conclusion

In this project we:
- Prepared our training data
- Trained our classifiers
- Gathered our own twitter data
- Analyzed the sentiment of our gathered data
- Got a new output with analyzed tweets
- We have learned a bit more about using some classifiers provided by NLTK.

We could have spent some more time investigating different ways to make our classifiers more accurate and learning more about them. Now it seems that if we put more training data in the desicion-tree classifiers it hangs and can go on forever.


