# Classifying Tweets

Using a Naive Bayes Classifier to find patterns in real tweets. Three files were provided: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets that were gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

Project from https://www.codecademy.com/paths/data-science/tracks/supervised-machine-learning-cumulative-project-skill-path/modules/supervised-learning-cumulative-project-skill-path/informationals/twitter-classification-cumulative-project-skill-path

In [1]:
# Investigating the data
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
# number of tweets 
print(len(new_york_tweets))
# features of tweets 
print(new_york_tweets.columns)
# example of 12th tweet's text 
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


In [2]:
london_tweets = pd.read_json("london.json", lines=True)
paris_tweets = pd.read_json("paris.json", lines=True)

# number of London tweets 
print(len(london_tweets))
# number of Paris tweets
print(len(paris_tweets))

5341
2510


In [3]:
# Goal: to look at how language is used differently in the 3 locations.

# Combining the text of all 3 cities 
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()
# Combining all tweets from the 3 marketplaces into 1 list
all_tweets = new_york_text + london_text + paris_text

# Creating the labels associated with each city
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)


The label 0 indicates a tweet was written from New York, 1 from London and 2 from Paris

In [4]:
# Breaking down data into a training and a test set

from sklearn.model_selection import train_test_split
train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size=0.2, random_state=1)
print(len(train_data), len(test_data))

10059 2515


In [5]:
# Making the count vectors
from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer()
counter.fit(train_data)

# Transforming train and test data into CountVectors. 
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

# Example of what a tweet looks like as a Count Vector
print(train_data[3])
print(train_counts[3])

saying bye is hard. Especially when youre saying bye to comfort.
  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


In [6]:
# Training and testing the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
# Fitting the classifier: calculating all of the probabilities used in Bayes Theorem
classifier.fit(train_counts, train_labels)
# Model should now be ready to quickly predict the location of a new tweet. 

# Now testing our model 
predictions = classifier.predict(test_counts)



In [7]:
# Evaluating model 

from sklearn.metrics import accuracy_score

# Printing percentage of tweets in the test set that the classifier correctly classified. 
print(accuracy_score(test_labels, predictions))

0.6779324055666004


Our model has an accuracy below 70% - indicating this could be improved. 

In [8]:
# Another way of evaluating the model: confusion matrix: a table that describes how the classifier made its 
# predictions.

from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


The table represents how the classifier made its predictions. There is one column per label, 0 being New York, 1 London and 2 Paris. 
Each row represents how the true tweets were classified, in the same order as the labels. 
The first row shows it is more difficult for the classifiter to identify a tweet's location when the language is the same: English - when it is from New York or London. However when tweets are from Paris, classification is more straightforward

In [9]:
# Testing our own tweet 
tweet = "Oh my God"
tweet_counts = counter.transform([tweet])
print(classifier.predict(tweet_counts))

[0]


The model is classifying our tweet as 0, which indicates it is recognizing the location as New York.