Classifying Tweets

In this project, I will use a Naive Bayes Classifier to find patterns in real tweets. I've had being given three files: `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets gathered from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigate the Data

To begin, let's take a look at the data. I've imported `new_york.json` and printed the following information:
* The number of tweets.
* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

new_york_tweets = pd.read_json("new_york.json", lines=True)
london_tweets = pd.read_json("london.json", lines=True)
paris_tweets = pd.read_json("paris.json", lines = True)
print(len(new_york_tweets))
print(len(london_tweets))
print(len(paris_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
5341
2510
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


# Classifying using language: Naive Bayes Classifier

I am going to create a Naive Bayes Classifier! Let's begin by looking at the way language is used differently in these three locations. Let's grab the text of all of the tweets and make it one big list. In the code block below, 

Let's also make the labels associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet.

In [10]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

# Making a Training and Test Set

I can now break the data into a training set and a test set. I'll use scikit-learn's `train_test_split` function to do this split. This function takes two required parameters: It takes the data, followed by the labels. I will set the optional parameter `test_size` to be `0.2`. Finally, I set the optional parameter `random_state` to `1`. This will make it so your data is split in the same way as the data in our solution code. 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(all_tweets, labels, test_size=0.2, random_state=1)
print(len(X_train))
print(len(X_test))

10059
2515


# Making the Count Vectors

To use a Naive Bayes Classifier, I need to transform the lists of words into count vectors. This changes the sentence `"I love New York, New York"` into a list that contains:

* Two `1`s because the words `"I"` and `"love"` each appear once.
* Two `2`s because the words `"New"` and `"York"` each appear twice.
* Many `0`s because every other word in the training set didn't appear at all.

In [13]:
counter = CountVectorizer()
counter.fit(X_train)

X_train_counts = counter.transform(X_train)
X_test_counts = counter.transform(X_test)
print(X_train[3])
print(X_train_counts[3])

saying bye is hard. Especially when youre saying bye to comfort.
  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


# Train and Test the Naive Bayes Classifier

I now have the inputs to the classifier. Let's use the CountVectors to train and test the Naive Bayes Classifier!

In [14]:
classifier = MultinomialNB()
classifier.fit(X_train_counts, y_train)
predictions = classifier.predict(X_test_counts)

# Evaluating The Model

Now that the classifier has made its predictions, let's see how well it did. Let's look at two different ways to do this. First, scikit-learn's `accuracy_score`. This prints the percentage of tweets in the test set that the classifier correctly classified.



In [15]:
print(accuracy_score(y_test, predictions))

0.6779324055666004


The other way I can evaluate my model is by looking at the **confusion matrix**. A confusion matrix is a table that describes how your classifier made its predictions. For example, if there were two labels, A and B, a confusion matrix might look like this:

```
9 1
3 5
```

In this example, the first row shows how the classifier classified the true A's. It guessed that 9 of them were A's and 1 of them was a B. The second row shows how the classifier did on the true B's. It guessed that 3 of them were A's and 5 of them were B's.

In [16]:
print(confusion_matrix(y_test, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


# Test Your My Own Tweet

The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

Now I´ll write a tweet and see how the classifier works! 

This should give a prediction for the tweet. Where a `0` represents New York, a `1` represents London, and a `2` represents Paris.

In [26]:
tweet = "If I say pizza what you'll say?"
tweet_count = counter.transform([tweet])
print(classifier.predict(tweet_count))

[1]
