## Project: Classifying Tweets

As part of a self-directed Codecademy certification, the purpose of this project is to use a Naive Bayes Classifier to find patterns in real tweets. 

The goal is to create a classification algorithm that can classify a tweet or sentence, and predict whether that sentence came from New York, London, or Paris.

## Previewing the Data ##

We will import and preview the dataset for each city.

In [29]:
import pandas as pd

#Importing the datasets

new_york_tweets = pd.read_json("new_york.json", lines=True)
london_tweets = pd.read_json("london.json", lines=True)
paris_tweets = pd.read_json("paris.json", lines=True)


In [30]:
#How many tweets do we have from each city?
print(len(new_york_tweets))
print(len(london_tweets))
print(len(paris_tweets))

#Which columns are present in each dataframe?
print(new_york_tweets.columns)
print(london_tweets.columns)
print(paris_tweets.columns)

4723
5341
2510
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
      

## Naive Bayes Classifier ##

We will use a Naive Bayes Classifier to find patterns in the tweets that we have imported.

We will use the integers 0, 1 and 2 to represent the cities; more specifically, **0** represents **New York**, **1** represents **London** and **2** represents **Paris**.

In [31]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

## Training and Testing Sets ##


We will use scikit-learn's **train_test_split** function to split the dataset into a training set and testing set. We will train the model on 80% of the data, and use the other 20% to test the model.

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
#The random_state parameter is set to 1, ensuring that the sets remain unchanged each time the code is run.
X_train, X_test, y_train, y_test = train_test_split(all_tweets, labels, test_size = 0.2, random_state=1) 

## Count Vectors ##

We need to transform our tweets into count vectors for the Naive Bayes Classifier. 

For example, the count vector for the tweet "I love chocolate and I love chocolate ice cream." would be

{"i": 2, "love": 2, "chocolate": 2, "and": 1, "ice": 1, "cream": 1}

We will do this using scikit-learn's **CountVectorizer** function.


In [34]:
from sklearn.feature_extraction.text import CountVectorizer


In [35]:
vectorizer = CountVectorizer()
 
#We want to teach our counter the vocabulary from our training set
vectorizer.fit(X_train)

#Now we transform the training and testing data to count vectors
train_counts = vectorizer.transform(X_train)

test_counts = vectorizer.transform(X_test)


## Training and testing the Naive Bayes Classifier ##

We will create a Naive Bayes Classifier using the scikit-learn package. 

We will fit the model to our training set, and using our testing set to test the model.

In [36]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
#Fitting the model
classifier.fit(train_counts, y_train)
#Testing the model
predictions = classifier.predict(test_counts)

## Evaluating Your Model ##

We will evaluate the model using scikit-learn's **accuracy_score** function, which returns the percentage of tweets in the testing set that were correctly classified.




In [37]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, predictions))

0.6779324055666004


The model correctly predicted the location of around 68% (roughly 2/3) of the tweets in our testing set.

We will also compute the confusion matrix for this model using scikit-learn's function.

In [38]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, predictions))

[[541 404  28]
 [203 824  34]
 [ 38 103 340]]


The model correctly predicted 541 tweets originating from New York, 824 tweets originating from London, and 340 tweets originating from Paris. Interestingly, the model predicted that 203 tweets originated from New York when they actually originated from London. Similarly, the model predicted that 404 tweets originated from London when they actually originated from New York. In comparison, even accounting for the proportion of tweets from each city, the model rarely incorrectly classifies tweets originating from Paris. One reason for this may be that English is the main spoken language of London and New York, whereas French is the main spoken language of Paris.

## Improving the Model ##

We will evaluate the model using scikit-learn's **accuracy_score** function, which returns the percentage of tweets in the testing set that were correctly classified.


