Classifying Tweets using a Naive Bayes Classifier to find patterns in real tweets. 
Input : `new_york.json`, `london.json`, and `paris.json`. These three files contain tweets from those locations.

The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

# Investigating the Data

In [4]:
import pandas as pd

new_york_tweets = pd.read_json("new_york.json", lines=True)
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[12]["text"])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
Be best #ThursdayThoughts


# Loading DataFrames

In [5]:
london_tweets = pd.read_json('london.json', lines = True)
paris_tweets = pd.read_json('paris.json', lines = True)

# Classifying using language - Naive Bayes Classifier

1) Transforming the text of the tweets and making a big list
2) Making a label associated with those tweets. `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet. 

In [6]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()


all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [2] * len(paris_text) + [1] * len(london_text)

# Making a Training and Test Set

Using scikit-learn's `train_test_split` function to break the data into a training set and a test set. 

In [7]:
from sklearn.model_selection import train_test_split
train_data, validation_data, train_labels, validation_labels = train_test_split(all_tweets, labels)

# Making the Count Vectors


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer()
counter.fit(train_data)
train_counts = counter.transform(train_data)
validation_counts = counter.transform(validation_data)

print(train_data[3])
print(train_counts[3])

My username is DiabloDame. I would love Soooome Extra Lives. https://t.co/WrZIj1qzhp
  (0, 5876)	1
  (0, 7560)	1
  (0, 9353)	1
  (0, 12353)	1
  (0, 13173)	1
  (0, 15411)	1
  (0, 15677)	1
  (0, 17446)	1
  (0, 23848)	1
  (0, 26798)	1
  (0, 28205)	1
  (0, 28257)	1


# Train and Test the Naive Bayes Classifier

1) Making a `MultinomialNB` named `classifier`.
2) Training the  `classifier`'s with `.fit()` method. 
3) Test the model with `classifier`'s `.predict()` method

In [10]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)

prediction = classifier.predict(validation_counts)

# Evaluating the Model



In [11]:
from sklearn.metrics import accuracy_score

print(accuracy_score(validation_labels, prediction))


0.5896946564885496


[[1028  108   16]
 [ 513  745   56]
 [ 454  143   81]]


from sklearn.metrics import confusion_matrix
print(confusion_matrix(validation_labels, prediction))

In [17]:
# Test a Tweet


[0]
[1]
[[4.39453099e-08 9.99999470e-01 4.86386921e-07]]


In [None]:
tweet = 'Where can I eat the best burger in the big apple'
tweet_counts = counter.transform([tweet])
prediction_jis = classifier.predict(tweet_counts)
print(prediction_jis)


tweet2 = 'Ou trouver le meilleur burger de paris'
tweet2_counts = counter.transform([tweet2])
prediction_jis2 = classifier.predict(tweet2_counts)
print(prediction_jis2)

prediction_proba = classifier.predict_proba(tweet2_counts)
print(prediction_proba)