## Twitter Classification Project

The aim of the project is to find patterns in real tweets. The files new_york.json, london.json, and paris.json contain tweets that were gathered from those locations. The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

In [12]:
import pandas as pd

new_york_tweets = pd.read_json('https://raw.githubusercontent.com/IrenaPlotka/codecademy_projects/master/twitter_classification/new_york.json', lines=True)
# print(new_york_tweets.head())
print(len(new_york_tweets))
print(new_york_tweets.columns)
print(new_york_tweets.loc[5]['text'])

4723
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'extended_tweet',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities',
       'withheld_in_countries'],
      dtype='object')
texting me bullshit i just swipe and delete it


In [10]:
london_tweets = pd.read_json('https://raw.githubusercontent.com/IrenaPlotka/codecademy_projects/master/twitter_classification/london.json', lines=True)
print(len(london_tweets))
print(london_tweets.columns)
print(london_tweets.loc[5]['text'])

5341
Index(['created_at', 'id', 'id_str', 'text', 'display_text_range', 'source',
       'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'extended_tweet', 'quote_count',
       'reply_count', 'retweet_count', 'favorite_count', 'entities',
       'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms',
       'possibly_sensitive', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status', 'quoted_status_permalink', 'extended_entities'],
      dtype='object')
What’s cooler than being cool? 
 - Ice Cold matcha latte - Simple n delish 🌴🌴
.
.
.
.
.
.
#matcha #matchatea… https://t.co/jnDjiyimov


In [11]:
paris_tweets = pd.read_json('https://raw.githubusercontent.com/IrenaPlotka/codecademy_projects/master/twitter_classification/paris.json', lines=True)
print(len(paris_tweets))
print(paris_tweets.columns)
print(paris_tweets.loc[5]['text'])

2510
Index(['created_at', 'id', 'id_str', 'text', 'source', 'truncated',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'quote_count', 'reply_count',
       'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted',
       'filter_level', 'lang', 'timestamp_ms', 'display_text_range',
       'extended_entities', 'possibly_sensitive', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status', 'quoted_status_permalink',
       'extended_tweet'],
      dtype='object')
Finally, rain in Paris. #aurevoirlahaut @ Paris, France https://t.co/9x3oUakKj1


### Naive Bayes Classifier

In [15]:
# first uniting all the texts of tweets in one list
new_york_text = new_york_tweets['text'].tolist()
london_text = london_tweets['text'].tolist()
paris_text = paris_tweets['text'].tolist()
all_tweets = new_york_text + london_text + paris_text

# making the labels associated with those tweets: 0 - a New York tweet, 1 - a Lodon tweet, 2 - a Paris tweet
lables = [0]*len(new_york_text) + [1]*len(london_text) + [2]*len(paris_text)

In [53]:
# breaking the data into a training and testing set
from sklearn.model_selection import train_test_split
train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, lables, test_size = 0.2, random_state = 1)
print(len(train_data))
print(len(test_data))

10059
2515


In [46]:
# transforming lists of words into count vectors
from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer()
counter.fit(train_data)
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)
# checking how a tweet looks as a Count Vector
print(train_data[102])
print(train_counts[102])

another day in #London #UK #🇬🇧 場所: Covent Garden London https://t.co/M2T85Hc1Z2
  (0, 2625)	1
  (0, 6215)	1
  (0, 6855)	1
  (0, 7507)	1
  (0, 11213)	1
  (0, 13029)	1
  (0, 13530)	1
  (0, 16414)	2
  (0, 16766)	1
  (0, 27960)	1
  (0, 32472)	1


In [47]:
# using Count Vectors to train and test Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)
predictions = classifier.predict(test_counts)
# print(classifier.score(test_counts, test_labels))

In [48]:
# evaluating the model with accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels, predictions))

0.6779324055666004


In [45]:
# evaluating the model with confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_labels, predictions))

[[587 329  57]
 [249 742  70]
 [ 53  70 358]]


The confusion matrix shows that 587 tweets from NY were associated correctly, 329 were associated as tweets from London and 57 from Paris. 742 tweets from London and 358 tweets from Paris were labeled right. As it turns out tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

In [52]:
# testing my own tweets
tweet = 'seems like me and my neighbors have the same music taste #someone #likes #loudmusic'
tweet_counts = counter.transform([tweet])
print(classifier.predict(tweet_counts))
tweet2 = 'the weather drives me crazy, how much longer will it last #rainingagain'
tweet_counts2 = counter.transform([tweet2])
print(classifier.predict(tweet_counts2))
tweet3 = 'rien ne peut être mieux que des vacances en montagne'
tweet_counts3 = counter.transform([tweet3])
print(classifier.predict(tweet_counts3))

[0]
[1]
[2]


All the 3 tweets were labeled correctly