# Classifying Tweets

## Introduction

This project analyzes real tweets gathered from three locations and uses a Naive Bayes Classifier to find patterns in them. The goal is to create a classification algorithm that can classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

#### Data sources:

The json files analyzed was provided by Codecademy.

## Scoping

- Investigate Data.
    
- Naive Bayes Classifier
    - Combine all texts
    - Split the Data into Training and Test Sets
    - Make the Count Vectors
    - Train and Test the Naive Bayes Classifier
    - Model Evaluation
    - Test on new Data

## Import Python Modules

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

# Load and Inspect Data

Three files, `new_york.json`, `london.json`, and `paris.json`, are loaded into `new_york_tweets`, `london_tweets`, and `paris_tweets`. These files contain different features about real tweets such as user information, location, tweets' texts, the time tweet was created, favorite count, etc. Some of these features are dictionaries that contain more information.

**new_york_tweets**: 
- There are 36 columns and 4,723 rows in `new_york_tweets`
- 17 of 36 columns have missing values

**london_tweets**:
- There are 35 columns and 5,341 rows in `london_tweets`
- 16 of 35 columns have missing values

**paris_tweets**:
- There are 35 columns and 2,510 rows in `paris_tweets`
- 16 of 35 columns have missing values

In [2]:
new_york_tweets = pd.read_json("new_york.json", lines=True)

In [3]:
pd.options.display.max_columns = 60
pd.options.display.max_colwidth = 500

In [4]:
new_york_tweets.head(1)

Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_screen_name,user,geo,coordinates,place,contributors,is_quote_status,quote_count,reply_count,retweet_count,favorite_count,entities,favorited,retweeted,filter_level,lang,timestamp_ms,extended_tweet,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities,withheld_in_countries
0,2018-07-26 13:32:33+00:00,1022474755625164800,1022474755625164800,@DelgadoforNY19 Calendar marked.,"[16, 32]","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",False,1.022208e+18,1.022208e+18,8.290618e+17,8.290618e+17,DelgadoforNY19,"{'id': 316616881, 'id_str': '316616881', 'name': 'Adam Ford', 'screen_name': 'NYCVermouth', 'location': 'New York City', 'url': 'http://www.fordobrien.com', 'description': 'White Collar Criminal Defense Lawyer. Atsby Vermouth founder. Authored the Vermouth book. I keep threatening to start a podcast.', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 612, 'friends_count': 615, 'listed_count': 17, 'favourites_count': 3813, 'statuses_count': 2109, 'created_a...",,,"{'id': '01a9a39529b27f36', 'url': 'https://api.twitter.com/1.1/geo/id/01a9a39529b27f36.json', 'place_type': 'city', 'name': 'Manhattan', 'full_name': 'Manhattan, NY', 'country_code': 'US', 'country': 'United States', 'bounding_box': {'type': 'Polygon', 'coordinates': [[[-74.026675, 40.683935], [-74.026675, 40.877483], [-73.910408, 40.877483], [-73.910408, 40.683935]]]}, 'attributes': {}}",,False,0,0,0,0,"{'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'DelgadoforNY19', 'name': 'Antonio Delgado', 'id': 829061809135030272, 'id_str': '829061809135030272', 'indices': [0, 15]}], 'symbols': []}",False,False,low,en,2018-07-26 13:32:33.060,,,,,,,,


In [5]:
new_york_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4723 entries, 0 to 4722
Data columns (total 36 columns):
 #   Column                     Non-Null Count  Dtype              
---  ------                     --------------  -----              
 0   created_at                 4723 non-null   datetime64[ns, UTC]
 1   id                         4723 non-null   int64              
 2   id_str                     4723 non-null   int64              
 3   text                       4723 non-null   object             
 4   display_text_range         2811 non-null   object             
 5   source                     4723 non-null   object             
 6   truncated                  4723 non-null   bool               
 7   in_reply_to_status_id      1668 non-null   float64            
 8   in_reply_to_status_id_str  1668 non-null   float64            
 9   in_reply_to_user_id        1829 non-null   float64            
 10  in_reply_to_user_id_str    1829 non-null   float64            
 11  in_r

In [6]:
london_tweets = pd.read_json("london.json", lines=True)
london_tweets.head(1)

Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_screen_name,user,geo,coordinates,place,contributors,is_quote_status,extended_tweet,quote_count,reply_count,retweet_count,favorite_count,entities,favorited,retweeted,filter_level,lang,timestamp_ms,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_entities
0,2018-07-26 13:39:30+00:00,1022476504855400449,1022476504855400448,@bbclaurak i agree Laura but the Party you seem to support so strongly is slowly doing the same thing . . . and usi… https://t.co/tsRsVBozIR,"[11, 140]","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",True,1.022447e+18,1.022447e+18,61183568.0,61183568.0,bbclaurak,"{'id': 340170806, 'id_str': '340170806', 'name': 'Big Bobs bastard beans', 'screen_name': 'annoyed_aldo', 'location': None, 'url': None, 'description': 'Memento Mori . . . check my timeline before following me 😒😒 come at me bitches . . . I’ll just block your racist, hateful asses . . .', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 775, 'friends_count': 1523, 'listed_count': 60, 'favourites_count': 16805, 'statuses_count': 38211, 'created_at': 'Fri Jul...",,,"{'id': '58f909abfd95e133', 'url': 'https://api.twitter.com/1.1/geo/id/58f909abfd95e133.json', 'place_type': 'city', 'name': 'Lewisham', 'full_name': 'Lewisham, London', 'country_code': 'GB', 'country': 'United Kingdom', 'bounding_box': {'type': 'Polygon', 'coordinates': [[[-0.074547, 51.414087], [-0.074547, 51.494127], [0.038567, 51.494127], [0.038567, 51.414087]]]}, 'attributes': {}}",,False,"{'full_text': '@bbclaurak i agree Laura but the Party you seem to support so strongly is slowly doing the same thing . . . and using you as their puppet 😐😐😐', 'display_text_range': [11, 141], 'entities': {'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'bbclaurak', 'name': 'Laura Kuenssberg', 'id': 61183568, 'id_str': '61183568', 'indices': [0, 10]}], 'symbols': []}}",0,0,0,0,"{'hashtags': [], 'urls': [{'url': 'https://t.co/tsRsVBozIR', 'expanded_url': 'https://twitter.com/i/web/status/1022476504855400449', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}], 'user_mentions': [{'screen_name': 'bbclaurak', 'name': 'Laura Kuenssberg', 'id': 61183568, 'id_str': '61183568', 'indices': [0, 10]}], 'symbols': []}",False,False,low,en,2018-07-26 13:39:30.109,,,,,,


In [7]:
london_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5341 entries, 0 to 5340
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype              
---  ------                     --------------  -----              
 0   created_at                 5341 non-null   datetime64[ns, UTC]
 1   id                         5341 non-null   int64              
 2   id_str                     5341 non-null   int64              
 3   text                       5341 non-null   object             
 4   display_text_range         3535 non-null   object             
 5   source                     5341 non-null   object             
 6   truncated                  5341 non-null   bool               
 7   in_reply_to_status_id      2230 non-null   float64            
 8   in_reply_to_status_id_str  2230 non-null   float64            
 9   in_reply_to_user_id        2444 non-null   float64            
 10  in_reply_to_user_id_str    2444 non-null   float64            
 11  in_r

In [8]:
paris_tweets = pd.read_json("paris.json", lines=True)
paris_tweets.head(1)

Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_screen_name,user,geo,coordinates,place,contributors,is_quote_status,quote_count,reply_count,retweet_count,favorite_count,entities,favorited,retweeted,filter_level,lang,timestamp_ms,display_text_range,extended_entities,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_tweet
0,2018-07-27 17:40:45+00:00,1022899608396156928,1022899608396156928,Bulletin météo parisien : des grêlons énormes s'abattent sur nous. La température à dégringoler de 36 A20 😍😍😍😍😍,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",False,,,,,,"{'id': 898983688960167936, 'id_str': '898983688960167936', 'name': 'l'idiopathe', 'screen_name': 'olivier7399', 'location': 'Paris, France', 'url': None, 'description': '#bibliothécaire et #francophone', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 18, 'friends_count': 21, 'listed_count': 1, 'favourites_count': 1906, 'statuses_count': 236, 'created_at': 'Sat Aug 19 19:03:08 +0000 2017', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang'...",,,"{'id': '09f6a7707f18e0b1', 'url': 'https://api.twitter.com/1.1/geo/id/09f6a7707f18e0b1.json', 'place_type': 'city', 'name': 'Paris', 'full_name': 'Paris, France', 'country_code': 'FR', 'country': 'France', 'bounding_box': {'type': 'Polygon', 'coordinates': [[[2.224101, 48.815521], [2.224101, 48.902146], [2.469905, 48.902146], [2.469905, 48.815521]]]}, 'attributes': {}}",,False,0,0,0,0,"{'hashtags': [], 'urls': [], 'user_mentions': [], 'symbols': []}",False,False,low,fr,2018-07-27 17:40:45.854,,,,,,,,


In [9]:
paris_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2510 entries, 0 to 2509
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype              
---  ------                     --------------  -----              
 0   created_at                 2510 non-null   datetime64[ns, UTC]
 1   id                         2510 non-null   int64              
 2   id_str                     2510 non-null   int64              
 3   text                       2510 non-null   object             
 4   source                     2510 non-null   object             
 5   truncated                  2510 non-null   bool               
 6   in_reply_to_status_id      1040 non-null   float64            
 7   in_reply_to_status_id_str  1040 non-null   float64            
 8   in_reply_to_user_id        1101 non-null   float64            
 9   in_reply_to_user_id_str    1101 non-null   float64            
 10  in_reply_to_screen_name    1101 non-null   object             
 11  user

<br>

## Naive Bayes Classifier

### Combine all texts:

- Combine all tweets' texts from all three locations using the `+` operator into `all_tweets`
- Make `labels` associated with those tweets: `0` represents a New York tweet, `1`  represents a London tweet, and `2` represents a Paris tweet.

In [10]:
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()

In [11]:
all_tweets = new_york_text + london_text + paris_text
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

### Split the Data into Training and Test Sets

In [12]:
train_data, test_data, train_labels, test_labels = train_test_split(all_tweets, labels, test_size = 0.2, random_state = 1)

In [13]:
print(len(train_data))
print(len(test_data))

10059
2515


### Make the Count Vectors

In [14]:
counter = CountVectorizer()
counter.fit(train_data)

CountVectorizer()

In [15]:
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

In [16]:
print(train_data[30])
print(train_counts[30])

@slack2thefuture Thank for the anecdote and the pictures.
  (0, 2489)	1
  (0, 2537)	1
  (0, 10664)	1
  (0, 20782)	1
  (0, 24853)	1
  (0, 26683)	1
  (0, 26698)	2


### Train and Test the Naive Bayes Classifier

In [17]:
classifier = MultinomialNB()
classifier.fit(train_counts, train_labels)

MultinomialNB()

In [18]:
predictions = classifier.predict(test_counts)

### Model Evaluation

In [19]:
np.round(accuracy_score(test_labels, predictions), 2)

0.68

In [20]:
confusion_matrix(test_labels, predictions)

array([[541, 404,  28],
       [203, 824,  34],
       [ 38, 103, 340]])

### Test on new data

In [21]:
tweet = 'The weather is not as cool as my accent'
tweet_counts = counter.transform([tweet])

In [22]:
classifier.predict(tweet_counts)

array([1])