# <b>Tweet Location</b>
# Twitter Classification Cumulative Project Part-2

My Codecademy Challenging Part-2 Project From The Data Scientist Path Foundations of Machine Learning: Supervised Learning Course, Advance Classification Models Section.

## <b>Overview<b>

In this project, Twitter Classification Cumulative Project, I use real tweets to find patterns in the way people use social media. There are two parts to this project:

- Part-1: Viral Tweets, Predict Viral Tweets, using a K-Nearest Neighbors classifier model.
- Part-2: Classifying Tweets or Tweets Location, (This Section).

### + Tweets Location  Project Goal

Using a Naive Bayes Classifier Model, classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris.

### + Project Requirements

Be familiar with:

- Python3
- Machine Learning: Supervised Learning

- The Python Libraries:
    - Pandas
    - NumPy
    - Sklearn

### + Links

[Project Blog](https://www.alex-ricciardi.com/post/tweet-location)

[Project GitHub](https://github.com/ARiccGitHub/tweet_location)

## <b>Libraries</b>

In [1]:
# Data manipulation tool
import pandas as pd
# Scientific computing, array
import numpy as np
# Data splitter
from sklearn.model_selection import train_test_split
# Convert a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer
# Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
# Model evaluation
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix

## <b>Investigate The Data<b>

The provided data:
 * `new_york.json`
 * `london.json`
 * `paris.json`

### + Importing the data 

In [2]:
new_york_tweets = pd.read_json("data_json/new_york.json", lines=True)
london_tweets = pd.read_json("data_json/london.json", lines=True)
paris_tweets = pd.read_json("data_json/paris.json", lines=True)

### + Exploring  the provided data

* The columns, or features, of a tweet.
* The text of the 12th tweet in the New York dataset.
* The number of tweets.

#### - The columns, or features, of a tweet:

In [3]:
features_type = new_york_tweets[new_york_tweets.columns].dtypes.to_frame().rename(columns={0:'dtype'})
features_type

Unnamed: 0,dtype
created_at,"datetime64[ns, UTC]"
id,int64
id_str,int64
text,object
display_text_range,object
source,object
truncated,bool
in_reply_to_status_id,float64
in_reply_to_status_id_str,float64
in_reply_to_user_id,float64


Some of the features are objects, for example "user", let's explore and find out what kind of object those features are.

In [4]:
features_type['type'] = [type(new_york_tweets.loc[0][col]).__name__ for col in new_york_tweets.columns]
features_type

Unnamed: 0,dtype,type
created_at,"datetime64[ns, UTC]",Timestamp
id,int64,int64
id_str,int64,int64
text,object,str
display_text_range,object,list
source,object,str
truncated,bool,bool_
in_reply_to_status_id,float64,float64
in_reply_to_status_id_str,float64,float64
in_reply_to_user_id,float64,float64


I used the new_york_tweets index O row to output the objects' data type, some of features object values of the row are equal to NaN outputting a NoneType data type, let's find out those objects' actual data type by using different row values not equal to NaN.

In [5]:
for feature in features_type.index:
    if features_type.loc[feature]['type'] == 'NoneType': 
        for i in range(len(new_york_tweets)):
            if new_york_tweets.loc[i][feature] != None:
                features_type.loc[feature]['type'] = type(new_york_tweets.loc[i][feature]).__name__ 
                break 
features_type.to_csv('data/features_type.csv')
features_type

Unnamed: 0,dtype,type
created_at,"datetime64[ns, UTC]",Timestamp
id,int64,int64
id_str,int64,int64
text,object,str
display_text_range,object,list
source,object,str
truncated,bool,bool_
in_reply_to_status_id,float64,float64
in_reply_to_status_id_str,float64,float64
in_reply_to_user_id,float64,float64


The `"text"` features has useful data to predict a tweet location.

In [6]:
print(f'\nText of 12th tweet: {new_york_tweets.loc[12]["text"]}')


Text of 12th tweet: Be best #ThursdayThoughts


`new_york_tweets`, `london_tweets` and `paris_tweets` number of tweets:

In [7]:
print(f'Number of tweets from New York: {len(new_york_tweets)}')
print(f'Number of tweets from London: {len(london_tweets)}')
print(f'Number of tweets from Paris: {len(paris_tweets)}')

Number of tweets from New York: 4723
Number of tweets from London: 5341
Number of tweets from Paris: 2510


The `paris_tweets` DataFrame has roughly half the amount of tweets than the `new_york_tweets` and `london_tweets` DataFrames.

## <b>Naive Bayes Classifier</b>

A [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a supervised machine learning algorithm that leverages [Bayesâ€™ Theorem](https://mathworld.wolfram.com/BayesTheorem.html) to make predictions and classifications.

### + Defining data and labels

To classify any tweet (or sentence) and predict whether that sentence came from New York, London, or Paris, using Naive Bayes Classifier Models,
I isolated the `text` features data from each tweets by location DataFrame and combined them into to one data variable named `all_tweets_text`.

I defined the labels associated with `paris_tweets`, `new_york_tweets` and `london_tweets` locations as follow:
- `0` represents a New York tweet
- `1`  represents a London tweet 
- `2` represents a Paris tweet 
I store the labels data into a variable named `labels`.

In [8]:
# Isolating the `text` features data from each DataFrame
new_york_text = new_york_tweets["text"].tolist()
london_text = london_tweets["text"].tolist()
paris_text = paris_tweets["text"].tolist()
# Combined text data
all_tweets_text = new_york_text + london_text + paris_text
# Labels
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)

### + Creating training and test sets

To split the data into training and test sets, I used the `"train_test_split"` function with the argument `"random_state = 1"`, which sets the random seed to 1, to ensure that results are reproducible.

In [9]:
train_data, test_data, train_labels, test_labels = train_test_split(all_tweets_text, labels, test_size = 0.2, random_state = 1)

In [10]:
# Labels Test Sample
pd.DataFrame({'test_labels':test_labels}).head(10).style.hide_index().set_properties(**{'text-align': 'center'})

test_labels
0
0
1
0
0
2
1
1
1
0


In [11]:
# Data Test Sample
pd.DataFrame({'test_data':test_data}).head(10).style.hide_index().set_properties(**{'text-align': 'center'})

test_data
@saritam Theyâ€™ll be aight!
Â¿Es porque me gusta la changua?ðŸ˜”
@zonalmista @Jamie_FD
"Trade & investment agreements often have a negative effect on the weakest parts of society, in particular women. Hoâ€¦ https://t.co/Ve3JlBzNlZ"
@ELVIAJEAMADO 23:35. Si me llevan voy en el baÃ±o igual ðŸ˜…
@Tanziloic Loic le poste d'arriere droit et la defense vont pas poser des probleme durant le mois d'aout quid desâ€¦ https://t.co/mOmFFJPzLq
https://t.co/UTj5tiGkKP
"Did You learn something ..? @ London, United Kingdom https://t.co/mkMzM1vuJC"
@billybragg We know about the far right so I think it's redundant to explain that they pose a threat. When Labour pâ€¦ https://t.co/UR9NPkBTFy
Excited to reveal @CampaignLiveUS' 2018 Inclusive & Creative Top 20 honorees. Check out which agencies and brands aâ€¦ https://t.co/q2Jd2YYxtE


### + Making the Count Vectors

To use a Naive Bayes Classifier, the data lists of words needs to be transformed into [count vectors](https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e).

For example, the sentence `"I love New York, New York"` will transform into a list that contains:
* Two `1`s because the words `"I"` and `"love"` each appear once.
* Two `2`s because the words `"New"` and `"York"` each appear twice.
* Many `0`s because every other word in the training set didn't appear at all.

In [12]:
# Initializes the counter vector
counter = CountVectorizer()
# learns a vocabulary dictionary: raw text corpus â†’ processed text â†’ tokenized text â†’ corpus vocabulary â†’ text representation
counter.fit(train_data)
# Vector, transforms the learned vocabulary dictionary to a document-term matrix
train_counts = counter.transform(train_data)
test_counts = counter.transform(test_data)

# Count Vector Sample 
print(f'Train data 3rd tweet:\n{train_data[3]}\n')
print('Count Vector:')
print(train_counts[3])

Train data 3rd tweet:
saying bye is hard. Especially when youre saying bye to comfort.

Count Vector:
  (0, 5022)	2
  (0, 6371)	1
  (0, 9552)	1
  (0, 12314)	1
  (0, 13903)	1
  (0, 23994)	2
  (0, 27146)	1
  (0, 29397)	1
  (0, 30274)	1


### + Train and Test the Naive Bayes Classifier, Predictions

I used the [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) class from sklearn python library to create my Naive Bayes Classifier model.
Then I trained the model and predicted the tweets test data locations. 

In [13]:
# Initializes Naive Bayes model
classifier = MultinomialNB()
# Trains model
classifier.fit(train_counts, train_labels)
# Predicts the tweets test data locations
predictions = classifier.predict(test_counts)
# Prediction Sample
pd.DataFrame({'Predictions':predictions}).head(10).style.hide_index().set_properties(**{'text-align': 'center'})

Predictions
0
2
1
1
2
2
1
1
1
0


## <b>Evaluating The Model<b>

To evaluate the models, I used the scaled test data sets to predict whether or not a tweet is a viral tweet, and compared the predicted results against the test labels sets by using the following evaluation metrics:

* Accuracy
* Precision
* Recall
* Confusion Matrix

More info: [5 Classification Evaluation metrics](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226)

### + Accuracy:

[Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy#:~:text=Accuracy%20is%20one%20metric%20for,predictions%20Total%20number%20of%20predictions) is the percentage of classifications that the algorithm got correct out of every classification it made.

In [14]:
accuracy = accuracy_score(test_labels, predictions)
# Saves and displays the accuracy score
accuracy = pd.DataFrame({'Accuracy':[accuracy]})
accuracy.to_csv('data/accuracy.csv')
accuracy.style.hide_index().set_properties(**{'text-align': 'right'})

Accuracy
0.677932


The model accuracy score is acceptable.

### + Precision:

[Precision](https://en.wikipedia.org/wiki/Precision_and_recall) measures the percentage of items the classifier found that were actually relevant.

In [15]:
# List of locations
locations = ['New York', 'London', 'Paris', 'Combined Locations']
# The argument "average=None" returns each class precision score
precisions = precision_score(test_labels, predictions, average=None)
# The argument "average='weighted'" returns the weighted averaged precision score of the three classes
precision_avg = precision_score(test_labels, predictions, average='weighted')
# Combined precision scores results
precisions = np.append(precisions, [precision_avg], axis=0)
precision_scores = pd.DataFrame({'Locations':locations, 'Precision Scores':precisions})
# Saves precision scores results
precision_scores.to_csv('data/precision_scores.csv')
# Displays
precision_scores.style.set_properties(subset=['Precision Scores'], **{'text-align': 'right'})

Unnamed: 0,Locations,Precision Scores
0,New York,0.691816
1,London,0.619083
2,Paris,0.845771
3,Combined Locations,0.690577


The model precision scores are acceptable.

### + Recall:

[Recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall) measures the percentage of the relevant items the classifier was able to successfully find. 

In [16]:
# The argument "average=None" returns each class recall score
recalls = recall_score(test_labels, predictions, average=None)
# The argument "average='weighted'" returns the weighted averaged recall score of the three classes
recall_avg = recall_score(test_labels, predictions, average='weighted')
# Combined recall scores results
recalls = np.append(recalls, [recall_avg], axis=0)
recall_scores = pd.DataFrame({'Locations':locations, 'Recall Scores':recalls})
# Saves precision scores results
recall_scores.to_csv('data/recall_scores.csv')

recall_scores.style.set_properties(subset=['Recall Scores'], **{'text-align': 'right'})

Unnamed: 0,Locations,Recall Scores
0,New York,0.556012
1,London,0.776626
2,Paris,0.706861
3,Combined Locations,0.677932


The New York recall score is a little low.

### + Confusion Matrix:

The other way to evaluate a model is by looking at the [confusion matrix](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62). A confusion matrix is a table that describes how a classifier made its predictions. 

For example, if there were two labels, A and B, a confusion matrix might look like this:

```
9 1
3 5
```

In this example, the first row shows how the classifier classified the true A's. It guessed that 9 of them were A's and 1 of them was a B. The second row shows how the classifier did on the true B's. It guessed that 3 of them were A's and 5 of them were B's.

This project utilizes three classes &mdash; `0` for New York, `1` for London, and `2` for Paris.

In [17]:
cf_matrix = confusion_matrix(test_labels, predictions)

raw_cf_matrix = pd.DataFrame(columns=[''])
for result in cf_matrix:
    raw_cf_matrix = raw_cf_matrix.append({'':result}, ignore_index=True)

# Saves raw matrix 
raw_cf_matrix.to_csv('data/raw_cf_matrix.csv')
raw_cf_matrix

Unnamed: 0,Unnamed: 1
0,"[541, 404, 28]"
1,"[203, 824, 34]"
2,"[38, 103, 340]"


Descriptive Confusion Matrix DataFrame:

In [18]:
new_york_matrix_labels = ['True Positives:', 'False Positive - Was London:', 'False Positive - Was Paris:']
london_matrix_labels = ['True Positives:', 'False Positive - Was New York:', 'False Positive - Was Paris:']
paris_matrix_labels = ['True Positives:', 'False Positive - Was New York:', 'False Positive - Was London:']

desp_cf_matix = pd.DataFrame({'New York':new_york_matrix_labels, ' ':cf_matrix[0],  
                            'London':london_matrix_labels, '  ':cf_matrix[1], 
                            'Paris':paris_matrix_labels, '   ':cf_matrix[2]})

# Saves description matrix
desp_cf_matix.to_csv('data/desp_cf_matix.csv')

desp_cf_matix.style.hide_index().set_properties(subset=[' ', '  ', '   '], **{'text-align': 'Left'})

New York,Unnamed: 1,London,Unnamed: 3,Paris,Unnamed: 5
True Positives:,541,True Positives:,203,True Positives:,38
False Positive - Was London:,404,False Positive - Was New York:,824,False Positive - Was New York:,103
False Positive - Was Paris:,28,False Positive - Was Paris:,34,False Positive - Was London:,340


The classifier predicts tweets that were actually from New York as either New York tweets or London tweets, but almost never Paris tweets. Similarly, the classifier rarely misclassifies the tweets that were actually from Paris. Tweets coming from two English speaking countries are harder to distinguish than tweets in different languages.

## <b>Test Your Own Tweet<b>

In [19]:
tweet = input(f'\nEnter your tweet:\n')
# Vectorizes the tweet
tweet_counts = counter.transform([tweet])
# Predicts
location = classifier.predict(tweet_counts)

if location == 0:
    print(f'\nBase on your tweet, you are probably from: {locations[0]}')
elif location == 1:
     print(f'\nBase on your tweet, you are probably from: {locations[1]}')
else:
     print(f'\nBase on your tweet, you are probably from: {locations[2]}')


Enter your tweet:
I love New York, New York

Base on your tweet, you are probably from: New York
