# Twitter – Bag of Words

**Global setup**

In [None]:
try:
    with open("../global_setup.py") as setupfile:
        exec(setupfile.read())
except FileNotFoundError:
    print('Setup already completed')

**Local setup**

In [None]:
from pathlib import Path
import json

from src.text.twitter.twitter_client import TwitterClient
from notebooks.exercises.src.text.bag_of_words import BagOfWords
from src.utility.files import ensure_directory

%matplotlib notebook

authentication_path = Path("data", "twitter", "authentication.json")
ensure_directory(authentication_path)

## Introduction

A bag-of-words representation is a way to represent documents by counting distinct words in each document. First, the distinct words for a set of documents (also called a corpus) is found. This is called the vocabulary. Then for each document the number of occurences of each distinct word is counted.

### Example

To show how this is done, we first need a corpus. We start with a fictional corpus of tweets.

A new Twitter user with limited vocabulary might tweet something like the following:

In [None]:
fictional_tweets = [
    "This is my first tweet!",
    "My second tweet is not like my first tweet.",
    "I really like to tweet in my own words. This is fun.",
    "I'm running out of things to tweet.",
    "I don't really like to tweet anymore."
]

We start by focusing on the first tweet.

In [None]:
one_tweet_bag_of_words = BagOfWords(fictional_tweets[0])
one_tweet_bag_of_words.plot()

Including the second tweet.

In [None]:
two_tweet_bag_of_words = BagOfWords(fictional_tweets[0:2])
two_tweet_bag_of_words.plot()

All tweets:

In [None]:
fictional_bag_of_words = BagOfWords(fictional_tweets)
fictional_bag_of_words.plot()

### tf-idf



In [None]:
fictional_bag_of_words.plot(tfidf=True)

### Questions

1. Why not use a set of real tweets instead of the fictional tweets?
    * Little overlap between document vocabulary for small sets.
2. Try to replacing the fictional tweets.

## Real tweets

In [None]:
# For easier development, the consumer key and secret can be read from a file
use_file = True
consumer_key = ""
consumer_secret = ""

if use_file:
    twitter = TwitterClient.authenticate_from_path()
    
else:
    twitter = TwitterClient(consumer_key, consumer_secret)
    twitter.save_authentication_to_path()

Next, we are fetching some tweets:

In [None]:
tweets = twitter.search("#AI", count=10)

Then, we extract the text from each tweet:

In [None]:
tweet_texts = [
    tweet.text_excluding(
        hashtags=False,
        mentions=False,
        urls=False
    )
    for tweet in tweets
]

Both bag-of-words matrix and tf-idf transformed one.

In [None]:
bag_of_words = BagOfWords(tweet_texts)
bag_of_words.plot()

In [None]:
bag_of_words.plot(tfidf=True)