## Analyzing Characters in Tweets

After finding out that tweets occasionally contain fancy characters such as a downward chart, I started thinking about analyzing the distribution of characters to find out how often such characters can be found in the tweet contents. The data presented here are tweets of selected twitter accounts, namely
* @business,
* @BloombergNRG,
* @ftenergy,
* @IEA,
* @LBBW_Research and
* @FT
from november 1st through november 14th. Unfortunately, at the time of performing this analysis, no further data is available. However, re-doing the same analysis in a few weeks will increase the credibility of the results.

In [1]:
import database
from collections import Counter

counter = Counter()
db = database.Database()
with db.get_session() as session:
    tweet_texts = session.query(database.Tweet.text).all()
for text in tweet_texts:
    for char in str(text).lower():
        counter[char] += 1

Now, let us print out the symbols in reverse order. I would expect the latin alphabet to be most prevalent, common text symbols (such as commas, question marks and so on) to be moderately common and emojis to be even less common. Our goal here is to find out which symbols to look for when working with the text. An initial hypothesis might be that a chart symbol depicting a downward trend is an important indicator of a potentially import economic event.

In [2]:
for char, count in reversed(counter.most_common()):
    print("Character {} (No. {}) occurred {} time{}.".format(char, ord(char), count, "s" if count > 1 else ""))

Character ç (No. 231) occurred 1 time.
Character 💍 (No. 128141) occurred 1 time.
Character 🎄 (No. 127876) occurred 1 time.
Character » (No. 187) occurred 1 time.
Character î (No. 238) occurred 1 time.
Character 🏾 (No. 127998) occurred 1 time.
Character 🙋 (No. 128587) occurred 1 time.
Character 🐧 (No. 128039) occurred 1 time.
Character 🐢 (No. 128034) occurred 1 time.
Character 🐟 (No. 128031) occurred 1 time.
Character 🦀 (No. 129408) occurred 1 time.
Character 节 (No. 33410) occurred 1 time.
Character 棍 (No. 26829) occurred 1 time.
Character 光 (No. 20809) occurred 1 time.
Character 😱 (No. 128561) occurred 1 time.
Character œ (No. 339) occurred 1 time.
Character ⌛ (No. 8987) occurred 1 time.
Character 🔊 (No. 128266) occurred 1 time.
Character 📨 (No. 128232) occurred 1 time.
Character ﬀ (No. 64256) occurred 1 time.
Character 🇵 (No. 127477) occurred 1 time.
Character 🇯 (No. 127471) occurred 1 time.
Character ≠ (No. 8800) occurred 1 time.
Character ⛽ (No. 9981) occurred 1 time.
Character 🚚 (N

Now, looking at these, we can see that some of these actually carry a meaning. For example, "😱" represents a shocked state and "📈" symbolizes an upwards trend. An easy approach to include such meaning in the tweet text is to replace the visual representation with a textual representation, e.g. to replace "😱" with the word "shocked". I already prepared such a function for this purpose which will come in handy when the data is prepared.