some observations    
===============

**vocabulary size**    
tweet tokenizer / no preprocessing = 313803    
tweet tokenizer / with cleaning method = 260580    
tweet tokenizer / with cleaning method, reduce length = 240963

## Create the Vocab Set

In [10]:
import pickle
import pandas as pd
import math
from collections import Counter
import sys
import csv
import string
import re
import emoji
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from itertools import chain
from indexer import TwitterIQ

We'll assign some variables for our `clean` function to use. We're doing so outside of the function itself so that they needn't be defined every time we want to use the clean method.

In [14]:
tokenizer = TweetTokenizer(reduce_len=True)
unicodes2remove = [
    # all kinds of quotes
    u'\u2018', u'\u2019', u'\u201a', u'\u201b', u'\u201c', \
    u'\u201d', u'\u201e', u'\u201f', u'\u2014',
    # all kinds of hyphens
    u'\u002d', u'\u058a', u'\u05be', u'\u1400', u'\u1806', \
    u'\u2010', u'\u2011', u'\u2012', u'\u2013',
    u'\u2014', u'\u2015', u'\u2e17', u'\u2e1a', u'\u2e3a', \
    u'\u2e3b', u'\u2e40', u'\u301c', u'\u3030',
    u'\u30a0', u'\ufe31', u'\ufe32', u'\ufe58', u'\ufe63', \
    u'\uff0d', u'\u00b4'
]

punctuation = string.punctuation.replace('@', '') + ''.join(unicodes2remove)
# regex to match urls (taken from the web)
urlregex = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]'
                           '|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
# keep @ to be able to recognize usernames
punctuation = string.punctuation.replace('@', '') + ''.join(unicodes2remove)
punctuation = punctuation.replace('#', '')
# a bunch of emoji unicodes
emojis = ''.join(emoji.UNICODE_EMOJI)
emojis = emojis.replace('#', '')
# combined english and german stop words
stop_words = set(stopwords.words('english') + stopwords.words('german'))

In [15]:
def clean(s):
    """
    Normalizes a string (tweet) by removing the urls, punctuation, digits,
    emojis, by putting everything to lowercase and removing the
    stop words. Tokenization is performed aswell.

    :param s the string (tweet) to clean
    :return: returns a list of cleaned tokens
    """
    s = s.replace('[NEWLINE]', '')
    s = s.replace('…', '...')
    s = urlregex.sub('', s).strip()
    s = s.translate(str.maketrans('', '', punctuation + string.digits \
                                  + emojis)).strip()
    s = s.lower()
    s = tokenizer.tokenize(s)
    #s = [w for w in s if w not in stop_words]
    return s

In the next few cells, we'll finish setting everything up. In order:

* `inv_index` is an inverted index (from past assignments) so that we can quickly get terms' document frequencies
* `df` is a Pandas DataFrame containing all the tweets, their authors, IDs, and other info
* `tweets` is a Pandas Series containing the tweets
* `tokenized` is a Pandas Series of lists containing the results of the above `clean` method, so lists of tokenized terms


In [11]:
inv_index = TwitterIQ('tweets.csv')

In [12]:
df = pd.read_csv('tweets.csv', sep='\t', usecols=[1,4], names=['id', 'tweet'])

In [13]:
tweets = df['tweet']

In [16]:
tokenized = tweets.apply(clean)

## TF-IDF

In [47]:
def compute_tfidf(term, doc, tweets):
    counts = Counter(doc)
    tf = counts[term] # term frequency
    df = inv_index[term].freq # document frequency
    idf = len(tweets) / (df + 1) # idf, adding 1 to `df` to avoid zero division
    return (1 + math.log10(tf) * (math.log10(idf)))

In [93]:
def tfidf(doc1, doc2, tweets):
    """
    This function calculates the tf-idf scores for two documents and returns them as a dictionary
    wherein each value might be best visualized as:
    
    {term: (.61, .97)}
    
    Here .61 and .97 are tfidf scores from `compute_tfidf`
    
    doc1,doc2 -> list of tokens
    tweets: a collection of lists of tokenized tweets
    """
    intersect = set(doc1) & (set(doc2))
        
    return {term : (compute_tfidf(term, doc1, tweets), compute_tfidf(term, doc2, tweets))
            for term in intersect}
        
    """
    d1 = {}
    d2 = {}
    
    for t in set(doc1):
        if t in intersect:
            d1[t] = compute_tfidf(t, doc1, tweets)
    for t in set(doc2):
        if t in intersect:
            #tf idf
            d2[t] = compute_tfidf(t, doc2, tweets)
            
    df_tfidf = pd.DataFrame().from_dict(d1, orient='index')
    df_tfidf[1] = pd.DataFrame().from_dict(d2, orient='index')
    return df_tfidf
    """

In [91]:
def cosine_dict(vector):
    """Gets the cosine similarity of two vectors represented as dictionaries."""
    if not vector:
        return 0
    
    numerator = 0
    denominator = 0
    vec1_length = 0
    vec2_length = 0
    # Walks through all tfidf pairs in the dictionary
    for pair in vector.values(): 
        numerator += pair[0] * pair[1] # Multipies each value pair
        vec1_length += pair[0]**2 # Sqaures the first value
        vec2_length += pair[1]**2 # Squares the second value
    vec1_length = math.sqrt(vec1_length)
    vec2_length = math.sqrt(vec2_length)
    denominator = vec1_length * vec2_length
    return numerator / denominator

In [19]:
def cosine(vec1, vec2):
    if len(vec1) == 0 or len(vec2) == 0:
        return 0
    nominator = 0
    denominator = 0
    vec1_length = 0
    vec2_length = 0
    for v1,v2 in zip(vec1,vec2):
        nominator += v1*v2
        vec1_length += v1*v1
        vec2_length += v2*v2
    vec1_length = math.sqrt(vec1_length)
    vec2_length = math.sqrt(vec2_length)
    denominator = vec1_length * vec2_length 
    return nominator / denominator

In [89]:
def top_x(x, q, tweets, cleaned=True):
    """
    x: top x number
    q: query to compare to
    tweets: all the tweets -> assumed to be cleaned/tokenized
    cleaned: whether `q` is cleaned
    """
    if not cleaned:
        q = clean(q)
    
    return sorted([(cosine_dict(tfidf(q, tweet, tweets)), tweet) for tweet in tweets], reverse=True)

### Examples
Execute the example docs you want to compare.

#### 1

In [43]:
#@Brandon: its fun to play around with the those documents
doc1 = "this is a random tweet Hausarzt Affe Affe Affe".split()
doc2 = "this is random a a a a a tweet Hausarzt Hausarzt Hausarzt Hausarzt Hausarzt I think bla foo Affe".split()

#### 2

In [67]:
doc1 = "i don't think society understands how hurtful it is when this kind of behavior by the POTUS becomes an accepted form of political discourse".split()
doc2 = 'And it is grievously hurtful to our society when vilification becomes an accepted form of political debate and negative campaigning becomes a full-time occupation.'.split()

#### 3

In [64]:
doc1 = "He was a sk8er boi, she said see you later boy".split()
doc2 = "I'm with the sk8er boi, I said see you later boy".split()

#### 4

In [59]:
doc1 = 'and she told me Ich sitze noch in der Küche'.split()
doc2 = 'Was meinst du mit sitting here with nachos'.split()

### Compute scores

In [68]:
df_tfidf = tfidf(doc1, doc2, tweets)
df_tfidf

{'hurtful': (1.0, 1.0),
 'it': (1.0, 1.0),
 'form': (1.0, 1.0),
 'an': (1.0, 1.0),
 'society': (1.0, 1.0),
 'when': (1.0, 1.0),
 'is': (1.0, 1.0),
 'becomes': (1.0, 2.5287296334437492),
 'political': (1.0, 1.0),
 'of': (2.5287296334437492, 1.0),
 'accepted': (1.0, 1.0)}

In [69]:
cosine_dict(df_tfidf)

0.857451092665427

In [84]:
article = """
President Donald Trump on Thursday ended his silence on Michael Cohen's prison sentencing, claiming he never directed his longtime attorney to break the law and that he bears no responsibility for Cohen's campaign finance violations.

In his first tweets since Cohen was sentenced to three years for a series of tax fraud and lying charges on Wednesday, Trump argued that Cohen pleaded guilty to breaking campaign finance laws to get a lighter sentence. The president also questioned whether any legal violations even occurred.

“I never directed Michael Cohen to break the law. He was a lawyer and he is supposed to know the law. It is called ‘advice of counsel,’ and a lawyer has great liability if a mistake is made. That is why they get paid," Trump wrote on Twitter across a flurry of posts Thursday morning. "Despite that many campaign finance lawyers have strongly stated that I did nothing wrong with respect to campaign finance laws, if they even apply, because this was not campaign finance."
"""

In [90]:
top_x(100, article, tokenized)

[(1.0,
  ['🥳',
   'was',
   'superhappy',
   'and',
   'very',
   'surprised',
   'already',
   'received',
   'my',
   'giveaway',
   'win',
   'thanks',
   'a',
   'lot',
   '@smoktechlogy',
   'again',
   'and',
   'again',
   'cant',
   'wait',
   'to',
   'test',
   'this',
   'thing',
   'of',
   'beauty',
   'and',
   'watch',
   'the',
   'world',
   'cup',
   '#worldcup']),
 (1.0,
  ['待ってられないんだ',
   '！',
   'ミクはヨーロッパに来るからすごく嬉しいよ',
   '！',
   'チケットが注文できるとすぐに買うよ',
   '！',
   '☆',
   'i',
   'cant',
   'wait',
   'miku',
   'will',
   'be',
   'coming',
   'to',
   'europe',
   'so',
   'im',
   'super',
   'happy',
   'as',
   'soon',
   'as',
   'the',
   'tickets',
   'will',
   'be',
   'available',
   'i',
   'will',
   'buy',
   '#mikuexpo',
   '#europe',
   '#hatsunemiku']),
 (1.0,
  ['⠀',
   '⠀',
   '⠀',
   '❝',
   'you',
   'know',
   'im',
   'actually',
   'pretty',
   'content',
   'with',
   'this',
   '❞',
   '╔',
   '║',
   '⠀',
   '⠀',
   '•',
   'finally',
   'a'