some observations    
===============

**vocabulary size**    
tweet tokenizer / no preprocessing = 313803    
tweet tokenizer / with cleaning method = 260580    
tweet tokenizer / with cleaning method, reduce length = 240963

In [1]:
import pickle
import pandas as pd
import math
from collections import Counter
import sys
import csv
import string
import re
import emoji
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from itertools import chain
from indexer import TwitterIQ

We'll assign some variables for our `clean` function to use. We're doing so outside of the function itself so that they needn't be defined every time we want to use the clean method.

In [2]:
tokenizer = TweetTokenizer(reduce_len=True)
unicodes2remove = [
    # all kinds of quotes
    u'\u2018', u'\u2019', u'\u201a', u'\u201b', u'\u201c', \
    u'\u201d', u'\u201e', u'\u201f', u'\u2014',
    # all kinds of hyphens
    u'\u002d', u'\u058a', u'\u05be', u'\u1400', u'\u1806', \
    u'\u2010', u'\u2011', u'\u2012', u'\u2013',
    u'\u2014', u'\u2015', u'\u2e17', u'\u2e1a', u'\u2e3a', \
    u'\u2e3b', u'\u2e40', u'\u301c', u'\u3030',
    u'\u30a0', u'\ufe31', u'\ufe32', u'\ufe58', u'\ufe63', \
    u'\uff0d', u'\u00b4'
]

punctuation = string.punctuation.replace('@', '') + ''.join(unicodes2remove)
# regex to match urls (taken from the web)
urlregex = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]'
                           '|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
# keep @ to be able to recognize usernames
punctuation = string.punctuation.replace('@', '') + ''.join(unicodes2remove)
punctuation = punctuation.replace('#', '')
# a bunch of emoji unicodes
emojis = ''.join(emoji.UNICODE_EMOJI)
emojis = emojis.replace('#', '')
# combined english and german stop words
stop_words = set(stopwords.words('english') + stopwords.words('german'))

In [3]:
def clean(s):
    """
    Normalizes a string (tweet) by removing the urls, punctuation, digits,
    emojis, by putting everything to lowercase and removing the
    stop words. Tokenization is performed aswell.

    :param s the string (tweet) to clean
    :return: returns a list of cleaned tokens
    """
    s = s.replace('[NEWLINE]', '')
    s = s.replace('…', '...')
    s = urlregex.sub('', s).strip()
    s = s.translate(str.maketrans('', '', punctuation + string.digits \
                                  + emojis)).strip()
    s = s.lower()
    s = tokenizer.tokenize(s)
    return s

In the next few cells, we'll finish setting everything up. In order:

* `inv_index` is an inverted index (from past assignments) so that we can quickly get terms' document frequencies
* `df` is a Pandas DataFrame containing all the tweets, their authors, IDs, and other info
* `tweets` is a Pandas Series containing the tweets
* `tokenized` is a Pandas Series of lists containing the results of the above `clean` method, so lists of tokenized terms


In [4]:
inv_index = TwitterIQ('tweets.csv')

In [5]:
df = pd.read_csv('tweets.csv', sep='\t', usecols=[1,4], names=['id', 'tweet'])

In [6]:
tweets = df['tweet']

In [7]:
tokenized = tweets.apply(clean)

## TF-IDF

In [8]:
def compute_tfidf(term, doc, tweets):
    """
    """
    counts = Counter(doc)
    tf = counts[term] # term frequency
    if not tf:
        return 0
    df = inv_index[term].freq # document frequency
    idf = len(tweets) / (df + 1) # idf, adding 1 to `df` to avoid zero division
    return (1 + math.log10(tf) * (math.log10(idf)))

In [9]:
def tfidf(doc1, doc2, tweets):
    """
    This function calculates the tf-idf scores for two documents and returns them as a dictionary
    wherein each value might be best visualized as:
    
    {term: (.61, .97)}
    
    Here .61 and .97 are tfidf scores from `compute_tfidf`
    
    doc1,doc2 -> list of tokens
    tweets: a collection of lists of tokenized tweets
    """
    intersect = set(doc1) | (set(doc2))
        
    return {term : (compute_tfidf(term, doc1, tweets), compute_tfidf(term, doc2, tweets))
            for term in intersect}

In [10]:
def cosine_dict(vector):
    """Gets the cosine similarity of two vectors represented as dictionaries."""
    if not vector:
        return 0
    
    numerator = 0
    denominator = 0
    vec1_length = 0
    vec2_length = 0
    # Walks through all tfidf pairs in the dictionary
    for pair in vector.values(): 
        numerator += pair[0] * pair[1] # Multipies each value pair
        vec1_length += pair[0]**2 # Squares the first value
        vec2_length += pair[1]**2 # Squares the second value
    vec1_length = math.sqrt(vec1_length)
    vec2_length = math.sqrt(vec2_length)
    denominator = vec1_length * vec2_length
    if not denominator:
        return 0
    return numerator / denominator

In [21]:
def top_x(x, q, tweets, cleaned=False):
    """
    x: top x number
    q: query to compare to
    tweets: all the tweets -> assumed to be cleaned/tokenized
    cleaned: whether `q` is cleaned
    """
    if not cleaned:
        q = clean(q)
    
    return sorted([(cosine_dict(tfidf(q, tweet, tweets)), ' '.join(tweet)) for tweet in tweets], reverse=True)[:x]

First Text
==========

In [16]:
article = """
President Donald Trump on Thursday ended his silence on Michael Cohen's prison sentencing, claiming he never directed his longtime attorney to break the law and that he bears no responsibility for Cohen's campaign finance violations.

In his first tweets since Cohen was sentenced to three years for a series of tax fraud and lying charges on Wednesday, Trump argued that Cohen pleaded guilty to breaking campaign finance laws to get a lighter sentence. The president also questioned whether any legal violations even occurred.

“I never directed Michael Cohen to break the law. He was a lawyer and he is supposed to know the law. It is called ‘advice of counsel,’ and a lawyer has great liability if a mistake is made. That is why they get paid," Trump wrote on Twitter across a flurry of posts Thursday morning. "Despite that many campaign finance lawyers have strongly stated that I did nothing wrong with respect to campaign finance laws, if they even apply, because this was not campaign finance."
"""
# Source: [https://www.politico.com/story/2018/12/13/trump-breaks-silence-michael-cohen-sentencing-1061817]

In [26]:
#top = top_x(100, article, tokenized)
#top

#we pickled the result -> see below

In [27]:
with open('first100.pickle', 'rb') as f:
    top = pickle.load(f)
top

[(0.4871853685799716,
  '@bikerbot @yourmomsbackup @roylellis @jpelusio @sensanders yeah and according to trump he planted a spie in his campaign in reality that was a informant of the fbi following protocol you know the fbi that was investigating it so how comes you support trump of you believe he is a russian operative and obama shpuld have stopped him'),
 (0.4694777261003532,
  'great piece on the carbohydrateinsulin model of obesity debate with a disclaimer to remember for basically anyone making any claim ever i dont want to be on the wrong side of history and one way to do that is to make overly confident and categorical predictions'),
 (0.46807387467777445,
  '@nabbingkeita @jclfc the problem was a blockade of his tmj and cervical musculature due to a hit in the jaw he received a couple months earlier that led to problems with his metabolism'),
 (0.46667501610004897,
  '@yudakaneo it was odd that the rather cheerful guy wasnt opening up the door and takahiro had to inhale a deep

Second Text
===========

In [32]:
article2 = tweets[119737]
article2

"Die Wartezeit beträgt ca 15 Minuten[NEWLINE]nach 25 fucking Minuten immer noch in der Warteschleife vom @hmdeutschland Kundenservice.[NEWLINE]And I'm kinda pissed... https://t.co/EFCBDWwn7F"

In [31]:
top2 = top_x(100, article2, tokenized)
top2

[(1.0000000000000002,
  'die wartezeit beträgt ca minutennach fucking minuten immer noch in der warteschleife vom @hmdeutschland kundenserviceand im kinda pissed'),
 (0.4082482904638631, 'im kinda pissed'),
 (0.4082482904638631, 'im fucking pissed'),
 (0.4082482904638631, 'im fucking pissed'),
 (0.4082482904638631, 'im fucking pissed'),
 (0.4082482904638631, 'im fucking pissed'),
 (0.4082482904638631, 'im fucking pissed'),
 (0.383649508822822,
  '@hilliknixibix die hab ich als kind immer vom dorfbäcker holen müssenab der die noch im sortiment hat'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738, 'im so fucking pissed'),
 (0.3535533905932738,

Third Text
==========

In [37]:
article3 = tweets[119757]
article3

'@tthicup IM GLAD YOU THINK SO I WAS JUST SHOCKED FOR A MOMENT LOLLLLL'

In [None]:
top3 = top_x(100, article3, tokenized)
top3