In [24]:
import gzip
import json
import pandas as pd
import re
import ufal.udpipe as udpipe #pip install ufal.udpipe
from collections import Counter

## 1. Read tweets in Python

First I read in all the tweets from the file. This is a mix of all the metadata of each tweet so I will need to narrow down the fields to take a look at.

In [2]:
tweet_list = []
with gzip.open("intro-to-nlp/english-tweets-sample.jsonl.gz") as f:
    for line in f:
        tweet_list.append(json.loads(line))

In [3]:
for tweet in tweet_list[:5]:
    print(tweet)

{'created_at': 'Tue Dec 26 14:16:22 +0000 2017', 'id': 945659557480611840, 'id_str': '945659557480611840', 'text': 'Check out my class in #GranblueFantasy! https://t.co/pAvXn8diJr', 'display_text_range': [0, 39], 'source': '<a href="http://granbluefantasy.jp/" rel="nofollow">グランブルー ファンタジー</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 883980236655779840, 'id_str': '883980236655779840', 'name': 'Pc Kwok', 'screen_name': 'jensenpck', 'location': None, 'url': None, 'description': None, 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 0, 'friends_count': 1, 'listed_count': 0, 'favourites_count': 0, 'statuses_count': 42, 'created_at': 'Sun Jul 09 09:24:46 +0000 2017', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'zh-tw', 'contributors_enabled': False, 'is_translator': False, 'pro

## 2. Read the text from the tweets

There are 4 different ways to fetch the full text.

```
['retweeted_status']['extended_tweet']['full_text']
['retweeted_status']['text']
['extended_tweet']['full_text']
['text']
```

A retweet can be truncated but that will not show if you look at the same truncated value as in an original tweet.

In [4]:
text_list = []
for tweet in tweet_list:
    if('retweeted_status' in tweet):
        if (tweet['retweeted_status']['truncated'] is True):
            text_list.append(tweet['retweeted_status']['extended_tweet']['full_text'])
        else:
            text_list.append(tweet['retweeted_status']['text'])
    elif(tweet['truncated'] is True):
        text_list.append(tweet['extended_tweet']['full_text'])
    else:
        text_list.append(tweet['text'])

In [5]:
len(text_list) # We have all the tweets with us

10000

## 3. Segment the text

### What I am removing from each tweet

From the above there is plenty of things we do not care about. What I personally from these few tweets can tell are unnecessary are:

Links

@username

The indicator for a RT

Anything that isn't a letter (This might break things regarding you're and similar words)

I'm leaving in ’' for the reason that words like you're and can't exist, they shouldn't matter all to much if I keep it consistent.

In [6]:
def apply_heuristics(text_list):  
    "Applies a set of heuristics (regex) to a list of texts"
    text_list_heuristic = []
    for tweet in text_list:
        updated = re.sub(r"http\S+", '', tweet) # Remove links
        updated = re.sub("RT.*?:", '', updated) #Remove retweets (will remove most @s)
        updated = re.sub("@.*?\s", '', updated) # Remove @s
        updated = re.sub(r"[^a-zA-Z’\s']", '', updated) # Remove non-letters
        updated = updated.lower()
        text_list_heuristic.append(updated)
    return text_list_heuristic

In [7]:
def apply_pipeline(text_list, path):
    """Takes a udpipe model from a path, and applies it to a list of texts"""
    model_path = path
    model = udpipe.Model.load(model_path)
    pipeline = udpipe.Pipeline(model, "tokenize", "none", "none", "horizontal")

    segmented_text_list = []
    for text in text_list:
        segmented_text_list.append(pipeline.process(text))
    return segmented_text_list

In [8]:
model_path = 'intro-to-nlp/en.segmenter.udpipe' 
text_list_heuristic = apply_heuristics(text_list)
segmented_text_list = apply_pipeline(text_list, model_path)

In [9]:
for heuristic, segmented in zip(text_list_heuristic[:20],segmented_text_list[:20]):
    print("Heuristic: ",heuristic)
    print("Pipelined: ",segmented)

Heuristic:  check out my class in granbluefantasy 
Pipelined:  Check out my class in # GranblueFantasy !
https://t.co/pAvXn8diJr

Heuristic:  extending a big thank you to our community partner all over the world 
Pipelined:  Extending a big Thank
You to our Community Partner all over the world !
https://t.co/cu7on7g1si

Heuristic:  blueberry  
Pipelined:  Blueberry 🍨 https://t.co/2gzHAFWYJY

Heuristic:  bad day 
Pipelined:  Bad day ☹️®️

Heuristic:  i'm chim tho
Pipelined:  @prologve_ @BTS_ARMY @BTS_twt I 'm Chim tho

Heuristic:  i need a dog to cuddle with right now
Pipelined:  i need a dog to cuddle with right now

Heuristic:   country inn countryinns campsprings   for taxi  
Pipelined:  RT : Country Inn countryinns # CampSprings 🏨 👉🚖 For Taxi 📞703-445-4450
https://t.co/lXdFUm4qUb

Heuristic:  day  penelope 
Pipelined:  DAY 10 - PENELOPE https://t.co/1z1cgzvZxh

Heuristic:  winnipeggers wake up to the city's coldest christmas in  decades  
Pipelined:  Winnipeggers wake up to the city

### The difference between heuristic and a pipeline

Above I have 15 different tweets I can compared. The heuristic filtering can be improved upon forever but for a short list of things to remove I think I succeeded pretty well.

In many cases the heuristic way seems to give a better result, but it also leads to more situations where words become concatenated and as such lose a lot of meaning for further analysis of the text.

An example of this is the 7th tweet about cases and application areas of ai, where the heuristic method just bunches everything together. The pipeline model just spaces these out so they are readable words.

For the rest of the assignment I'll use a combination of both, since in my case that gave a decent result.

In [10]:
combined_text_list = []
combined_text_list = apply_heuristics(text_list)
combined_text_list = apply_pipeline(combined_text_list, model_path)

In [11]:
for tweet in combined_text_list:
    print(tweet)

check out my class in granbluefantasy

extending a big thank you to our community partner all over the world

blueberry

bad day

i 'm chim tho

i need a dog to cuddle with right now

country inn countryinns campsprings for taxi

day penelope

winnipeggers wake up to the city 's coldest christmas in decades

id vote for episode count hoga

use casesapplication areas of ai in offlinedigitalmarketing

you ’re allowed to be human

our dad passed away earlier this summer so my mom and i decided to surprise my sisters with bears with his favorite cologne and a recording of his voice it ’s not christmas without you dad but we have you in spirit

you ca n’t win big games if people are coming late to a meeting or just not wearing the right socks if you ca n’t do the small things right how can you expect to do the big things minkah fitzpatrick alabama db

i 'm so sleepy

president trump cuts funding to un after israel vote newsweek

follow utrust on bitcointalk and have your say payment protect

they bought their disabled dog some new wheels for christmas

whiskey tango foxtrot

rt inanotherw videomt v lady gaga zara larsson

oh what i 'm verified heck yea thanks for the xmas present

no ed who has corroborated it nobody

live cam mewtwo viewers male y xxx chat

pussy ass boob hot amateur

boxing day day
another hattrick takes him to pl goals in a new record for a calendar year

now playing on universitypulse i call your name the mamas amp the papas tune in at

it hurts to see other people living out your dream

you know what
i enjoyed our small talks

idk how people can bring someone around their family like nothing amp break up and bring someone new around so quickit ’s such a privilege to meet family i ’m sorry but you not coming no where near my family unless i know you ’re the one i wan na make my life with

genie giveaway rt this tweet if you want genie streaming pass exol only we ’ll check we will pick random person make sure your dm is open giveaway end at am kst exo
u


it ’s impossible to pack a christmas tree back into the original box and close said box and i have to do two of them

do n’t be petty lol

great goal from romain saiss wwfc

truth is that while many brexit leaders pretend to be defending the working class they are only manipulating them to achieve their own right wing agenda scratch beneath the facade and they regard the working class like dirt on their shoes

at sbs gayo yesterday someone suprised when he saw kyungsoo dancing saying private won is dancing heol today someone saw the mv and asked if it s the same person who act in with god it is our do kyungsoo

i refuse to apologize for being a bitch no one has ever apologized to me for treating me like shit amp bringing out the bitch in me

best tak rate kit

kaoru mori is great she draws detailed art and destroys toxic masculinity i am fulfilled

it was very enjoyable ended with a lot of john prine and chuck berry videos niamh got us a bullseye board game too which was great fun

ch

## 4. Word count

This section is focused on generating a word count for all the words in every tweet. I then list out the 20 most common words in this case.

In [12]:
token_counter = Counter()
for tweet in combined_text_list:
    tokens = tweet.strip().split()
    token_counter.update(tokens)

In [13]:
counter_df = pd.DataFrame.from_dict(token_counter, orient="index", columns=["Count"])

In [14]:
counter_df.sort_values("Count",ascending=False).head(20)

Unnamed: 0,Count
the,4382
to,3458
a,2915
i,2869
and,2707
you,2671
of,2091
in,1959
for,1873
is,1861


All of the words that occur the most are essentialy stopwords. This is not surprising since a large part of the english language is composed of there. There could be a reason to remove these however for this assigment it doesn't change the end result. Once we complete the IDF value these will simply be moved over to have a low IDF value instead.

## 5. IDF weighting

For this section I will generate an IDF value for each unique word.

DF(t) = in how many documemnts the term t exists

t = term

IDF = m / DF(t)

m = total documents

t = term

In [15]:
dict_df = dict(token_counter)
dict_df = dict_df.fromkeys(dict_df, 0)

In [16]:
def cvt_list_to_dict(input_list):
    """Convert a list to a dictionary"""
    res_dct = {input_list[i]: 0 for i in range(0, len(input_list))}
    return res_dct

In [17]:
def df_value(text_list, dictionary): 
    """Convert a list containing tweets or text blocks, to a dictionary of DF(t) values"""
    error_list = []
    for tweet in text_list:
        split = tweet.strip().split(' ')
        dict_split = cvt_list_to_dict(split)
        for key in dict_split.keys():
            try:
                dictionary[key] += 1
            except: # Error handling due to some newline issues..
                error_list.append(key)
                new_split = re.split("\\n", key)
                for val in new_split:
                    if (val is not ""):
                        dictionary[val] += 1
    return dictionary

In [18]:
dict_df = df_value(combined_text_list, dict_df)
m = len(combined_text_list)

In [19]:
def df_to_idf(dictionary, total):
    """Convert a dictionary of DF values to the IDF values"""
    idf_dict = dictionary.copy()
    for key in idf_dict:
        idf_dict[key] = total / idf_dict[key]
    return idf_dict

In [20]:
idf_dict = df_to_idf(dict_df, m)

In [21]:
df = pd.DataFrame.from_dict(idf_dict, orient="index", columns=["IDF Score"])

#### 20 lowest IDF values

The least common words are the same words we saw in the most common words. This could be fixed by for example using a stopword remover such as the nltk.corpus stopwords, however with IDF this isn't that worthwile.

In [22]:
df.sort_values('IDF Score').head(20)

Unnamed: 0,IDF Score
the,3.407155
to,3.877472
a,4.319654
and,4.705882
i,4.933399
you,5.313496
of,5.988024
for,6.045949
in,6.131208
is,6.464124


#### 20 highest IDF values

These are the 20 words with the highest IDF value. Most words here are actual words which means are text processing worked out pretty well, however we still have words like lesscrappiness which could have been split better with a more rigorous heuristic segmentation. 

The words here are the words that don't occur a lot in the tweets, and as such the IDF-value indicates that they might hold a large meaning in the text.

In [23]:
df.sort_values('IDF Score', ascending=False).head(20)

Unnamed: 0,IDF Score
canlab,10000.0
seemingly,10000.0
norm,10000.0
momentos,10000.0
stans,10000.0
horrors,10000.0
exploration,10000.0
dailymotion,10000.0
images,10000.0
crawled,10000.0


## 6. Find near duplicate tweets
Notes from a friend
1. preprocessing (stemming & lemmatization)
2. Sklearns Countvectorizer (BoW) eller TfidVectorizer
3. matis genom sklearns cosine_similarity