<a href="https://colab.research.google.com/github/Kraakan/intro-to-nlp/blob/master/demo_basic_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic NLP exercises

* During these exercises, you will learn basic Python skills required in NLP, for example
  * Reading and processing language data
  * Segmenting text
  * Calculating word frequencies and idf weights

* Exercises are based on tweets downloaded using Twitter API. Both Finnish and English tweets are available, you are free to choose which language you want to work with.


> Finnish: http://dl.turkunlp.org/intro-to-nlp/finnish-tweets-sample.jsonl.gz

> English: http://dl.turkunlp.org/intro-to-nlp/english-tweets-sample.jsonl.gz


* Both files include 10,000 tweets. If processing the whole file takes too much time, you can also read just a subset of the data, for example only 1,000 tweets.


## 1) Read tweets in Python

* Download the file, and read the data in Python
* **The outcome of this exercise** should be a list of tweets, where each tweet is a dictionary including different (key, value) pairs

In [None]:
# When I opened this .ipynb in colab a lot of my changes were gone, but I can still see them on github
!wget -nc http://dl.turkunlp.org/intro-to-nlp/finnish-tweets-sample.jsonl.gz
# How to unzip?
import gzip
import json
tweets = []
# Iterate over jsonl and call json.load for each
f = gzip.open("finnish-tweets-sample.jsonl.gz", "rt", encoding="utf-8")
lines = f.readlines()
for line in lines:
    data = json.loads(line)
    tweets.append(data)
print("Here are the json keys: ", tweets[0].keys())


--2021-01-16 03:58:43--  http://dl.turkunlp.org/intro-to-nlp/finnish-tweets-sample.jsonl.gz
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6120485 (5.8M) [application/octet-stream]
Saving to: ‘finnish-tweets-sample.jsonl.gz’


2021-01-16 03:58:44 (4.88 MB/s) - ‘finnish-tweets-sample.jsonl.gz’ saved [6120485/6120485]

dict_keys(['retweeted_status', 'retweet_count', 'favorited', 'geo', 'is_quote_status', 'in_reply_to_user_id', 'place', 'id', 'timestamp_ms', 'coordinates', 'truncated', 'id_str', 'in_reply_to_status_id', 'source', 'in_reply_to_user_id_str', 'text', 'in_reply_to_screen_name', 'contributors', 'retweeted', 'lang', 'created_at', 'filter_level', 'in_reply_to_status_id_str', 'favorite_count', 'entities', 'user'])


## 2) Extract texts from the tweet jsons

* During these exercises we need only the actual tweet text. Inspect the dictionary and extract the actual text field for each tweet.
* When carefully inspecting the dictionary keys and values, you may see the old Twitter character limit causing unexpect behavior for text. In these cases, are you able to extract the full text?
* **The outcome of this exercise** should be a list of tweets, where each tweet is a string.

In [None]:
from random import *
tweetlist = []
# Improve: 1. Check if tweet is truncated
#          2. Add full tweet
for i in tweets:
    tweetlist.append(i["text"])
print("Total number of tweets: ", len(tweetlist))
print("Random example: ", tweetlist[randint(0,len(tweetlist))])

Total number of tweets:  10000
Random example:  @JanneRautakoski @ElinaNiiranen_ Turvallisia ajokilometrejä! 🙂 🌿


## 3) Segment tweets

* Segment tweets using the UDPipe machine learned model, remember to select the correct language.

> English model: https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe

> Finnish model: https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/fi.segmenter.udpipe

* Note that the segmentation model was not trained on tweets, so it may have difficulties in some cases. Inspect the output to get an idea how well it performs on tweets.
* Note: In case of the notebook cell dies while trying to load/run the model, the most typical reason is wrong file path or name, or incorrectly downloaded model.
* **The output of this excercise** should be a list of segmented tweets, where each tweet is a string.

In [None]:
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/fi.segmenter.udpipe
!pip3 install ufal.udpipe
import ufal.udpipe as udpipe

model = udpipe.Model.load("fi.segmenter.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal") # horizontal: returns one sentence per line, with words separated by a single space
segmented_tweets=[]
for t in tweetlist:
    segmented_tweets.append(pipeline.process(t))
print("Random example: ", segmented_tweets[randint(0,len(segmented_tweets))])

File ‘fi.segmenter.udpipe’ already there; not retrieving.

Random example:  @rapvoid kukira mimpiin miton😔😠



## 4) Calculate word frequencies

* Calculate a word frequency list (how many times each word appears) based on the tweets. Which are the most common words appearing in the data?
* Calculate the size of your vocabulary (how many unique words there are).
* **The output of this excercise** should be a sorted list of X most common words and their frequencies, and the number of unique words in the data.

In [None]:

from collections import Counter

token_counter = Counter()
for s in segmented_tweets:
    tokenized = pipeline.process(s)
    tokens = tokenized.split() # after segmenter, we can do whitespace splitting
    token_counter.update(tokens)

print("Most common tokens:", token_counter.most_common(20))
print("Vocabulary size:", len(token_counter))

Most common tokens: [('.', 5162), ('…', 4281), (',', 4088), (':', 3932), ('#', 3543), ('RT', 2766), ('ja', 2482), ('on', 2243), ('!', 1718), ('?', 1092), ('"', 925), ('ei', 820), ('että', 690), ('-', 499), ('(', 494), ('”', 439), (')', 437), ('se', 417), ('–', 387), ('kun', 343)]
Vocabulary size: 53204


## 5) Calculate idf weights

* Calculate idf weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values.
* Can you think of a reason why someone could claim that tf does not have a high impact when processing tweets?
* **The output of this excercise** should be a list of words sorted by their idf weights.


In [None]:
# DF = document frequency df(t), in how many documents (out of all documents) the term t appears
# IDF = inverse document frequency, m/df(t), where m is the total number of documents in your collection
DF = {}
IDF = {}
import random
example_token=random.choice(list(token_counter.keys()))
print("Total count for '", example_token,"' (selected at random): ", token_counter[example_token])
print("The next part of my code is slow, but it gets there eventually!")
for t in token_counter.keys():
  for s in segmented_tweets:
      if t in s:
        if t in DF:
           DF[t]+=1
        else: DF[t]=1
print("Document frequency for '", example_token,"': ", DF[example_token])
for t in DF:
  IDF[t]=len(segmented_tweets)/DF[t]
print("Inverse document frequency for '", example_token,"': ", IDF[example_token])  

Total count for ' https://t.co/sKlRZGxzmB ' (selected at random):  1
The next part of my code is slow, but it gets there eventually!
Document frequency for ' https://t.co/sKlRZGxzmB ':  1
Inverse document frequency for ' https://t.co/sKlRZGxzmB ':  10000.0


## 6) Duplicates or near duplicates

* Check whether we have duplicate tweets (in terms of text field only) in our dataset. Duplicate tweet means here that the exactly same tweet text appears more than once in our dataset.
* Note: It makes sense to check the duplicates using original tweet texts as the texts were before segmentation. I would also recommend using the full 10,000 dataset here in order to get higher chance of seeing duplicates (this does not require heavy computing).
* Try to check whether tweets have additional near-duplicates. Near duplicate means here that tweet text is almost the same in two or more tweets. Ponder what kind of near duplicates there could be and how to find those. Start by considering for example different normalization techniques. Implement some of the techniques you considered.
* **The outcome of this exercise** should be a number of unique tweets in our dataset (with possibly counting also which are the most common duplicates) as well as the number of unique tweets after removing also near duplicates.

In [None]:
dupe_counter = Counter()
dupe_counter.update(tweetlist)
print("Most common tweets:", dupe_counter.most_common(20))
uniquetweets=[]
for t in dupe_counter:
  if dupe_counter[t]==1: uniquetweets.append(t)
print("Number of unique tweets: ", len(uniquetweets))
print("Random example: ", uniquetweets[randint(0,len(uniquetweets))])
# The duplicates seem to be mostly retweets, so I'm thinking near duplicates might be retweets with something added
# - though I've ever tweeted myself so I'm not sure...
neardupe_counter = {}
for t in uniquetweets:
  for s in uniquetweets:
    if t in s:
      if t in neardupe_counter: neardupe_counter[t]+=1
      else:neardupe_counter[t]=1
print("For some reason the next part appears only after a short delay.")
superuniquetweets=[]
for t in neardupe_counter:
  if neardupe_counter[t]==1: superuniquetweets.append(t)
print("Number of super unique tweets: ", len(superuniquetweets))
print("Random example: ", superuniquetweets[randint(0,len(superuniquetweets))])

Most common tweets: [('RT @SitaSalminen: Testasin huvikseen millaisen reaktion lempeys saa aikaan. Suu loksahti auki https://t.co/L7RR70QsZo', 9), ('RT @KeyisQueen: Sosa babyy https://t.co/raoAJv8auH', 8), ('RT @babaBC: T’challa jata hoon kisi ki dhun mein https://t.co/oRkEoGg15B', 8), ('RT @AestheticsJapan: Guiding Light | by Julius Kähkönen\n(https://t.co/vZMCQYy8Rt) https://t.co/XbnOyg8l7w', 8), ('RT @alvaleryae: Que hermoooosuuraa https://t.co/Tvha0K9GQC', 5), ('RT @BTS_army_Fin: [COMEBACK GOALS FINLAND!]⚠️\n\nTässä @BTS_twt Comeback tavoitteet Suomessa.\n\nNäitä ei ole helppo saavuttaa, meidän pitää ol…', 4), ('RT @HelsinkiKymp: Tiedoksi:\nHelsingin puistoissa ja yleisillä alueilla saa liikkua myös "ilman järkevän tuntuista syytä". \n\nTuollainen pääm…', 4), ('RT @VartiainenPasi: Näyttäkää minulle ennustaja, joka tiesi tulevaksi nämä otsikot vuosi sitten. https://t.co/apGqjDFRWY', 4), ('RT @nastynapalm: Mood ku CV:s ei lue mitään muuta ku nimi, yhteystiedot sekä "Puhe- ja kirjoitu

Comments:
I'm very new to python, but I managed to get by with liberal copy-pasting and some googling.
I never managed to find the untruncated tweets though.