Intelligent User Interfaces: Assignment 1
=========================================

Course: Intelligent Human-Computer Interface (COMP0455)  
Student: Mikolaj Kuranowski (2020427681)

In [1]:
import nltk
import heapq
import spacy
import re
import urllib.request
import IPython.display
from bs4 import BeautifulSoup
from collections import Counter
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from spacy import displacy
from table2md import MarkdownTable

nlp = spacy.load("en_core_web_sm")

Problem 1
---------

Named Entity Recognition showing tokenization, parts of speech tagging followed by named entity recognition for

> Steve Jobs was an American entrepreneur and business magnate.
> He was the chairman, chief executive officer (CEO), and a co-founder of Apple Inc.,
> chairman and majority shareholder of Pixar, a member of The Walt Disney Company's
> board of directors following its acquisition of Pixar, and the founder, chairman, and CEO of NeXT.
> Jobs is widely recognized as a pioneer of the microcomputer
> revolution of the 1970s and 1980s, along with Apple co-founder Steve Wozniak.

In [2]:
text = """Steve Jobs was an American entrepreneur and business magnate.
He was the chairman, chief executive officer (CEO), and a co-founder of Apple Inc.,
chairman and majority shareholder of Pixar, a member of The Walt Disney Company's
board of directors following its acquisition of Pixar, and the founder, chairman, and CEO of NeXT.
Jobs is widely recognized as a pioneer of the microcomputer
revolution of the 1970s and 1980s, along with Apple co-founder Steve Wozniak.
"""

In [3]:
nltk.word_tokenize(text)[:15]

['Steve',
 'Jobs',
 'was',
 'an',
 'American',
 'entrepreneur',
 'and',
 'business',
 'magnate',
 '.',
 'He',
 'was',
 'the',
 'chairman',
 ',']

In [4]:
displacy.render(nlp(text), jupyter=True, style="ent")

Problem 2
---------

Extract all bigrams, trigrams using ngrams of nltk library

> Machine learning is a necessary field in today's world.
> Data science can do wonders
> Natural Language Processing is how machines understand text

In [5]:
text = """Machine learning is a necessary field in today's world.
Data science can do wonders.
Natural Language Processing is how machines understand text.
"""
tokens = nltk.word_tokenize(text)

In [6]:
list(nltk.ngrams(tokens, 2))[:10]

[('Machine', 'learning'),
 ('learning', 'is'),
 ('is', 'a'),
 ('a', 'necessary'),
 ('necessary', 'field'),
 ('field', 'in'),
 ('in', 'today'),
 ('today', "'s"),
 ("'s", 'world'),
 ('world', '.')]

In [7]:
list(nltk.ngrams(tokens, 3))[:10]

[('Machine', 'learning', 'is'),
 ('learning', 'is', 'a'),
 ('is', 'a', 'necessary'),
 ('a', 'necessary', 'field'),
 ('necessary', 'field', 'in'),
 ('field', 'in', 'today'),
 ('in', 'today', "'s"),
 ('today', "'s", 'world'),
 ("'s", 'world', '.'),
 ('world', '.', 'Data')]

Problem 3
---------

Sentiment analysis using Vader. Print polarity scores for each token along
with compound scores for each sentence. Based on the compound scores,
decide sentiment as positive (if >=0.05), negative (if<+0.05) or neutral otherwise.

Sentences:

- We are happy!
- Today I am Happy
- The best life ever
- I am sad
- We are sad
- We are super sad
- We are all so sad today

In [8]:
sentences = [
    "We are happy!",
    "Today I am Happy",
    "The best life ever",
    "I am sad",
    "We are sad",
    "We are super sad",
    "We are all so sad today",
]

In [9]:
data = []

for sentence in sentences:
    compound_score = SentimentIntensityAnalyzer().polarity_scores(sentence)["compound"]
    
    sentiment = "neutral"
    if compound_score >= 0.05: sentiment = "positive"
    if compound_score <= -0.05: sentiment = "negative"

    data.append({"Sentence": sentence, "Sentiment": sentiment, "Compound Score": compound_score})

MarkdownTable.from_dicts(data).display()


|         Sentence        | Sentiment | Compound Score |
|-------------------------|-----------|----------------|
| We are happy!           | positive  | 0.6114         |
| Today I am Happy        | positive  | 0.5719         |
| The best life ever      | positive  | 0.6369         |
| I am sad                | negative  | -0.4767        |
| We are sad              | negative  | -0.4767        |
| We are super sad        | positive  | 0.2023         |
| We are all so sad today | negative  | -0.6113        |


Problem 4
---------

Text Summarization of a Wikipedia article

https://en.wikipedia.org/wiki/Artificial_intelligence

1. Data collection from Wikipedia using web scraping(using Urllib library)
2. Parsing the URL content of the data(using BeautifulSoup library)
3. Data clean-up like removing special characters, numeric values, stop words and punctuations.
4. Tokenization — Creation of tokens (Word tokens and Sentence tokens)
5. Calculate the word frequency for each word.
6. Calculate the weighted frequency for each sentence.
7. Creation of summary choosing 30% of top weighted sentences.

In [10]:
# 1. Download the article
with urllib.request.urlopen("https://en.wikipedia.org/wiki/Artificial_intelligence") as f:
    text = str(f.read(), "utf-8")

In [11]:
# 2. Parse the HTML content and extract the main article
paragraphs = BeautifulSoup(text, "html.parser").find(id="mw-content-text").find_all("p")
article = " ".join(p.text for p in paragraphs)

In [12]:
# 3. Keep only words from the article and basic punctuation
clean_article = re.sub(r"\[[0-9a-zA-Z]*\]", "", article)
clean_article = re.sub(r"\s+", " ", clean_article)

article_words = re.sub(r"[^a-zA-Z]", " ", clean_article)
article_words = re.sub(r"\s+", " ", article_words)

In [13]:
# 4. Tokenize the words
sentences = nltk.sent_tokenize(clean_article)
words = nltk.word_tokenize(article_words)

In [14]:
# 5. Calculate word frequency
stop_words = set(nltk.corpus.stopwords.words('english'))

tf = Counter(filter(lambda w: w not in stop_words, map(str.casefold, words)))

In [15]:
# 6. Score every sentence
sentence_scores = {}
for sentence in sentences:
    sentence_scores[sentence] = 0
    sentence_words = nltk.word_tokenize(sentence)

    if len(sentence_words) >= 30:
        continue

    for word in sentence_words:
        word = word.casefold()
        sentence_scores[sentence] += tf.get(word, 0)

In [16]:
# 7. Creation of summary choosing 30 of top weighted sentences.
summary_sentences = heapq.nlargest(30, sentence_scores.keys(), key=sentence_scores.get)
summary = "\n".join(summary_sentences)

IPython.display.display_pretty(summary, raw=True)

 Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to intelligence of humans and other animals.
AI founder John McCarthy agreed, writing that "Artificial intelligence is not, by definition, simulation of human intelligence".
Much of current research involves statistical AI, which is overwhelmingly used to solve specific problems, even highly successful techniques such as deep learning.
By 2000, solutions developed by AI researchers were being widely used, although in the 1990s they were rarely described as "artificial intelligence".
The next few years would later be called an "AI winter", a period when obtaining funding for AI projects was difficult.
General intelligence is difficult to define and difficult to measure, and modern AI has had more verifiable successes by focusing on specific problems with specific solutions.
Many problems in AI (including in reasoning, planning, learning, perception, and robotics) require the agent to operate with incomple

Problem 5
---------

Language detection Using NLTK Python and print the probabilities
and language name for the following phrases:

1. Solen skinner i dag, fuglene synger, og det er sommer.
2. Ní dhéanfaidh ach Dia breithiúnas orm.
3. I domum et cuna matrem tuam in cochleare.
4. Huffa, huffa meg, det finns poteter på badet. Stakkars, stakkars meg, det finns poteter på badet.

In [17]:
# Solution adapted from: https://www.nltk.org/_modules/nltk/classify/textcat.html
# Note that the model doesn't use "probability", but "distance" from a language.
#
# Distance can be arbitrarily large, assume Inf. Therefore, probability makes no sense,
# as (in floating point numbers, anyway) anything divided by Inf is zero.

sentences = [
    "Solen skinner i dag, fuglene synger, og det er sommer.",
    "Ní dhéanfaidh ach Dia breithiúnas orm.",
    "I domum et cuna matrem tuam in cochleare.",
    "Huffa, huffa meg, det finns poteter på badet. Stakkars, stakkars meg, det finns poteter på badet.",
]

tc = nltk.TextCat()
data = []
for sentence in sentences:
    distances = {k: float(v) for k, v in tc.lang_dists(sentence).items()}
    most_likely = heapq.nsmallest(3, distances, key=distances.get)
    data.append({
        "Sentence": sentence,
        "1st most likely language (distance)": f"{most_likely[0]} ({distances[most_likely[0]]})",
        "2nd most likely language (distance)": f"{most_likely[1]} ({distances[most_likely[1]]})",
        "3rd most likely language (distance)": f"{most_likely[2]} ({distances[most_likely[2]]})",
    })
MarkdownTable.from_dicts(data).display()

|                                              Sentence                                             | 1st most likely language (distance) | 2nd most likely language (distance) | 3rd most likely language (distance) |
|---------------------------------------------------------------------------------------------------|-------------------------------------|-------------------------------------|-------------------------------------|
| Solen skinner i dag, fuglene synger, og det er sommer.                                            | nob (17231.0)                       | nno (18845.0)                       | dan (24349.0)                       |
| Ní dhéanfaidh ach Dia breithiúnas orm.                                                            | gle (15746.0)                       | sun (4.611686018427397e+19)         | eng (7.378697629483826e+19)         |
| I domum et cuna matrem tuam in cochleare.                                                         | eng  (54494.0)                      | eng (57954.0)                       | fra (62543.0)                       |
| Huffa, huffa meg, det finns poteter på badet. Stakkars, stakkars meg, det finns poteter på badet. | nno (9.223372036854817e+18)         | nob (9.223372036854823e+18)         | dan (1.8446744073709597e+19)        |


Problem 6
---------

Which problems do adaptive and predictive keyboards address?
Explain how touch information and language information can be combined for keyboard adaptation.
Explain decoding of touch sequences with token passing and beam pruning. Using Algorithm