This document: https://goo.gl/NIFa0D

# What are the national stereotypes in Finnish Internet fora?

## Data and background

* Suomi24 is the largest Finnish online forum
* All 15 years of online Suomi24 discussions have been recently released for researcher use
  * XXX sentences / XXX words
* *Citizen Mindscapes* - A consortium project funded by the Academy of Finland to dig into the data
* Consortium members cover sociology, psychology, statistics, language technology,...

## National stereotypes

* A question posed to us by Jussi Pakkasvirta (Jussi: you've hereby been acknowledged for the idea!)
* **What are the stereotypes about nations in the S24 data?**
* In other words: **How do people talk about different nations and countries?**
* This is just a use case of a more general approach

## Finding distinctive features

* The general problem: given two collections of texts one *focus* and the other *background* - what are the features which distinguish focus from background?
    * In our example: *focus* texts about a nation *background* texts about all other nations
* General problem - General solution

## Supervised Machine Learning

* Machine learning from examples
* In goes: examples and their desired output
* Out goes: a learned model which can repeat the task on previously unseen examples
    * Customer feedback -> positive/negative sentiment
    * Photo of an apple -> rotten/edible
    * Cell customer data -> will switch provider yes/no
    * Weather data -> will rain next hour yes/no
    * Movie review -> the number of recommendation stars
    * Threatening message -> serious yes/no
    * Email -> spam yes/no
    * Photograph -> nude people yes/no
    * Stock price today -> stock price tomorrow
    * ...
* Very general problem which has very general solutions

## Classifiers and Features

* General machines which can learn a given *example -> prediction* mapping from examples
* Universal, *"just drop the data in"* solutions
* Each example must be presented to the classifier in the form of **features**
* Features are single, measurable properties of the examples
    * Customer feedback -> one feature for every unique word
    * Photo of an apple -> one feature for every pixel, carrying color information
    * Weather data -> set of current atmospheric measurements
    * Email -> one feature for every unique word, the domain of the sender
    * Stock price -> time series of past prices, one feature for every word in news pieces mentioning the company in today's news
    * ...
* In its simplest form, a classifier learns a weight for every feature
* ...which happens to be the same thing as drawing a line across the space and saying that everything above the line is one class, and everything below the line is the other class
    * one feature -> we set a cut-off point
    * two features -> we draw a line
    * three features -> we draw a plane
    * four and more features -> hyperplane

<img src="http://mlpy.sourceforge.net/docs/3.2/_images/lda_binary1.png"/>
    
## Classification - let's try!

https://github.com/TurkuNLP/DigiHum16/blob/master/datapaja_smileys.ipynb

* This example showed us
  * How to train a classifier
  * How to find out which features it uses to make its decisions
  

# In search of national stereotypes

* ...now back to our original problem
* Find what people say about different nations and countries

##  Nations and countries

A list of nations and their names is needed. Semi-manual work, starting from an online list. Coverage not perfect, but quite good (164 entries).

```
AF      Afganistan,afganistanlainen
NL      Alankomaat,hollantilainen,alankomaalainen,Hollanti
AL      Albania,albanialainen
DZ      Algeria,algerialainen
AD      Andorra,andorralainen
AO      Angola,angolalainen
AR      Argentiina,argentiinalainen
AM      Armenia,armenialainen
AW      Aruba,arubalainen
AU      Australia,australialainen
AZ      Azerbaidžan,azerbaidžanlainen,Azerbaidzan,azerbaidzanlainen
BS      Bahama,bahamalainen
BH      Bahrain,bahrainlainen
BD      Bangladesh,bangladeshlainen
BB      Barbados,barbadoslainen
BE      Belgia,belgialainen
BZ      Belize,belizeläinen
BM      Bermuda,bermudalainen
BT      Bhutan,bhutanlainen
BO      Bolivia,bolivialainen
BA      Bosnia,Hertzegovina,bosnialainen,hertsegovinalainen
...
```

## Hits in the data

* S24: 8.6M hits in 7M sentences
* What is the distribution of these hits for different nations?

http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html#s24_counts

## Most distinctive features

* Train an SVM classifier for each nation/country specifically
* Set its parameters so that only about 100 features are used (strong L1 regularization)

1. Use all lemmas of each sentence http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html#keywords-s24
1. Use only adjectives as possible features http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html#alladj-s24
1. Use only adjectives and nation (not country) names http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html#citizenadj-s24
The maps are here: http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html

## Sentiment

* There are sentiment dictionaries available, like this one: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm 
* An original list of English sentiment words, machine-translated to several dozens of languages
* Bummer: the list is not that great and the translations suck, a bit unlucky choice
* For each nation, we can reduce the data only to hits of these sentiment words, and pick the most distinctive ones once again
* This gets pretty sparse, so many countries don't get a sentiment --- need better sentiment dictionary!
* Having selected the most distinctive sentiment terms for each nation, we can aggregate a total sentiment for each, weighted by the (log of the) number of hits of the sentiment words

The maps are here: http://bionlp-www.utu.fi/.ginter/maastereotypiat/map.html

```
luopua  neg     abandon
hylätty neg     abandoned
hylkääminen     neg     abandonment
abba    pos     abba
sieppaus        neg     abduction
poikkeuksellinen        neg     aberrant
poikkeama       neg     aberration
inhota  neg     abhor
vastenmielinen  neg     abhorrent
kyky    pos     ability
viheliäinen     neg     abject
poikkeava       neg     abnormal
lakkauttaa      neg     abolish
poistaminen     neg     abolition
iljettävä       neg     abominable
inhottavuus     neg     abomination
keskeyttää      neg     abort
abortti neg     abortion
epäonnistunut   neg     abortive
Edellä mainittujen      pos     abovementioned
hiertymä        neg     abrasion
kumota  neg     abrogate
paise   neg     abscess
poissaolo       neg     absence
poissa  neg     absent
poissaolija     neg     absentee
poissaolot      neg     absenteeism
absoluuttinen   pos     absolute
synninpäästö    pos     absolution
imeytyy pos     absorbed
järjetön        neg     absurd
järjettömyys    neg     absurdity
```

## Where next?

* The distinctive features we get are quite nice (we think) but:
  - Need to be more "stereotypic" - any ideas?
* We do not take into account the syntax, and simply default to the sentence as the context
  - Try with adjective modifiers and specific syntactic structures
  - Data sparsity for rarely mentioned nations
* The sentiment detection is not that great atm
  - Need better sentiment list / classifier - any ideas?
  
## Data and code

https://github.com/jmnybl/maastereotypiat