# Query the index for threads and topics

## Prerequisites

* Enable **port forwarding** from port 23456 to port localhost:23456
* Point your browser to **http://localhost:23456**
* You will see the default Solr interface

## Basic querying

* Select "s24_top" from *Core selector*
* Click *Query*
* Click *Execute Query" on the bottom of the screen

* This returns all documents, since we searched for everything (\*:\* in the *q* field). The queries are of the form "field:query" (so \*:\* means everything in all fields). What fields do we have?

* **id** thread ID in Suomi24
* **thread_txt_fi** the text of the thread
* **date_tdt** the date of the first post
* **best1_s** the highest-scoring topic for this thread
* **best1_f** the probability of the best scoring topic
* **bestN_ss** few highest_scoring topics
* **s24_area_s** and **s24_subarea_s** S24 sections

*Topic names* 
[http://localhost:23456/solr/s24_top/select?stats=on&stats.field=best1_s&rows=0&stats.calcdistinct=true&q=*:*] (defined originally in file `top_names_50_400k.txt`)

## Querying for topics and sorting

* **best1_s:työ** gives all documents whose best topic is työ, in no particular order
* **bestN_ss:remontti** gives all documents for which *remontti* is among the top topics
* **+best1_s:raskaus +thread_txt_fi:koira** gives all documents with topic 0 and the word koira in them

sorting is easy

* **date_tdt desc** added to the *sort* field sorts by date, from new to old
* **best1_f desc** in the *sort* field, combined with **best1_s:opiskelu** finds the most representative threads for topic opiskelu.

Full query language documentation: http://www.solrtutorial.com/solr-query-syntax.html

## Exercice (Solr-sivun kautta)

* Etsi kommentteja, joiden todennäköisin topiikki on lemmikki. Lue niitä läpi. Ovatko oikein? Mitä muita topiikkeja näihin liittyy?
* Etsi kommentteja, joissa mainitaan sana lemmikki. Miltä nämä näyttävät?
* Etsi kommentteja, joitka edustavat topiikkia lemmikki, mutta eivät ole "s24_area_s":ssa Lemmikit. Miltä nämä näyttävät? (poissaoloa etsitään plussan sijaan -merkillä (miinus). Missä muissa S24-palstan osioissa lemmikeistä näytettäisiin puhuttavan? Vai onko topiikkimalli väärässä? Tällöin nämä kommentit eivät oikeasti liity lemmikkeihin
* Etsi vielä kommentteja, jotka edustavat topiikkia lemmikki, mutta joissa ei ole sanaa lemmikki. Miltä nämä näyttävät? Mistä näissä puhutaan?
* Ota sitten joku sana, joka edustaa jotain ajankohtaista aihetta. Se voi olla urheilusta, politiikasta, musiikista... Etsi kommentteja, joissa on tämä sana. 
* Sorttaa eli järjestä kommentit vielä uusimmasta vanhempaan. Lue muutama kommentti. Tee sitten sama toisin päin, vanhemmasta uudempaan. Näyttävätkö kommentit erilaisilta?
* Saman saa tehdä myös jollekin muulle aihepiirille kuin lemmikeille...


# Topic distribution in S24

* We can query (programmatically) each topic in a row, and ask how many hits we've got
* Then we can sort by which topics are largest

This needs a bit of Python:


In [6]:
import pysolr
solr=pysolr.Solr("http://localhost:23456/solr/s24_top")
result=solr.search(q="*:*", fl="best1_s")
#result2=solr.search(q="thread_txt_fi:Turku", fl="best1_s")
print("Found this many:",result.hits)
print("Here they are:",result.docs)

Found this many: 620721
Here they are: [{'best1_s': 'työ'}, {'best1_s': 'internet_keskustelu'}, {'best1_s': 'perhe_suhteet'}, {'best1_s': 'talouspolitiikka'}, {'best1_s': 'lemmikki'}, {'best1_s': 'opiskelu'}, {'best1_s': 'koirat'}, {'best1_s': 'perhe_suhteet'}, {'best1_s': 'internet_keskustelu'}, {'best1_s': 'uskonto_filosofia'}]


## Try

#### 1)
* You can try this by first dowloading the Github repo: git clone https://github.com/TurkuNLP/Digi_menetelmat.git
* Type **ls** and you'll see Digi_menetalmat directory. Go there with **cd Digi_menetelmat**
* With **ls** again you'll see what all you've got
* Type **python3 topics.py** and the program will do what we have above
* Modify the script so that it will print you result2
* You can open the file with **nano topics.py**
* In which S24 sections is Turku discussed?
* What to do if you want to read the comments as well?
* If there is too much text to read, use **python3 topics.py | less**
* press **q** to escape

### 2) 
* Open topics2.py . Read it. What does it do? try to run it. Does it work?
* Select a word and modify topics2.py so that it searches for them. How many hits did you get?
* Which topics the comments reflect? How can you print also the names of the S24 sections the comments appear in? Some topics vs. S24 sections must conflict. Print some of these comments and read them. Which one was wrong? The topic model or the S24 section?

### 3)
* Modify topics2.py so that it searches for all comments under a specific topic.
* In which S24 sections is it discussed?
* Sort the comments from the most probable to the least probable and print the texts. How do the comments look like, is the topic well defined?
* Sort from the least probable. How do these look like? Are the topics still correct?
* If you have time, try to focus the query to a specific date. This can be done, e.g., with **date_tdt:[2015-01-01T00:00:00Z TO 2016-01-31T00:00:00Z]**
