# News-Stream Example Queries


Solr queries can be made with the Solr search page under 

http://hdp-node06.neofonie.de:8983/solr/#/hackathon_shard3_replica2/query .

There exists a Banana dashboard with plenty of prepared graphics and loaded data from the News-Stream system:

https://nstr.neofonie.de/dev/#/dashboard/solr/Hackathon .


In this Notebook we will show some example queries, to give an idea and easy access to all the data in the News-Stream project.


First we import some stuff we will need from python.




## Querying Data from News-Stream



Please fill in the user id and the password for retrieving data from the News-Stream system.

First of all some helper functions to make the requested prameters in the rest of the notebook more readable.


In [1]:
import json
from newsstream_client import NewsStreamClient

newsstreamClient = NewsStreamClient()

## Substitute function 
exec_query = newsstreamClient.exec_query

def dump(jsonData):
    print(json.dumps(jsonData, indent=4))



Using as base url for News-Stream: https://nstr.neofonie.de/solr-dev/hackathon/select




There exists a Banana dashboard with plenty of prepared graphics and loaded data from the News-Stream system:


In [2]:
print('\nhttps://'+newsstreamClient.auth['login']+':'+newsstreamClient.auth['password']+'@nstr.neofonie.de/dev/#/dashboard/solr/Hackathon\n')


https://tickertools:hallonewsstream@nstr.neofonie.de/dev/#/dashboard/solr/Hackathon



#### Importing NVD3 for graphical output

* pip install python-nvd3



In [3]:
import datetime
import time
import random
from IPython import display as d
import nvd3
nvd3.ipynb.initialize_javascript(use_remote=True)
#help(nvd3.ipynb.initialize_javascript)


loaded nvd3 IPython extension
run nvd3.ipynb.initialize_javascript() to set up the notebook
help(nvd3.ipynb.initialize_javascript) for options


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>



## Examples Fetching Data with Search Words



All queries are accessible from the commandline via curl. 

All available fields are documented in the document in the githup repository: 

[EnglishHowTohackathon](https://github.com/dpa-newslab/tickertools2016/blob/master/neofonie/EnglischHowToHackathon.md)


#### Searchword: "Hillary Clinton" - All Data


In [4]:
dump(exec_query({'q': 'Hillary Clinton'}))


{
    "response": {
        "docs": [
            {
                "unknownPersons": [
                    "Hillary for America"
                ],
                "text": "As election day nears, Snapchat lands some big political ad buys, including a Trump geofilter and Hillary Clinton selfie lens. Priorities USA Action Hillary for America",
                "neoDocId": "6854613762995479308",
                "neoPublicationName": "BUZZFEEDNEWS",
                "_version_": 1550380498998525952,
                "neoApplication": "buzzfeed_com",
                "unknownTypes": [
                    "PERSON"
                ],
                "neoPublicationId": 1794,
                "title": "Election Week Snapchat Ads Will Include Clinton Selfie Lens, Trump Geofilter",
                "mlRessort": "pl",
                "neoBaseUrl": "https://www.buzzfeed.com/",
                "neoTeaser": "As election day nears, Snapchat lands some big political ad buys, including a Trump geofilter and


#### Searchword: "Hillary Clinton AND Donald Trump" - All Data


In [5]:
dump(exec_query({'q': 'Hillary Clinton OR Donals Trump'}))


{
    "response": {
        "docs": [
            {
                "text": "Die US-Pr\u00e4sidentschaftskandidaten Hillary Clinton und Donald Trump haben ihren Wahlkampf am Samstag im hart umk\u00e4mpften Staat Florida fortgesetzt. Miami. Die US-Pr\u00e4sidentschaftskandidaten Hillary Clinton und Donald Trump haben ihren Wahlkampf am Samstag im hart umk\u00e4mpften Staat Florida fortgesetzt.",
                "neoDocId": "3847959807451155506",
                "neoPublicationName": "RP Online",
                "_version_": 1550184838406864896,
                "neoApplication": "rponline",
                "neoTeaserGenerated": false,
                "neoPublicationId": 99,
                "title": "Trump vergleicht sich mit Rapper Jay-Z",
                "mlRessort": "pl",
                "neoTeaser": "Die US-Pr\u00e4sidentschaftskandidaten Hillary Clinton und Donald Trump haben ihren Wahlkampf am Samstag im hart umk\u00e4mpften Staat Florida fortgesetzt.",
                "language": "


#### Searchword: "Hillary Clinton" AND "Donald Trump" - Just title and text


In [6]:
dump(exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"', 
            'fl': 'title AND text',
        }))


{
    "response": {
        "docs": [
            {
                "title": "LA Times Tracking Poll: Trump Up 5 Points",
                "text": " Donald Trump holds a 5-point lead over Hillary Clinton, the Los Angeles Times Daybreak tracking poll showed Thursday. Donald Trump, 47.5 percent. Hillary Clinton, 42.5 percent. \u00a0 \u00a9 2016 Newsmax. All rights reserved. Click Here to comment on this article"
            },
            {
                "title": "Trump vergleicht sich mit Rapper Jay-Z",
                "text": "Die US-Pr\u00e4sidentschaftskandidaten Hillary Clinton und Donald Trump haben ihren Wahlkampf am Samstag im hart umk\u00e4mpften Staat Florida fortgesetzt. Miami. Die US-Pr\u00e4sidentschaftskandidaten Hillary Clinton und Donald Trump haben ihren Wahlkampf am Samstag im hart umk\u00e4mpften Staat Florida fortgesetzt."
            },
            {
                "title": "Donald Trump gewinnt die US-Wahl - Seine Siegesrede im O-Ton",
                "text": " De


#### Searchword: "Hillary Clinton" AND "Donald Trump" -  Titles only for articles in english language.


In [7]:
dump(exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"',
            'fq': 'language: en AND sourceId:neofonie',
            'fl': 'title',
            'sort': 'publicationDate DESC',
            'rows': '10'
        }))


{
    "response": {
        "start": 0,
        "docs": [
            {
                "title": "Turkish President Erdo\u011fan congratulates Trump over US election win"
            },
            {
                "title": "Left-wing Democrats Sanders, Warren extend olive branch to Trump"
            },
            {
                "title": "The world celebrates outgoing US president with social drive to #ThankObamaIn4Words"
            },
            {
                "title": "America rocked by nationwide protests as Donald Trump celebrates election victory"
            },
            {
                "title": "Analysis: Trump's victory a reversal of fortune for Obama"
            },
            {
                "title": "For many supporters, Trump is a thing called hope"
            },
            {
                "title": "Transition: Obama, Trump to meet at White House"
            },
            {
                "title": "'Good for us \u2013 BAD for Europe' President Trump


#### Using Meta Information and some semantics of Solr search queries

In the next queries we are setting the number of results to zero, because we are just interested in the meta information

For each of the following three examples we find a different number of results depending on the semantic of the seach query.

* In the first example the query string is OR'ed and we get all results containing any occurrence of the query tokens.
* In the second example the semantics of the query is interpreted by Solr ("text:hillary +text:clinton +text:donald text:trump").
* In the third query we are searching for exact matches of "Hillary Clinton" AND "Donald Trump".

Most of the time you want the third query for results which match both politicians.


In [8]:
dump(exec_query(
        {
            'q': 'Hillary Clinton Donald Trump', 
            'rows': '0'
        }))


{
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 1.4696037,
        "numFound": 42584
    },
    "responseHeader": {
        "params": {
            "q": "Hillary Clinton Donald Trump",
            "indent": "on",
            "rows": "0",
            "wt": "json"
        },
        "QTime": 9,
        "status": 0
    }
}


In [9]:
dump(exec_query(
        {
            'q': 'Hillary Clinton AND Donald Trump', 
            'rows': '0'
        }))


{
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 1.4696037,
        "numFound": 22908
    },
    "responseHeader": {
        "params": {
            "q": "Hillary Clinton AND Donald Trump",
            "indent": "on",
            "rows": "0",
            "wt": "json"
        },
        "QTime": 249,
        "status": 0
    }
}


In [10]:
dump(exec_query(
        {
            'q': '"Hillary Clinton" AND "Donald Trump"', 
            'rows': '0'
        }))


{
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 2.0780168,
        "numFound": 19817
    },
    "responseHeader": {
        "params": {
            "q": "\"Hillary Clinton\" AND \"Donald Trump\"",
            "indent": "on",
            "rows": "0",
            "wt": "json"
        },
        "QTime": 306,
        "status": 0
    }
}



#### Documents about "Washington" from Neofonie's news crawl not older than 24 hours


The following query returns results for all news articles containing the search term 'Washington'.

Results contain terms like 'Kamasi Washington', as 'Washington Redskins' etc.

In [11]:
dump(exec_query(
        {
            'q': 'Washington', 
            'fq': '+sourceId:neofonie +publicationDateNOW/HOUR-24HOUR TO NOW/HOUR+1HOUR'
        }))


{
    "response": {
        "docs": [
            {
                "unknownPersons": [
                    "Tuesday \u2019",
                    "Results Trump",
                    "Bureau . Raise"
                ],
                "text": "An initiative to raise the statewide minimum wage and require paid sick leave was leading in Tuesday results. Under the measure, workers would receive the first pay jump, from the current $9.47 to $11 an hour, starting Jan. 1. A\u00a0ballot measure to raise the minimum wage statewide to $13.50 an hour by 2020 was leading\u00a0in Tuesday\u2019s election returns. Initiative 1433 ,\u00a0which would also require paid sick leave for employees, was leading with about 59 percent of the vote in Tuesday\u2019s early statewide returns, which did not\u00a0include results from Snohomish County. King County and Snohomish County released results later than others, and they were not initially tallied in the early statewide results. In King County, the measure w


Whereas the following search narrows the search down to all articles containing the entity with label 'Washington', which might match your initial intention of searching for the american capital in news.

Please see the next chapter for more examples using named entities.


In [12]:
dump(exec_query(
        {
            'q': 'entityLabels: Washington', 
            'fq': '+sourceId:neofonie +publicationDateNOW/HOUR-24HOUR TO NOW/HOUR+1HOUR'
        }))


{'response': {'docs': [{'_version_': 1550262850093580288,
    'entityLabels': ['Amide',
     'Bremerton',
     'Roger Goodell',
     'National Football League',
     'Washington',
     'Centers for Disease Control and Prevention',
     'Bellingham',
     'Facebook',
     'Doctors',
     'Seattle',
     'Desoxyribonukleinsäure',
     'Informationstechnik',
     'Highschool',
     'Donald Trump',
     'Florida',
     'Richard Sherman'],
    'entityRfc4180': ['k,0,4,CONCEPT,Amid,Q188777,Amide,34.911667\r\n',
     'k,27,36,PLACE,Bremerton,Q695417,Bremerton,47.035908\r\n',
     'k,1179,1192,PERSON,Roger Goodell,Q2271796,Roger Goodell,57.893955\r\n',
     'k,1212,1215,ORGANISATION,NFL,Q1215884,National Football League,37.00284\r\n',
     'k,1420,1430,PLACE,Washington,Q1223,Washington,31.400198\r\n',
     'k,1516,1558,ORGANISATION,Centers for Disease Control and Prevention,Q583725,Centers for Disease Control and Prevention,46.19369\r\n',
     'k,1560,1563,ORGANISATION,CDC,Q583725,Centers for 



#### Hourly Documents Count about "Hillary Clinton" from Neofonie's news crawl not older than 24 hours: 


In [13]:
timeCounts = exec_query(
        {
            'q': 'entityLabels: Hillary Clinton', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'rows': '0',
            'facet': 'true',
            'facet.range': 'publicationDate',
            'facet.range.start': 'NOW/HOUR-24HOUR',
            'facet.range.end': 'NOW/HOUR+1HOUR',
            'facet.range.gap': '+1HOUR'
        })

dump(timeCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {},
        "facet_ranges": {
            "publicationDate": {
                "counts": [
                    "2016-11-09T09:00:00Z",
                    522,
                    "2016-11-09T10:00:00Z",
                    354,
                    "2016-11-09T11:00:00Z",
                    345,
                    "2016-11-09T12:00:00Z",
                    389,
                    "2016-11-09T13:00:00Z",
                    283,
                    "2016-11-09T14:00:00Z",
                    315,
                    "2016-11-09T15:00:00Z",
                    266,
                    "2016-11-09T16:00:00Z",
                    274,
                    "2016-11-09T17:00:00Z",
                    419,
                    "2016-11-09T18:00:00Z",
                    342,
                    "2016-11-09T19:00:00Z",
                    411,
                    "2016-11-09T20:00:00Z",
                    321,
      


##### AreaChart with the hourly distribution for selected news.


In [14]:
from nvd3 import stackedAreaChart

timeCountList = timeCounts['facet_counts']['facet_ranges']['publicationDate']['counts']
#timeCountList = [int(x) for x in timeCountList[1::2]]
#dump("TimeCounts" + str(timeCountList))
#dump("Countings " + str(timeCountList[1::2]))

timeTuple = [datetime.datetime.strptime(str(d), "%Y-%m-%dT%H:%M:%SZ") for d in timeCountList[::2]]
timeTuple = [time.mktime(d.timetuple()) for d in timeTuple]
timeTuple = [int(t1) * 1000 for t1 in timeTuple]
#print("TimeTuple" + str(timeTuple))

name = "News for \"Hillary Clinton\" per hour for the last day"
hourlyDocsChart = nvd3.stackedAreaChart(name=name,height=450,width=500, use_interactive_guideline=False, x_is_date=True)
hourlyDocsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")
hourlyDocsChart.add_serie(name="Hourly documents", y=timeCountList[1::2], x=timeTuple)
hourlyDocsChart




## Examples fetching data based on named entities


#### Fetch Top 5 news with NER annotations for "Hillary Clinton" AND "Donald Trump"

In [15]:
dump(exec_query(
        {
            'q': 'entityLabels: "Hillary Clinton" AND entityLabels: "Donald Trump"', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'fl': 'neoUrl AND title AND entityLabels',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))


{
    "response": {
        "start": 0,
        "docs": [
            {
                "title": "Tausende protestieren gegen Trump-Sieg in den USA",
                "entityLabels": [
                    "Amerikaner",
                    "Berkeley",
                    "fl\u00fcssige Luft",
                    "CNN",
                    "Manhattan",
                    "Dokumentarfilmer",
                    "Michael Moore",
                    "Facebook",
                    "Trump Tower",
                    "Sattelzug",
                    "Hochhaus",
                    "Pr\u00e4sident der Vereinigten Staaten",
                    "Deutsche Presse-Agentur",
                    "Muslim",
                    "Nachtschicht \u2013 Ich habe Angst",
                    "Kalifornien",
                    "Universit\u00e4tsstadt",
                    "Mexiko",
                    "Spanisch",
                    "Wahlnacht",
                    "USA Today",
                    "Donald Trump

#### Fetch TOP 5 news for "Volkswagen"

In [16]:
dump(exec_query(
        {
            'q': 'entityLabels: Volkswagen', 
            'fl': 'title',
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))


{
    "response": {
        "start": 0,
        "docs": [
            {
                "title": "Bob Lutz: Trump positiv f\u00fcr US-Wirtschaft \u2013 \u201caber nicht gut f\u00fcr Tesla und Solarcity!\u201d"
            },
            {
                "title": "Neues zu Renaults Dieselgate"
            },
            {
                "title": "POL-AA: Ostalbkreis: Kripo sucht Hinweise nach sexuellem \u00dcbergriff, Einbruch in"
            },
            {
                "title": "POL-OG: Meldungen aus den Bereichen Baden-Baden/B\u00fchl"
            },
            {
                "title": "Dow und Dax feiern Donald Trump"
            }
        ],
        "numFound": 258
    },
    "responseHeader": {
        "params": {
            "fl": "title",
            "fq": "+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie",
            "sort": "publicationDate DESC",
            "rows": "5",
            "q": "entityLabels: Volkswagen",
            "indent": "on",


#### Fetch TOP 5 news for the last two hours with recognized Organisations

In [17]:
dump(exec_query(
        {
            'q': 'entityTypes: ORGANISATION', 
            'fl': 'neoUrl title entityRfc4180',
            'fq': '+publicationDate:[NOW/HOUR-2HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))


{
    "response": {
        "start": 0,
        "docs": [
            {
                "title": "Bad Aibling : Fahrdienstleiter legt in Prozess um Zugungl\u00fcck Gest\u00e4ndnis ab",
                "entityRfc4180": [
                    "k,0,11,PLACE,Bad Aibling,Q5758,Bad Aibling,51.775562\r\n",
                    "k,14,30,JOBTITLE,Fahrdienstleiter,Q1392297,Fahrdienstleiter,44.99667\r\n",
                    "k,500,522,ORGANISATION,Landgericht Traunstein,Q1802902,Landgericht Traunstein,44.29343\r\n",
                    "k,524,534,PLACE,Oberbayern,Q10562,Oberbayern,49.394806\r\n",
                    "k,630,640,CONCEPT,9. Februar,Q2324,9. Februar,39.678333\r\n",
                    "k,812,823,CONCEPT,fahrl\u00e4ssige,Q160070,Fahrl\u00e4ssigkeit,35.078335\r\n",
                    "k,908,918,CONCEPT,9. Februar,Q2324,9. Februar,39.678333\r\n",
                    "k,954,964,CONCEPT,Smartphone,Q22645,Smartphone,50.0\r\n",
                    "k,1004,1014,PLACE,Oberbayern,Q10562,Oberba

#### Fetch TOP 5 news for which CRF recognized persons that are not already known as named entities.

In [18]:
dump(exec_query(
        {
            'q': 'unknownTypes: PERSON', 
            'fl': 'neoUrl title entityRfc4180',
            'fq': '+publicationDate:[NOW/HOUR-2HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'sort': 'publicationDate DESC',
            'rows': '5',
        }))


{
    "response": {
        "start": 0,
        "docs": [
            {
                "title": "S\u00fcdkorea: Korruptionsaff\u00e4re \u00fcberschattet Olympia",
                "entityRfc4180": [
                    "k,0,8,PLACE,S\u00fcdkorea,Q884,S\u00fcdkorea,247.48982\r\n",
                    "k,80,88,PLACE,S\u00fcdkorea,Q884,S\u00fcdkorea,247.48982\r\n",
                    "k,1521,1524,CONCEPT,Sil,Q14556,Sil,32.803333\r\n",
                    "k,1930,1937,CONCEPT,Familie,Q35409,Familie,31.193335\r\n",
                    "k,2221,2245,ORGANISATION,Deutschen Presse-Agentur,Q312653,Deutsche Presse-Agentur,75.17783\r\n",
                    "k,2312,2323,PLACE,Pyeongchang,Q188624,Pyeongchang,48.52241\r\n",
                    "k,2416,2443,CONCEPT,\u00f6ffentliche Ausschreibungen,Q294300,\u00d6ffentlicher Auftrag,35.185\r\n",
                    "k,2667,2675,CONCEPT,Logistik,Q177777,Logistik,38.55222\r\n",
                    "k,2805,2820,CONCEPT,Provinz Gangwon,Q41067,Gangwon-do,3



## Examples fetching data with facets



#### Number of documents from the different News-Stream sources


In [19]:
sourceDistributionCounts = exec_query(
        {
            'q': '*', 
            'fq': '+publicationDate:[NOW/HOUR-30DAY TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'neoPublicationName',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
        })
dump(sourceDistributionCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "neoPublicationName": [
                "AD HOC NEWS",
                25769,
                "FOCUS Online",
                7307,
                "Westdeutsche Allgemeine",
                7126,
                "Wallstreet Online",
                5982,
                "finanzen.net",
                3965,
                "freiepresse",
                3803,
                "news aktuell",
                2869,
                "Schleswig-Holsteinischer Zeitungsverlag",
                2501,
                "Schw\u00e4bische Zeitung",
                2461,
                "Klamm",
                2416,
                "T-Online",
                2364,
                "Mirror",
                2329,
                "firmenpresse",
                2324,
                "Aktien Check",
                2220,
                "Nordwest-Zeitung",
                2219,
                "MarketWatch",
     


##### Pie chart for the news distribution in News-Stream.


In [44]:
from nvd3 import discreteBarChart

sourceDistributionCountList = sourceDistributionCounts['facet_counts']['facet_fields']['neoPublicationName']
sourceDistributionCountList = sourceDistributionCountList[0:-2]
print("\n" + str(sourceDistributionCountList) + "\n")

name = 'News distribution in sources of News-Stream in the last 30 days'
sourceDistributionCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=900)
sourceDistributionCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")
sourceDistributionCountsChart.add_serie(y=list(reversed(sourceDistributionCountList[1::2])), x=list(reversed(sourceDistributionCountList[::2])))
sourceDistributionCountsChart



['AD HOC NEWS', 25769, 'FOCUS Online', 7307, 'Westdeutsche Allgemeine', 7126, 'Wallstreet Online', 5982, 'finanzen.net', 3965, 'freiepresse', 3803, 'news aktuell', 2869, 'Schleswig-Holsteinischer Zeitungsverlag', 2501, 'Schwäbische Zeitung', 2461, 'Klamm', 2416, 'T-Online', 2364, 'Mirror', 2329, 'firmenpresse', 2324, 'Aktien Check', 2220, 'Nordwest-Zeitung', 2219, 'MarketWatch', 2213, 'Berliner Zeitung', 2207, 'San Francisco Chronicle', 2094, 'Frankfurter Rundschau', 2059, 'Mitteldeutsche Zeitung', 2035, 'Washingtontimes', 2034, 'chron', 1963, 'Mittelbayerische', 1924, 'ZEIT ONLINE', 1912, 'Neue Zürcher Zeitung', 1831, 'Stern', 1805, 'Augsburger Allgemeine', 1802, 'Westdeutsche Zeitung newsline', 1802, 'Münchner Merkur', 1800, 'Berliner Morgenpost', 1742, 'EXPRESS.co.uk', 1708, 'Österreichischer Rundfunk', 1708, 'Donaukurier', 1699, 'Neue Westfälische', 1647, 'SZ-Online', 1621, '20min-ch', 1584, 'Magdeburger Volksstimme', 1581, 'openPR', 1541, 'Daily Star', 1506, 'Frankenpost', 1486, 

#### Counts of news per hour containing the search term "Hillary Clinton" in the last 24 hours.

In [21]:
dump(exec_query(
        {
            'q': 'Hillary Clinton', 
            'fq': '+publicationDate:[NOW/HOUR-24HOUR TO NOW/HOUR+1HOUR] +sourceId:neofonie',
            'fl': 'titles',
            'rows': '0',
            'facet': 'true',
            'facet.range': 'publicationDate',
            'facet.range.start': 'NOW/HOUR-24HOUR',
            'facet.range.end': 'NOW/HOUR+1HOUR',
            'facet.range.gap': '+1HOUR'
        }))


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {},
        "facet_ranges": {
            "publicationDate": {
                "counts": [
                    "2016-11-09T09:00:00Z",
                    534,
                    "2016-11-09T10:00:00Z",
                    360,
                    "2016-11-09T11:00:00Z",
                    361,
                    "2016-11-09T12:00:00Z",
                    395,
                    "2016-11-09T13:00:00Z",
                    286,
                    "2016-11-09T14:00:00Z",
                    319,
                    "2016-11-09T15:00:00Z",
                    272,
                    "2016-11-09T16:00:00Z",
                    287,
                    "2016-11-09T17:00:00Z",
                    428,
                    "2016-11-09T18:00:00Z",
                    352,
                    "2016-11-09T19:00:00Z",
                    431,
                    "2016-11-09T20:00:00Z",
                    336,
      

#### Count news grouped by language for the search term "Hillary Clinton" OR "Donald Trump".

In [22]:
languageCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'language',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'fcs'
        })
dump(languageCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "language": [
                "de",
                25336,
                "en",
                10605,
                "fr",
                345,
                "",
                167,
                "es",
                0,
                null,
                0
            ]
        },
        "facet_ranges": {},
        "facet_intervals": {},
        "facet_queries": {}
    },
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 4.8672085,
        "numFound": 36453
    },
    "responseHeader": {
        "params": {
            "facet.missing": "true",
            "rows": "0",
            "facet.method": "fcs",
            "facet.field": "language",
            "q": "entityLabels:\"Hillary Clinton\" OR entityLabels:\"Donald Trump\"",
            "indent": "on",
            "facet.sort": "count",
            "wt": "json",
            "facet": "true"
        },
       


##### Pie chart for the language distribution in selected news.


In [23]:
from nvd3 import pieChart

languageCountList = languageCounts['facet_counts']['facet_fields']['language']
languageCountList = languageCountList[0:-2]
print("\n" + str(languageCountList) + "\n")

name = 'Language distribution (entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump")'
languageDistChart = nvd3.pieChart(name=name, color_category='category20c', height=450, width=450)
languageDistChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

#Add the serie
extra_serie = {"tooltip": {"y_start": "", "y_end": " cal"}}
languageDistChart.add_serie(y=languageCountList[1::2], x=languageCountList[::2], extra=extra_serie)
languageDistChart


['de', 25336, 'en', 10605, 'fr', 345, '', 167, 'es', 0]



#### Counting all occurrences of named entities in news which contain NEs "Hillary Clinton" OR "Donald Trump"

In [24]:
surfaceCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq':'publicationDate:[NOW/DAY-7DAY TO NOW/DAY+1DAY]',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'knownSurfaceforms',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(surfaceCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "knownSurfaceforms": [
                "Donald Trump",
                32340,
                "Hillary Clinton",
                23808,
                "USA",
                13190,
                "Republikaner",
                7874,
                "Wahlkampf",
                7052,
                "Barack Obama",
                6800,
                "Florida",
                5871,
                "Pr\u00e4sident",
                5473,
                "Deutschland",
                5401,
                "Demokratin",
                5224,
                null,
                0
            ]
        },
        "facet_ranges": {},
        "facet_intervals": {},
        "facet_queries": {}
    },
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 4.8672085,
        "numFound": 36130
    },
    "responseHeader": {
        "params": {
            "facet.missing": "true"


##### Bar chart for the top ranking surface forms in the selected news.


In [25]:
from nvd3 import discreteBarChart

surfaceCountList = surfaceCounts['facet_counts']['facet_fields']['knownSurfaceforms']
print("\n" + str(surfaceCountList) + "\n")

name = 'Top ranking surfaces in selected news'
surfaceCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
surfaceCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")
surfaceCountsChart.add_serie(y=list(reversed(surfaceCountList[1::2])), x=list(reversed(surfaceCountList[::2])))
surfaceCountsChart



['Donald Trump', 32340, 'Hillary Clinton', 23808, 'USA', 13190, 'Republikaner', 7874, 'Wahlkampf', 7052, 'Barack Obama', 6800, 'Florida', 5871, 'Präsident', 5473, 'Deutschland', 5401, 'Demokratin', 5224, None, 0]



#### Counting all CRFs in news which contain NEs "Hillary Clinton" OR "Donald Trump"

In [26]:
crfSurfaceCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq':'publicationDate:[NOW/DAY-3DAY TO NOW/DAY+1DAY]',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'unknownPersons',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(crfSurfaceCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "unknownPersons": [
                "White House",
                2348,
                "Donald Trump's",
                935,
                "Hillary Clinton's",
                436,
                "Republican Party",
                410,
                "Viktor Orban",
                400,
                "The Republican",
                350,
                "Kellyanne Conway",
                333,
                "Donald J. Trump",
                302,
                "Click Here",
                299,
                "She said",
                285,
                null,
                10999
            ]
        },
        "facet_ranges": {},
        "facet_intervals": {},
        "facet_queries": {}
    },
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 4.8672085,
        "numFound": 27569
    },
    "responseHeader": {
        "params": {
            "facet


##### Bar chart for the top ranking unknown surface forms in the selected news, which were generated with CRF.


In [27]:
from nvd3 import discreteBarChart

crfSurfaceCountList = crfSurfaceCounts['facet_counts']['facet_fields']['unknownPersons']
crfSurfaceCountList = crfSurfaceCountList[0:-2]
print("\n" + str(crfSurfaceCountList) + "\n")

name = 'Top ranking unknown surfaces (CRF) in selected news'
crfSurfaceCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
crfSurfaceCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

crfSurfaceCountsChart.add_serie(y=list(reversed(crfSurfaceCountList[1::2])), x=list(reversed(crfSurfaceCountList[::2])))
crfSurfaceCountsChart



['White House', 2348, "Donald Trump's", 935, "Hillary Clinton's", 436, 'Republican Party', 410, 'Viktor Orban', 400, 'The Republican', 350, 'Kellyanne Conway', 333, 'Donald J. Trump', 302, 'Click Here', 299, 'She said', 285]




## Examples for selecting dpa data


#### Loading dpa-News from News-Stream

In [28]:
dump(exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
         }))


{
    "response": {
        "docs": [
            {
                "entitySurfaceforms": [
                    "Europa",
                    "dpa-AFX",
                    "britischen",
                    "FTSE",
                    "100-",
                    "US-Pr\u00e4sidentschaftswahl",
                    "8. November",
                    "Hillary Clinton",
                    "Finanzm\u00e4rkte",
                    "US-Notenbank Fed",
                    "Zinspolitik",
                    "CMC Markets",
                    "US-Pr\u00e4sidenten",
                    "Notenbank",
                    "Kursgewinn",
                    "Finanzvorstand",
                    "L'Oreal",
                    "Nordamerika",
                    "Axa",
                    "Leitindex",
                    "Euroraum",
                    "franz\u00f6sische",
                    "Versicherer",
                    "Donald Trump",
                    "Umsatz",
                    "LafargeHolc

#### Loading dpa-News from News-Stream with dpa specific fields

In [29]:
dump(exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'fl': 'id dpaId publicationDate title mlRessort dpaIndustries',
            'sort': 'publicationDate DESC',
            'rows': '5'
        }))


{
    "response": {
        "start": 0,
        "docs": [
            {
                "dpaId": "urn:newsml:dpa.com:20090101:161110-99-129528/3",
                "title": "Dax klettert in Richtung 10 800 Punkte",
                "id": "35c50bb3ffddc1bb5eca43c8b8bde658",
                "mlRessort": "wi",
                "publicationDate": "2016-11-10T09:03:43Z"
            },
            {
                "title": "dpa/audio-Programm f\u00fcr Donnerstag, 10. November 2016, 10 Uhr",
                "dpaId": "urn:newsml:dpa.com:20090101:161110-99-129786/2",
                "id": "831f88925cbbe6fb315e82076505fd2c",
                "mlRessort": "vm",
                "publicationDate": "2016-11-10T09:02:57Z"
            },
            {
                "title": "POLITIK: Seoul: Wahlsieger Trump sichert Schutz der USA f\u00fcr S\u00fcdkorea zu",
                "dpaId": "urn:newsml:dpa-afx.de:ADE:20161110T100137+0100:1478768495733/1",
                "id": "2351f4fa44d11d1efc8d37b4188dd3a9"

#### Aggregation of dpa news on category 'mlIndustries'

FIN -> Asset Management, Finanzdienstleister | AUT -> Automobil-/Zuliefererindustrie (Autos &amp; LKW, Ersatzteile, Reifen) | BAN -> Banken | CON -> Bau | PER -> Bekleidung, Kosmetik | MIN -> Bergbau, Rohstoffförderung (Kohle, Diamanten, Gold, Platin, Edelmetalle) | EQI -> Beteiligungsgesellschaften | EQN -> Börsennotierte Fonds (ETF, etc.) | CHM -> Chemie, Kunststoffe | CMP -> Computer, Hardware, Software, Halbleiter, Bauteile | ELU -> Elektrizitätsversorger | ELE -> Elektronik, Elektrik, Komponenten | AEG -> Erneuerbare Energien | HTH -> Gesundheitswesen, Medizintechnik, Krankenhausbedarf | BEV -> Getränke (Bier, Wein, Destillerien, Soft Drinks) | TRN -> Gütertransport, Logistik | HOU -> Haushaltswaren, Möbel, Eigenheime | PRO -> Immobilien | REF -> Lebensmittel- und Pharmahandel | ASS -> Lebensversicherer | ENG -> Maschinenbau, Starkstrom, Umwelttechnik | MET -> Metallverarbeitung- und förderung, NE-Metalle | INL -> Mischkonzerne, Verpackungsindustrie | FOO -> Nahrungsmittel (Hersteller, inkl. Agrarindustrie) | RET -> Non-Food-Einzelhandel, Endkunden-Dienstleister | PAP -> Papier, Zellulose, Holz | PHA -> Pharma, Biotechnologie | DEF -> Rüstungsindustrie, Flugzeughersteller | INS -> Sach- und Rückversicherungen | SOF -> Software, IT-Beratung, Internet, Portalbetreiber | TOB -> Tabakindustrie | TEL -> Telefongesellschaften (Festnetz) | MOB -> Telefongesellschaften (Mobilfunk) | LEI -> Tourismus, Fluggesellschaften, Bahn (Personenverkehr) | SVS -> Unternehmensdienstleister | CSM -> Verbrauchsgüter, Kosmetik, Seife, Handwerksbedarf, Möbel, Haushaltsgeräte, Unterhaltungselektronik | MED -> Verlage, Rundfunk, Info-Dienste, Zeitungen, Bücher, Werbung | UTI -> Versorger (Gas, Wasser etc.) | OIL -> Öl, Ölexploration, Gas | OES -> Öl-Anlagenbau, Pipelines |

In [30]:
industryCategoryCounts = exec_query(
        {
            'q': 'Siemens',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'mlIndustries',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(industryCategoryCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "mlIndustries": [
                "ENG",
                30,
                "INL",
                20,
                "ELE",
                9,
                "CON",
                3,
                "ELU",
                3,
                "OIL",
                3,
                "UTI",
                3,
                "MET",
                2,
                "TRN",
                2,
                "AUT",
                1,
                null,
                88
            ]
        },
        "facet_ranges": {},
        "facet_intervals": {},
        "facet_queries": {}
    },
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 1.1924596,
        "numFound": 120
    },
    "responseHeader": {
        "params": {
            "facet.missing": "true",
            "fq": "sourceId:dpa",
            "wt": "json",
            "rows": "0",
            "facet.method


Bar chart for the top ranking industry categories for "Siemens".


In [31]:
from nvd3 import discreteBarChart

industryCategoryCountList = industryCategoryCounts['facet_counts']['facet_fields']['mlIndustries']
industryCategoryCountList = industryCategoryCountList[0:-2]
print("\n" + str(industryCategoryCountList) + "\n")

name = 'Top ranking industry categories for "Siemens"'
industryCategoryCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
industryCategoryCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

industryCategoryCountsChart.add_serie(y=list(reversed(industryCategoryCountList[1::2])), x=list(reversed(industryCategoryCountList[::2])))
industryCategoryCountsChart



['ENG', 30, 'INL', 20, 'ELE', 9, 'CON', 3, 'ELU', 3, 'OIL', 3, 'UTI', 3, 'MET', 2, 'TRN', 2, 'AUT', 1]



#### Aggregation of dpa news on category 'dpaRessort'

pl="politik", wi="wirtschaft", rs="redaktioneller service", vm="vermischtes", ku="kultur", sp="sport"

In [32]:
dpaRessortCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaRessort',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(dpaRessortCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "dpaRessort": [
                "pl",
                1023,
                "wi",
                346,
                "rs",
                158,
                "vm",
                35,
                "ku",
                11,
                "sp",
                4,
                null,
                0
            ]
        },
        "facet_ranges": {},
        "facet_intervals": {},
        "facet_queries": {}
    },
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 4.8672085,
        "numFound": 1577
    },
    "responseHeader": {
        "params": {
            "facet.missing": "true",
            "fq": "sourceId:dpa",
            "rows": "0",
            "facet.method": "enum",
            "facet.field": "dpaRessort",
            "q": "entityLabels:\"Hillary Clinton\" OR entityLabels:\"Donald Trump\"",
            "indent": "on",
            "facet.sort": "co


Pie chart of the ressorts of the news for "entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"".

* pl="politik", wi="wirtschaft", rs="redaktioneller service", vm="vermischtes", ku="kultur", sp="sport"


In [33]:
from nvd3 import pieChart

dpaRessortCountList = dpaRessortCounts['facet_counts']['facet_fields']['dpaRessort']
dpaRessortCountList = dpaRessortCountList[0:-2]
print("\n" + str(dpaRessortCountList) + "\n")

name = 'Distribution of the ressorts for selected news (entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump")'
dpaRessortsChart = nvd3.pieChart(name=name, color_category='category20c', height=450, width=450)
dpaRessortsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

#Add the serie
extra_serie = {"tooltip": {"y_start": "", "y_end": " cal"}}
dpaRessortsChart.add_serie(y=dpaRessortCountList[1::2], x=dpaRessortCountList[::2], extra=extra_serie)
dpaRessortsChart


['pl', 1023, 'wi', 346, 'rs', 158, 'vm', 35, 'ku', 11, 'sp', 4]



#### Aggregation of dpa news on category 'dpaServices'

In [34]:
dpaServicesCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaServices',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(dpaServicesCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "dpaServices": [
                "dpasrv:bdt",
                845,
                "afxsrv:ADE",
                599,
                "dpasrv:edi",
                582,
                "dpasrv:edt",
                582,
                "dpasrv:erd",
                582,
                "dpasrv:bid",
                445,
                "dpasrv:hfk",
                79,
                "dpasrv:kom",
                79,
                "dpasrv:bwg",
                39,
                "dpasrv:wap-bwg",
                20,
                "dpasrv:brb",
                18,
                "dpasrv:rhs",
                16,
                "dpasrv:bay",
                14,
                "dpasrv:hsh",
                12,
                "dpasrv:nwf",
                11,
                "dpasrv:wap-brb",
                11,
                "dpasrv:aht",
                10,
                "dpasrv:san",



Bar chart of the dpa services with news for "entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"".


In [40]:
from nvd3 import discreteBarChart

dpaServicesCountList = dpaServicesCounts['facet_counts']['facet_fields']['dpaServices']
dpaServicesCountList = dpaServicesCountList[0:-2]
print("\n" + str(dpaServicesCountList) + "\n")

name = 'Top ranking dpa services in selected news'
dpaServicesCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
dpaServicesCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

dpaServicesCountsChart.add_serie(y=list(reversed(dpaServicesCountList[1::2])), x=list(reversed(dpaServicesCountList[::2])))
dpaServicesCountsChart



['dpasrv:bdt', 845, 'afxsrv:ADE', 599, 'dpasrv:edi', 582, 'dpasrv:edt', 582, 'dpasrv:erd', 582, 'dpasrv:bid', 445, 'dpasrv:hfk', 79, 'dpasrv:kom', 79, 'dpasrv:bwg', 39, 'dpasrv:wap-bwg', 20, 'dpasrv:brb', 18, 'dpasrv:rhs', 16, 'dpasrv:bay', 14, 'dpasrv:hsh', 12, 'dpasrv:nwf', 11, 'dpasrv:wap-brb', 11, 'dpasrv:aht', 10, 'dpasrv:san', 10, 'dpasrv:wap-bay', 10, 'dpasrv:hes', 8, 'dpasrv:wap-rhs', 8, 'dpasrv:wap-san', 8, 'dpasrv:wap-aht', 7, 'dpasrv:wap-nwf', 7, 'dpasrv:mbv', 6, 'dpasrv:thg', 6, 'dpasrv:wap-hsh', 5, 'dpasrv:wap-mbv', 5, 'dpasrv:nsb', 4, 'dpasrv:wap-hes', 4, 'dpasrv:wap-thg', 4, 'dpasrv:wap-nsb', 1]



#### Aggregation of dpa news on category 'dpaKeywords'

In [36]:
dpaKeywordsCounts = exec_query(
        {
            'q': 'entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"',
            'fq': 'sourceId:dpa',
            'rows': '0',
            'facet': 'true',
            'facet.field': 'dpaKeywords',
            'facet.limit': '10',
            'facet.missing': 'true',
            'facet.sort': 'count',
            'facet.method': 'enum'
         })
dump(dpaKeywordsCounts)


{
    "facet_counts": {
        "facet_dates": {},
        "facet_fields": {
            "dpaKeywords": [
                "Pr\u00e4sident",
                509,
                "dpa",
                199,
                "Reaktionen",
                157,
                "Tagesvorschau",
                99,
                "Wahlen",
                82,
                "Audio",
                79,
                "Medien",
                34,
                "USA",
                21,
                "dpa-Morgenlage",
                11,
                "Abendvorschau",
                8,
                null,
                749
            ]
        },
        "facet_ranges": {},
        "facet_intervals": {},
        "facet_queries": {}
    },
    "response": {
        "docs": [],
        "start": 0,
        "maxScore": 4.8679953,
        "numFound": 1577
    },
    "responseHeader": {
        "params": {
            "facet.missing": "true",
            "fq": "sourceId:dpa",
        


Bar chart of the dpa keywords for selected news "entityLabels:"Hillary Clinton" OR entityLabels:"Donald Trump"".


In [38]:
from nvd3 import discreteBarChart

dpaKeywordsCountList = dpaKeywordsCounts['facet_counts']['facet_fields']['dpaKeywords']
dpaKeywordsCountList = dpaKeywordsCountList[0:-2]
print("\n" + str(dpaKeywordsCountList) + "\n")

name = 'Top ranking dpa keywords in selected news'
dpaKeywordsCountsChart = discreteBarChart(name=name, color_category='category20c', height=400, width=400)
dpaKeywordsCountsChart.set_containerheader("\n\n<h3>" + name + "</h3>\n\n")

dpaKeywordsCountsChart.add_serie(y=list(reversed(dpaKeywordsCountList[1::2])), x=list(reversed(dpaKeywordsCountList[::2])))
dpaKeywordsCountsChart



['Präsident', 509, 'dpa', 199, 'Reaktionen', 157, 'Tagesvorschau', 99, 'Wahlen', 82, 'Audio', 79, 'Medien', 34, 'USA', 21, 'dpa-Morgenlage', 11, 'Abendvorschau', 8]

