Skip to content

Searching for articles

Gregor Leban edited this page Jun 23, 2023 · 41 revisions

In order to search for articles in Event Registry, we provide two classes - QueryArticles and QueryArticlesIter. Both classes can be used to find articles using a set of various types of search conditions.

The class QueryArticlesIter is meant to obtain an iterator, that makes it easy to iterate over all articles that match the search conditions. Alternatively, the QueryArticles class can be used to obtain a broader range of information about the matching articles in various forms. In case of QueryArticles, the results can be not only the list of articles but also a time distribution when articles were published, distribution of top news sources that wrote the matching articles, top concepts mentioned in the articles, etc.

The returned information about articles follows the Article data model.

QueryArticlesIter

Example of usage

Before describing the class, here is an example of use that prints the list of articles from New York Times that mention Elon Musk in the title, but don't mention SpaceX anywhere in the article body:

from eventregistry import *
er = EventRegistry(apiKey = YOUR_API_KEY)
q = QueryArticlesIter(
    keywords = "Elon Musk",
    keywordsLoc = "title",
    ignoreKeywords = "SpaceX",
    sourceUri = "nytimes.com")
for article in q.execQuery(er, sortBy = "rel",
        returnInfo = ReturnInfo(articleInfo = ArticleInfoFlags(concepts = True, categories = True)),
        maxItems = 500):
    print article

Constructor

QueryArticlesIter is a derived class from QueryArticles. It's constructor can accept the following arguments:

QueryArticlesIter(keywords = None,
    conceptUri = None,
    categoryUri = None,
    sourceUri = None,
    sourceLocationUri = None,
    sourceGroupUri = None,
    authorUri = None,
    locationUri = None,
    lang = None,
    dateStart = None,
    dateEnd = None,
    dateMentionStart = None,
    dateMentionEnd = None,
    keywordsLoc = "body",
    ignoreKeywords = None,
    ignoreConceptUri = None,
    ignoreCategoryUri = None,
    ignoreSourceUri = None,
    ignoreSourceLocationUri = None,
    ignoreSourceGroupUri = None,
    ignoreAuthorUri = None,
    ignoreLocationUri = None,
    ignoreLang = None,
    ignoreKeywordsLoc = "body",
    isDuplicateFilter = "keepAll",
    hasDuplicateFilter = "keepAll",
    eventFilter = "keepAll",
    startSourceRankPercentile = 0,
    endSourceRankPercentile = 100,
    minSentiment = -1,
    maxSentiment = 1,
    dataType = "news",
    requestedResult = None)

The parameters for which you don't specify a value will be ignored. In order for the query to be valid (=it can be executed by Event Registry), it has to have at least one positive condition (conditions that start with ignore* do not count as positive conditions). The meaning of the arguments is the following

  • keywords: find articles that mention the specified keywords. A single keyword/phrase can be provided as a string, multiple keywords/phrases can be provided as a list of strings. Use QueryItems.AND() if all provided keywords/phrases should be mentioned, or QueryItems.OR() if any of the keywords/phrases should be mentioned.
  • conceptUri: find articles where the concept with concept URI is mentioned. A single concept URI can be provided as a string, multiple concept URIs can be provided as a list of strings. Use QueryItems.AND() if all provided concepts should be mentioned, or QueryItems.OR() if any of the concepts should be mentioned. To obtain a concept URI based on a (partial) concept label use EventRegistry.getConceptUri().
  • categoryUri: find articles that are assigned to a particular category. A single category URI can be provided as a string, multiple category URIs can be provided as a list of strings. Use QueryItems.AND() if all provided categories should be mentioned, or QueryItems.OR() if any of the categories should be mentioned. A category URI can be obtained based on a (partial) category name using EventRegistry.getCategoryUri().
  • sourceUri: find articles that were written by a news source sourceUri. If multiple sources should be considered, use QueryItems.OR() to provide a list of sources. Source URI for a given (partial) news source name can be obtained using EventRegistry.getNewsSourceUri().
  • sourceLocationUri: find articles that were written by news sources located in the given geographic location. If multiple source locations are provided, then put them into a list inside QueryItems.OR(). Location URI can either be a city or a country. Location URI for a given (partial) name can be obtained using EventRegistry.getLocationUri().
  • sourceGroupUri: find articles that were written by news sources that are assigned to the specified source group(s). If multiple source groups are provided, then put them into a list inside QueryItems.OR(). Source group URI for a given name can be obtained using EventRegistry.getSourceGroupUri().
  • authorUri: find articles that were written by a specific author. If multiple authors should be considered, use QueryItems.AND() if provided authors should be joint authors of the same articles, or QueryItems.OR() if you want to find articles that were written by any of the provided authors. To obtain the author URI based on a (partial) author name and potential source domain name use EventRegistry.getAuthorUri().
  • locationUri: find articles that describe an event that occurred at a particular location. An article will be associated with that location if it's mentioned in the dateline. Location URI can either be a city or a country. If multiple locations are provided, resulting articles have to match any of the locations. Location URI for a given name can be obtained using EventRegistry.getLocationUri().
  • lang: find articles that are written in the specified language. If more than one language is specified, resulting articles have to be written in any of the languages. Specify the value as string or list in QueryItems.OR(). See supported languages for the list of language codes to use.
  • dateStart: find articles that were written on or after dateStart. The date should be provided in YYYY-MM-DD format, datetime.time or datetime.datetime.
  • dateEnd: find articles that occurred before or on dateEnd. The date should be provided in YYYY-MM-DD format, datetime.time or datetime.datetime.
  • dateMentionStart: find articles that explicitly mention a date that is equal or greater than dateMentionStart.
  • dateMentionEnd: find articles that explicitly mention a date that is lower or equal to dateMentionEnd.
  • keywordsLoc: where should we look when searching using the keywords provided by keywords parameter. "body" (default), "title", or "body,title"
  • ignoreKeywords: ignore articles that mention the provided keywords. Specify the value as string or list in QueryItems.OR() or QueryItems.AND().
  • ignoreConceptUri: ignore articles that mention the provided concepts. Specify the value as string or list in QueryItems.OR() or QueryItems.AND().
  • ignoreCategoryUri: ignore articles that are about the provided set of categories. Specify the value as string or list in QueryItems.OR() or QueryItems.AND().
  • ignoreSourceUri: ignore articles which have been written by the specified list of news sources. Specify the value as string or list in QueryItems.OR().
  • ignoreSourceLocationUri: ignore articles which have been written by news sources located at the specified geographic location(s). Specify the value as string or list in QueryItems.OR().
  • ignoreSourceGroupUri: ignore articles which have been written by the news sources assigned to the specified source groups. Specify the value as string or list in QueryItems.OR().
  • ignoreAuthorUri: ignore articles that were written by one or more provided authors. Specify the value as string or list in QueryItems.OR() or QueryItems.AND().
  • ignoreLocationUri: ignore articles that occurred in any of the provided locations. A location can be a city or a place. Specify the value as string or list in QueryItems.OR().
  • ignoreLang: ignore articles that are written in any of the provided languages. See supported languages for the list of language codes to use.
  • ignoreKeywordsLoc: where should we look when data should be used when searching using the keywords provided by ignoreKeywords parameter. "body" (default), "title", or "body,title"
  • isDuplicateFilter: some articles can be duplicates of other articles. What should be done with them? Possible values are: "skipDuplicates" (skip the resulting articles that are duplicates of other articles); "keepOnlyDuplicates" (return only the duplicate articles); "keepAll" (no filtering, default).
  • hasDuplicateFilter: some articles are later copied by others. What should be done with such articles? Possible values are: "skipHasDuplicates" (skip the resulting articles that have been later copied by others); "keepOnlyHasDuplicates" (return only the articles that have been later copied by others); "keepAll" (no filtering, default).
  • eventFilter: some articles describe a known event and some don't. This filter allows you to filter the resulting articles based on this criteria. Possible values are: "skipArticlesWithoutEvent" (skip articles that are not describing any known event in ER); "keepOnlyArticlesWithoutEvent" (return only the articles that are not describing any known event in ER). "keepAll" (no filtering, default).
  • startSourceRankPercentile and endSourceRankPercentile: The parameters can be used to filter the returned articles to include only those that are from news sources that are of a certain ranking. Sources are ranked according to the global Alexa site ranking. By setting startSourceRankPercentile to 0 and endSourceRankPercentile to 20 would, for example, return only articles from top-ranked news sources that would amount to about approximately 20% of all matching content. Note: 20 percentiles do not represent 20% of all top sources. The value is used to identify the subset of news sources that generate approximately 20% of our collected news content.
  • minSentiment: what should be the minimum sentiment value of an article? The range is from -1 (very negative) to 1 (very positive). NOTE: By setting a non-default value, the results will automatically be filtered to the English language only, since sentiment can only be computed for the English language.
  • maxSentiment: what should be the maximum sentiment value of an article? The range is from -1 (very negative) to 1 (very positive). NOTE: By setting a non-default value, the results will automatically be filtered to the English language only, since sentiment can only be computed for the English language.
  • dataType: what data types should we search? "news" (news content, default), "pr" (press releases), or "blog". If you want to use multiple data types, put them in an array (e.g. ["news", "pr"]).
  • requestedResult: the information that should be returned as the result of the query. If None then by default we set RequestArticleInfo().

When two or more parameters are specified in the constructor, the results will be computed in a way so that all conditions will be met. For example, if you specify QueryArticlesIter(keywords = "Barack Obama", conceptUri = "http://en.wikipedia.org/wiki/White_House") then the resulting articles will mention phrase Barack Obama and will be annotated with concept White House.

Creating QueryArticlesIter using static methods

The QueryArticlesIter class can also be initialized in two other ways:

QueryArticlesIter.initWithArticleUriList(uriList) is a static method that can be used to specify the set of article URIs that you want to use as a result. In this case, no query conditions are used and this set is used as the resulting set. All the return information about the articles will be based on this set of articles.

QueryArticlesIter.initWithComplexQuery(query, dataType) is another static method that can be used to create a complex query based on the advanced query language. You can call the method by providing an instance of ComplexArticleQuery class. Alternatively, you can also call the method with a python dictionary or a string containing the JSON object matching the language (see the examples).

Methods

The class has two main methods: count() and execQuery().

count(er) simply returns the number of articles that match the specified conditions. er is the instance of EventRegistry class.

execQuery method has the following format:

execQuery(er,
    sortBy = "rel",
    sortByAsc = False,
    returnInfo = ReturnInfo(),
    maxItems = -1)
  • er: an instance of the EventRegistry class.
  • sortBy: sets the order in which the resulting articles are sorted, before returning. Options: date (publishing date), cosSim (closeness to the centroid of the associated event), rel (relevance to the query), sourceImportance (manually curated score of source importance - high value, high importance), sourceImportanceRank (reverse of sourceImportance), sourceAlexaGlobalRank (global rank of the news source), sourceAlexaCountryRank (country rank of the news source), socialScore (total shares on social media), facebookShares (shares on Facebook only).
  • sortByAsc: should the results be sorted in ascending order.
  • returnInfo: sets the properties of various types of data that is returned (concepts, categories, sources, ...). See details.
  • maxItems: max number of articles to return by the iterator. Use the default (-1) to simply return all matching articles. By setting the value to a positive number, the iterator will stop downloading articles once the number is reached.

QueryArticles

Example of usage

Before describing the QueryArticles() class and the details that can be requested, let's look at an example of it's usage:

from eventregistry import *
er = EventRegistry(apiKey = YOUR_API_KEY)
q = QueryArticles(
    # set the date limit of interest
    dateStart = datetime.date(2014, 4, 16), dateEnd = datetime.date(2014, 4, 28),
    # find articles mentioning the company Apple
    conceptUri = er.getConceptUri("Apple"))
# return the list of top 30 articles, including the concepts, categories and article image
q.setRequestedResult(RequestArticlesInfo(page = 1, count = 100,
    returnInfo = ReturnInfo(articleInfo = ArticleInfoFlags(concepts = True, categories = True, image = True))))
res = er.execQuery(q)

The returned information about articles in ret follows the Article data model.

Constructor

QueryArticles constructor accepts the same arguments as QueryArticlesIter

Creating QueryArticles using static methods

The QueryArticles class can also be initialized in two other ways:

QueryArticles.initWithArticleUriList() is a static method that can be used to specify the set of article URIs that you want to use as the result. In this case, no query conditions are used and this set is used as the resulting set. All the return information about the articles will be based on this set of articles.

QueryArticles.initWithComplexQuery() is another static method that can be used to create a complex query based on the advanced query language. You can call the method by providing an instance of ComplexArticleQuery class. Alternatively, you can also call the method with a python dictionary or a string containing the JSON object matching the language (see the examples).

Returned information

When executing the query, there will be a set of articles that will match the specified criteria. What information about these articles is to be returned however still needs to be determined. Do you want to get the details about these articles? Are you interested in the top concepts mentioned in them? Maybe news sources?

The information to be returned about the matching articles is set by calling the setRequestedResult() method. The setRequestedResult() accepts as an argument an instance that has a base class RequestArticles. Below are the classes that can be specified in the setRequestedResult() calls:

RequestArticlesInfo

RequestArticlesInfo(page = 1,
    count = 100,
    sortBy = "date", sortByAsc = False,
    returnInfo = ReturnInfo())

RequestArticlesInfo class provides detailed information about the resulting articles.

  • page: determines the page of the results to return (starting from 1).
  • count: determines the number of articles to return. Max articles that can be returned per call is 100.
  • sortBy: sets the order in which the resulting articles are sorted, before returning. Options: date (publishing date), cosSim (closeness to the centroid of the associated event), rel (relevance to the query), sourceImportance (manually curated score of source importance - high value, high importance), sourceImportanceRank (reverse of sourceImportance), sourceAlexaGlobalRank (global rank of the news source), sourceAlexaCountryRank (country rank of the news source), socialScore (total shares on social media), facebookShares (shares on Facebook only).
  • sortByAsc: should the results be sorted in ascending order.
  • returnInfo: sets the properties of various types of data that is returned (concepts, categories, news sources, ...). See details.

RequestArticlesUriWgtList

RequestArticlesUriWgtList(page = 1,
    count = 10000,
    sortBy = "fq", sortByAsc = False)

RequestArticlesUriWgtList returns a simple list of article URIs and their weight (that is used for sorting the results) in format {uri}:{wgt} that match criteria. Useful if you wish to obtain the full list of article URIs in a single query.

  • page: determines the page of the results to return (starting from 1).
  • count: determines the number of results to return in a single call. Max results per call can be 50.000.
  • sortBy: sets the order in which the resulting articles are sorted, before returning. Options: date (publishing date), cosSim (closeness to the centroid of the associated event), rel (relevance to the query), sourceImportance (manually curated score of source importance - high value, high importance), sourceImportanceRank (reverse of sourceImportance), sourceAlexaGlobalRank (global rank of the news source), sourceAlexaCountryRank (country rank of the news source), socialScore (total shares on social media), facebookShares (shares on Facebook only).
  • sortByAsc: should the results be sorted in ascending order

RequestArticlesTimeAggr

RequestArticlesTimeAggr()

RequestArticlesTimeAggr returns information about how the articles are distributed over time. The constructor does not accept any additional parameters.

RequestArticlesConceptAggr

RequestArticlesConceptAggr(conceptCount=25,
    conceptCountPerType = None,
    conceptScoring = "importance",
    articlesSampleSize = 10000,
    returnInfo = ReturnInfo())

RequestArticlesConceptAggr returns a list of top concepts that are mentioned the most in the resulting articles.

  • conceptCount: number of top concepts to return (at most 500).
  • conceptCountPerType: if you wish to limit the number of top concepts per type (person, org, loc, wiki) then set this to some number. If you want to get an equal number of concepts for each type then set conceptCountPerType to conceptCount/4 (since there are 4 concept types).
  • conceptScoring: how should the top concepts be computed. Possible values are "importance" (takes into account how frequently a concept is mentioned and how relevant it is in an article); "frequency" (ranks the concepts simply by how frequently the concept is mentioned in the results); "uniqueness" (computes what are the top concepts that are frequently mentioned in the results of your search query but less frequently mentioned in the news in general).
  • articlesSampleSize: on what sample of results should the aggregate be computed (at most 20.000).
  • returnInfo: what details about the concepts should be included in the returned information.

RequestArticlesSourceAggr

RequestArticlesSourceAggr(articlesSampleSize = 20000,
    sourceCount = 50,
    normalizeBySourceArts = False,
    returnInfo = ReturnInfo())

RequestArticlesSourceAggr provides a list of top news sources that have written the most articles in the results.

  • articlesSampleSize: on what sample of results should the aggregate be computed (at most 1.000.000)
  • sourceCount: the number of top sources to return
  • normalizeBySourceArts: some sources generate significantly more content than others which is why they can appear as a top source for a given query. If you want to normalize and sort the sources by the total number of articles that they have published set this to True. This will return as top sources those, which potentially publish less content overall, but their published content is more about the searched query.
  • returnInfo: what details about the sources should be included in the returned information.

RequestArticlesCategoryAggr

RequestArticlesCategoryAggr(articlesSampleSize = 20000,
    returnInfo = ReturnInfo())

RequestArticlesCategoryAggr returns information about what categories are the resulting articles about.

  • articlesSampleSize: on what sample of results should the aggregate be computed (at most 50.000)
  • returnInfo: what details about the categories should be included in the returned information

RequestArticlesKeywordAggr

RequestArticlesKeywordAggr(articlesSampleSize = 500)

RequestArticlesKeywordAggr returns the keywords that summarize the best the resulting articles.

  • articlesSampleSize: the sample size of articles on which to compute the keywords. Maximum 20.000.

RequestArticlesConceptGraph

RequestArticlesConceptGraph(conceptCount = 25,
    linkCount = 50,
    articlesSampleSize = 10000,
    skipQueryConcepts = True,
    returnInfo = ReturnInfo())

RequestArticlesConceptGraph returns a graph of concepts. Concepts are connected if they frequently occur in the same articles.

  • conceptCount: number of top concepts (nodes) to return
  • linkCount: number of edges in the graph
  • articlesSampleSize: on what sample of articles should the graph be computed
  • skipQueryConcepts: should the concepts that were used in the search be ignored from the results
  • returnInfo: the details about the types of returned data to include. See details.

RequestArticlesConceptMatrix

RequestArticlesConceptMatrix(conceptCount = 25,
    measure = "pmi",
    articlesSampleSize = 500,
    returnInfo = ReturnInfo())

RequestArticlesConceptMatrix computes a matrix of concepts and their dependencies. For individual concept pairs, it returns how frequently they co-occur in the resulting articles and how "surprising" this is, based on the frequency of individual concepts.

  • conceptCount: the number of concepts on which to compute the matrix
  • measure: the measure to be used for computing the "surprise factor". Options: pmi (point-wise mutual information), pairTfIdf (pair frequency * IDF of individual concepts), chiSquare.
  • articlesSampleSize: on what sample of articles should the matrix be computed
  • returnInfo: the details about the types of returned data to include. See details.

RequestArticlesConceptTrends

RequestArticlesConceptTrends(conceptUris = None,
    conceptCount = 10,
    articlesSampleSize=10000,
    returnInfo = ReturnInfo())

RequestArticlesConceptTrends provides a list of most popular concepts in the results and how they daily trend over time

  • conceptUris: custom list of concepts for which we want to identify how much they are trending over time in the list of results that match the query. If None, then top concepts will be automatically computed.
  • conceptCount: if the conceptUris are not provided, what should be the number of automatically determined concepts to return (at most 50).
  • articlesSampleSize: on what sample of results should the aggregate be computed (at most 50.000)
  • returnInfo: the details about the types of returned data to include. See details.

RequestArticlesDateMentionAggr

RequestArticlesDateMentionAggr()

RequestArticlesDateMentionAggr provides information about the dates that have been found mentioned in the resulting articles. The class does not accept any additional arguments.

RequestArticlesRecentActivity

RequestArticlesRecentActivity(maxArticleCount = 100,
    updatesAfterTm = None,
    updatesAfterMinsAgo = None,
    updatesUntilTm = None,
    updatesUntilMinsAgo = None,
    lang = None,
    mandatorySourceLocation = False,
    returnInfo = ReturnInfo())

RequestArticlesRecentActivity is to be used to get the articles that match the particular set of search conditions and were added to Event Registry after the specified time.

  • maxArticleCount: maximum number of articles to return (max 2000). If more than 100 articles are requested then the correspondingly higher number of tokens will be used with a single call
  • updatesAfterTm: starting time after which the resulting articles should be imported into Event Registry. Specify a datetime instance or a string in format 'YYYY-MM-DDTHH:MM:SS.SSSS' that represents time in ISO format. When making consecutive calls, you can use value currTime returned from a previous call.
  • updatesAfterMinsAgo: instead of specifying the updatesAfterTm you can also simply ask to get content that was published after some minutes ago. You can use this if you are calling the API at regular time intervals.
  • updatesUntilTm: ending time before which the resulting articles should be imported into Event Registry. Specify a datetime instance or a string in format 'YYYY-MM-DDTHH:MM:SS.SSSS' that represents time in ISO format.
  • updatesUntilMinsAgo: instead of specifying the updatesUntilTm you can also simply ask to get content that was published before this number of minutes ago.
  • lang: limit resulting articles to the single language or a list of valid languages.
  • mandatorySourceLocation: should we return just the articles that have a known source location?
  • returnInfo: the details about the types of returned data to include. See details.

Advanced query language

For many users, simply providing a list of concepts, keywords, sources etc. is not sufficient and a more complex way of specifying a query is required. For such purposes, we provide a query language where conditions can be specified in a particular JSON object, that resembles the query language used by the MongoDB. The grammar for the language is as follows:

ComplexArticleQuery ::=
{
    "$query": CombinedQuery | BaseQuery,

    "$filter": null | {
        "dataType": null | "news" | "blog" | "pr" | ["news", ...]
        "isDuplicate": null | "keepAll" | "skipDuplicates" | "keepOnlyDuplicates",
        "hasDuplicate": null | "keepAll" | "skipHasDuplicates" | "keepOnlyHasDuplicates",
        "hasEvent": null | "keepAll" | "skipArticlesWithoutEvent" | "keepOnlyArticlesWithoutEvent",
        "startSourceRankPercentile": null | 0 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90,
        "endSourceRankPercentile": null | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100,
        "minSentiment": null | float([-1,1]),
        "maxSentiment": null | float([-1,1])
    }
}

CombinedQuery ::=
{
    "$or": [ CombinedQuery | BaseQuery, ... ],
    "$not": null | CombinedQuery | BaseQuery
}

CombinedQuery ::=
{
    "$and": [ CombinedQuery | BaseQuery, ... ],
    "$not": null | CombinedQuery | BaseQuery
}

BaseQuery ::=
{
    "conceptUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
    "keyword": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
    "categoryUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
    "lang": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},

    "sourceUri": null | string | { "$or": [ string, ... ]},
    "sourceLocationUri": null | string | { "$or": [ string, ... ]},
    "sourceGroupUri": null | string | { "$or": [ string, ... ]},

    "authorUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},

    "locationUri": null | string | { "$or": [ string, ... ]},

    "dateStart": null | string,
    "dateEnd": null | string,
    "dateMention": null | [string, ... ],

    "keywordLoc": null | "body" | "title" | "title,body",
    "keywordSearchMode": null | "simple" | "exact" | "phrase",

    "minArticlesInEvent": null | int,
    "maxArticlesInEvent": null | int,

    "$not": null | CombinedQuery | BaseQuery
}

Explanation

Each complex article query needs to be a JSON object that has a $query key. The $query key must contain another JSON object that should be parsable as a CombinedQuery or a BaseQuery. A CombinedQuery can be used to specify a list of conditions, where all ($and) or any ($or) conditions should hold. The CombinedQuery can also contain a $not key containing another CombinedQuery or BaseQuery defining the results that should be excluded from the results computed by the $and or $or conditions.

The BaseQuery represents a JSON object with actual conditions to search for. These (positive) conditions can include concepts, keywords, categories, sources, authors, etc. to search for. If multiple conditions are specified, for example, a conceptUri as well as a sourceUri, then results will have to match all the conditions. The BaseQuery can also contain the $not key specifying results to exclude from the results matching the positive conditions of the BaseQuery. A BaseQuery containing only the $not key is not a valid query (since it has no positive conditions to exclude results from).

The complex article query can also have a $filter object. It can contain information about which data types to search for - dataType property - which can be a simple string value or an array of values. You can also specify what should happen with articles that are (a) duplicates of other articles, (b) were later copied by other sources, and (c) are describing an event. Finally, you can also set the startSourceRankPercentile and endSourceRankPercentile filters to limit the results to certain subsets of news sources based on global Alexa ranking. By setting startSourceRankPercentile to 0 and endSourceRankPercentile to 20 you can limit the results to top-ranking news sources so that the resulting number of articles will be approximately 20% of all content that would be received without the filter.

Using this language you can specify queries that are not possible to express using the constructor parameters in QueryArticles or QueryArticlesIter.

Examples

Here are some examples of queries and what they would return:

A query that would return the list of articles that mention concept AI or phrases deep learning or machine learning, and none of the results mention phrase data mining in the article title:

{
    "$query": {
        "$or": [
            { "conceptUri": "http://en.wikipedia.org/wiki/Artificial_Intelligence" },
            {
                "keyword": {
                    "$or": [ "deep learning", "machine learning" ]
                }
            }
        ],
        "$not": {
            "keyword": "data mining",
            "keywordLoc": "title"
        }
    }
}

A query that would return the list of politics related articles about Donald Trump and Hillary Clinton, or business-related news that mention Elon Musk:

{
    "$query": {
        "$or": [
            {
                "conceptUri": {
                    "$and": [
                        "http://en.wikipedia.org/wiki/Donald_Trump",
                        "http://en.wikipedia.org/wiki/Hillary_Rodham_Clinton"
                    ]
                },
                "categoryUri": "dmoz/Society/Politics"
            },
            {
                "conceptUri": "http://en.wikipedia.org/wiki/Elon_Musk",
                "categoryUri": "dmoz/Business"
            }
        ]
    }
}

Depending on your preference, you can build such JSONs for these complex queries yourself or you can use the associated classes such as ComplexArticleQuery(), CombinedQuery() and BaseQuery(). Below is an example where we search for articles that are either about Donald Trump or are in the Politics category, but were not published in February 2017 or mention Barack Obama:

er = EventRegistry()
trumpUri = er.getConceptUri("Trump")
obamaUri = er.getConceptUri("Obama")
politicsUri = er.getCategoryUri("politics")
cq = ComplexArticleQuery(
    query = CombinedQuery.OR(
        [
            BaseQuery(conceptUri = trumpUri),
            BaseQuery(categoryUri = politicsUri)
        ],
        exclude = CombinedQuery.OR([
            BaseQuery(dateStart = "2017-02-01", dateEnd = "2017-02-28"),
            BaseQuery(conceptUri = obamaUri)])
        )
    )
q = QueryArticles.initWithComplexQuery(cq)
q.setRequestedResult(RequestArticlesInfo())
res = er.execQuery(q)

If you've built the Python dictionary with the query yourself, you can also use like this:

er = EventRegistry()
q = QueryArticles.initWithComplexQuery({"$query": { ... } })
q.setRequestedResult(RequestArticlesInfo())
res = er.execQuery(q)

In this case, you need to make sure you're providing valid query parameters in the provided dictionary.

Similarly, if you would like to use the QueryArticlesIter to quickly iterate over the results, you can also use the initWithComplexQuery method like this:

er = EventRegistry()
q = QueryArticlesIter.initWithComplexQuery({ "$query": { ... } })
for article in q.execQuery(er):
    print(article)