Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQP query syntax #193

Closed
emilyf413 opened this issue Jul 12, 2021 · 3 comments
Closed

CQP query syntax #193

emilyf413 opened this issue Jul 12, 2021 · 3 comments

Comments

@emilyf413
Copy link

Hi everyone,

I'm just beginning to work with polmineR and trying to search for articles containing at least one word from 3 groups of words. I'm struggling to write this syntax properly and Google searches haven't helped me so far.

Here's what I want, with the query pulling articles containing at least one word from each of the 3 word groups:

kwic (faz, query=[one group of words] & [another group of words] & [third group of words]

And here's a simplified version of what I have ("faz" is the corpus):

kwic(faz, query='[word = ".*[aA]usländer.*" | word = ".*[mM]igration.*"] & [word = "Deutschland" | word = "Bundesrepublik" | word = "BRD"] & [word = ".*[iI]ntegration.*" | word = ".*[aA]bschieb.*"]', cqp=TRUE)

But I don't think this is correct, as the number of articles that come up isn't changing when I add more words to the search string. So I think it's just pulling all the articles that have at least one of the words in the whole search string.

Can anyone provide some suggestions? Thank you so much!

Best,
Emily

@ablaette
Copy link
Collaborator

Using the '&' as a logical operator does not work with the CQP syntax. You need to follow the regular expression syntax. Something like this:

k <- kwic(faz, query = '"(.*[aA]usländer.*|.*[mM]igration.*|Deutschland|.*[iI]ntegration.*|.*[aA]bschieb.*)"')

Yet I do not think this is exactly what you had in mind. Maybe explain your objective in plain words first.

A valuable reference is the CQP tutorial: http://cwb.sourceforge.net/files/CQP_Tutorial.pdf

@mxi-hug
Copy link

mxi-hug commented Jul 22, 2021

Hey Emily,

if you're still looking for a solution, try this one:

kwic(faz, query = '"(.*[aA]usländer.*|.*[mM]igration.*)" []{0,4} "(Deutschland|Bundesrepublik|BRD)" []{0,4} "(.*[iI]ntegration.*|.*[aA]bschieb.*)"', cqp = TRUE)

A query for multiple words in CQP works like query = '"word1" "word2" "word3"'
consequently, you can replace a word by a regex containing an OR-operator to build groups of words query = '"(word1|word2|word3" "(word4|word5|word6)"'

For reasons of flexibility, I added the expression []{0,4} which allows up to 4 words in between the words of interest. This can make sense depending on your research interest, but it could as well add some noise to your results. You have to try it out whether it's useful in your case.

In any case it is very helpful to check, which expressions from the corpus are found with your cqp-query. For a quick overview, try

count(faz, query = '"(.*[aA]usländer.*|.*[mM]igration.*)" []{0,4} "(Deutschland|Bundesrepublik|BRD)" []{0,4} "(.*[iI]ntegration.*|.*[aA]bschieb.*)"', cqp = TRUE, breakdown = TRUE)

which returns a count of all matches from the corpus

On a general note: depending on your research, it often makes sense not to make your cqp-expression too complex but to split it up into multiple but simpler expressions, especially when you are working with greedy expressions like .* and have to deal with many hits.

Best,
Max

@emilyf413
Copy link
Author

Hi @mxi-hug,

Thank you for your feedback on this! I ended up splitting the query into smaller and simpler expressions, as I'm searching for entire articles that contain at least one word from each set of words. So I could have used the first option you provided and put in the expression []{0,100} or something like that, to provide a big window, but I agree that splitting up into smaller expressions is just easier.

Thanks again!

Best,
Emily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants