Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate data #19

Closed
borenstein opened this issue May 31, 2016 · 6 comments
Closed

Duplicate data #19

borenstein opened this issue May 31, 2016 · 6 comments

Comments

@borenstein
Copy link

borenstein commented May 31, 2016

Hi Mark,

Another issue using the v4 API. The following query returns 126,399 rows. It is expected to return 26,399 rows. As far as I can tell, it's just repeating some of the result rows. The rows that I checked looked to have the correct data.

data.googleAnalyticsR <- google_analytics_4(ga_id, 
                                 dimensions=c('ga:month', "ga:year", "ga:landingPagePath"), 
                                 date_range=c("2015-04-01", "2015-04-30"),
                                 metrics = c('ga:sessions', "ga:bounceRate", "ga:avgSessionDuration", "ga:pageviewsPerSession"),
                                 max=1000000)

When I ran the equivalent queries on RGA and RGoogleAnalytics, I got the expected behavior. These are, I believe, using the v3 API.

tmp.query.list <- Init(start.date = "2015-04-01",
                       end.date = "2015-04-30",
                       dimensions=c('ga:month', "ga:year", "ga:landingPagePath"), 
                       metrics = c('ga:sessions', "ga:bounceRate", "ga:avgSessionDuration", "ga:pageviewsPerSession"),
                       max.results = 1000000,
                       table.id = "ga:XXXXXX")
tmp.query <- QueryBuilder(tmp.query.list)
data.RGoogleAnalytics <- GetReportData(tmp.query, token, split_daywise = F, delay = 0)

data.RGA <- get_ga(profileId = "ga:XXXXXX",
                  dimensions=c('ga:month', "ga:year", "ga:landingPagePath"), 
                  metrics = c('ga:sessions', "ga:bounceRate", "ga:avgSessionDuration", "ga:pageviewsPerSession"),
                  start.date='2015-04-01',
                  end.date='2015-04-30',
                  max=1000000
)

Thanks and let me know if you need any more sleuthing. Happy to help--it's the least I can do.

Best,
David

@MarkEdmondson1234
Copy link
Collaborator

Thanks for this - this must be connected with the v4 batching. It works out how many fetches from the max parameter, perhaps if the max is set to 27000 it will work? Of course you can't always know this in advance, so I will look at the logic on working out the batches to see whats happening.

@borenstein
Copy link
Author

Hi Mark,

As you suggested, it does the right thing if I set the maximum to 27k. It also does if I set it to 30k or 50k. When I set it to 100k, the query returns 36,399 rows.

Here's the logic behind what I'm doing: I want to retrieve monthly reporting data from a large site that's been using GA since mid-2006. I find that, if I run a single large query, there's very likely to be a server hiccup or some other issue that causes the entire pull to get botched. (It's not unusual to get a 500 error once every 100,000 rows or so.)

So I query one month at a time. The problem is that I don't know how many rows each query will return. Probably fewer than 100k, but I don't want to discover--or worse, fail to discover--that I'm missing some data. So I just set the maximum really high, to effectively turn it off.

Best,
David

@MarkEdmondson1234
Copy link
Collaborator

Hi David,

Have you tried the same query using the v3 google_analytics() ?

I think that has more robust batching at the moment(?), and should deal with the occasional 500 errors by backing off and trying again. With that you can set it to a high max value, it will ignore it anyway if over 10000

@MarkEdmondson1234
Copy link
Collaborator

Dear @biologicaldynamics , thanks for this report.

It turned out to be something a bit ridiculous - R turned 10000 into "1e5" which in turn the API thought was 1. For every batch over 10,000 it added rows from 1-10000 again(!). Ouch. But now fixed I hope, thanks to you 👍

@MarkEdmondson1234
Copy link
Collaborator

This isn't fixed...

@MarkEdmondson1234
Copy link
Collaborator

Ok, think I got it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants