Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId. #800

Closed
ilg-ul opened this issue Jan 16, 2020 · 30 comments

Comments

@ilg-ul
Copy link
Contributor

ilg-ul commented Jan 16, 2020

Resolution: the problem was caused by the recent change at Keil, which added a redirect from http to https, configuration not supported by the Java HttpURLConnect, which require to manually follow the redirections.

The error message is caused by the SAX parser trying to parse the html returned together with the 302 response.


It looks like something changed recently in the index.pidx file, crashing the SAX parser:

Parsing "http://www.keil.com/pack/index.pidx"...
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId.

The current file reads like:

<?xml version="1.0" encoding="UTF-8" ?> 
<index schemaVersion="1.1.0" xs:noNamespaceSchemaLocation="PackIndex.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance">
<vendor>Keil</vendor>
<url>http://www.keil.com/pack/</url>
<timestamp>2020-01-14T04:02:51.9611227+00:00</timestamp>
<pindex>
  <pdsc url="http://www.keil.com/pack/" vendor="ARM" name="minar" version="1.0.0" />
  ...
  <pdsc url="http://mcu.holtek.com.tw/pack" vendor="Holtek" name="HT32_DFP" version="1.0.24" />
</pindex>
</index>

I would suspect that the PackIndex.xsd requires a full absolute URL.

@JonatanAntoni
Copy link
Member

JonatanAntoni commented Jan 17, 2020

Hi @ilg-ul,

thanks for letting us know.

The index file seems to be in sync with the specification in the documentation. There are no such elements like publicId or systemId around. Validating the file against the schema doesn't show any issues.

Cheers,
Jonatan

@edriouk
Copy link
Collaborator

edriouk commented Jan 17, 2020

Hi Liviu,

We do not face any problem.
Our plug-ins first downloads index.pidx, then parses it without validating against the schema ( the file is generated at the server side and therefore ensured to match the xsd file).
I have googled for the message and the problem seems to be similar to this one:
https://stackoverflow.com/questions/46943878/org-xml-sax-saxparseexception-white-spaces-are-required-between-publicid-and-sy

Best regards,
Evgueni

@JonatanAntoni
Copy link
Member

One issue could be the redirect that happens when accessing http://www.keil.com/pack/index.pidx. A file download using wget or curl -L does resolve the redirect correctly. But using a different implementation to access web resources might introduce weird effects.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 17, 2020

Can you check when was the xs:noNamespaceSchemaLocation="PackIndex.xsd" added to the file? Since before this change everything was fine.

The reason for the shown error is that the parser cannot reach the schema file; when using relative paths, like in your case, the schema file is expected to be in the same folder as the parsed file, and it is not, neither at http://www.keil.com/pack/PackIndex.xsd, nor in the local folder if the index is first downloaded locally.

However, the safest way is to use absolute URLs.

I'm not sure this attribute should be present here. It should be present in your development environment to validate the index, but once you make it public you force all parsers to validate the content at each access. Not nice.

~~My first suggestion is to remove this attribute. ~~

If you decide to keep this attribute, please publish the schema in a public location and change the attribute to the absolute URL of the schema.

https://www.oreilly.com/library/view/xml-in-a/0596007647/re167.html

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 17, 2020

Our plug-ins first downloads index.pidx, then parses it without validating against the schema

Lucky you! Downloading and parsing locally seems to disable the schema validation.

Here are the latest tests. Parsing directly from the URL:

2020-01-17 12:02:27
Update packs job started.
Parsing "http://www.keil.com/pack/index.pidx"...
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId.
File "/Users/ilg/Library/CMSIS-Packs/.cache/.content_www_keil_com_pack_index_pidx.xml" written.

Copying the file locally and parsing:

2020-01-17 12:04:57
Update packs job started.
Parsing "file:///Users/ilg/Downloads/index.pidx"...
Contributed 606 pack(s).

I first thought that the problem was introduced by updating the JDK to OpenJDK 13, but with the old 1.8 the behaviour was the same.

I have no idea how it worked before...

@edriouk
Copy link
Collaborator

edriouk commented Jan 17, 2020

The xs:noNamespaceSchemaLocation="PackIndex.xsd" was added to the index.pidx in October 2018.
I believe the problem is as described stackoverflow article and caused by URL redirection that was made recently.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 17, 2020

I believe the problem is as described stackoverflow article and caused by URL redirection that was made recently.

I understand that the redirection was added recently, but I do not think it causes the issue described at stackoverflow.

Here is the verbose curl output:

ilg@wks Downloads % curl -L http://www.keil.com/pack/index.pidx -o index.pidx -v
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 217.140.99.213...
* TCP_NODELAY set
* Connected to www.keil.com (217.140.99.213) port 80 (#0)
> GET /pack/index.pidx HTTP/1.1
> Host: www.keil.com
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 302 Found
< Server: Microsoft-IIS/8.5
< Content-Type: text/html
< Date: Fri, 17 Jan 2020 10:20:30 GMT
< Location: https://sadevicepacksprodus.blob.core.windows.net/idxfile/index.pidx
< Connection: Keep-Alive
< X-UA-Compatible: IE=EDGE
< X-Powered-By: ASP.NET
< Content-Length: 7764
< 
* Ignoring the response-body
{ [6559 bytes data]
100  7764  100  7764    0     0  69945      0 --:--:-- --:--:-- --:--:-- 70581
* Connection #0 to host www.keil.com left intact
* Issue another request to this URL: 'https://sadevicepacksprodus.blob.core.windows.net/idxfile/index.pidx'
*   Trying 52.190.240.132...
* TCP_NODELAY set
* Connected to sadevicepacksprodus.blob.core.windows.net (52.190.240.132) port 443 (#1)
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [255 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [81 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [5238 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=*.blob.core.windows.net
*  start date: May  2 00:41:38 2019 GMT
*  expire date: May  2 00:41:38 2021 GMT
*  subjectAltName: host "sadevicepacksprodus.blob.core.windows.net" matched cert's "*.blob.core.windows.net"
*  issuer: C=US; ST=Washington; L=Redmond; O=Microsoft Corporation; OU=Microsoft IT; CN=Microsoft IT TLS CA 4
*  SSL certificate verify ok.
> GET /idxfile/index.pidx HTTP/1.1
> Host: sadevicepacksprodus.blob.core.windows.net
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 76035
< Content-Type: text/plain
< Last-Modified: Tue, 14 Jan 2020 04:02:51 GMT
< ETag: 0x8D798A6A0712CCF
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: aaf11b0b-401e-000f-361f-cd00f4000000
< x-ms-version: 2009-09-19
< x-ms-lease-status: unlocked
< x-ms-blob-type: AppendBlob
< x-ms-blob-committed-block-count: 1
< Date: Fri, 17 Jan 2020 10:20:31 GMT
< 
{ [15980 bytes data]
100 76035  100 76035    0     0  54466      0  0:00:01  0:00:01 --:--:-- 79534
* Connection #1 to host sadevicepacksprodus.blob.core.windows.net left intact
* Closing connection 1
* Closing connection 0
ilg@wks Downloads % 

The file is not UTF-8 but text/plain and the downloaded file has no BOM, it starts directly with ASCII chars:

ilg@wks Downloads % hexdump  /Users/ilg/Downloads/index.pidx  
0000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31
0000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54

@JonatanAntoni
Copy link
Member

JonatanAntoni commented Jan 17, 2020

Yes, you're right, the file has no BOM but it should be proper UTF-8 encoding nevertheless.

Might it happen that the stream reader you are using fails to detect proper encoding if there is no BOM right at the start? Any chance to force the stream reader to use UTF-8?

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 17, 2020

Might it happen that the stream reader you are using fails to detect proper encoding if there is no BOM right at the start? Any chance to force the stream reader to use UTF-8?

Please note that exactly the same file is parsed by exactly the same code properly when copied locally. It should have nothing to do with encoding.

And parsing was ok until recently, when something changed on your side.

Most probably the issue is caused by the validation, which is not possible from your URL.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 17, 2020

caused by URL redirection that was made recently.

Evgueni seems right, I uploaded the index.pidx to GitHub and from there the Java parser can process it:

Parsing "https://github.com/ilg-ul/test-sax-validation/raw/master/index.pidx"...
Contributed 606 pack(s).

So my guess that it has something to do with validation was not confirmed.

A curl session looks like:

ilg@wks ~ % curl -L -o index2.pidx https://github.com/ilg-ul/test-sax-validation/raw/master/index.pidx -v
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 140.82.118.3...
* TCP_NODELAY set
* Connected to github.com (140.82.118.3) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [224 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [108 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [3085 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [300 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [37 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: businessCategory=Private Organization; jurisdictionCountryName=US; jurisdictionStateOrProvinceName=Delaware; serialNumber=5157550; C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=github.com
*  start date: May  8 00:00:00 2018 GMT
*  expire date: Jun  3 12:00:00 2020 GMT
*  subjectAltName: host "github.com" matched cert's "github.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 Extended Validation Server CA
*  SSL certificate verify ok.
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0> GET /ilg-ul/test-sax-validation/raw/master/index.pidx HTTP/1.1
> Host: github.com
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 302 Found
< Date: Fri, 17 Jan 2020 13:02:30 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Server: GitHub.com
< Status: 302 Found
< Vary: X-PJAX
< Access-Control-Allow-Origin: https://render.githubusercontent.com
< Location: https://raw.githubusercontent.com/ilg-ul/test-sax-validation/master/index.pidx
< Cache-Control: no-cache
< Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
< X-Frame-Options: deny
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< Expect-CT: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
< Content-Security-Policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com www.google-analytics.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com wss://live.github.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com *.githubusercontent.com; manifest-src 'self'; media-src 'none'; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com
< Age: 0
< Vary: Accept-Encoding
< X-GitHub-Request-Id: DDEB:F596:27E1B4C:3B5193D:5E21B065
< 
* Ignoring the response-body
{ [155 bytes data]
100   144    0   144    0     0    331      0 --:--:-- --:--:-- --:--:--   330
* Connection #0 to host github.com left intact
* Issue another request to this URL: 'https://raw.githubusercontent.com/ilg-ul/test-sax-validation/master/index.pidx'
*   Trying 151.101.16.133...
* TCP_NODELAY set
* Connected to raw.githubusercontent.com (151.101.16.133) port 443 (#1)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [239 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [108 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [3182 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [300 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [37 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=www.github.com
*  start date: Mar 23 00:00:00 2017 GMT
*  expire date: May 13 12:00:00 2020 GMT
*  subjectAltName: host "raw.githubusercontent.com" matched cert's "*.githubusercontent.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
*  SSL certificate verify ok.
> GET /ilg-ul/test-sax-validation/master/index.pidx HTTP/1.1
> Host: raw.githubusercontent.com
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
< Strict-Transport-Security: max-age=31536000
< X-Content-Type-Options: nosniff
< X-Frame-Options: deny
< X-XSS-Protection: 1; mode=block
< ETag: W/"8c5f775585a16c5e8f27556fa1bd47117a66f17ae056af2b72affdaec243caa0"
< Content-Type: text/plain; charset=utf-8
< Cache-Control: max-age=300
< X-Geo-Block-List:
< Via: 1.1 varnish-v4
< X-GitHub-Request-Id: 3CA4:22F3:0333:03E3:5E21AF60
< Content-Length: 75423
< Accept-Ranges: bytes
< Date: Fri, 17 Jan 2020 13:02:30 GMT
< Via: 1.1 varnish
< Connection: keep-alive
< X-Served-By: cache-lcy19264-LCY
< X-Cache: HIT
< X-Cache-Hits: 1
< X-Timer: S1579266150.389912,VS0,VE1
< Vary: Authorization,Accept-Encoding
< Access-Control-Allow-Origin: *
< X-Fastly-Request-ID: 3196e178173b2a09b9bcb0fe77ef1a58b0687a1b
< Expires: Fri, 17 Jan 2020 13:07:30 GMT
< Source-Age: 261
< 
{ [1875 bytes data]
100 75423  100 75423    0     0   104k      0 --:--:-- --:--:-- --:--:--  104k
* Connection #1 to host raw.githubusercontent.com left intact
* Closing connection 0
* Closing connection 1
ilg@wks ~ % 

The one difference that I can spot is that GitHub responds with Content-Type: text/plain; charset=utf-8, while your server only with Content-Type: text/plain.

Could you find a fix for this?

@JonatanAntoni
Copy link
Member

Liviu,

I asked the web hosting team if we can change the reported Content-Type to text/xml; charset=utf-8. Not sure what type of influence we have here since the files are shipped by Microsoft Azure.

Cheers,
Jonatan

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 17, 2020

the files are shipped by Microsoft Azure

:-(

Long live Microsoft!

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 20, 2020

Based on further tests, configuring the plug-ins to use the real address (https://sadevicepacksprodus.blob.core.windows.net/idxfile/index.pidx) avoids the problem, so the culprit is the redirection, not the content type.

Could you compare the current redirection setup with the previous one, which worked, perhaps you can identify the problem?

For completeness, the curl session looks like this:

ilg@wks tmp % curl -L -o index3.pidx https://sadevicepacksprodus.blob.core.windows.net/idxfile/index.pidx -v
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.190.240.132...
* TCP_NODELAY set
* Connected to sadevicepacksprodus.blob.core.windows.net (52.190.240.132) port 443 (#0)
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [255 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [81 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [5238 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=*.blob.core.windows.net
*  start date: May  2 00:41:38 2019 GMT
*  expire date: May  2 00:41:38 2021 GMT
*  subjectAltName: host "sadevicepacksprodus.blob.core.windows.net" matched cert's "*.blob.core.windows.net"
*  issuer: C=US; ST=Washington; L=Redmond; O=Microsoft Corporation; OU=Microsoft IT; CN=Microsoft IT TLS CA 4
*  SSL certificate verify ok.
> GET /idxfile/index.pidx HTTP/1.1
> Host: sadevicepacksprodus.blob.core.windows.net
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 76375
< Content-Type: text/plain
< Last-Modified: Sat, 18 Jan 2020 04:01:56 GMT
< ETag: 0x8D79BCB28BA6D49
< Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: 42b49ce6-801e-007f-667d-cf7330000000
< x-ms-version: 2009-09-19
< x-ms-lease-status: unlocked
< x-ms-blob-type: AppendBlob
< x-ms-blob-committed-block-count: 1
< Date: Mon, 20 Jan 2020 10:33:14 GMT
< 
{ [15980 bytes data]
100 76375  100 76375    0     0  62911      0  0:00:01  0:00:01 --:--:-- 62911
* Connection #0 to host sadevicepacksprodus.blob.core.windows.net left intact
* Closing connection 0
ilg@wks tmp % 

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 22, 2020

Any estimate when this issue will be addressed?

As a workaround, I currently asked users to reconfigure their Eclipses to use the windows.net URL, but this is not a solution for long term.

@JonatanAntoni
Copy link
Member

In your analysis above the file still gets delivered as text/plain. I don't understand what's the difference from your clients point of view between being redirected or accessing the final URL directly. I doubt simply changing the content type to text/xml; charset=utf-8 fixes your issue.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 22, 2020

I doubt simply changing the content type to text/xml; charset=utf-8 fixes your issue

First, this is not my issue, I use the XML SAX parser available in the Oracle JDK in the simplest and most obvious configuration.

If I pass it the 'keil.com' URL, if fails; if I pass the windows.net URL, it passes; if I copy the file locally and pass the local URL, the parser passes again.

The content type seems to have no importance.

The problem is the new Microsoftish redirection, which confuses the Java parser.

If you think that the problem is not real simply because users of your CMSIS Eclipse plug-ins do not feel the pain, you are wrong, because Evgueni took a different path and copied the file locally (thus processing the redirect in a more fortunate context), but the problem is there for anyone trying to parse the file directly from the URL.

Please compare the current redirection setup with the previous one, which worked, and fix the problem.

@JonatanAntoni
Copy link
Member

Our web team is investigating the issue. But the solution we pointed out in the first place won't be enough. I cannot give you an estimate, but probably not before end of January.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 22, 2020

the solution we pointed out in the first place won't be enough

If you mean fixing the content type, yes, I guess that won't make any difference. Check the redirects.

@cdwilson

This comment has been minimized.

@cdwilson
Copy link

Whoops, ignore above, I forgot the -L flag to curl:

$ curl -L -s https://www.keil.com/pack/index.pidx | xmllint --noout --schema PackIndex.xsd -
- validates

@JonatanAntoni
Copy link
Member

Hi @cdwilson,

using curl -s clearly cannot work with redirects, you need to use -L in such a case.

curl -L https://www.keil.com/pack/index.pidx | xmllint --noout --schema PackIndex.xsd -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  7765    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 76375  100 76375    0     0  58346      0  0:00:01  0:00:01 --:--:--  228k
- validates

There is nothing basically wrong with the redirect itself. Its just a matter of coping with these redirects, properly. I am not an expert on that "XML SAX parser available in the Oracle JDK". Can you come up with a small command line reproducer revealing that issue? E.g. a java program I can run from command line in a similar way than above curl command? This might be helpful for our web team to analyse the issue.

Cheers,
Jonatan

@cdwilson
Copy link

cdwilson commented Jan 22, 2020

Yup, I realized that right after I posted it... [facepalm]

The original error message that @ilg-ul posted looks similar to the errors that curl is throwing when I forgot the -L flag, i.e.

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 50; White spaces are required between publicId and systemId.

vs.

parser error : Space required after the Public Identifier

I wonder if there is some similar option to curl's -L that needs to be passed in the SAX parser.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 22, 2020

I did some further tests and the problem is definitely related to the redirection.

The problem is not in the SAX parser itself, but in the HttpURLConnection, used to read the content.

For reasons that I did not identify yet, in some cases this class does not follow redirections, and returns the error string issued by the server (html content). This string obviously is not a properly formed xml, and the SAX parser fails with that SAXParseException.

[Edit: The class does not follow redirections from http to https.]

Can you confirm that before the move to windows.net, the index.pidx file had no redirects at all? That would explain why it worked for so long and failed recently.

The strange thing is that in some other cases, exactly the same code used in the plug-ins performs as expected, following the redirect and returning the xml, not the error html. [Edit: the separate tests worked because they used https.]

I'll try to identify the reason of this inconsistent behaviour, and a possible solution to avoid it.

Evgueni @edriouk, any thoughts on this?

@edriouk
Copy link
Collaborator

edriouk commented Jan 23, 2020

Liviu,
have you tried to use HttpURLConnection methods setFollowRedirects() and/or setInstanceFollowRedirects()?
If it does not help, I see currently only the possibility to download the file first and then parse it.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 23, 2020

setFollowRedirects()

I checked and this property is already set to true. :-(

currently only the possibility to download the file first and then parse it

I already do this (actually I use an internal buffer), and the problem occurs when reading in the file via HttpURLConnection, instead of the xml I get the html error page.

The only way out I can see now is to explicitly process redirects in my code, which is silly.

@edriouk
Copy link
Collaborator

edriouk commented Jan 23, 2020

@JonatanAntoni
Copy link
Member

Well, as far as I can recap we use redirection since quite a while. Need to dig deeper to understand if anything changed recently.

To be honest, I don't know what we should do if the implementation you are using is causing the wired behavior.

I cannot see that the redirect is somehow special and it works without issues using curl.

@JonatanAntoni
Copy link
Member

Hi @ilg-ul,

I got some feedback from the web team. They moved from redirecting to http to redirecting to https on Jan 8th. This might indeed cause issues.

May I ask you to update the URL from http://www.keil.com/pack/index.pidx to https://www.keil.com/pack/index.pidx, please? Does this change anything on your end?

Cheers,
Jonatan

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 23, 2020

They moved from redirecting to http to redirecting to https on Jan 8th. This might indeed cause issues.

Indeed.

update the URL from http://www.keil.com/pack/index.pidx to https://www.keil.com/pack/index.pidx, please? Does this change anything on your end?

Yes, now it no longer throws the exception.

It looks like the Java classes cannot redirect from http to https.

Please note that your url change is not reflected by the documentation, which still points to http.

https://arm-software.github.io/CMSIS_5/Pack/html/packIndexFile.html

I think that you should explicitly announce this configuration change.

@ilg-ul
Copy link
Contributor Author

ilg-ul commented Jan 23, 2020

have a look how our code in CpRepoServiceProvider.readIndexFile()

Thank you Evgueni. Yes, you are explicitly processing redirects, and do not rely on moody implementations. Good to know.

@JonatanAntoni JonatanAntoni added DONE and removed review labels Jan 24, 2020
@ilg-ul ilg-ul closed this as completed Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants