Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal sharepoint website is giving a 403 Forbidden #917

Open
michaelt16 opened this issue Mar 4, 2024 · 3 comments
Open

Internal sharepoint website is giving a 403 Forbidden #917

michaelt16 opened this issue Mar 4, 2024 · 3 comments
Labels
stale From automation, when inactive for too long.

Comments

@michaelt16
Copy link

Hi Pascal,

I have a question regarding crawling through an internal sharepoint site. It seems like everytime I go through the internal links I get a 403 forbidden, although I have setup the login aunthentication. Is there anything else I should think about when trying to solve this issue?

For context, I am testing with a depth of 1. Lets say that the first page required a log in as well (which works and able to crawl through it) but when the crawler goes through the sublinks (aka the sharepoint sites) it gives a 403 error although it typically just requires one login to access both.

What are some things I should look at when troubleshooting this? Let me know if configuration or more context is needed.

Thank you
-Michael

@ohtwadi
Copy link
Contributor

ohtwadi commented Mar 8, 2024

Hi Michael,

The crawler offers generic NTLM support thanks to the Apache HttpClient library. It supports a few different NTML protocol versions but may not support the one you are using. Details on supported versions: https://hc.apache.org/httpcomponents-client-4.5.x/ntlm.html

You may also want to check with your system administrator to see if there are extra security layers or special configuration requirements you need to be aware of. Maybe you need to pass custom HTTP headers, or go through a proxy (look at <headers> and <proxySettings>).

Finally, if all fails you can try to find out if they offer a way to access your site via other authentication methods or maybe even white-list the crawler IP or some other workaround. There might be other network conditions you are not meeting with NTLM alone.

If you get a specific error from the crawler that suggests a bug, feel free to share your config here and the exact error/logs so we can look for a fix.

@michaelt16
Copy link
Author

michaelt16 commented Mar 13, 2024

Hi,

Thank you for your response. I switched the login to ntlm and it still gave me aa 403 forbidden error unfortunately.

I was thinking of this solution, I am not sure how it is going to work though. I was thinking of using some type of java browse bot and using it alongside norconex. Since I was able to use a browse bot to login to the sharepoint sites and retrieve the html contents.

Copy link

stale bot commented May 14, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale From automation, when inactive for too long. label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale From automation, when inactive for too long.
Projects
None yet
Development

No branches or pull requests

2 participants