Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Form Authentication #434

Closed
jacksonp2008 opened this issue Dec 5, 2017 · 13 comments
Closed

Form Authentication #434

jacksonp2008 opened this issue Dec 5, 2017 · 13 comments

Comments

@jacksonp2008
Copy link

Trying to authenticate using form authentication. The form appears to be generating a (hidden) unique token that need to be sent back.

From the site:

<form action="https://dev-asfdsdf.asdfsf.com/login" method="POST" id="login-form">
                    <input type="hidden" name="_token" value="BuHHYIrC2KaAfmezJ4dOiWlJdx6kO58JrOrj54Bx">

Assuming I am interpreting this correctly, any thoughts on how to capture the token and send it back in httpClientFactory?

thanks!

@essiembre
Copy link
Contributor

It looks like this authentication method needs a custom solution. I suspect the _token to be dynamic, but just in case, version 2.8.0 allows you to specify extra form arguments with your login URL (in httpClientFactory):

      <authFormParams>
          <param name="(param name)">(param value)</param>
          <!-- You can repeat this param tag as needed. -->
      </authFormParams>

I doubt this will work. You may have to do some coding. One approach could be to extend GenericHttpClientFactory and override the authenticateUsingForm method to do a second pass with the token.

If your authentication method uses a known standard, you can provide more information on it and we can turn this into a feature request.

@AntonioAmore
Copy link

Hello!
Trying apply your advice:

<httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">  
...
     <authFormParams>
         <param name="(param name)">(param value)</param>         
     </authFormParams>
...

but getting an error:

ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) GenericHttpClientFactory: cvc-complex-type.2.4.a: Invalid content was found starting with element 'authFormParams'. One of '{connectionRequestTimeout, connectionCharset, expectContinueEnabled, maxRedirects, localAddress, maxConnections, maxConnectionsPerRoute, maxConnectionIdleTime, maxConnectionInactiveTime, sslProtocols, proxyHost, proxyPort, proxyRealm, proxyScheme, proxyUsername, proxyPassword, proxyPasswordKey, proxyPasswordKeySource, headers, authPasswordKey, authPasswordKeySource, authFormCharset, authHostname, authPort, authRealm, authWorkstation, authDomain, authPreemptive}' is expected.

versions:
Norconex HTTP Collector 2.8.0
Norconex HTTP Collector 2.8.1-snapshot

essiembre added a commit that referenced this issue Dec 8, 2017
@essiembre
Copy link
Contributor

This is a validation error because validation was not updated to support the new feature. It can be ignored unless you start the collector with the -k option (which prevents starting upon validation errors).

An updated snapshot was just made to fix the validation error.

@AntonioAmore
Copy link

AntonioAmore commented Dec 8, 2017

Thank you for the quick response!

I tested the issue with 2.8.1 (2017-12-08) snapshot.

On

<authFormParams>
    <param name="param"></param>        
 </authFormParams>

Got the following:

ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) GenericHttpClientFactory: cvc-minLength-valid: Value '' with length = '0' is not facet-valid with respect to minLength '1' for type 'nonEmtyString'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) GenericHttpClientFactory: cvc-complex-type.2.2: Element 'param' must have no element [children], and the value must be valid.

I believe minLength may be 0 for some reasons, including security (requirement for empty field in a request)

Also analyzing server-side request variable, I haven't found param there but according to authFormParams it should be sent.

The problem is more generic, I guess, because configuring

<authFormParams>
    <param name="param">1</param>        
 </authFormParams>

also doesn't bring param to server-side request var.

Please feel free ask additional information which help you reproduce the error.

But I should note, that ERROR from my previous message were disappeared.

@essiembre
Copy link
Contributor

I wrongfully assumed the value should always be present. I will modify the validation. But I doubt it will make a difference.

Again, if your authentication is "standard", point me to documentation for it, or share credentials with me so I can try.

Otherwise, your best bet is to implement the auth logic yourself (e.g., extending GenericHttpClientFactory).

@AntonioAmore
Copy link

I don't think it is a kind of "standard" practice, but in my current case there is a form authorization with additional fields, which can be empty.

Providing of invisible for user, but available for crawlers form fields, and applying some heuristics to their values can be called as a generic approach. Please see https://www.drupal.org/project/honeypot as an example.

Thank you for your commit, I believe it doesn't break class logic, it still generic enough.

Also I absolutely agree with you, that more complex authorization form logic should be implemented by GenericHttpClientFactory extending.

@AntonioAmore
Copy link

AntonioAmore commented Dec 25, 2017

May I point you again to a problem?

 <authFormParams>
    <param name="param">1</param>        
 </authFormParams>

When checking server-side request variable I cannot see a parameter with name 'param'. Is the problem my local?

The server uses LAMP stack, analysing $_REQUEST and $_REQUEST vars.

@essiembre
Copy link
Contributor

I know it's been a while, but do you still have the issue? If so, can you share your config along with authentication info to reproduce (privately if you want).

@jacksonp2008
Copy link
Author

Hi Pascal,

Thanks for following up. We got sidetracked with some other issues but will revisit soon (likely Feb)

@bpamiri
Copy link

bpamiri commented Mar 12, 2018

I've got a scenario very similar to the original post. I have a website I am trying to extract information from. The pages I want to scrap are behind a form based authentication. The problem is that the login form generates a dynamic form stored in a hidden form field that needs to be sent back for authentication to work.

Is there anyway to add something like the following to the httpClientFactory:

<authPrefetch>
   <prefetchURL> URL_of_the_original_form_with_the_dynamic_code </prefetchURL>        
   <prefetchParam>form_field_name_of_the_dynamic_hidden_field</prefetchParam>
</authPrefetch>

This would create a prefetch pass to the URL defined to get the dynamically generated form field and submit it on the actual httpClientFactory authentication submission.

@essiembre
Copy link
Contributor

Can you share the URL (or ideally your config)? Quite often the problem with your suggested approach is the dynamic field value is populated via JavaScript, precisely to prevent automated scraping. Is that your case? Javascript is not interpreted unless you use the PhantomJSDocumentFetcher. To have its integration with PhantomJS working you probably to modify the phantom.js script and do that magic yourself.

I could confirm better with a URL to your login form.

@mattbucci
Copy link

I'm also facing this issue @AntonioAmore Did you ever find a resolution?

I made a simple script that var_dumps($_POST);

website: 2021-01-14 18:50:28 INFO - Performing FORM authentication at "https://example.com/test.php" (username=username; password=*****)
website: 2021-01-14 18:50:28 INFO - Authentication status: HTTP/1.1 200 OK
website: 2021-01-14 18:50:28 DEBUG - Authentication response:
array(2) {
["username"]=>
string(8) "username"
["pass"]=>
string(8) "password"
}

I've tried both 2.8.1 and 2.9.1 but the authFormParams field doesn't seem to do anything

my config looks like

    <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
        <cookiesDisabled>false</cookiesDisabled>
        <maxRedirects>5</maxRedirects>
        <connectionTimeout>60000</connectionTimeout>
        <authFormCharset>UTF-8</authFormCharset>
        <authMethod>form</authMethod>
        <authURL>https://example.com/test.php</authURL>
        <authUsername>username</authUsername>
        <authPassword>password</authPassword>
        <authUsernameField>username</authUsernameField>
        <authPasswordField>pass</authPasswordField>
        <authFormParams>
          <param name="test">test</param>
        </authFormParams>
      </httpClientFactory>

@AntonioAmore
Copy link

@mattbucci It's a pity to say, but no. I resolved my tasks by another, alternative way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants