Skip to content

Crawl Engine is a PHP Library that helps to automate the Process of Login into password Protected Sites and Getting Needed information from them

License

Notifications You must be signed in to change notification settings

Okerefe/CrawlEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Read Me

CrawlEngine

Crawl Engine is a PHP Library that helps to automate the Process of Login into password Protected Sites and Getting Needed information from them. It does this with the help of other Great Libraries Like Guzzle, DomCrawler etc. License: MIT

Table Of Contents

Installation

The Preferred way of installing CrawlEngine is with Composer as follows:

composer require deravenedwriter/crawlengine

Then ensure your bootstrap file is loading the composer autoloader:

require_once 'vendor/autoload.php';

Bootstrapping The Engine Class

The Engine class is used for performing most of CrawlEngines Functions This Includes Resolving Requests, Getting Form Details from pages and more.. The Engine can be initialized as follows:

<?php
// First we need to add the namespace
use CrawlEngine\Engine;

// Now We Initialize the main Engine class
$engine = new Engine();

/**
 * The Engine class accepts one optional parameter of an integer
 * This Parameter would be set as the default timeout for all web requests made with that Engine Instance
 * So for example if I want the Default timeout to be 20 seconds, I would do that as follows:
 */
 
 $engine = new Engine(20);
 
// although by default, the timeout is 10 (10 seconds)

Bootstrapping The InputDetail Class

The InputDetail Instance describes an input tag of a form. An input tag of a Form that can be as follows:

<input name='name'  type='text' value='' placeholder='Input Your Name'/>

The InputDetail Class is used to pass field values for a form to the Engine class and it is also what is returned when the Engine is asked to get the form inputs of a given page

It contains different properties including name, which refers to the name of the input in question, type which refers to the type of the input in question and placeholder for the placeholder of same.

We can initialize the InputDetail Class as follows:

<?php
// First we need to add the namespace
use CrawlEngine\InputDetail;

/**
 * Now We Initialize the Input Detail class
 * In the example below the InputDetail is initialized with a name
 * which refers to the name of the InputDetail in question as follows
 */
$input = new InputDetail('full_name');

/**
 * The name parameter is always compolsory for the instantiation of any InputDetail object
 * Other Optional Parameters includes value and inputString
 * the value refers to the value of the input while the InputString refers to the entire string of the input
 * Here is an example:
 */
 
 $input = new InputDetail(
                "full_name",
                "john Doe",
                "<input name='name'  type='text' value='' placeholder='Input Your Name'/>"
            );
 
/**
* The Purpose of the last parameter (inputString) is so all other values can be prefilled automatically
* so in this case I could just construct this InputDetail as follows:
*/
 $input = new InputDetail("name", "", <input name='name'  type='text' value='Joe' placeholder='Input Your Name'/>");

/**
* So in this case, other values would be generated by the constructor, so:
* $input->name is equal to 'name'
* $input->value is equal to 'Joe'
* $input->type is equal to 'text'
* $input->placeholder is equal to 'Input Your Name'
*/

// You could also echo out the properties of an InputDetail Instance:

echo $input;
// The above would display as follows:
/**
* Input Detail:
* Name: name
* Value: Joe
* Placeholder: Input Your Name
* Type: text
*/

Getting InputTag Details from a Page Containing a Form

CrawlEngine has a way of accesing websites to analyze the input tags present. Say for example, a website located as https://example.com/login has a page as shown:

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/login">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>
    </body>
</html>
);

we could get an array of all the inputtags contained in this page as follows:

<?php
// First we need to add the namespace
use CrawlEngine\InputDetail;
use CrawlEngine\CrawlEngine;

$inputs = (new Engine())->getLoginFields('https://example.com/login');

// $inputs would contain an array of all the input tags.
//  found in the first form Element on the webpage of the uri given
// and they are in the form of an InputDetail instance
// so we could display them as shown:

foreach($inputs as $input){
    echo $input;
}
// the above code would output as shown:

/**
* Input Detail:
* Name: email
* Value: 
* Placeholder: Input Your Email
* Type: email
*
* Input Detail:
* Name: password
* Value: 
* Placeholder: Input Your Password
* Type: password
*
* Input Detail:
* Name: _token
* Value: wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH
* Placeholder:
* Type: hidden
*/

as was earlier said, this function returns input detail of the first form found on a page. if there are more than one form for example:

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/userlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

        <form method="POST" action="/adminlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

    </body>
</html>
);

The Function would only return the inputs from the first form element. if you want to return values from the second form, you would specify an additional second value to the getLoginFields function as follows

$inputs = (new Engine())->getLoginFields('https://example.com/login', 2);

The above code would fetch form details for the second form on the page.

Resolving Requests with CrawlEngine

To make a request with Crawl Engine one needs to know somethings about the website been accessed. this includes the uri of the form used to login, the uri which the form submits to and required fields in the form. so say for example the login form for a website is located at https://example.com/login and is structured as shown:

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/login">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>
    </body>
</html>
);

Above is what a typical login form should look like. so from this login form we can see that the Uri where the form would be submitted to is: https://example.com/login and that we need a valid username and password to be able to login. We also see that the site generates a csrf token to validate request and this is dynamic. you dont have to bother about this field as CrawlEngnie automatically takes care of it. you also dont have to bother about any field that has been pre-filled by the server unless you wish to change it. when CrawlEngine makes it's request, it fetches the form page, records all pre-filled input values, combines them with the ones you would give it and makes the request. So from this page above we know that we just have to give CrawlEngine a valid username and password to make the request. The main function responsible for resolving requests is the resolveRequest Method of the Engine class and is used as shown:

<?php
// First we need to add the namespace
use CrawlEngine\InputDetail;
use CrawlEngine\CrawlEngine;
use CrawlEngine\Engine;

// we then create instances of the inputDetail class to carry values the form needs as follows:
$emailInput = new InputDetail('email', 'johndoe@mymail.com');
$passwordInput = new InputDetail('password','topSecretPassword');

// We could then arrange them in an array like so:
$formFields = [$emailInput, $passwordInput];

// We would then define the Uri where the formPage would be found:
$formPageUri = 'https://example.com/login';
// And the Submit Uri
$submitUri = 'https://example.com/login';
/**
* After Logging in, we would need to retrieve some information from some password protected areas of the site.
* Lets say this areas are located at https://example.com/dashboard and https://example.com/transactions
* We would also define as follows:
*/
$contentPagesUri = ['https://example.com/dashboard', 'https://example.com/transactions'];

// after which we can then make our request as shown:
$engine = new Engine();
$crawlers =  $engine->resolveRequest(
                $formPageUri,
                $submitUri,
                $formFields,
                $contentPagesUri
            );

That's all you have to do, and then CrawlEngine does all the rest of the magic. it visits the site, take your given details along with any pre-filled ones found on the site that you didn't overwrite and submits. and while logged in like a normal user, it access all the contentPagesUri and brings the entire pages back as crawler objects Lets say for example the https://example.com/dashboard page is as follows:

<html>
<head>
    <title>Example Site Dashboard Page</title>
</head>
<body>
    <section class="main-content">
        <div>
            <span id="user-number">200432234233</span>
            <span id="user-address">No.3 Washington Avenue, California</span>
            <span id="user-email">johndoe@mymail.com</span>
        </div>
    </section>

</body>
</html>

The resolveRequest function then returns an array of crawlers containing crawlers for each of the content pages given. so for our request above:

// $crawlers[0] will contain crawler object for https://example.com/dashboard
// $crawlers[1] will contain crawler object for https://example.com/transactions

// so I can then access values from the page as shown:

echo $crawlers[0]->filterXPath('//body/section/div/span')->text(); // this would output: '200432234233'
//or this:
echo $crawlers[0]->filter('body > section > div > span')->text(); // this would also output: '200432234233'

for more information on crawlers and how to access different values in a page, you can check out The DomCrawler Documentation

The CrawlEngine by default searches for the input field from the first form it sees on the page containing the form. If there are more than one form on the login page from which the crawl engine would access like follows:

<html>
    <head>
        <title>Example Site Login Page</title>
    </head>
    <body>
        <form method="POST" action="/userlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

        <form method="POST" action="/adminlogin">
            <input name='email'  type='email' placeholder='Input Your Email'/>
            <input name='password'  type='password' placeholder='Input Your Password'/>
             <input type="hidden" name="_token" value="wi8AGQVAsR8sasNHcRFhgnVemspnNoRwmJfBQ0TH">
            <button type="submit" class="btn btn-primary">Login</button>
        </form>

    </body>
</html>
);

then by default, CrawlEngine would be referencing the first form, so the csrf and other pre-filled inputs would be from the first form. If one wishes to specify that the request is for the second form, it can be done by adding an extra parameter to the resolveRequest method as follows:

$crawlers =  $engine->resolveRequest(
                $formPageUri,
                $submitUri,
                $formFields,
                $contentPagesUri,
                2
            );

The above tells CrawlEngine that you are not referring to the first form on the page but the second one.

About

Crawl Engine is a PHP Library that helps to automate the Process of Login into password Protected Sites and Getting Needed information from them

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages