Skip to content

2naive/AngryCurl

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

AngryCurl

  • used for parsing information from remote resourse using user-predefined amount of simultaneous connections over proxies-list.

Basic information

Depencies:

  • PHP 5 >= 5.1.0
  • RollingCurl
  • cURL

Use cases:

  • multi-threaded parsing over proxy
  • overcoming simple parsing protection by using User-Agent header and proxy-lists
  • proxy list checking
  • validating proxies' response

Main features

  • loading proxy-list from file or array
  • removing duplicates
  • filtering alive proxies
  • checking if proxy given response content is correct
  • loading useragent-list from file or array
  • changing proxy/useragent "on the fly"
  • preventing direct connections without any proxy/useragent if such options are set
  • multi-thread connections
  • callback functions
  • working with chains of requests
  • web-console mode
  • logging

Documentation

Preferred environment configuration

  • PHP as Apache module
  • safe_mode Off
  • open_basedir is NOT set
  • PHP cURL installed
  • gzip Off

Basic usage

require("RollingCurl.class.php");
require("AngryCurl.class.php");

function my_callback($response, $info, $request)
{
    // callback function here
}

// sending callback function name as param
$AC = new AngryCurl('my_callback');
// initializing console-style output
$AC->init_console();


// Importing proxy and useragent lists, setting regexp, proxy type and target url for proxy check
// You may also import proxy from an array as simple as $AC->load_proxy_list($proxy array);
$AC->load_proxy_list(
    // path to proxy-list file
    'proxy_list.txt',
    // optional: number of threads
    200,
    // optional: proxy type
    'http',
    // optional: target url to check
    'http://google.com',
    // optional: target regexp to check
    'title>G[o]{2}gle'
);
// You may also import useragents from an array as simple as $AC->load_useragent_list($proxy array);
$AC->load_useragent_list('useragent_list.txt');

while(/* */)
{
    $url = /**/;
    // adding URL to queue
    $AC->get($url);
    
    // you may also use 
    // $AC->post($url, $post_data = null, $headers = null, $options = null);
    // $AC->get($url, $headers = null, $options = null);
    // $AC->request($url, $method = "GET", $post_data = null, $headers = null, $options = null);
    // as well
}

// setting amount of threads and starting connections
$AC->execute(200);

// if console_mode is off
//AngryCurl::print_debug(); 

unset($AC);

cURL options

You may also pass cURL options for each url before adding to queue like here:

// Define HTTP headers (CURLOPT_HTTPHEADER) if needed, or just set to NULL
$headers = array('Content-type: text/plain', 'Content-length: 100');
// Define cURL options (will be passed through curl_setopt_array) if needed, or just set to NULL
$options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
// Define post-data array to send in case of POST method, or just set to NULL
$post_data = array('param' => 'value');

// Add request
$AC->get($url, $headers, $options);
// or
$AC->post($url, $post_data, $headers, $options) ;
// or
$AC->request($url, $method = "GET", $post_data, $headers, $options);

// ATTENTION: temporary "on-the-fly" proxy/useragents lists are not
// working with AngryCurlRequest. Keep it in mind if you will use code below
// as alternative to written above.

$request = new AngryCurlRequest($url);
// $url, $method, $post_data, $headers, $options - public properties of AngryCurlRequest
$request->options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
$AC->add($request);

Because this class is kind of extension of RollingCurl class you may use any constructions RollingCurl has. For other information read here: http://code.google.com/p/rolling-curl/source/browse/trunk/

TODO

  • chains of requests
  • stop on error_limit exceed
  • better documentation and examples

Credits

You may join this class discussion here: http://stupid.su/php-curl_multi/ Any questions, change requests and other things you may send to my email written in class comments.

Thank you for reading.

  • naive

About

AngryCurl - Anonymized Rolling Curl class, used for parsing information from remote resourse using user-predefined amount of simultaneous connections over proxies-list.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages