Skip to content

0xcaffebabe/Spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spider

2019-1-10 启动项目

注意事项: 项目用的JDK版本为11,并且使用了一些JDK11的新特性,运行低于JDK11可能会出现异常

2019-1-15

一个简单应用:

     private static Spider spider = new Spider();
         private static Map<String,String> map = new HashMap<>();
         public static void main(String[] args) {
     
             spider.setConnectionTimeOutEvent((spider1, request) -> {
                 System.out.println(request.getUrl()+"超时了");
             });
             Request request = new Request()
                     .url("http://dytt8.net/")
                     .method(RequestMethods.GET);
             request(request);
     
         }
     
         public static void request(Request request){
            spider.request(request,response -> {
                response.toTextResponse("gb2312")
                        .css("td[style=WORD-WRAP: break-word] a")
                        .forEach(e->{
                            System.out.println(e.attr("href"));
                        });
                response.toTextResponse("gb2312")
                        .css("a")
                        .forEach(e->{
                            String url = null;
                            if (!e.attr("href").startsWith("http://")){
                                url = "http://dytt8.net"+e.attr("href");
                            }else{
                                url = e.attr("href");
                            }
                            Request subRequest = new Request()
                                    .url(url);
                            request(subRequest);
                        });
            });
         }

avatar

扩展:

可以实现该接口:

public interface ResponseProcessChain {

    void process(Request request, Response response, Spider spider);
}

在Spider构造函数中进行注册:

public Spider(){
        responseProcessor.registerProcessChain(new WebNotFoundProcessChain());
        responseProcessor.registerProcessChain(new MovedTemporarilyProcessChain());
    }

当一个请求通过request完成之后,会根据注册顺序依次调用相关处理器,

使用者可以根据自身需要分别对request,response,spider等对象进行修改

2->

spider.setConnectionTimeOutEvent((spider1, request) -> {
            System.out.println(request.getUrl()+"超时了");
        });

可以向该函数传入一个实现了该接口的事件:

public interface ConnectionTimeOutEvent {

    void onTimeOut(Spider spider, Request request);
}

当请求超时,这个事件将会被调用

如果进行大量爬取操作,该段代码可能会成为性能瓶颈

public URLConnection send(String url,Map<String,String> headers) throws IOException {
        URLConnection connection = new URL(url).openConnection();
        for (String key : headers.keySet()){
            connection.setRequestProperty(key,headers.get(key));
        }
        return connection;
    }

后期会考虑连接复用,前期暂时不考虑性能问题

About

一个JAVA爬虫项目

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •  

Languages