Skip to content

10elements/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

A Simple Web Crawler using python threading, multiprocessing and concurrent.futures

I'm interested in how threading in python works, so I did some analysis and decide to write a simple program using threads. A web crawler would be a good example to use thread. crawler.py implements a crawler using threading and a crawler using multiprocessing.dummy. crawler_concurrent.py implements a crawler using concurrent.futures module.

Description

I decide to crawl the data at http://my.fit.edu/~vkepuska/Android%20Programming/, in crawler.py, first I create a class that extends threading.Thread, very similar to Java, I override the run method.

class Downloader(threading.Thread):
	def __init__(self, queue):
		threading.Thread.__init__(self)
		self.queue = queue

	def run(self):
		while True:
			msg = self.queue.get()
			if isinstance(msg, str) and msg == 'quit':
				break
			fdir, url = msg
			print('downloading %s' %url)
			with open(fdir, 'wb') as f:
				f.write(urlreq.urlopen(url).read())
			print('%s downloaded' %url)
			self.queue.task_done()

The self.queue member variable is mutil-producer, multi-consumer queue used for safely exchanging information between threads. It has implemented all the locking semantics, so we don't have do add the synchronize keywords manually like we did in the classic producer-consumer problem written in Java.

Now let's move on to the crawling method where the crawling operation happens. I manually create a thread pool of 8 threads, later you will see that there is a more convenient way of creating thread pool. As you can see the queue variable is passed to the threads in the pool to allow information exchange.

	downloaders = []
	queue = Queue()
	for i in range(8):
		dl = Downloader(queue)
		dl.start()
		downloaders.append(dl)

This is a simple Depth-First-Search. Whenever I encounter a resource, I put its url to queue. So basically the crawling method is like the producer which continuously produces resource links and the threads in the pool are like consumers who repeatedly fetch url from the queue and download its resource.

	while len(stack) != 0:# DFS
		topUrl = stack.pop()
		urltoken = topUrl.split('/')
		fdir = homDir + '/' + topUrl[topUrl.find('Android%20Programming/'):]#dir in file system
		if urltoken[-1] == '':#if path
			if not os.path.exists(fdir):
				os.mkdir(fdir)
			response = urlreq.urlopen(topUrl)
			soup = BeautifulSoup(response.read(), 'lxml')
			for link in soup.find_all('a'):
				href = link.get('href')
				if not href.startswith('/~vkepuska') and not href.startswith('?'):
					stack.append(topUrl + link.get('href'))
		else:
			queue.put((fdir, topUrl))

If the task of thread is highly self-contained, then there is a more convenient way to create a thread pool with concurrent.futures module. In crawler_concurrent.py I implement a method crawling

	with concurrent.futures.ThreadPoolExecutor(max_workers = 8) as executor:
		while len(stack) != 0:# DFS
			topUrl = stack.pop()
			urltoken = topUrl.split('/')
			fdir = homDir + '/' + topUrl[topUrl.find('Android%20Programming/'):]#dir in file system
			if urltoken[-1] == '':#if path
				print(topUrl)
				if not os.path.exists(fdir):
					os.mkdir(fdir)
				response = urlreq.urlopen(topUrl)
				soup = BeautifulSoup(response.read(), 'lxml')
				for link in soup.find_all('a'):
					href = link.get('href')
					if not href.startswith('/~vkepuska') and not href.startswith('?'):
						stack.append(topUrl + link.get('href'))
			else:
				executor.submit(download, fdir, url)

As you can see, I can create a thread pool with one line with concurrent.futures.ThreadPoolExecutor(max_workers = 8) as executor:, then in the crawling process whenever I encounter a resource I submit it to executor by executor.submit(download, fdir, url), download is a callable in which the downloading process happens.

What's even more amazing is that by changing with concurrent.futures.ThreadPoolExecutor(max_workers = 8) as executor: to with concurrent.futures.ProcessPoolExecutor(max_workers = 8) as executor I can change my crawler to multiprocessing mode.

Due to the Global Intepreter Lock in python, python thread does not allow real parallelism, which means multiple threads in the same process can not be run at the same time. What thread in python support is concurrency, in other words CPU time is switched between multiple threads. To achieve parallelism in python, multiprocessing module should be used.

Results

Built With

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

Acknowledgments