## URL Grabber

This assignment develops a simple url grabber. The program gets as input two address files, one containing urls and the other one contains regular expressions. The purpose of the program is to check the urls, fetch their context and match with given regular expressions.

- docopt is used to get as arguments two file names: url_file and regex_file
- pandas is used to read urls assuming urls have big data
- multi-processing is used to assign workers for finding a match string given urls and regex list.
    - To log the output, multi-threading queue is used to put a dictionary as a log
    - To read the url, urllib is used to open the webpage and read its context
    - The process of assigning workers is as follows:
        - 1) a url is selected from database, and its context is fetched if it exists
        - 2) as long as a worker is free, a regex is selected from list to be checked against the url
        - 3) go to first step if all regex list is assigned a worker, and there is more url to check otherwise end
- to handle ctrl+c, try-except is used
    - first: I needed to import big library like pandas and urllib inside the function. As a result, I could catch the error and log/report nice print in case of ctrl+c
    - second: ctrl+c might happed in the middle of reading from database and as a result it reported error. To avoid that I used os.kill to kill the process and send ctrl+break

In [5]:
from function_utils import is_match, fetch_text_from_url

# test is_match function: return matching string if exist otherwise return percentage error
print("Matching log between ABCD and AB:", is_match("ABCD", "AB"))
print("Matching log between EBCD and AB:", is_match("EBCD", "AB"))

text, url_exist = fetch_text_from_url("https://sites.google.com/view/kourosh-naderi/home")
print("https://sites.google.com/view/kourosh-naderi/home", "--- Len: ", len(text),  "--- exists: ", url_exist)
text, url_exist = fetch_text_from_url("www.not_existed_site.com")
print("www.not_existed_site.com", "--- Len: ", len(text),  "--- exists: ", url_exist)

Matching log between ABCD and AB: {'worker_id': 0, 'time': 'Thu Nov 28 01:11:56 2019', 'RE': 'AB', 'error': 0, 'matched-string': 'AB', 'deltatime': 0.0}
Matching log between EBCD and AB: {'worker_id': 0, 'time': 'Thu Nov 28 01:11:56 2019', 'RE': 'AB', 'error': 0.6666666666666667, 'matched-string': '', 'deltatime': 0.0}
https://sites.google.com/view/kourosh-naderi/home --- Len:  60514 --- exists:  True
www.not_existed_site.com --- Len:  0 --- exists:  False


In [6]:
from solution import multi_process_urls


print("------------testing with 1 worker, expecting to see the url processing in order--------") 
multi_process_urls("url_file.txt", "regex_file.txt", _max_num_workers=1)

print("------------testing with 5 workers, expecting to see the url processing in random order--------") 
multi_process_urls("url_file.txt", "regex_file.txt", _max_num_workers=5)

------------testing with 1 worker, expecting to see the url processing in order--------
.... started processing the urls .....
processing url:  https://en.wikipedia.org/wiki/Python_(programming_language)
{'worker_id': 0, 'time': 'Thu Nov 28 01:17:12 2019', 'RE': 'welcome(.*)', 'error': 0.999955581438964, 'matched-string': '', 'deltatime': 0.272302, 'url': 'https://en.wikipedia.org/wiki/Python_(programming_language)', 'url-code': 0}
{'worker_id': 0, 'time': 'Thu Nov 28 01:17:13 2019', 'RE': 'Hello(.*)', 'error': 0, 'matched-string': 'Hello,_World!%22_program" title="&quot;Hello, World!&quot; program">Hello world</a> program:', 'deltatime': 0.0, 'url': 'https://en.wikipedia.org/wiki/Python_(programming_language)', 'url-code': 0}
{'worker_id': 0, 'time': 'Thu Nov 28 01:17:13 2019', 'RE': 'This is Me(.*)', 'error': 0.9999407756903951, 'matched-string': '', 'deltatime': 0.365017, 'url': 'https://en.wikipedia.org/wiki/Python_(programming_language)', 'url-code': 0}
{'worker_id': 0, 'time': 'T

## Conclusions

After testing the functions, we can see that multi-processing can help enhancing the process of urls. Specially in case of having big datasets of urls, multi processing can help in avoiding the freezing effect and constantly reporting whenever any of the workers get done with their process.

Ctrl+C effect is tested on the console. The usage of try-except in addition to os.kill can handle the deadlock, printing errors and gives control in handling the signal by the programmer. 