Skip to content

Commit

Permalink
Simplify logging via custom context manager text writer
Browse files Browse the repository at this point in the history
- python standard library's logging package doesn't work for this use case
  - use case:
    - log output from multiple ListCreator instances running on different threads to DIFFERENT files
      - tried running 2 instances of ListCreator on 2 different threads
      - program scraping logic executes properly
      - HOWEVER, content of log files is scrambled
        - instead of thread 1 writing only to file 1 and thread 2 writing only to file 2:
          - thread 1 writes to both file 1 AND file 2
          - thread 2 writes to both file 2 AND file 1
          - the pattern of a specific thread writing to different files was not readily obvious
            - it DID seem like the first thread wrote only to its specified file UNTIL the second thread was created, after which the threads mixed their output
      - logging package claims to be thread safe and available for use on multiple threads without additional configuration
        - however, no variation of attribute changes, object creation, or logging.basicConfig() modification seemed to avoid the problem of writing to multiple files instead of the ONE designated file for that specific thread
        - problem might be because logging package enables logging to all modules in a python package by making logging GLOBAL (?)
          - logging may be global on a single python process (?)
          - this would explain why both threads wrote to both open files instead of the file specified for their own thread
          - ESSENTIALLY
            - the logging from thread 1 initially sets the output to be file 1
            - logging from thread 1 writes to only file 1
            - thread 2 is created and the logging package sets thread 2 to write to file 2
              - HOWEVER, since logging is global, thread 2 is essentially ADDING file 2 to the "list of logging output locations," therefore
                - thread 2 writes to file 2 AND file 1
                - thread 1 writes to file 1 AND file 2
                  - BECAUSE: logging package NOW takes anything that should be logged and logs it to EVERY LOCATION that is added to the "list of logging output locations" using the .addHandler() logging method
        - the thread safe claim by logging package makes sense for a multi-threaded application that needs to log ALL info into the SAME file(s) EVEN with application running on multiple threads
        - however, this use case is different and the goal is to NOT write the info to the same file(s) across different threads, but to write to only a SPECIFIC file specified by its specific thread
- workaround: provide the output location to any function that logged any information
  - used @contextlib.contextmanager (to avoid manually creating __enter__() and __exit__() methods) for using a FUNCTION instead of a file like resource with the "with" keyword
    - trying to run

```
with open(filename, 'a', encoding='utf-8') if log_file else sys.stdout as logging_output_location:
    logging_output_location.writelines(content)
```

    - did not seem like best practice, since trying to run these lines more than once resulted in:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.
```

    - this might be okay in typical use cases if user exits the python interpreter after running ListCreator().create_list_for() method ONCE
    - but might cause problems if user tries to run method again OR tries to print() anything in the same python process (does not restart interpreter)
  - also tried:

```
with open(sys.stdout) as f:
     f.write('some text')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper
```

  - tried to "yield" a file object from a function without the @contextlib.contextmanager decorator:

```
def stdout_writer():
    yield sys.stdout

def file_writer(filename):
    with open(filename, 'a', encoding='utf-8') as file:
        yield file

with file_writer(filename) if some_truthy_value else stdout_writer() as out:
    out.write('haha')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: __enter__
```

  - decorating both functions with @contextlib.contextmanager addressed AttributeError: __enter__

```
@contextlib.contextmanager
def stdout_writer():
    yield sys.stdout

@contextlib.contextmanager
def file_writer(filename):
    with open(filename, 'a', encoding='utf-8') as file:
        yield file

with file_writer(filename) if True else stdout_writer() as out:
    out.write('haha')

with file_writer(filename) if False else stdout_writer() as out:
    out.write('haha')
haha4
```

  - as seen above, running the .write(content) method to the provided writer object always returns the NUMBER of bytes written
    - the .writelines(content) method does NOT display number of bytes written to user
    - HOWEVER, the .writelines(content) method does NOT add a newline to the end of every write, so '\n' needed to be manually added to all logging statements

- resources:
  - logging package
    - https://realpython.com/python-logging/#other-configuration-methods
    - https://stackoverflow.com/questions/11232230/logging-to-two-files-with-different-settings/11233293
    - https://stackoverflow.com/questions/15199816/python-logging-multiple-files-using-the-same-logger
    - https://stackoverflow.com/questions/24816456/python-logging-wont-shutdown
    - https://docs.python.org/3/library/logging.html
    - https://docs.python.org/3/howto/logging-cookbook.html#:~:text=Although%20logging%20is%20thread-safe,across%20multiple%20processes%20in%20Python.
    - https://docs.python.org/3/howto/logging-cookbook.html#multiple-handlers-and-formatters
    - https://docs.python.org/3/library/logging.html#logging.getLogger
  - sys.stdout, print(), and formatting strings
    - https://stackoverflow.com/questions/3263672/the-difference-between-sys-stdout-write-and-print
    - https://stackoverflow.com/questions/493386/how-to-print-without-newline-or-space?rq=1
    - https://stackoverflow.com/questions/12377473/python-write-versus-writelines-and-concatenated-strings
  - time package
    - https://docs.python.org/3/library/time.html#time.strftime
    - https://en.wikipedia.org/wiki/ISO_8601
  - contextlib package
    - https://dbader.org/blog/python-context-managers-and-with-statement
    - https://docs.python.org/3/library/contextlib.html
    - https://pymotw.com/2/contextlib/
    - https://www.geeksforgeeks.org/context-manager-using-contextmanager-decorator/
  • Loading branch information
shailshouryya committed Dec 28, 2020
1 parent 7134c3e commit 82a0129
Show file tree
Hide file tree
Showing 10 changed files with 146 additions and 169 deletions.
35 changes: 12 additions & 23 deletions python/dev/execute.py
@@ -1,6 +1,6 @@
import sys
import time
import logging
import contextlib

import selenium
from selenium import webdriver
Expand Down Expand Up @@ -139,6 +139,15 @@ def show_user_how_to_set_up_selenium():
common_message.display_dependency_setup_instructions(user_driver, user_os)


@contextlib.contextmanager
def yield_file_writer(log_file):
with open (log_file, 'a', encoding='utf-8') as output_location:
yield output_location

@contextlib.contextmanager
def yield_stdout_writer():
yield sys.stdout


user_os = determine_user_os()
url, seleniumdriver = check_user_input()
Expand All @@ -163,28 +172,8 @@ def show_user_how_to_set_up_selenium():
driver.set_window_size(780, 800)
driver.set_window_position(0, 0)
file_name = determine_file_name()
if log_file:
log = logging.getLogger()
# logging level (in order of increasing severity) can be:
# - DEBUG (prints everything python does)
# - INFO
# - WARNING
# - ERROR
# - CRITICAL
log.setLevel(logging.INFO)
handler = logging.FileHandler(log_file)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter(fmt='%(asctime)s %(levelname)s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
handler.setFormatter(formatter)
log.addHandler(handler)
else:
logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S', level=logging.INFO)
logging.info('*' * 100)
logging.info(f'Starting new job at {time.asctime()}')
program.determine_action(url, driver, scroll_pause_time, reverse_chronological, file_name, txt, csv, markdown)
if log_file:
# log.removeHandler(handler) # if you don't provide the log_file argument, program ALWAYS prints DEBUG info to console
log.handlers.clear() # if you don't provide the log_file argument, program prints DEBUG info to console the first time after providing log_file argument (but not second or subsequent times)
with yield_file_writer(log_file) if log_file else yield_stdout_writer() as logging_output_location:
program.determine_action(url, driver, scroll_pause_time, reverse_chronological, file_name, txt, csv, markdown, logging_output_location)
program_end = time.perf_counter()
total_time = program_end - program_start
print(f'This program took {total_time} seconds to complete.\n')
Expand Down
55 changes: 27 additions & 28 deletions python/dev/file/create_file.py
@@ -1,5 +1,4 @@
import functools
import logging
import time
import csv
import os
Expand All @@ -11,40 +10,40 @@
NEWLINE = '\n'


def scroll_down(current_elements_count, driver, scroll_pause_time):
def scroll_down(current_elements_count, driver, scroll_pause_time, logging_output_location):
driver.execute_script('window.scrollBy(0, 50000);')
time.sleep(scroll_pause_time)
new_elements_count = driver.execute_script('return document.querySelectorAll("ytd-grid-video-renderer").length')
logging.info(f'Found {new_elements_count} videos...')
logging_output_location.writelines(f'Found {new_elements_count} videos...\n')
if new_elements_count == current_elements_count:
# wait scroll_pause_time seconds and check again to verify you really did reach the end of the page, and there wasn't a buffer loading period
logging.info(common_message.no_new_videos_found(scroll_pause_time * 2))
logging_output_location.writelines(common_message.no_new_videos_found(scroll_pause_time * 2))
time.sleep(scroll_pause_time * 2)
new_elements_count = driver.execute_script('return document.querySelectorAll("ytd-grid-video-renderer").length')
if new_elements_count == current_elements_count:
logging.info('Reached end of page!')
logging_output_location.writelines('Reached end of page!\n')
return new_elements_count


def save_elements_to_list(driver, start_time, scroll_pause_time, url):
def save_elements_to_list(driver, start_time, scroll_pause_time, url, logging_output_location):
elements = driver.find_elements_by_xpath('//*[@id="video-title"]')
end_time = time.perf_counter()
total_time = end_time - start_time - scroll_pause_time # subtract scroll_pause_time to account for the extra waiting time to verify end of page
logging.info(f'It took {total_time} seconds to find all {len(elements)} videos from {url}{NEWLINE}')
logging_output_location.writelines(f'It took {total_time} seconds to find all {len(elements)} videos from {url}{NEWLINE}\n')
return elements


def scroll_to_bottom(url, driver, scroll_pause_time):
def scroll_to_bottom(url, driver, scroll_pause_time, logging_output_location):
start_time = time.perf_counter() # timer stops in save_elements_to_list() function
driver.get(url)
current_elements_count = driver.execute_script('return document.querySelectorAll("ytd-grid-video-renderer").length')
while True:
new_elements_count = scroll_down(current_elements_count, driver, scroll_pause_time)
new_elements_count = scroll_down(current_elements_count, driver, scroll_pause_time, logging_output_location)
if new_elements_count == current_elements_count:
break
else:
current_elements_count = new_elements_count
return save_elements_to_list(driver, start_time, scroll_pause_time, url)
return save_elements_to_list(driver, start_time, scroll_pause_time, url, logging_output_location)


def time_writer_function(writer_function):
Expand All @@ -53,18 +52,18 @@ def wrapper_timer(*args, **kwargs):
start_time = time.perf_counter()
extension = writer_function.__name__.split('_')[-1]
timestamp = kwargs.get('timestamp', 'undeteremined_start_time')
logging.info(f'Opening a temp {extension} file and writing video information to the file....')
file_name, videos_written = writer_function(*args, **kwargs) # writer_function() writes to temp_{file_name}
file_name, videos_written, logging_output_location = writer_function(*args, **kwargs) # writer_function() writes to temp_{file_name}
logging_output_location.writelines(f'Opening a temp {extension} file and writing video information to the file....\n')
end_time = time.perf_counter()
total_time = end_time - start_time
temp_file = f'temp_{file_name}_{timestamp}.{extension}' # determine temp_{file_name} for wrapper_timer() scope
final_file = f'{file_name}.{extension}'
logging.info(f'Finished writing to'.ljust(38) + f'{temp_file}')
logging.info(f'{videos_written} videos written to'.ljust(38) + f'{temp_file}')
logging.info(f'Closing'.ljust(38) + f'{temp_file}')
logging_output_location.writelines(f'Finished writing to'.ljust(38) + f'{temp_file}\n')
logging_output_location.writelines(f'{videos_written} videos written to'.ljust(38) + f'{temp_file}\n')
logging_output_location.writelines(f'Closing'.ljust(38) + f'{temp_file}\n')
os.replace(temp_file, final_file) # rename temp_{file_name} to {file_name}.{extension} here AFTER everything else finishes to ensure atomicity
logging.info(f'Successfully completed write, renamed {temp_file} to {final_file}')
logging.info(f'It took {total_time} seconds to write all {videos_written} videos to {final_file}{NEWLINE}')
logging_output_location.writelines(f'Successfully completed write, renamed {temp_file} to {final_file}\n')
logging_output_location.writelines(f'It took {total_time} seconds to write all {videos_written} videos to {final_file}{NEWLINE}\n')
return wrapper_timer


Expand All @@ -80,34 +79,34 @@ def prepare_output(list_of_videos, reverse_chronological):
return total_videos, total_writes, video_number, incrementer


def txt_writer(file, markdown_formatting, reverse_chronological, list_of_videos, spacing, video_number, incrementer, total_writes):
def txt_writer(file, markdown_formatting, reverse_chronological, list_of_videos, spacing, video_number, incrementer, total_writes, logging_output_location):
for selenium_element in list_of_videos if reverse_chronological else list_of_videos[::-1]:
video_number, total_writes = write.txt_entry(file, markdown_formatting, selenium_element, NEWLINE, spacing, video_number, incrementer, total_writes)
if total_writes % 250 == 0:
logging.info(f'{total_writes} videos written to {file.name}...')
logging_output_location.writelines(f'{total_writes} videos written to {file.name}...\n')

@time_writer_function
def write_to_txt(list_of_videos, file_name, reverse_chronological, timestamp):
def write_to_txt(list_of_videos, file_name, reverse_chronological, logging_output_location, timestamp):
total_videos, total_writes, video_number, incrementer = prepare_output(list_of_videos, reverse_chronological)
markdown_formatting = False
spacing = f'{NEWLINE}' + ' '*4
with open(f'temp_{file_name}_{timestamp}.txt', 'w', encoding='utf-8') as txt_file:
txt_writer(txt_file, markdown_formatting, reverse_chronological, list_of_videos, spacing, video_number, incrementer, total_writes)
return file_name, total_videos
txt_writer(txt_file, markdown_formatting, reverse_chronological, list_of_videos, spacing, video_number, incrementer, total_writes, logging_output_location)
return file_name, total_videos, logging_output_location


@time_writer_function
def write_to_md(list_of_videos, file_name, reverse_chronological, timestamp):
def write_to_md(list_of_videos, file_name, reverse_chronological, logging_output_location, timestamp):
total_videos, total_writes, video_number, incrementer = prepare_output(list_of_videos, reverse_chronological)
markdown_formatting = True
spacing = f'{NEWLINE}' + '- ' + f'{NEWLINE}'
with open(f'temp_{file_name}_{timestamp}.md', 'w', encoding='utf-8') as md_file:
txt_writer(md_file, markdown_formatting, reverse_chronological, list_of_videos, spacing, video_number, incrementer, total_writes)
return file_name, total_videos
txt_writer(md_file, markdown_formatting, reverse_chronological, list_of_videos, spacing, video_number, incrementer, total_writes, logging_output_location)
return file_name, total_videos, logging_output_location


@time_writer_function
def write_to_csv(list_of_videos, file_name, reverse_chronological, timestamp):
def write_to_csv(list_of_videos, file_name, reverse_chronological, logging_output_location, timestamp):
total_videos, total_writes, video_number, incrementer = prepare_output(list_of_videos, reverse_chronological)
with open(f'temp_{file_name}_{timestamp}.csv', 'w', newline='', encoding='utf-8') as csv_file:
fieldnames = ['Video Number', 'Video Title', 'Video URL', 'Watched?', 'Watch again later?', 'Notes']
Expand All @@ -116,5 +115,5 @@ def write_to_csv(list_of_videos, file_name, reverse_chronological, timestamp):
for selenium_element in list_of_videos if reverse_chronological else list_of_videos[::-1]:
video_number, total_writes = write.csv_entry(writer, selenium_element, video_number, incrementer, total_writes)
if total_writes % 250 == 0:
logging.info(f'{total_writes} videos written to {csv_file.name}...')
return file_name, total_videos
logging_output_location.writelines(f'{total_writes} videos written to {csv_file.name}...\n')
return file_name, total_videos, logging_output_location

0 comments on commit 82a0129

Please sign in to comment.