SPIKE - refactoring analysis of initial solution #4

MobiTikula · 2024-05-16T08:28:53Z

No description provided.

lsulak · 2024-05-16T13:48:20Z

I reviewed https://github.com/absa-group/living-doc-website-example/tree/main/scripts and I found some existing ideas in https://github.com/absa-group/living-doc-website-example/issues/27 - I like the proposals. Except those, some of mine are below:

--

Consider removing some unnecessary code documentation - things like these does not bring any new information

# Process issues  
issue_list = process_issues(issues, org_name, repo_name)

--

Feel free to experiment with OOP. The current code is almost script-like, very procedural, but it's okay for now.

--

Be careful about variable scopes. For example, in github_query_project_state.py, session = requests.Session() - the session is not passed to any function below that code, but it's used within function send_graphql_query directly.

--

Be careful on these things:

output_file_path = f"{output_directory}/{output_file_name}"

it makes your code less multiplatform - this can't run on Windows because of different path delimiters. Better is to use os.path.join function

--

consider using black for auto formatting and maintaining consistent formatting across the whole codebase

--

consider extracting / changing some data structures to specific custom data classes, for example this one: repositories: List[dict] - if we know that it's a list of dictionaries of certain format, why not to have it as List[Repository] where Repository can be for example something like (maybe orgName -> org_name):

from dataclasses import dataclass

@dataclass
class Repository:
    orgName: str
    repoName: str
    queryLabels: List[str]

also this might be worth considering: Dict[str, List[str]] to be represented as a data class (it's a return type of convert_field_options_to_dict) - but up to you, it's not that complex/nested

also, this can be improved using data classes because you have well-defined structure and you would keep it separated from the processing:

unique_projects[project_id] = {  
    "ID": project_id,  
    "Number": project_number,  
    "Title": project_title,  
    "Owner": org_name,  
    "RepositoriesFromConfig": [repo_name],  
    "ProjectRepositories": [],  
    "Issues": [],  
    "FieldOptions": sanitized_field_options_dict  
}

--

this can be done better:

def send_graphql_query(...):
	...
	return {}

if len(response) == 0:  
    return []

like:

def send_graphql_query(...):
	...
	return None # or just `return`, because None is the default

if not response:
    return

and then, if you are working with None, you would need to add one more check for things like this:

if not projects:  # this one would need to be added
	return
	
for project in projects:

--

perhaps some constants can be extracted into constants.py or global constants of a given file, like issues_per_page = 100

lsulak · 2024-05-16T14:10:46Z

Also, more about functions vs classes - as I said, your code pretty much doesn't have any OOP. Perhaps it's fine for now, and it's not necessary to do no matter what. This is pretty cool video: https://www.youtube.com/watch?v=o9pEzgHorH0, maybe you will like it. Also some people have quite good points here: https://stackoverflow.com/questions/33072570/when-should-i-be-using-classes-in-python?rq=1

Generally speaking, if you have some processing/functionality that is logically close together and operates on some data, then encapsuling it into classes can lead to much better readability and maintain-ability of your code.

But for now, I would be okay with the current approach (since it's a short collection of scripts, each has about 200-300 lines of code anyway) - perhaps just consider to wrapping around some more complex data types into data classes and once this grows, perhaps then it might make sense to revisit this OOP idea :) I can help

lsulak · 2024-05-16T14:26:25Z

Btw, in terms of additional tooling for checking your code and/or managing environment etc, I recommend using:

black, flake8, pylint, pytest, mypy, and potentially poetry

This project has some of these things already: https://github.com/AbsaOSS/spline-python-agent/blob/master/pyproject.toml

disclaimer: I've never used poetry, because I'm not using Python for around 2 years now, but I used to like pipenv project; but Pipenv seemed to be a bit dead for a while, and I heard really cool things about Poetry, seems to be quite popular these days.

miroslavpojer · 2024-05-17T06:09:52Z

def send_graphql_query(...):
	...
	return None # or just `return`, because None is the default

@lsulak How about using typing.Optional for return value, which can be None?

miroslavpojer · 2024-05-17T06:13:31Z

Also, more about functions vs classes - as I said, your code pretty much doesn't have any OOP. Perhaps it's fine for now, and it's not necessary to do no matter what. This is pretty cool video: https://www.youtube.com/watch?v=o9pEzgHorH0, maybe you will like it. Also some people have quite good points here: https://stackoverflow.com/questions/33072570/when-should-i-be-using-classes-in-python?rq=1

Generally speaking, if you have some processing/functionality that is logically close together and operates on some data, then encapsuling it into classes can lead to much better readability and main-ability of your code.

But for now, I would be okay with the current approach (since it's a short collection of scripts, each has about 200-300 lines of code anyway) - perhaps just consider to wrapping around some more complex data types into data classes and once this grows, perhaps then it might make sense to revisit this OOP idea :) I can help

Refactoring to use classes to known objects (now defined list) is a handy tip for this project. I would like to see it as one of the main goals of following Refactoring.
Now, we are familiar with the object we are interacting with.

lsulak · 2024-05-17T08:31:46Z

def send_graphql_query(...):
	...
	return None # or just `return`, because None is the default
@lsulak How about using typing.Optional for return value, which can be None?

Yes in cases where None can be returned, along with some non-none value, like list or so, Optional can be used (or Union, but Optional is more tidy).

https://stackoverflow.com/questions/39429526/how-to-specify-nullable-return-type-with-type-hints

def get_some_date(some_argument: int=None) -> Optional[datetime]:
    if some_argument is not None and some_argument == 1:
        return datetime.utcnow()
    else:
        return None

or, instead of Optional[datetime], you can use Union[datetime, None], which should be the same thing. Also, from Python 3.10, it seems like there is a third option (I've never used it, as Python is being developed quite rapidly), could be: datetime | None: - so a new 'pipe' operator

MobiTikula · 2024-05-20T10:10:28Z

This comment is to summarize all SPIKE points for the POC refactoring. Each point is suggested to be a separated refactoring task. This is in my opinion best order to refactor the POC:

Project structure

we have to find out best practice for having src / dist files in our project
there is no "standard" or best practice to have a project structure
having src and test folder on the directory level is a good practice
here is the example of sample project: https://github.com/pypa/sampleproject
here is interesting overflow debate, that @lsulak suggested to read, where we can find maybe some ideas for improvement: https://stackoverflow.com/questions/193161/what-is-the-best-project-structure-for-a-python-application

OOP -- 1: Data containers

UP TO DEBATE: bring OOP to the project's scripts
extract / change known data structures to specific custom data containers
possible for Dict[str, List[str]] in return type for convert_field_options_to_dict
example:

from dataclasses import dataclass

@dataclass
class Repository:
    orgName: str
    repoName: str
    queryLabels: List[str]

Logging

add logging to the project for better error handling.
in the future ideal for aggregation and centralization, ideal to have logs on one permanent storage
via (https://realpython.com/python-logging/,
https://docs.python.org/3/howto/logging.html)

Scripts bug fixes (comments from @lsulak)

refactor send_graphql_query method as it is described in SPIKE - refactoring analysis of initial solution #4 (comment)

"""
Be careful on these things:
output_file_path = f"{output_directory}/{output_file_name}"

it makes your code less multiplatform - this can't run on Windows because of different path delimiters. Better is to use os.path.join function
"""

"""
Be careful about variable scopes. For example, in github_query_project_state.py, session = requests.Session() - the session is not passed to any function below that code, but it's used within function send_graphql_query directly.
"""

Constants handeling

check for more constants in the scripts
behave like is suggested in https://realpython.com/python-constants/
UP TO DEBATE: making a constants.py file

Methods

break down complex long methods into several smaller ones, that are atomic
refactor consolidate_features_with_project() as it is suggested in: https://github.com/AbsaOSS/living-doc-generator/pull/2/files#r1603412467
make some methods more general and add them to utils folder
main method pattern refactoring with best practice (via https://realpython.com/python-main-function/)

def main():
    data = "My data read from the Web"
    print(data)

if __name__ == "__main__":
    main()

Documentation and formatting

remove some unnecessary code doc, that does not bring new info
update type hints, if method can also return None deal with it: Union[datetime, None], Optional[datetime], datetime | None
UP TO DEBATE: considering using black, poetry for auto formatting
after the spike, I am more into poetry formatter
via (https://github.com/psf/black,
https://www.reddit.com/r/Python/comments/16qz8mx/pipenv_piptools_pdm_or_poetry/)
type hints and proper method / script documentation for controller.py

miroslavpojer · 2024-05-21T09:54:15Z

This comment is to summarize all SPIKE points for the POC refactoring. Each point is suggested to be a separated refactoring task. This is in my opinion best order to refactor the POC:

Project structure

we have to find out best practice for having src / dist files in our project

there is no "standard" or best practice to have a project structure

having src and test folder on the directory level is a good practice

here is the example of sample project: https://github.com/pypa/sampleproject

here is interesting overflow debate, that @lsulak suggested to read, where we can find maybe some ideas for improvement: https://stackoverflow.com/questions/193161/what-is-the-best-project-structure-for-a-python-application

OOP -- 1: Data containers

UP TO DEBATE: bring OOP to the project's scripts

extract / change known data structures to specific custom data containers

possible for Dict[str, List[str]] in return type for convert_field_options_to_dict

example:
from dataclasses import dataclass

@dataclass
class Repository:
    orgName: str
    repoName: str
    queryLabels: List[str]
Logging

add logging to the project for better error handling.

in the future ideal for aggregation and centralization, ideal to have logs on one permanent storage

via (https://realpython.com/python-logging/,
https://docs.python.org/3/howto/logging.html)

Scripts bug fixes (comments from @lsulak)

refactor send_graphql_query method as it is described in SPIKE - refactoring analysis of initial solution #4 (comment)

""" Be careful on these things: output_file_path = f"{output_directory}/{output_file_name}"

it makes your code less multiplatform - this can't run on Windows because of different path delimiters. Better is to use os.path.join function """

""" Be careful about variable scopes. For example, in github_query_project_state.py, session = requests.Session() - the session is not passed to any function below that code, but it's used within function send_graphql_query directly. """

Constants handeling

check for more constants in the scripts

behave like is suggested in https://realpython.com/python-constants/

UP TO DEBATE: making a constants.py file

Methods

break down complex long methods into several smaller ones, that are atomic

refactor consolidate_features_with_project() as it is suggested in: https://github.com/AbsaOSS/living-doc-generator/pull/2/files#r1603412467

make some methods more general and add them to utils folder

main method pattern refactoring with best practice (via https://realpython.com/python-main-function/)
def main():
    data = "My data read from the Web"
    print(data)

if __name__ == "__main__":
    main()
Documentation and formatting

remove some unnecessary code doc, that does not bring new info

update type hints, if method can also return None deal with it: Union[datetime, None], Optional[datetime], datetime | None

UP TO DEBATE: considering using black, poetry for auto formatting

after the spike, I am more into poetry formatter

via (https://github.com/psf/black,
https://www.reddit.com/r/Python/comments/16qz8mx/pipenv_piptools_pdm_or_poetry/)

type hints and proper method / script documentation for controller.py

I have addressed the mentioned refactoring topic in epic #3 .

lsulak · 2024-05-21T13:43:07Z

Regarding 3, logging, depends how this will be exactly implemented and deployed - what's the target desired logging storage if any; Watchtower might be helpful: https://pypi.org/project/watchtower/, it directly sends logs to AWS CloudWatch, it's very easy to be plugged into standard Python logger :)

miroslavpojer · 2024-05-22T05:14:57Z

Regarding 3, logging, depends how this will be exactly implemented and deployed - what's the target desired logging storage if any; Watchtower might be helpful: https://pypi.org/project/watchtower/, it directly sends logs to AWS CloudWatch, it's very easy to be plugged into standard Python logger :)

As we are simple GH Action then the logging is about:

info: providing information about the generation process - informal statuses about logic steps
debug: providing more detailed information about received and processed data

miroslavpojer · 2024-05-22T05:15:45Z

I believe we have reached here the point to close this Issue as Solved.

MobiTikula · 2024-05-22T07:11:05Z

I am of the same opinion that we reached our goal for this SPIKE. I am closing this Issue as Solved.

MobiTikula mentioned this issue May 16, 2024

Action PoC RLS #3

Open

MobiTikula self-assigned this May 16, 2024

lsulak mentioned this issue May 16, 2024

Feature/1 initial repository setup #2

Merged

miroslavpojer mentioned this issue May 21, 2024

Refactoring - Scripts bug fixes #10

Open

MobiTikula closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIKE - refactoring analysis of initial solution #4

SPIKE - refactoring analysis of initial solution #4

MobiTikula commented May 16, 2024

lsulak commented May 16, 2024

lsulak commented May 16, 2024 •

edited

Loading

lsulak commented May 16, 2024

miroslavpojer commented May 17, 2024 •

edited

Loading

miroslavpojer commented May 17, 2024

lsulak commented May 17, 2024

MobiTikula commented May 20, 2024 •

edited

Loading

miroslavpojer commented May 21, 2024 •

edited

Loading

lsulak commented May 21, 2024

miroslavpojer commented May 22, 2024

miroslavpojer commented May 22, 2024

MobiTikula commented May 22, 2024

SPIKE - refactoring analysis of initial solution #4

SPIKE - refactoring analysis of initial solution #4

Comments

MobiTikula commented May 16, 2024

lsulak commented May 16, 2024

lsulak commented May 16, 2024 • edited Loading

lsulak commented May 16, 2024

miroslavpojer commented May 17, 2024 • edited Loading

miroslavpojer commented May 17, 2024

lsulak commented May 17, 2024

MobiTikula commented May 20, 2024 • edited Loading

miroslavpojer commented May 21, 2024 • edited Loading

lsulak commented May 21, 2024

miroslavpojer commented May 22, 2024

miroslavpojer commented May 22, 2024

MobiTikula commented May 22, 2024

lsulak commented May 16, 2024 •

edited

Loading

miroslavpojer commented May 17, 2024 •

edited

Loading

MobiTikula commented May 20, 2024 •

edited

Loading

miroslavpojer commented May 21, 2024 •

edited

Loading