Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE - refactoring analysis of initial solution #4

Closed
Tracked by #3
MobiTikula opened this issue May 16, 2024 · 12 comments
Closed
Tracked by #3

SPIKE - refactoring analysis of initial solution #4

MobiTikula opened this issue May 16, 2024 · 12 comments
Assignees

Comments

@MobiTikula
Copy link
Collaborator

No description provided.

@MobiTikula MobiTikula self-assigned this May 16, 2024
@lsulak
Copy link
Collaborator

lsulak commented May 16, 2024

I reviewed https://github.com/absa-group/living-doc-website-example/tree/main/scripts and I found some existing ideas in https://github.com/absa-group/living-doc-website-example/issues/27 - I like the proposals. Except those, some of mine are below:

--

Consider removing some unnecessary code documentation - things like these does not bring any new information

# Process issues  
issue_list = process_issues(issues, org_name, repo_name)

--

Feel free to experiment with OOP. The current code is almost script-like, very procedural, but it's okay for now.

--

Be careful about variable scopes. For example, in github_query_project_state.py, session = requests.Session() - the session is not passed to any function below that code, but it's used within function send_graphql_query directly.

--

Be careful on these things:

output_file_path = f"{output_directory}/{output_file_name}"

it makes your code less multiplatform - this can't run on Windows because of different path delimiters. Better is to use os.path.join function

--

consider using black for auto formatting and maintaining consistent formatting across the whole codebase

--

consider extracting / changing some data structures to specific custom data classes, for example this one: repositories: List[dict] - if we know that it's a list of dictionaries of certain format, why not to have it as List[Repository] where Repository can be for example something like (maybe orgName -> org_name):

from dataclasses import dataclass

@dataclass
class Repository:
    orgName: str
    repoName: str
    queryLabels: List[str]

also this might be worth considering: Dict[str, List[str]] to be represented as a data class (it's a return type of convert_field_options_to_dict) - but up to you, it's not that complex/nested

also, this can be improved using data classes because you have well-defined structure and you would keep it separated from the processing:

unique_projects[project_id] = {  
    "ID": project_id,  
    "Number": project_number,  
    "Title": project_title,  
    "Owner": org_name,  
    "RepositoriesFromConfig": [repo_name],  
    "ProjectRepositories": [],  
    "Issues": [],  
    "FieldOptions": sanitized_field_options_dict  
}

--

this can be done better:

def send_graphql_query(...):
	...
	return {}
if len(response) == 0:  
    return []

like:

def send_graphql_query(...):
	...
	return None # or just `return`, because None is the default
if not response:
    return

and then, if you are working with None, you would need to add one more check for things like this:

if not projects:  # this one would need to be added
	return
	
for project in projects:

--

perhaps some constants can be extracted into constants.py or global constants of a given file, like issues_per_page = 100

@lsulak
Copy link
Collaborator

lsulak commented May 16, 2024

Also, more about functions vs classes - as I said, your code pretty much doesn't have any OOP. Perhaps it's fine for now, and it's not necessary to do no matter what. This is pretty cool video: https://www.youtube.com/watch?v=o9pEzgHorH0, maybe you will like it. Also some people have quite good points here: https://stackoverflow.com/questions/33072570/when-should-i-be-using-classes-in-python?rq=1

Generally speaking, if you have some processing/functionality that is logically close together and operates on some data, then encapsuling it into classes can lead to much better readability and maintain-ability of your code.

But for now, I would be okay with the current approach (since it's a short collection of scripts, each has about 200-300 lines of code anyway) - perhaps just consider to wrapping around some more complex data types into data classes and once this grows, perhaps then it might make sense to revisit this OOP idea :) I can help

@lsulak
Copy link
Collaborator

lsulak commented May 16, 2024

Btw, in terms of additional tooling for checking your code and/or managing environment etc, I recommend using:

black, flake8, pylint, pytest, mypy, and potentially poetry

This project has some of these things already: https://github.com/AbsaOSS/spline-python-agent/blob/master/pyproject.toml

disclaimer: I've never used poetry, because I'm not using Python for around 2 years now, but I used to like pipenv project; but Pipenv seemed to be a bit dead for a while, and I heard really cool things about Poetry, seems to be quite popular these days.

@miroslavpojer
Copy link
Collaborator

miroslavpojer commented May 17, 2024

def send_graphql_query(...):
	...
	return None # or just `return`, because None is the default

@lsulak How about using typing.Optional for return value, which can be None?

@miroslavpojer
Copy link
Collaborator

Also, more about functions vs classes - as I said, your code pretty much doesn't have any OOP. Perhaps it's fine for now, and it's not necessary to do no matter what. This is pretty cool video: https://www.youtube.com/watch?v=o9pEzgHorH0, maybe you will like it. Also some people have quite good points here: https://stackoverflow.com/questions/33072570/when-should-i-be-using-classes-in-python?rq=1

Generally speaking, if you have some processing/functionality that is logically close together and operates on some data, then encapsuling it into classes can lead to much better readability and main-ability of your code.

But for now, I would be okay with the current approach (since it's a short collection of scripts, each has about 200-300 lines of code anyway) - perhaps just consider to wrapping around some more complex data types into data classes and once this grows, perhaps then it might make sense to revisit this OOP idea :) I can help

Refactoring to use classes to known objects (now defined list) is a handy tip for this project. I would like to see it as one of the main goals of following Refactoring.
Now, we are familiar with the object we are interacting with.

@lsulak
Copy link
Collaborator

lsulak commented May 17, 2024

def send_graphql_query(...):
	...
	return None # or just `return`, because None is the default

@lsulak How about using typing.Optional for return value, which can be None?

Yes in cases where None can be returned, along with some non-none value, like list or so, Optional can be used (or Union, but Optional is more tidy).

https://stackoverflow.com/questions/39429526/how-to-specify-nullable-return-type-with-type-hints

def get_some_date(some_argument: int=None) -> Optional[datetime]:
    if some_argument is not None and some_argument == 1:
        return datetime.utcnow()
    else:
        return None

or, instead of Optional[datetime], you can use Union[datetime, None], which should be the same thing. Also, from Python 3.10, it seems like there is a third option (I've never used it, as Python is being developed quite rapidly), could be: datetime | None: - so a new 'pipe' operator

@MobiTikula
Copy link
Collaborator Author

MobiTikula commented May 20, 2024

This comment is to summarize all SPIKE points for the POC refactoring. Each point is suggested to be a separated refactoring task. This is in my opinion best order to refactor the POC:

  1. Project structure
  1. OOP -- 1: Data containers
  • UP TO DEBATE: bring OOP to the project's scripts
  • extract / change known data structures to specific custom data containers
  • possible for Dict[str, List[str]] in return type for convert_field_options_to_dict
  • example:
from dataclasses import dataclass

@dataclass
class Repository:
    orgName: str
    repoName: str
    queryLabels: List[str]
  1. Logging
  1. Scripts bug fixes (comments from @lsulak)

"""
Be careful on these things:
output_file_path = f"{output_directory}/{output_file_name}"

it makes your code less multiplatform - this can't run on Windows because of different path delimiters. Better is to use os.path.join function
"""

"""
Be careful about variable scopes. For example, in github_query_project_state.py, session = requests.Session() - the session is not passed to any function below that code, but it's used within function send_graphql_query directly.
"""

  1. Constants handeling
  1. Methods
def main():
    data = "My data read from the Web"
    print(data)

if __name__ == "__main__":
    main()
  1. Documentation and formatting

@miroslavpojer
Copy link
Collaborator

miroslavpojer commented May 21, 2024

This comment is to summarize all SPIKE points for the POC refactoring. Each point is suggested to be a separated refactoring task. This is in my opinion best order to refactor the POC:

  1. Project structure
  1. OOP -- 1: Data containers
  • UP TO DEBATE: bring OOP to the project's scripts
  • extract / change known data structures to specific custom data containers
  • possible for Dict[str, List[str]] in return type for convert_field_options_to_dict
  • example:
from dataclasses import dataclass

@dataclass
class Repository:
    orgName: str
    repoName: str
    queryLabels: List[str]
  1. Logging
  1. Scripts bug fixes (comments from @lsulak)

""" Be careful on these things: output_file_path = f"{output_directory}/{output_file_name}"

it makes your code less multiplatform - this can't run on Windows because of different path delimiters. Better is to use os.path.join function """

""" Be careful about variable scopes. For example, in github_query_project_state.py, session = requests.Session() - the session is not passed to any function below that code, but it's used within function send_graphql_query directly. """

  1. Constants handeling
  1. Methods
def main():
    data = "My data read from the Web"
    print(data)

if __name__ == "__main__":
    main()
  1. Documentation and formatting

I have addressed the mentioned refactoring topic in epic #3 .

@lsulak
Copy link
Collaborator

lsulak commented May 21, 2024

Regarding 3, logging, depends how this will be exactly implemented and deployed - what's the target desired logging storage if any; Watchtower might be helpful: https://pypi.org/project/watchtower/, it directly sends logs to AWS CloudWatch, it's very easy to be plugged into standard Python logger :)

@miroslavpojer
Copy link
Collaborator

Regarding 3, logging, depends how this will be exactly implemented and deployed - what's the target desired logging storage if any; Watchtower might be helpful: https://pypi.org/project/watchtower/, it directly sends logs to AWS CloudWatch, it's very easy to be plugged into standard Python logger :)

As we are simple GH Action then the logging is about:

  • info: providing information about the generation process - informal statuses about logic steps
  • debug: providing more detailed information about received and processed data

@miroslavpojer
Copy link
Collaborator

I believe we have reached here the point to close this Issue as Solved.

@MobiTikula
Copy link
Collaborator Author

I am of the same opinion that we reached our goal for this SPIKE. I am closing this Issue as Solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants