Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(struc): database modularization and code improvement #137

Merged
merged 25 commits into from
Sep 2, 2023

Conversation

luabida
Copy link
Collaborator

@luabida luabida commented Jun 9, 2023

TODOs:

Base modules

  • Database base class
  • FTP File base class

Databases

  • SINAN
  • SINASC
  • CIHA
  • CNES
  • ESUS
  • IBGE
  • PNI
  • SIA
  • SIH
  • SIM

Functions:

  • List all files
  • Describe file
  • Download method
  • Endpoints

Docs:

  • docstrings

Tests:

Listing files & downloading

  • SINAN
  • CIHA
  • CNES
  • ESUS
  • IBGE
  • PNI
  • SIA
  • SINASC
  • SIH
  • SIM

Parsing

  • dbc -> dbf
  • dbf -> parquet
  • parquet -> dataframe

@luabida
Copy link
Collaborator Author

luabida commented Jun 9, 2023

This PR aims to improve the code for pysus databases, making it easier to maintain, to test and to add new functionalities to the code. The preview below is one of the benefits of the new code:

image

@fccoelho
Copy link
Collaborator

Could you please add some docstrings to the code?

@@ -26,8 +26,16 @@ class File:
def __init__(
self, path: str, name: str, size: int, date: datetime.datetime
) -> None:
try:
name, extension = name.split(".")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is more robust to os.path.splitext(name) here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that's exactly what I needed. Thanks!

name, extension = name.split(".")
self.name = name
self.extension = extension
self.basename = (".").join([name, extension])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use os.path.join here

if file.name.startswith("SRC"):
dis_code = file.name[:3]
elif file.name == "LEIBR22":
dis_code = "LEIV" # MISPELLED FILE NAME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a mispelling the V corresponds to Visceral Leishmaniasis, but still does not follow the pattern of the other file names, so it is ok to add this special treatment here.

ER="AIH Rejeitada com erro",
SP="Serviços Profissionais",
CH="Cadastro Hospitalar",
CM="", # TODO
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fccoelho do you know, by any chance, what CM means here? I couldn't find any reference to it. My guess is Cadastro Municipal, but I'm not sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Idea, I have checked the documentation in the ftp but there is no mention of it. Leave it blank for now.

@luabida
Copy link
Collaborator Author

luabida commented Aug 21, 2023

@fccoelho I've added a 5 seconds limit to every test because they were taking too much time and stopping the CI before the end of the tests. They should be rather mocked or split into smaller tests in the future. I've also started splitting the dependencies using extras, so now the preprocessing methods would have to install pysus[preprocessing]

@luabida luabida force-pushed the modularization branch 2 times, most recently from c2ceea0 to 73a6239 Compare September 2, 2023 16:38
@luabida
Copy link
Collaborator Author

luabida commented Sep 2, 2023

In [1]: db = SINAN()

In [2]: len(db.files)
Out[2]: 757

In [3]: !ls -la ~/pysus
total 8
drwxrwxr-x  2 bida bida 4096 set  2 13:39 .
drwxr-x--- 58 bida bida 4096 set  2 13:39 ..

In [4]: file = db.files[0]

In [5]: db.describe(file)
Out[5]:
{'name': 'ACBIBR06.dbc',
 'disease': 'Acidente de trabalho com material biológico',
 'year': 2006,
 'size': '28.3 kB',
 'last_update': '01-16-2023 02:15PM'}

In [6]: file.download()
Out[6]: '/home/bida/pysus/ACBIBR06.dbc'

In [7]: file = db.files[1]

In [8]: await file.async_download()

In [9]: !ls -la ~/pysus
total 696
drwxrwxr-x  2 bida bida   4096 set  2 13:41 .
drwxr-x--- 58 bida bida   4096 set  2 13:39 ..
-rw-rw-r--  1 bida bida  28326 set  2 13:40 ACBIBR06.dbc
-rw-rw-r--  1 bida bida 673314 set  2 13:41 ACBIBR07.dbc

In [10]: db.get_files(dis_codes=["ZIKA", "DENG"])
Out[10]:
[DENGBR00.dbc,
 DENGBR01.dbc,
 DENGBR02.dbc,
 DENGBR03.dbc,
 DENGBR04.dbc,
 DENGBR05.dbc,
 DENGBR06.dbc,
 DENGBR07.dbc,
 DENGBR08.dbc,
 DENGBR09.dbc,
 DENGBR10.dbc,
 DENGBR11.dbc,
 DENGBR12.dbc,
 DENGBR13.dbc,
 DENGBR14.dbc,
 DENGBR15.dbc,
 DENGBR16.dbc,
 DENGBR17.dbc,
 DENGBR18.dbc,
 DENGBR19.dbc,
 DENGBR20.dbc,
 DENGBR21.dbc,
 ZIKABR16.dbc,
 ZIKABR17.dbc,
 ZIKABR18.dbc,
 ZIKABR19.dbc,
 ZIKABR20.dbc,
 ZIKABR21.dbc,
 ZIKABR22.dbc,
 DENGBR22.dbc,
 ZIKABR23.dbc]

In [11]: db.get_files(dis_codes=["ZIKA", "DENG"], years=2017)
Out[11]: [DENGBR17.dbc, ZIKABR17.dbc]

In [12]: file = db.get_files(dis_codes=["ZIKA", "DENG"], years=2017)[1]

In [13]: file
Out[13]: ZIKABR17.dbc

In [14]: file.download()
Out[14]: '/home/bida/pysus/ZIKABR17.dbc'

In [15]: file.info
Out[15]:
{'size': '646938',
 'type': 'file',
 'modify': datetime.datetime(2021, 11, 23, 18, 3)}

In [16]: db.format(file)
Out[16]: ('ZIKA', '17')

@luabida luabida force-pushed the modularization branch 2 times, most recently from 15b1d8b to 436dc2b Compare September 2, 2023 16:57
@luabida
Copy link
Collaborator Author

luabida commented Sep 2, 2023

ipython -i pysus/online_data/SINAN.py

In [1]: download("ZIKA", "2023")
Out[1]: ['/home/bida/pysus/ZIKABR23.dbc']

In [2]: download("ZIKA", ["2023", "22"])
Out[2]: ['/home/bida/pysus/ZIKABR22.dbc', '/home/bida/pysus/ZIKABR23.dbc']

In [3]: download("ZIKA", ["2023", "22", 21])
Out[3]:
['/home/bida/pysus/ZIKABR21.dbc',
 '/home/bida/pysus/ZIKABR22.dbc',
 '/home/bida/pysus/ZIKABR23.dbc']

In [4]: download(["ZIKA", "CHIK"], ["2023", "22", 21])
Out[4]:
['/home/bida/pysus/CHIKBR21.dbc',
 '/home/bida/pysus/CHIKBR22.dbc',
 '/home/bida/pysus/ZIKABR21.dbc',
 '/home/bida/pysus/ZIKABR22.dbc',
 '/home/bida/pysus/CHIKBR23.dbc',
 '/home/bida/pysus/ZIKABR23.dbc']

In [5]: get_available_years("DENG")
Out[5]:
[DENGBR00.dbc,
 DENGBR01.dbc,
 DENGBR02.dbc,
 DENGBR03.dbc,
 DENGBR04.dbc,
 DENGBR05.dbc,
 DENGBR06.dbc,
 DENGBR07.dbc,
 DENGBR08.dbc,
 DENGBR09.dbc,
 DENGBR10.dbc,
 DENGBR11.dbc,
 DENGBR12.dbc,
 DENGBR13.dbc,
 DENGBR14.dbc,
 DENGBR15.dbc,
 DENGBR16.dbc,
 DENGBR17.dbc,
 DENGBR18.dbc,
 DENGBR19.dbc,
 DENGBR20.dbc,
 DENGBR21.dbc,
 DENGBR22.dbc]

@luabida
Copy link
Collaborator Author

luabida commented Sep 2, 2023

ipython -i pysus/online_data/SIM.py

In [1]: download("SP", 2020)
Out[1]: ['/home/bida/pysus/DOSP2020.dbc']

In [2]: download("MS", 2015)
Out[2]: ['/home/bida/pysus/DOMS2015.dbc']

In [3]: download(["SC", "PE"], [2015, "04"])
Out[3]:
['/home/bida/pysus/DOPE2004.dbc',
 '/home/bida/pysus/DOPE2015.dbc',
 '/home/bida/pysus/DOSC2004.dbc',
 '/home/bida/pysus/DOSC2015.dbc']

@luabida
Copy link
Collaborator Author

luabida commented Sep 2, 2023

In [1]: get_available_years("AC")
Out[1]:
[DNRAC1994.dbc,
 DNRAC94.DBC,
 DNRAC95.DBC,
 DNRAC1995.dbc,
 DNAC1996.DBC,
 DNAC1997.DBC,
 DNAC1998.DBC,
 DNAC1999.DBC,
 DNAC2000.DBC,
 DNAC2001.DBC,
 DNAC2002.DBC,
 DNAC2003.DBC,
 DNAC2004.DBC,
 DNAC2005.DBC,
 DNAC2006.DBC,
 DNAC2007.DBC,
 DNAC2008.DBC,
 DNAC2009.DBC,
 DNAC2010.DBC,
 DNAC2011.DBC,
 DNAC2012.DBC,
 DNAC2013.dbc,
 DNAC2014.dbc,
 DNAC2015.dbc,
 DNAC2016.dbc,
 DNAC2017.dbc,
 DNAC2018.dbc,
 DNAC2019.dbc,
 DNAC2020.dbc,
 DNAC2021.dbc]

@luabida
Copy link
Collaborator Author

luabida commented Sep 2, 2023

ipython -i pysus/online_data/SIH.py

In [1]: sih.groups
Out[1]:
{'RD': 'AIH Reduzida',
 'RJ': 'AIH Rejeitada',
 'ER': 'AIH Rejeitada com erro',
 'SP': 'Serviços Profissionais',
 'CH': 'Cadastro Hospitalar',
 'CM': ''}

In [2]: download(["SC", "PE"], [2015, "04"], [1,2,3], ["RD", "RJ"])
Out[2]:
['/home/bida/pysus/RDPE0104.dbc',
 '/home/bida/pysus/RDPE0204.dbc',
 '/home/bida/pysus/RDPE0304.dbc',
 '/home/bida/pysus/RDSC0104.dbc',
 '/home/bida/pysus/RDSC0204.dbc',
 '/home/bida/pysus/RDSC0304.dbc']

@luabida
Copy link
Collaborator Author

luabida commented Sep 2, 2023

In [1]: sia.groups
Out[1]:
{'AB': 'Laudo de Acompanhamento a Cirurgia Bariátrica',
 'ABO': 'Acompanhamento Pós Cirurgia Bariátrica',
 'ACF': 'Confeção de Fístula Arteriovenosa',
 'AD': 'Laudos Diversos',
 'AM': 'Laudo de Medicamentos',
 'AMP': '',
 'AN': 'Laudo de Nefrologia',
 'AQ': 'Laudo de Quimioterapia',
 'AR': 'Laudo de Radioterapia',
 'ATD': 'Tratamento Dialítico',
 'BI': 'Boletim Individual',
 'IMPBO': '',
 'PA': 'Produção Ambulatorial',
 'PAM': '',
 'PAR': '',
 'PAS': '',
 'PS': 'Psicossocial',
 'SAD': 'Atenção Domiciliar'}

In [2]: download(["AM", "RJ"], [2001, "04"], [1,2,3], ["aq", "PA", "BI"])
Out[2]: ['/home/bida/pysus/PAAM0304.dbc', '/home/bida/pysus/PARJ0304.dbc']

In [3]: download(["AM", "RJ"], [2001, "04", 2010], [1,2,3], ["aq", "PA", "BI"])
Out[3]:
['/home/bida/pysus/PAAM0110.dbc',
 '/home/bida/pysus/PAAM0210.dbc',
 '/home/bida/pysus/PAAM0304.dbc',
 '/home/bida/pysus/PARJ0110.dbc',
 '/home/bida/pysus/PARJ0210.dbc',
 '/home/bida/pysus/PARJ0304.dbc',
 '/home/bida/pysus/PARJ0310.dbc']

@luabida luabida marked this pull request as ready for review September 2, 2023 19:52
@luabida luabida merged commit d7e6d27 into AlertaDengue:master Sep 2, 2023
1 of 2 checks passed
@github-actions
Copy link

🎉 This PR is included in version 0.10.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants