Skip to content

caeltarifa/data_araucania_chi

Repository files navigation

Data Araucania

Data wrangling, routing, transformation, and system mediation logic.

General concerns

  1. After the data has been collected, they have to be loaded into remote database.
  2. Giving format to xlxs files' tables to json for each file and book that it is inside on it.
  3. The data is turned into sql format after json data structure has done in order to load with diagram by itself (i.e. automaticlly).
  4. Extract, transform, and load data

1. Wrangling data from Excel files:

In a perfect world, data is clean, structured, and all in one place. That’s every analyst’s dream! But much like most things in life, data is messy and unpredictable.

"... it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data. However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it’s essential that you become fluent and efficient in data wrangling techniques." — Page v, Data Wrangling with R, 2016

Several files withim INE's site to be handled are messy, no align to easily read, and they do not have a structuring header for storing them. In this section, we, by seeking the right method, explored different ways to read, transform, and load all data withim Excel files. The most optimal, regarding time of execution and memory output, would be choosen. By the way, below is presented, an example of the INE's excel file, requirements of Python's libraries, and data Pipeline comprised in three paths.

Frame of data: Messy data and unreadeable fields as headers for database
selecting_target

By Xldr (retrieves data from Excel files)

Within INE's data bank, several documents with different format is storaged. This library comprises the process of cleaning and unifying raw, messy and complex data sets for easy access and analysis for Biosoft aproaches. In below is shown the usage.

This way was built using Xldr Jupyter notebook or also OOP paradigm in files.

  • Requirements

    polars==0.15.8
    numpy==1.21.6
    xlrd==1.2.0
  • Datapipeline Main class or how to instance "wrangling procceses". Reading Excel files by FileReader Class where it requires files' folder to be processed.

    • files = FileReader('/content/data_income')

    • files.collect_files()

    • files.show_files()

      1    /content/data_income/numero-upa-practicas-mejoramiento-suelo.xlsx
      2    /content/data_income/numero-productores-tramos-edad.xlsx
      3    /content/data_income/existencias-colmenas.xlsx
      .

    For each file, the access and analysis over sheet content, structure and design have been into account to specify the work zone as a dataframe. for url_file in files.list_files:

    • array_xl = Array_xl(url_file, '/content/data_outcome_ine_cl')
    • array_xl.data_wrangling()
    • array_xl.data_normalization()
    • array_xl.default_dataframe_csv() or array_xl.custom_dataframe_csv(1)
  • Outcome CSV dataset

    Dataframe on defining range action
    Dataframe

By PySpark (Parallel computing)

This way was built using PySpark Jupyter notebook.

  • Requirements

    pandas==1.3.5
    numpy==1.21.6
    pyspar==3.3.1
    findspark==2.0.1
  • Datapipeline Main class or how to instance "wrangling procceses". Reading Excel files by FileReader Class where it requires files' folder to be processed.

  • folder = FileReader('/content/data_income')

  • folder.collect_files()

  • folder.show_files()

    1    /content/data_income/numero-upa-practicas-mejoramiento-suelo.xlsx
    2    /content/data_income/numero-productores-tramos-edad.xlsx
    3    /content/data_income/existencias-colmenas.xlsx
    .

For each file, the access and analysis over sheet content, structure and design have been into account to specify the work zone as a dataframe. for xl_file in folder.list_f

  • first = spark_dataframe(path_origin=xl_file, path_export="./pyspark/")

  • first.remove_nan()

  • first.remove_comments()

  • first.remove_empty_column()

  • first.Establishing_header()

  • first.export_to_csv()

  • Outcome CSV dataset

    Preliminar Dataframe Extracting Data
    Preliminar Dataframe Extracting Data

References

Polar

NumPy

Pandas

Xlrd

Data wrangling: Fundamentals & best practices

PySpark