Data Araucania

Data wrangling, routing, transformation, and system mediation logic.

General concerns

After the data has been collected, they have to be loaded into remote database.
Giving format to xlxs files' tables to json for each file and book that it is inside on it.
The data is turned into sql format after json data structure has done in order to load with diagram by itself (i.e. automaticlly).
Extract, transform, and load data

1. Wrangling data from Excel files:

In a perfect world, data is clean, structured, and all in one place. That’s every analyst’s dream! But much like most things in life, data is messy and unpredictable.

"... it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data. However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it’s essential that you become fluent and efficient in data wrangling techniques." — Page v, Data Wrangling with R, 2016

Case: Instituto Nacional de Estadistica - Chile (INE)

Several files withim INE's site to be handled are messy, no align to easily read, and they do not have a structuring header for storing them. In this section, we, by seeking the right method, explored different ways to read, transform, and load all data withim Excel files. The most optimal, regarding time of execution and memory output, would be choosen. By the way, below is presented, an example of the INE's excel file, requirements of Python's libraries, and data Pipeline comprised in three paths.

Frame of data: Messy data and unreadeable fields as headers for database

By Xldr (retrieves data from Excel files)

Within INE's data bank, several documents with different format is storaged. This library comprises the process of cleaning and unifying raw, messy and complex data sets for easy access and analysis for Biosoft aproaches. In below is shown the usage.

This way was built using Xldr Jupyter notebook or also OOP paradigm in files.

Requirements

polars==0.15.8
numpy==1.21.6
xlrd==1.2.0

Datapipeline Main class or how to instance "wrangling procceses". Reading Excel files by FileReader Class where it requires files' folder to be processed.
- files = FileReader('/content/data_income')
- files.collect_files()
- files.show_files()
```
1    /content/data_income/numero-upa-practicas-mejoramiento-suelo.xlsx
2    /content/data_income/numero-productores-tramos-edad.xlsx
3    /content/data_income/existencias-colmenas.xlsx
.
```
For each file, the access and analysis over sheet content, structure and design have been into account to specify the work zone as a dataframe. for url_file in files.list_files:
- array_xl = Array_xl(url_file, '/content/data_outcome_ine_cl')
- array_xl.data_wrangling()
- array_xl.data_normalization()
- array_xl.default_dataframe_csv() or array_xl.custom_dataframe_csv(1)
Outcome CSV dataset

Dataframe on defining range action

By PySpark (Parallel computing)

This way was built using PySpark Jupyter notebook.

Requirements

pandas==1.3.5
numpy==1.21.6
pyspar==3.3.1
findspark==2.0.1

Datapipeline Main class or how to instance "wrangling procceses". Reading Excel files by FileReader Class where it requires files' folder to be processed.
folder = FileReader('/content/data_income')
folder.collect_files()

folder.show_files()

1    /content/data_income/numero-upa-practicas-mejoramiento-suelo.xlsx
2    /content/data_income/numero-productores-tramos-edad.xlsx
3    /content/data_income/existencias-colmenas.xlsx
.

For each file, the access and analysis over sheet content, structure and design have been into account to specify the work zone as a dataframe. for xl_file in folder.list_f

first = spark_dataframe(path_origin=xl_file, path_export="./pyspark/")
first.remove_nan()
first.remove_comments()
first.remove_empty_column()
first.Establishing_header()
first.export_to_csv()
Outcome CSV dataset

Preliminar Dataframe Extracting Data

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
Padron electoral- eleccion consejo constitucional		Padron electoral- eleccion consejo constitucional
dataset_censo_agropecuario		dataset_censo_agropecuario
nifi_flow		nifi_flow
py_wrangling_ine_cl		py_wrangling_ine_cl
DW0___data_wrangling_with_XLDR.ipynb		DW0___data_wrangling_with_XLDR.ipynb
DW1___data_wrangling_with_Parallel_processing_PySpark.ipynb		DW1___data_wrangling_with_Parallel_processing_PySpark.ipynb
DW2___data_wrangling_with_Pandas.ipynb		DW2___data_wrangling_with_Pandas.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padron electoral- eleccion consejo constitucional

Padron electoral- eleccion consejo constitucional

dataset_censo_agropecuario

dataset_censo_agropecuario

nifi_flow

nifi_flow

py_wrangling_ine_cl

py_wrangling_ine_cl

DW0___data_wrangling_with_XLDR.ipynb

DW0___data_wrangling_with_XLDR.ipynb

DW1___data_wrangling_with_Parallel_processing_PySpark.ipynb

DW1___data_wrangling_with_Parallel_processing_PySpark.ipynb

DW2___data_wrangling_with_Pandas.ipynb

DW2___data_wrangling_with_Pandas.ipynb

README.md

README.md

Repository files navigation

Data Araucania

General concerns

1. Wrangling data from Excel files:

Case: Instituto Nacional de Estadistica - Chile (INE)

By Xldr (retrieves data from Excel files)

By PySpark (Parallel computing)

References

Polar

NumPy

Pandas

Xlrd

Data wrangling: Fundamentals & best practices

PySpark

About

Releases

Packages

Languages

caeltarifa/data_araucania_chi

Folders and files

Latest commit

History

Repository files navigation

Data Araucania

General concerns

1. Wrangling data from Excel files:

Case: Instituto Nacional de Estadistica - Chile (INE)

By Xldr (retrieves data from Excel files)

By PySpark (Parallel computing)

References

Polar

NumPy

Pandas

Xlrd

Data wrangling: Fundamentals & best practices

PySpark

About

Topics

Resources

Stars

Watchers

Forks

Languages