# **Data Loading Workflow Guide**
--------------------------------------------------------

### Welcome to the Workflow Guide for Data Loading!

In this workflow guide, we will learn how can load our data using datalabx.

Currently, in v0.1, datalabx allows you to read these tabular data files:

- csv files
- JSON
- parquet
- Excel (XLSX, XLS)

We will see how we can load each of these data files using datalabx.



### **Importing Libraries**

To begin with, we will be importing the library:

- datalabx

In [2]:
import datalabx

## **Loading the Data** 

datalabx allows us to load tabular data using ``DataLoader`` class.

We can use ``load_tabular()`` function from ``DataLoader`` class to load tabular data.


This method auto-detects your file type and returns a pandas DataFrame.

#### **CSV Files**

We can load our csv files using ``load_tabular()`` method from datalabx like this:

In [3]:
from datalabx import DataLoader

df = DataLoader('ultra_messy_dataset.csv').load_tabular()

DataLoader - INFO - Data Loader initialized with csv file.


In [4]:
print(df.shape[0])

df.head()

100000


Unnamed: 0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
0,five,1.04e+05,4735.244618878169,"1,55e+02",,unknown,,30.70,4.888018131992931,64972
1,20.673408544550288,134460.66741276794,9533.158420128186,"1,53e+02",12193,2.05e+01,2400.0,one,"3,33e+00","$58,276.81"
2,33.56,missing,4104.95,204,96.50051410846052,"-1,14e+01",2877.0871437672777,four,unknown,94033.3425007563
3,missing,,894429.0,196.34 cm,?,-1.57e+01,,?,4.31,"27,400cm"
4,28$,112877.67251785153,1040.45,171.40 cm,4912,-6,4344.39,2.24,4.14,67682


In [5]:
df.dtypes

Age                large_string[pyarrow]
Salary             large_string[pyarrow]
Expenses           large_string[pyarrow]
Height_cm          large_string[pyarrow]
Weight_kg          large_string[pyarrow]
Temperature_C      large_string[pyarrow]
Purchase_Amount    large_string[pyarrow]
Score              large_string[pyarrow]
Rating             large_string[pyarrow]
Debt               large_string[pyarrow]
dtype: object

We can see:

- We have over ``100000`` rows in this dataset.
- datalabx automatically converted datatypes to ``pyarrow`` datatypes.

We can change this to usual ``numpy`` style arrays by passing **array_type = True** in the ``load_tabular()`` method of ``DataLoader`` class like this:

In [6]:
numpy_df = DataLoader('ultra_messy_dataset.csv').load_tabular(array_type='numpy')

DataLoader - INFO - Data Loader initialized with csv file.


In [7]:
numpy_df.dtypes

Age                object
Salary             object
Expenses           object
Height_cm          object
Weight_kg          object
Temperature_C      object
Purchase_Amount    object
Score              object
Rating             object
Debt               object
dtype: object

Great! 

We can see that we now have numpy ``object`` datatypes.

#### **Excel Files**

We can also load Excel files (.xlsx) using ``load_tabular()`` method.

In [8]:
df = DataLoader('hotel_bookings.xlsx').load_tabular()

DataLoader - INFO - Data Loader initialized with xlsx file.


In [9]:
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


#### **Parquet Files**

We can load parquet files (.parquet) using ``load_tabular()`` method like this:

In [11]:
df = DataLoader('hotel_bookings.parquet').load_tabular()

DataLoader - INFO - Data Loader initialized with parquet file.


In [12]:
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


#### **JSON Object**

We can load JSON records using ``load_tabular`` method of datalabx.

In [22]:
df = DataLoader('hotel_bookings.json').load_tabular(array_type='pyarrow')

DataLoader - INFO - Data Loader initialized with json file.


In [23]:
df

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240,,0,Transient,98.0,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Resort Hotel,1,122,2015,August,33,9,2,4,2,...,No Deposit,240,,0,Transient,166.0,0,2,Canceled,2015-05-27
996,Resort Hotel,1,41,2015,August,33,9,2,4,2,...,No Deposit,240,,0,Transient,202.0,0,2,Canceled,2015-07-17
997,Resort Hotel,1,41,2015,August,33,9,2,4,2,...,No Deposit,240,,0,Transient,172.0,0,2,Canceled,2015-07-17
998,Resort Hotel,0,81,2015,August,33,9,2,4,2,...,No Deposit,250,,0,Transient,277.0,1,1,Check-Out,2015-08-15


**Great!**

This is how you can successfully load your tabular data files using datalabx.