# Project: File Ingestion ans Schema Validation

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Task" data-toc-modified-id="Task-1">Task</a></span></li><li><span><a href="#Import-modules" data-toc-modified-id="Import-modules-2">Import modules</a></span></li><li><span><a href="#Read-YAML-config-file" data-toc-modified-id="Read-YAML-config-file-3">Read YAML config file</a></span></li><li><span><a href="#Data-reading---validation---writing" data-toc-modified-id="Data-reading---validation---writing-4">Data reading - validation - writing</a></span><ul class="toc-item"><li><span><a href="#Read-data" data-toc-modified-id="Read-data-4.1">Read data</a></span></li><li><span><a href="#Validate-header" data-toc-modified-id="Validate-header-4.2">Validate header</a></span></li><li><span><a href="#Write-compressed-file" data-toc-modified-id="Write-compressed-file-4.3">Write compressed file</a></span></li></ul></li></ul></div>

## Task 

* Take any csv/text file of 2+ GB of your choice. --- (You can do this assignment on Google colab)

* Read the file ( Present approach of reading the file )

* Perform basic validation on data columns : eg: remove special character , white spaces from the col name

* As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of   
  read and write file, column name in YAML

* Validate number of columns and column name of ingested file with YAML.

* Write the file in pipe separated text file (|) in gz format.

* Create a summary of the file:

    - total number of rows,

    - total number of columns

    - file size

## Import modules

In [1]:
import utility
import warnings
import dask.dataframe as dd
import os

warnings.filterwarnings("ignore")

## Read YAML config file

In [2]:
config = utility.read_config_file('file.yaml')

In [3]:
config

{'file_type': 'csv',
 'dataset_name': 'commerce',
 'file_name': 'commerce',
 'table_name': 'commercesurv',
 'inbound_delimiter': ',',
 'outbound_delimiter': '|',
 'skip_leading_rows': 1,
 'columns': ['event_time',
  'event_type',
  'product_id',
  'category_id',
  'category_code',
  'brand',
  'price',
  'user_id',
  'user_session']}

## Data reading - validation - writing

### Read data

In [4]:
df = utility.read_data(config)

In [5]:
df.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2019-10-01 00:00:00 UTC,view,44600062,2103807459595387724,,shiseido,35.79,541312140,72d76fde-8bb3-4e00-8c23-a032dfed738c
1,2019-10-01 00:00:00 UTC,view,3900821,2053013552326770905,appliances.environment.water_heater,aqua,33.2,554748717,9333dfbd-b87a-4708-9857-6336556b0fcc
2,2019-10-01 00:00:01 UTC,view,17200506,2053013559792632471,furniture.living_room.sofa,,543.1,519107250,566511c2-e2e3-422b-b695-cf8e6e792ca8
3,2019-10-01 00:00:01 UTC,view,1307067,2053013558920217191,computers.notebook,lenovo,251.74,550050854,7c90fc70-0e80-4590-96f3-13c02c18c713
4,2019-10-01 00:00:04 UTC,view,1004237,2053013555631882655,electronics.smartphone,apple,1081.98,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d


### Validate header

In [6]:
utility.col_header_val(df, config)

column name and column length validation passed


True

### Write compressed file

In [7]:
utility.write_compressed(df, config, compression='gzip')

Summary:
File size: 1.6 Gb
Number of columns: 9
Number of rows: 42448764
