# File ingestion and schema validation

## Task:
- Take any csv/text file of 2+ GB of your choice. --- (You can do this assignment on Google colab)
- Read the file ( Present approach of reading the file )
- Try different methods of file reading eg: Dask, Modin, Ray, pandas and present your findings in term of computational efficiency
- Perform basic validation on data columns : eg: remove special character , white spaces from the col name
- As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of read and write file, column name in YAML
- Validate number of columns and column name of ingested file with YAML.
- Write the file in pipe separated text file (|) in gz format.
- Create a summary of the file:
    - Total number of rows,
    - total number of columns
    - file size

# Read the data file

In [1]:
import os
import time

In [2]:
# get the size of data file
size = os.path.getsize("C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv")
print(f"size of the data file : {size}")

size of the data file : 2859504349


## Read the data with Pandas

In [4]:
import pandas as pd
start = time.time()
df = pd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with pandas: ",(end-start),"sec")

Read csv with pandas:  35.50532031059265 sec


## Read the data with Dask

In [6]:
import dask.dataframe as dd
start = time.time()
df = dd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")

Read csv with dask:  1.442429780960083 sec


## Read the data with Modin and Ray

In [8]:
import modin.pandas as pd
import ray
ray.shutdown()
ray.init()
start = time.time()
df = pd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with modin and ray: ",(end-start),"sec")

2023-02-10 09:47:17,699	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8266 [39m[22m


Read csv with modin and ray:  33.37739181518555 sec


## Read the data with Vaex

In [14]:
import vaex
start = time.time()
df = vaex.open('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with vaex: ",(end-start),"sec")

ModuleNotFoundError: No module named 'vaex'

## Here Dask is better than Pandas, Modin and Ray, Vaex with the least reading time of aprroximate 0.87 sec

In [33]:
import dask.dataframe as dd
df = dd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv', delimiter = ',')

In [34]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 10 entries, Id to review/text
dtypes: object(6), float64(2), int64(2)

In [35]:
print("Number of columns : ", len(df.columns))

Number of columns :  10


In [4]:
df.columns

Index(['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness',
       'review/score', 'review/time', 'review/summary', 'review/text'],
      dtype='object')

In [5]:
df.columns = df.columns.str.replace('[ ,/,&,#,@,_,-]','')
df.columns

  df.columns = df.columns.str.replace('[ ,/,&,#,@,_,-]','')


Index(['Id', 'Title', 'Price', 'Userid', 'profileName', 'reviewhelpfulness',
       'reviewscore', 'reviewtime', 'reviewsummary', 'reviewtext'],
      dtype='object')

In [6]:
df.columns = df.columns.str.lower()
df.columns

Index(['id', 'title', 'price', 'userid', 'profilename', 'reviewhelpfulness',
       'reviewscore', 'reviewtime', 'reviewsummary', 'reviewtext'],
      dtype='object')

# Validation

In [21]:
import logging
import os
import subprocess
import yaml
import datetime 
import gc
import re

In [28]:
# Read config file
import utility

In [29]:
config_data = utility.read_config_file("file.yaml")

In [30]:
config_data['inbound_delimiter']

','

In [31]:
#inspecting data of config file
config_data

{'file_type': 'csv',
 'datset_name': 'testfile',
 'file_name': 'Books_rating',
 'table_name': 'edusurv',
 'inbound_delimiter': ',',
 'outbound_delimiter': '|',
 'skip_leading_rows': 1,
 'columns': ['id',
  'title',
  'price',
  'userid',
  'profilename',
  'reviewhelpfulness',
  'reviewscore',
  'reviewtime',
  'reviewsummary',
  'reviewtext']}

In [37]:
# read the file using config file
#Reading the file using config file
file_type = config_data['file_type']
source_file = "C:/Users/chitr/Documents/DataGlacier/Data/" + config_data['file_name'] + f'.{file_type}'

print("",source_file)
df = dd.read_csv(source_file,config_data['inbound_delimiter'])


 C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv


ValueError: An error occurred while calling the read_csv method registered to the pandas backend.
Original Message: Could not interpret '1,' as a number