# File ingestion and schema validation

## Task:
- Take any csv/text file of 2+ GB of your choice. --- (You can do this assignment on Google colab)
- Read the file ( Present approach of reading the file )
- Try different methods of file reading eg: Dask, Modin, Ray, pandas and present your findings in term of computational efficiency
- Perform basic validation on data columns : eg: remove special character , white spaces from the col name
- As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of read and write file, column name in YAML
- Validate number of columns and column name of ingested file with YAML.
- Write the file in pipe separated text file (|) in gz format.
- Create a summary of the file:
    - Total number of rows,
    - total number of columns
    - file size

# Read the data file

In [1]:
import os
import time

In [2]:
# get the size of data file
size = os.path.getsize("C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv")
print(f"size of the data file : {size}")

size of the data file : 2859504349


## Read the data with Pandas

In [3]:
import pandas as pd
start = time.time()
df = pd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with pandas: ",(end-start),"sec")

Read csv with pandas:  35.40927815437317 sec


## Read the data with Dask

In [4]:
import dask.dataframe as dd
start = time.time()
df = dd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")

Read csv with dask:  0.8652992248535156 sec


## Read the data with Modin and Ray

In [5]:
import modin.pandas as pd
import ray
ray.shutdown()
ray.init()
start = time.time()
df = pd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with modin and ray: ",(end-start),"sec")

2023-02-06 10:27:23,369	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8266 [39m[22m


Read csv with modin and ray:  61.704468965530396 sec


## Read the data with Vaex

In [6]:
import vaex
start = time.time()
df = vaex.open('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with vaex: ",(end-start),"sec")

Read csv with vaex:  6.6056108474731445 sec


## Here Dask is better than Pandas, Modin and Ray, Vaex with the least reading time of aprroximate 0.87 sec

In [7]:
import dask.dataframe as dd
df = dd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv', delimiter = ',')

In [8]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 10 entries, Id to review/text
dtypes: object(6), float64(2), int64(2)

In [11]:
print("Number of columns : ", len(df.columns))

Number of columns :  10


In [15]:
df.columns

Index(['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness',
       'review/score', 'review/time', 'review/summary', 'review/text'],
      dtype='object')

In [20]:
df.columns = df.columns.str.replace('[ ,/,&,#,@,_,-]','')
df.columns



Index(['Id', 'Title', 'Price', 'Userid', 'profileName', 'reviewhelpfulness',
       'reviewscore', 'reviewtime', 'reviewsummary', 'reviewtext'],
      dtype='object')

In [21]:
df.columns = df.columns.str.lower()
df.columns

AttributeError: 'StringMethods' object has no attribute 'tolower'