# File ingestion and schema validation

## Task:
- Take any csv/text file of 2+ GB of your choice. --- (You can do this assignment on Google colab)
- Read the file ( Present approach of reading the file )
- Try different methods of file reading eg: Dask, Modin, Ray, pandas and present your findings in term of computational efficiency
- Perform basic validation on data columns : eg: remove special character , white spaces from the col name
- As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of read and write file, column name in YAML
- Validate number of columns and column name of ingested file with YAML.
- Write the file in pipe separated text file (|) in gz format.
- Create a summary of the file:
    - Total number of rows,
    - total number of columns
    - file size

# Get the 2+GB Data

In [1]:
import os
import time

In [2]:
# get the size of data file
size = os.path.getsize("C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv")
print(f"size of the data file : {size}")

size of the data file : 2859504349


# Read the data file

## Read the data with Pandas

In [3]:
import pandas as pd
start = time.time()
df = pd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with pandas: ",(end-start),"sec")

Read csv with pandas:  19.417900562286377 sec


## Read the data with Dask

In [4]:
import dask.dataframe as dd
start = time.time()
df = dd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")

Read csv with dask:  0.5321564674377441 sec


## Read the data with Modin and Ray

In [5]:
import modin.pandas as pd
import ray
ray.shutdown()
ray.init()
start = time.time()
df = pd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv')
end = time.time()
print("Read csv with modin and ray: ",(end-start),"sec")

2023-02-12 20:08:39,857	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


Read csv with modin and ray:  13.920469045639038 sec


## Here Dask is better than Pandas, Modin and Ray with the least reading time 

# Let's take a look at Data

In [6]:
import dask.dataframe as dd
df = dd.read_csv('C:/Users/chitr/Documents/DataGlacier/Data/Books_rating.csv', delimiter = ',')

In [7]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 10 entries, Id to review/text
dtypes: object(6), float64(2), int64(2)

In [8]:
print("Number of columns : ", len(df.columns))

Number of columns :  10


In [9]:
df.columns

Index(['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness',
       'review/score', 'review/time', 'review/summary', 'review/text'],
      dtype='object')

In [10]:
df.columns = df.columns.str.replace('[ ,/,&,#,@,_,-]','')
df.columns



Index(['Id', 'Title', 'Price', 'Userid', 'profileName', 'reviewhelpfulness',
       'reviewscore', 'reviewtime', 'reviewsummary', 'reviewtext'],
      dtype='object')

In [11]:
df.columns = df.columns.str.lower()
df.columns

Index(['id', 'title', 'price', 'userid', 'profilename', 'reviewhelpfulness',
       'reviewscore', 'reviewtime', 'reviewsummary', 'reviewtext'],
      dtype='object')

# Validation

In [12]:
import logging
import os
import subprocess
import yaml
import datetime 
import gc
import re

In [13]:
# Read config file
import utility

In [14]:
config_data = utility.read_config_file("file.yaml")

In [15]:
config_data['inbound_delimiter']

','

In [16]:
#inspecting data of config file
config_data

{'file_type': 'csv',
 'datset_name': 'testfile',
 'file_name': 'Books_rating',
 'table_name': 'edusurv',
 'inbound_delimiter': ',',
 'outbound_delimiter': '|',
 'skip_leading_rows': 1,
 'columns': ['id',
  'title',
  'price',
  'userid',
  'profilename',
  'reviewhelpfulness',
  'reviewscore',
  'reviewtime',
  'reviewsummary',
  'reviewtext']}

In [17]:
# read the file using config file
#Reading the file using config file
file_type = config_data['file_type']
source_file = "C:/Users/chitr/Documents/DataGlacier/Data/" + config_data['file_name'] + f'.{file_type}'

In [18]:
import pandas as pd
df_sample = pd.read_csv(source_file,config_data['inbound_delimiter'])
df_sample.head()



Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


In [19]:
#validating the header of the file
utility.col_header_val(df_sample,config_data)

column name and column length validation failed
Following File columns are not in the YAML file ['review_helpfulness', 'review_score', 'review_time', 'user_id', 'review_summary', 'review_text']
Following YAML columns are not in the file uploaded ['userid', 'reviewsummary', 'reviewhelpfulness', 'reviewtext', 'reviewtime', 'reviewscore']


0

In [20]:
print("columns of files are:" ,df_sample.columns)
print("columns of YAML are:" ,config_data['columns'])

columns of files are: Index(['id', 'title', 'price', 'user_id', 'profilename', 'review_helpfulness',
       'review_score', 'review_time', 'review_summary', 'review_text'],
      dtype='object')
columns of YAML are: ['id', 'title', 'price', 'userid', 'profilename', 'reviewhelpfulness', 'reviewscore', 'reviewtime', 'reviewsummary', 'reviewtext']


In [21]:
if utility.col_header_val(df_sample,config_data)==0:
    print("validation failed")
else:
    print("col validation passed")

column name and column length validation failed
Following File columns are not in the YAML file ['review_helpfulness', 'review_score', 'review_time', 'user_id', 'review_summary', 'review_text']
Following YAML columns are not in the file uploaded ['userid', 'reviewsummary', 'reviewhelpfulness', 'reviewtext', 'reviewtime', 'reviewscore']
validation failed


# Write the file in pipe separated text file (|) in gz format.

In [22]:
import datetime
import csv
import gzip

df_sample.to_csv('Books_ratings.csv.gz',
          sep='|',
          header=True,
          index=False,
          mode = 'w',
          quoting=csv.QUOTE_ALL,
          compression='gzip',
          quotechar='"',
          doublequote=True,
          lineterminator='\n')

In [23]:
#size of the gz format folder
os.path.getsize("C:/Users/chitr/Documents/DataGlacier/Data_Ingestion_Pipeline/Books_ratings.csv.gz")

1060588843