# **Introduction**

Initial step is for us to export BigQuery Data from GCP by running a query to divide the large dataset into chunks and exporting it to google drive as JSON files. 

For this project we will be using Google Collab as it makes processing such large dataset easier. We mount our local google drive and using pyspark API we will start cleaning this dataset.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 47 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 66.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=5eb6b4d9e38673f4a1051abc0b4612843fceaddd6f1b8026ca4ad35d04e38011
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


# **2.1: Imports**

**We need to import all the predict libaries we will need and mount personal google drive where we saved the json files to extract into google colab**

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd 
import matplotlib.pyplot as plt
import random
from pyspark.sql.functions import col
from pyspark.sql.types import StringType,BooleanType,DateType, StructType
from pyspark.sql.functions import split
from pyspark.sql.functions import *


In [3]:
spark = SparkSession.builder.appName('Project').getOrCreate()

In [4]:
spark

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
df = spark.read.option("multiline","false").json("/content/drive/MyDrive/BigQuery_Edata")

In [9]:
df.show(n=10)

+---------------+--------------------+--------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------------+
|channelGrouping|    customDimensions|    date|              device|      fullVisitorId|          geoNetwork|                hits|socialEngagementType|              totals|       trafficSource|   visitId|visitNumber|visitStartTime|
+---------------+--------------------+--------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------------+
|       Referral|[{4, North America}]|20170531|{Chrome, not avai...|5988717949752143819|{New York, not av...|[{{shop.googlemer...|Not Socially Engaged|{null, 5, null, 5...|{null, {null, not...|1496250487|          3|    1496250487|
|       Referral|                  []|20170531|{Chrome, not avai...|1914

#### Identify dtype of each substructure

In [10]:
df.printSchema() 

root
 |-- channelGrouping: string (nullable = true)
 |-- customDimensions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- index: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- date: string (nullable = true)
 |-- device: struct (nullable = true)
 |    |-- browser: string (nullable = true)
 |    |-- browserSize: string (nullable = true)
 |    |-- browserVersion: string (nullable = true)
 |    |-- deviceCategory: string (nullable = true)
 |    |-- flashVersion: string (nullable = true)
 |    |-- isMobile: boolean (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- mobileDeviceBranding: string (nullable = true)
 |    |-- mobileDeviceInfo: string (nullable = true)
 |    |-- mobileDeviceMarketingName: string (nullable = true)
 |    |-- mobileDeviceModel: string (nullable = true)
 |    |-- mobileInputSelector: string (nullable = true)
 |    |-- operatingSystem: string (nullable = true)
 |    |-- operati

#### You can explode the entire struct using the * operator

#### Automating explosion by identifying struct elements in the each column and dtypes to see which require exploding

In [11]:
dtypes_df = df.withColumn('hits_col_copy', F.col('hits'))\
.withColumn('hits_col_array', F.col('hits_col_copy')[0]).select("hits_col_array.*").dtypes

print(dtypes_df)

[('appInfo', 'struct<exitScreenName:string,landingScreenName:string,screenDepth:string,screenName:string>'), ('contentGroup', 'struct<contentGroup1:string,contentGroup2:string,contentGroup3:string,contentGroup4:string,contentGroup5:string,contentGroupUniqueViews1:string,contentGroupUniqueViews2:string,contentGroupUniqueViews3:string,previousContentGroup1:string,previousContentGroup2:string,previousContentGroup3:string,previousContentGroup4:string,previousContentGroup5:string>'), ('customDimensions', 'array<string>'), ('customMetrics', 'array<string>'), ('customVariables', 'array<string>'), ('dataSource', 'string'), ('eCommerceAction', 'struct<action_type:string,option:string,step:string>'), ('eventInfo', 'struct<eventAction:string,eventCategory:string,eventLabel:string>'), ('exceptionInfo', 'struct<isFatal:boolean>'), ('experiment', 'array<string>'), ('hitNumber', 'string'), ('hour', 'string'), ('isEntrance', 'boolean'), ('isExit', 'boolean'), ('isInteraction', 'boolean'), ('item', 'st

In [12]:
dtypes_df23 = df.withColumn('hits_col_copy', F.col('hits'))\
.withColumn('hits_col_array', F.col('hits_col_copy')[0])\
.withColumn('product',F.col('hits_col_array.product'))\
.withColumn('finalproduct',F.col('product')[0]).select('finalproduct.*').dtypes

In [13]:
d05 = dtypes_df + dtypes_df23
d05

[('appInfo',
  'struct<exitScreenName:string,landingScreenName:string,screenDepth:string,screenName:string>'),
 ('contentGroup',
  'struct<contentGroup1:string,contentGroup2:string,contentGroup3:string,contentGroup4:string,contentGroup5:string,contentGroupUniqueViews1:string,contentGroupUniqueViews2:string,contentGroupUniqueViews3:string,previousContentGroup1:string,previousContentGroup2:string,previousContentGroup3:string,previousContentGroup4:string,previousContentGroup5:string>'),
 ('customDimensions', 'array<string>'),
 ('customMetrics', 'array<string>'),
 ('customVariables', 'array<string>'),
 ('dataSource', 'string'),
 ('eCommerceAction', 'struct<action_type:string,option:string,step:string>'),
 ('eventInfo',
  'struct<eventAction:string,eventCategory:string,eventLabel:string>'),
 ('exceptionInfo', 'struct<isFatal:boolean>'),
 ('experiment', 'array<string>'),
 ('hitNumber', 'string'),
 ('hour', 'string'),
 ('isEntrance', 'boolean'),
 ('isExit', 'boolean'),
 ('isInteraction', 'boo

In [14]:
dtypes_df1 = df.withColumn('device_col_array',F.col('device')).select("device_col_array.*").dtypes
print(dtypes_df1)

[('browser', 'string'), ('browserSize', 'string'), ('browserVersion', 'string'), ('deviceCategory', 'string'), ('flashVersion', 'string'), ('isMobile', 'boolean'), ('language', 'string'), ('mobileDeviceBranding', 'string'), ('mobileDeviceInfo', 'string'), ('mobileDeviceMarketingName', 'string'), ('mobileDeviceModel', 'string'), ('mobileInputSelector', 'string'), ('operatingSystem', 'string'), ('operatingSystemVersion', 'string'), ('screenColors', 'string'), ('screenResolution', 'string')]


In [15]:
dtypes_df2 = df.withColumn('customDimensions_col_array', F.col('customDimensions')[0]).select("customDimensions_col_array.*").dtypes
  
print(dtypes_df2)

[('index', 'string'), ('value', 'string')]


In [16]:
dtypes_df3 = df.withColumn('geoNetwork_col_array', F.col('geoNetwork')).select("geoNetwork_col_array.*").dtypes

print(dtypes_df3)

[('city', 'string'), ('cityId', 'string'), ('continent', 'string'), ('country', 'string'), ('latitude', 'string'), ('longitude', 'string'), ('metro', 'string'), ('networkDomain', 'string'), ('networkLocation', 'string'), ('region', 'string'), ('subContinent', 'string')]


In [17]:
dtypes_df4 = df.withColumn('totals_col_array', F.col('totals')).select("totals_col_array.*").dtypes

print(dtypes_df4)

[('bounces', 'string'), ('hits', 'string'), ('newVisits', 'string'), ('pageviews', 'string'), ('sessionQualityDim', 'string'), ('timeOnSite', 'string'), ('totalTransactionRevenue', 'string'), ('transactionRevenue', 'string'), ('transactions', 'string'), ('visits', 'string')]


In [18]:
dtypes_df5 = df.withColumn('trafficSource_col_array', F.col('trafficSource')).select("trafficSource_col_array.*").dtypes
print(dtypes_df5)

[('adContent', 'string'), ('adwordsClickInfo', 'struct<adNetworkType:string,criteriaParameters:string,gclId:string,isVideoAd:boolean,page:string,slot:string>'), ('campaign', 'string'), ('campaignCode', 'string'), ('isTrueDirect', 'boolean'), ('keyword', 'string'), ('medium', 'string'), ('referralPath', 'string'), ('source', 'string')]


In [19]:
struct_items05 = [item[0] for item in d05 if item[1].startswith('struct')]
print(struct_items05)

['appInfo', 'contentGroup', 'eCommerceAction', 'eventInfo', 'exceptionInfo', 'item', 'latencyTracking', 'page', 'promotionActionInfo', 'social', 'transaction']


**Identify a column key including exlpode struct, we will use this key as an identifier to select all columns **

In [20]:
explode_struct_cols = ["hits_col_array." +col + ".*" for col in struct_items05]
explode_struct_cols

['hits_col_array.appInfo.*',
 'hits_col_array.contentGroup.*',
 'hits_col_array.eCommerceAction.*',
 'hits_col_array.eventInfo.*',
 'hits_col_array.exceptionInfo.*',
 'hits_col_array.item.*',
 'hits_col_array.latencyTracking.*',
 'hits_col_array.page.*',
 'hits_col_array.promotionActionInfo.*',
 'hits_col_array.social.*',
 'hits_col_array.transaction.*']

In [21]:
explode_list = [i+'.*' for i in df.columns]
print(explode_list)

['channelGrouping.*', 'customDimensions.*', 'date.*', 'device.*', 'fullVisitorId.*', 'geoNetwork.*', 'hits.*', 'socialEngagementType.*', 'totals.*', 'trafficSource.*', 'visitId.*', 'visitNumber.*', 'visitStartTime.*']


In [22]:
exp_list = ['channelGrouping','customDimensions.value','date', 'device.*', 'fullVisitorId', 'geoNetwork.*', 'socialEngagementType', 'totals.*', 'trafficSource.*', 'visitId', 'visitNumber', 'visitStartTime','product.*']

In [23]:
col_names = exp_list + explode_struct_cols
print(col_names)

['channelGrouping', 'customDimensions.value', 'date', 'device.*', 'fullVisitorId', 'geoNetwork.*', 'socialEngagementType', 'totals.*', 'trafficSource.*', 'visitId', 'visitNumber', 'visitStartTime', 'product.*', 'hits_col_array.appInfo.*', 'hits_col_array.contentGroup.*', 'hits_col_array.eCommerceAction.*', 'hits_col_array.eventInfo.*', 'hits_col_array.exceptionInfo.*', 'hits_col_array.item.*', 'hits_col_array.latencyTracking.*', 'hits_col_array.page.*', 'hits_col_array.promotionActionInfo.*', 'hits_col_array.social.*', 'hits_col_array.transaction.*']


**Subset a new dataframe with all the columns from original set are exploded into each indivdual columns**

In [24]:
flatdf = df.withColumn('hits_col_copy', F.col('hits'))\
        .withColumn('hits_col_array', F.col('hits_col_copy')[0])\
        .withColumn('product',F.col('hits_col_array.product'))\
        .withColumn('product',F.col('product')[0])\
        .select(col_names)

In [25]:
flatdf.show(n=10,truncate = False)

+---------------+---------------+--------+-------+-----------------------------+-----------------------------+--------------+-----------------------------+--------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+---------------+-----------------------------+-----------------------------+-----------------------------+-------------------+-----------+-----------------------------+---------+--------------+-----------------------------+-----------------------------+---------------------------------+-----------------+-----------------------------+-----------------+----------------+--------------------+-------+----+---------+---------+-----------------+----------+-----------------------+------------------+------------+------+---------+-------------------------------------------------------------+---------+------------+------------+--------------+-------+---------

### **2.2: Data Cleaning**: Now that we have all our data in individual columns, we are going to tidy up and reduce the memory size to make it easire for reading and storage.

In [26]:
flatdf.dtypes

[('channelGrouping', 'string'),
 ('value', 'array<string>'),
 ('date', 'string'),
 ('browser', 'string'),
 ('browserSize', 'string'),
 ('browserVersion', 'string'),
 ('deviceCategory', 'string'),
 ('flashVersion', 'string'),
 ('isMobile', 'boolean'),
 ('language', 'string'),
 ('mobileDeviceBranding', 'string'),
 ('mobileDeviceInfo', 'string'),
 ('mobileDeviceMarketingName', 'string'),
 ('mobileDeviceModel', 'string'),
 ('mobileInputSelector', 'string'),
 ('operatingSystem', 'string'),
 ('operatingSystemVersion', 'string'),
 ('screenColors', 'string'),
 ('screenResolution', 'string'),
 ('fullVisitorId', 'string'),
 ('city', 'string'),
 ('cityId', 'string'),
 ('continent', 'string'),
 ('country', 'string'),
 ('latitude', 'string'),
 ('longitude', 'string'),
 ('metro', 'string'),
 ('networkDomain', 'string'),
 ('networkLocation', 'string'),
 ('region', 'string'),
 ('subContinent', 'string'),
 ('socialEngagementType', 'string'),
 ('bounces', 'string'),
 ('hits', 'string'),
 ('newVisits', '

2.2.1 Drop columns we are not going to use.

In [27]:
flatdf.show(n=10,truncate=False)

+---------------+---------------+--------+-------+-----------------------------+-----------------------------+--------------+-----------------------------+--------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+---------------+-----------------------------+-----------------------------+-----------------------------+-------------------+-----------+-----------------------------+---------+--------------+-----------------------------+-----------------------------+---------------------------------+-----------------+-----------------------------+-----------------+----------------+--------------------+-------+----+---------+---------+-----------------+----------+-----------------------+------------------+------------+------+---------+-------------------------------------------------------------+---------+------------+------------+--------------+-------+---------

In [28]:
droplist = (
 'browser','browserSize','browserVersion','flashVersion','language','mobileDeviceBranding','mobileDeviceInfo','mobileDeviceMarketingName','mobileDeviceModel',
 'mobileInputSelector','screenColors','screenResolution','latitude','longitude','sessionQualityDim','adwordsClickInfo','campaign','keyword','screenDepth',
 'contentGroup1','contentGroup2','contentGroup3','contentGroup4','contentGroup5','previousContentGroup1','previousContentGroup2','previousContentGroup3','previousContentGroup4',
 'previousContentGroup5','option','step','domLatencyMetricsSample','pageDownloadTime','pageLoadSample','pageLoadTime','serverConnectionTime','speedMetricsSample',
 'hostname','pagePathLevel1','pagePathLevel2','pagePathLevel3','pagePathLevel4','searchKeyword','affiliation','currencyCode')

In [29]:
flatdf = flatdf.drop(*droplist)

**Stream through remaining columns and locate any that need to be drop with lack of information**

In [30]:
droplist2 = ('operatingSystemVersion','cityId','networkLocation','customDimensions','customMetrics','socialInteractionNetworkAction')

In [31]:
flatdf = flatdf.drop(*droplist2)

In [32]:
flatdf.show(n=5)

+---------------+---------------+--------+--------------+--------+---------------+-------------------+-----------+---------+-------------+--------------+---------------+-----------------+----------------+--------------------+-------+----+---------+---------+----------+-----------------------+------------------+------------+------+---------+------------+------------+------+------------+--------+----------+-----------+--------------+-------+------------+-----------------+-------------------+------------+---------------+-------------------+------------+---------------+--------------+----------+--------------+-----------------+-------------+--------------------+--------------------+--------------------+------------------------+------------------------+------------------------+-----------+-----------+-------------+----------+-------+-------------+--------------------+------------------+----------------+---------------+------------------+--------+---------+--------------+------------+-------

# **2.2.3: Drop unnesscary parenthesis and create key words for long repeating strings.**

In [33]:
flatdf = flatdf.withColumn('networkDomain', regexp_replace('networkDomain','[({ })]',""))\
.withColumn('value', regexp_replace('value','[ ]',""))\
.withColumn('city', regexp_replace('city','[({ })]',""))\
.withColumn('city', regexp_replace('city','notavailableindemodataset','NAID'))\
.withColumn('metro', regexp_replace('metro','not available in demo dataset','NAID'))\
.withColumn('metro', regexp_replace('metro','[({ })]',""))\
.withColumn('region', regexp_replace('region','[({ })]',""))\
.withColumn('region', regexp_replace('region','notavailableindemodataset','NAID'))\
.withColumn('socialEngagementType', regexp_replace('socialEngagementType','Not Socially Engaged','NSE'))\
.withColumn('medium',regexp_replace('medium','[({ })]',""))\
.withColumn('source',regexp_replace('source','[({ })]',""))\
.withColumn('productBrand',regexp_replace('productBrand','[({ })]',""))\
.withColumn('productVariant',regexp_replace('productVariant','[({ })]',""))\
.withColumn('socialNetwork',regexp_replace('socialNetwork','[({ })]',""))\
.withColumn('exitScreenName',regexp_replace('exitScreenName','www.',''))\
.withColumn('exitScreenName',regexp_replace('exitScreenName','.com',''))\
.withColumn('landingScreenName',regexp_replace('landingScreenName','www.',''))\
.withColumn('landingScreenName',regexp_replace('landingScreenName','.com',''))\
.withColumn('screenName',regexp_replace('screenName','www.',''))\
.withColumn('screenName',regexp_replace('screenName','.com',''))

In [34]:
droplist3 = ('subContinent','value','pagePath','productVariant','v2ProductCategory','continent','networkDomain','metro','region','socialEngagementType','screenName','searchCategory','isMobile','campaignCode','city','pageTitle','localTransactionRevenue','localTransactionShipping','localTransactionTax','transactionCoupon','transactionShipping','transactionTax','transactionId','transactionRevenue','v2ProductName')

In [35]:
flatdf = flatdf.drop(*droplist3)

In [36]:
flatdf.show()

+---------------+--------+--------------+---------------+-------------------+--------------+-------+----+---------+---------+----------+-----------------------+------------+------+--------------------+------------+---------+------------+--------+----------+-----------+--------------+-------+------------+-----------------+-------------------+------------+---------------+-------------------+------------+---------------+--------------+--------------+--------------------+--------------------+------------------------+------------------------+------------------------+-----------+-----------+-------------+----------+-------+--------------------+------------------+----------------+---------------+------------------+------------+-----------+-----------------------+-------------+
|channelGrouping|    date|deviceCategory|operatingSystem|      fullVisitorId|       country|bounces|hits|newVisits|pageviews|timeOnSite|totalTransactionRevenue|transactions|visits|           adContent|isTrueDirect|   me

In [None]:
flatdf.columns

In [38]:
flatdf.dtypes

[('channelGrouping', 'string'),
 ('date', 'string'),
 ('deviceCategory', 'string'),
 ('operatingSystem', 'string'),
 ('fullVisitorId', 'string'),
 ('country', 'string'),
 ('bounces', 'string'),
 ('hits', 'string'),
 ('newVisits', 'string'),
 ('pageviews', 'string'),
 ('timeOnSite', 'string'),
 ('totalTransactionRevenue', 'string'),
 ('transactions', 'string'),
 ('visits', 'string'),
 ('adContent', 'string'),
 ('isTrueDirect', 'boolean'),
 ('medium', 'string'),
 ('referralPath', 'string'),
 ('source', 'string'),
 ('visitId', 'string'),
 ('visitNumber', 'string'),
 ('visitStartTime', 'string'),
 ('isClick', 'boolean'),
 ('isImpression', 'boolean'),
 ('localProductPrice', 'string'),
 ('localProductRevenue', 'string'),
 ('productBrand', 'string'),
 ('productListName', 'string'),
 ('productListPosition', 'string'),
 ('productPrice', 'string'),
 ('productQuantity', 'string'),
 ('productRevenue', 'string'),
 ('productSKU', 'string'),
 ('exitScreenName', 'string'),
 ('landingScreenName', 'stri

# **2.2.4: Convert pyspark dataframe to pandas dataframe.

In [39]:
pandas_df = flatdf.to_pandas_on_spark()



In [40]:
pandas_df.head()

Unnamed: 0,channelGrouping,date,deviceCategory,operatingSystem,fullVisitorId,country,bounces,hits,newVisits,pageviews,timeOnSite,totalTransactionRevenue,transactions,visits,adContent,isTrueDirect,medium,referralPath,source,visitId,visitNumber,visitStartTime,isClick,isImpression,localProductPrice,localProductRevenue,productBrand,productListName,productListPosition,productPrice,productQuantity,productRevenue,productSKU,exitScreenName,landingScreenName,contentGroupUniqueViews1,contentGroupUniqueViews2,contentGroupUniqueViews3,action_type,eventAction,eventCategory,eventLabel,isFatal,domContentLoadedTime,domInteractiveTime,domainLookupTime,redirectionTime,serverResponseTime,promoIsClick,promoIsView,hasSocialSourceReferral,socialNetwork
0,Referral,20170531,desktop,Linux,5988717949752143819,United States,,5,,5,22,,,1,,True,none,/,direct,1496250487,3,1496250487,,,,,,,,,,,,shop.googlemerchandisestore/store.html,shop.googlemerchandisestore/home,,,,0,,,,True,,,,,,,True,No,notset
1,Referral,20170531,desktop,Macintosh,1914206347497291855,Hong Kong,,7,,7,348,,,1,,True,none,/,direct,1496242570,7,1496242570,,,,,,,,,,,,shop.googlemerchandisestore/google+redesign/of...,shop.googlemerchandisestore/home,,,,0,,,,True,,,,,,,True,No,notset
2,Referral,20170531,desktop,Macintosh,5688612779726117211,United States,,8,,8,207,,,1,,,none,/,direct,1496268814,2,1496268814,,,,,,,,,,,,shop.googlemerchandisestore/google+redesign/ap...,shop.googlemerchandisestore/home,,,,0,,,,True,,,,,,,True,No,notset
3,Direct,20170531,desktop,Windows,5995173173416756249,Argentina,,10,,10,248,,,1,,True,none,,direct,1496260952,2,1496260952,,,,,,,,,,,,shop.googlemerchandisestore/google+redesign/of...,shop.googlemerchandisestore/home,,,,0,,,,True,,,,,,,True,No,notset
4,Referral,20170531,desktop,Macintosh,5703962423814396269,United States,,39,,25,2040,,,1,,True,none,/,direct,1496262254,22,1496262254,,,,,,,,,,,,shop.googlemerchandisestore/basket.html,shop.googlemerchandisestore/home,,,,0,,,,True,,,,,,,True,No,notset


**Change Data formats to match respective value**

In [41]:
df = pandas_df

In [42]:
df.dtypes

channelGrouping             object
date                        object
deviceCategory              object
operatingSystem             object
fullVisitorId               object
country                     object
bounces                     object
hits                        object
newVisits                   object
pageviews                   object
timeOnSite                  object
totalTransactionRevenue     object
transactions                object
visits                      object
adContent                   object
isTrueDirect                  bool
medium                      object
referralPath                object
source                      object
visitId                     object
visitNumber                 object
visitStartTime              object
isClick                       bool
isImpression                  bool
localProductPrice           object
localProductRevenue         object
productBrand                object
productListName             object
productListPosition 

In [43]:
int_list = ['fullVisitorId',
 'bounces',
 'hits',
 'newVisits',
 'pageviews',
 'totalTransactionRevenue',
 'transactions',
 'visits',
 'visitId',
 'visitNumber',
 'localProductPrice',
 'localProductRevenue',
 'productListPosition',
 'productPrice',
 'productQuantity',
 'productRevenue',
 'contentGroupUniqueViews1',
 'contentGroupUniqueViews2',
 'contentGroupUniqueViews3',
 'action_type',
 'promoIsClick',
 'promoIsView']

In [44]:
df[int_list] = df[int_list].astype(str).astype(int)

In [45]:
df.dtypes

channelGrouping             object
date                        object
deviceCategory              object
operatingSystem             object
fullVisitorId                int64
country                     object
bounces                      int64
hits                         int64
newVisits                    int64
pageviews                    int64
timeOnSite                  object
totalTransactionRevenue      int64
transactions                 int64
visits                       int64
adContent                   object
isTrueDirect                  bool
medium                      object
referralPath                object
source                      object
visitId                      int64
visitNumber                  int64
visitStartTime              object
isClick                       bool
isImpression                  bool
localProductPrice            int64
localProductRevenue          int64
productBrand                object
productListName             object
productListPosition 

In [None]:
df.head(n=5)

Saved file a parquet on locacl google drive

In [48]:
df.to_parquet('/content/drive/MyDrive/df.parquet')