# Big Query
- [Youtube Tutorial Video Link](https://www.youtube.com/watch?v=woU1YYlSR7o)


## 1. Big Query concepts
- Big Query is a service that allows you to query sql like queries against multiple terabytes of data at very high  speeds (seconds)
- Big Query has a feature that is serverless (great for small teams)
- Other data warehousing solutions include: amazon redshift and snowflake

suppose you are a data engineer and have a 5tb data you need to analyze:
- what am i going to build with this data?
- where is data comming from ?
    - data collection method (scraping/ api/ buying from vendor/ generated by co)
    - check the data governance and useage policy from docs or vendor
- what type of storage to use and what system holds this data?
    - local, on-premis cluster, cloud storage service (gcp)
    - security, scalability
    - reducndnecy and backups
    - how is the data accessed
- what does the data look like (data exploration)?
    - structured or unstrutured
    - format
    - schema
    - change over time
    
- what kind of analytics are going to be preformed?
    - processing engine
    - service availability
    - analysis output
        - who is going to access the data (data scientes/ analyst)
    

- `NOTE:` BIG DATA usally is referred to data > 1tb, because common RDBMS will start to degrate over time because they are not built for analystics (online transactional databases vs online analytical databases)
    - OLAP: a software for performing multidimensional analysis at high speeds on large volumes of data from a data warehouse, data mart, or some other unified, centralized data store
    - OLTP: is a type of data processing that consists of executing a number of transactions occurring concurrently, for example: online banking, shopping, order entry, or sending text messages.


- when the RDBMS reach its limit its recommened to switch to a data warehouse system
- when working with a mix of structured and unstructured data or just unstructured data a data lake system is recommended

Big tech companies using big query:
- spotify
     - user data (structured)
     - audio data (unstructured)
     - analytics for each user (big number of analytics processes)

## 2. Interacting with Big Query:

### Login and User Interface

 - login to a google account
 - use sandbox to avoid billing info
 - the top left product menu icon choose Big Query
 - in the Big Query menu you can choose from multiple sub-product for analysis, migration, and monitoring

### Services
- sql workspace: provide a code editor for sql queries 
- data transfer: an API service to import data into Big Query from various sources like:
    - google saas data and google cloud storage(buckets),
    - external storage providers: amazon S3, and azure blob storage 
    - external data warehouses: amazon redshift, tera data
- schedual queries: automate queries
- capacity manegment: specify a fixed amount of slots(pricing currency) for each job
- BI Engine: optimize queries by using the most frequntly used data, including queries written by visulization tools

#### SQL workspace
- using a publically available dataset on google cloud we can click the view button to open the dataset
- in the work space resources section choose the data to work with (open the schema to explore metadata)
- write a query to explore the data ex:
```sql
select country_region
from `bigquery-public-data.covid19_jhu_csse.summary`
group by 1
limit 20;
```
    - note: the FROM table format is as follows: `project_name.dataset_name.table_name`
    - the from is enclosed in `` not in '' or left without any
- check the excution details on the query to understand the usage and the slot consubtion (price) of the query

### Grants
- grants permissions for users to preform specific actions ex: admin, data editor, data viewer
- row base access is done using views (creating a query and saving it as av view to show only the date you want to show)

### Slots
A BigQuery slot is a virtual CPU used by BigQuery to execute SQL queries. During the query execution, BigQuery automatically calculates how many slots a query requires, depending on the query size and complexity.

## 3.Loading data into Big Query

1. identify the data source and transform into one of the available data formats in bigquery:
    - csv
    - new_line_delimited_json
    - ORC (Apache Hive) and PARQUET (Apache spark): Optimized Row Columnars, are columnar storage formats that are designed to improve performance for read-heavy workloads. great for fast query time and effiecent storage (used with big data solutions like apache products)
    - AVRO(Hadoop): Avro is a compact and efficient data serialization format that supports schema evolution and self-description of data.
    - datastore_backup: is a binary format used by Google Cloud Datastore to store backups of its NoSQL database. The format is optimized for efficient storage and retrieval of large amounts of data, and is designed to be scalable and reliable.

2. define the data schema:
- when defining a table schema each column should have the following defined: 
    - description:
    - mode: Nullable, required, repeated 
    - type: DDL SQL data types (ex: int, float, numeric, etc)
    - name: column name
3. combine the data source and the schema into a table by creating a loading job

### loading data into BigQuery best practices
- you have the option to load and store the data into BigQuery or read data from external data sources, but have worse preformance
- when loading data into BigQuery load it incrementaly to catch any failures without losing all your progress
- data storage on big query is cheaper long term compared to google cloud storage buckets
- set table exporation date: Cost management, Performance optimization, Data governance
- Documenting the access grants to external data sources :
    - security and auditability:track users and apps and allow only authorized users to access the data
    - complience: privacy regulations
- take full advantage of nested and repeated feilds(cells that comtain multiple values)!!
- when defining the schema manually skip the header row from the advanced options

### Example

1. example 1:
- transform a dataset not compatible with big query using python
- export the dataset into csv and upload using the big qurey UI

2. example 2 (not yet implemented)
- using gcp big query library upload the data in batches
- create a dashboard that tracks:
    - total uploaded data in gb
    - total number of rows
    - total number of nulls
    - (batch no. or timestamp) vs stats as line graph  of the above stats, with a filter button for these stats 

__Example 1 datasets:__
- [DATA SOURCE](https://iab.de/en/daten/iab-brain-drain/)

- `Brain drain data`
Contains data on the total number of foreign-born individuals aged 25 years and older living in each of the 20 considered OECD destination countries by year, gender, country of origin and educational level. Educational levels are distinguished into low, medium and high skilled.

- `Migration by gender`
Total number of foreign-born individuals (all age groups) living in each of the 20 considered OECD destination countries by gender and country of origin.

- `Emigration rates`
Proportion of migrants of the pre-migration population (defined as the sum of residents and migrants in each source country) by gender, skill level and year. Age group: 25 years and older.

__Example 1 code:__

In [None]:
import pandas as pd

url_brainDranin = 'https://doku.iab.de/daten/brain-drain/iabbd_8010_v1.dta'
url_migrationByGender = 'https://doku.iab.de/daten/brain-drain/iabbd_8010_v1_gender.dta'
url_emigration = 'https://doku.iab.de/daten/brain-drain/iabbd_8010_v1_emigration.dta'

# load the data in chunks
chunk_size = int(0.1 * 1024 * 1024) # 0.1 mb

df_brainDrain_chunk = pd.read_stata(brainDranin_url,chunksize= chunk_size)
df_brainDrain = pd.concat(df_brainDrain_chunk)

df_migrationByGender_chunk= pd.read_stata(url_migrationByGender,chunksize= chunk_size)
df_migrationByGender = pd.concat(df_migrationByGender_chunk)

df_emigration_chunk = pd.read_stata(url_emigration,chunksize= chunk_size)
df_emigration = pd.concat(df_emigration_chunk)

df_migrationByGender.to_csv('C:\\Users\\mohimen\\Desktop\\mod_migrationByGender.csv', index = False, header=False)

In [48]:
print('number of rows:',len(df_brainDrain), len(df_migrationByGender), len(df_emigration))
# data size estimation
print( (df_brainDrain.memory_usage().sum() / (1024*1024)).round(2), 'MB')
print((df_migrationByGender.memory_usage().sum() / (1024*1024)).round(2), 'MB')
print((df_emigration.memory_usage().sum() / (1024*1024)).round(2), 'MB')

number of rows: 53235 27293 4116
3.35 MB
1.41 MB
0.2 MB


## 4. Data exploration in big query:
- query features and best practices

### Query batching:
 - normal (interactive queries) run immediatlly without considration to available computing resources
 - batch query runs whenever there are free recourses available
 
### Query results:
- each query have the following results:
    - job info tab: record data about the query such as query id, user id, creation timestamp, start time, duration, size,etc
        - note: creation time and start time are not the same in batch queries since a query can wait for up to 24 hours to start running
    - result and json tabs: results of the query in tablular format json format
    - excution details: break down the excution operations (wait, read, compute, write) for the query operations (input , join, aggrigate, output, etc)

### Query results are saved temporarly for 24h or can be saved:
- locally: csv, json, clipboard (10mb)
- on cloud: google sheets, csv(google drive), json(google drive) (1gb)
- new permenant bigquery table

### Scheduled queries (automated queries):
- Scheduled a query to excute at a specified frequency (minute ,hourly, daily,weekly)
- choose whether to overwrite or append the new data each time its quried
- billing must be enabled

### best practices for querying data on big query:
- calculate and document your query cost by plugging the query cost into google priceing calculator (bigqeury)
- avoid select * use preview instead
- avoid user defined functions (extra computing)
- materialize (save) large / complex query outputs into a table
- in big query putting the smaller table on the left requires less computing
- limit clause doesnt reduce the computing cost
- avoid self joins
- big query is not build for transactional or transformation opertaions:
    - avoid insert delete and update table operations
    - to manipulate your table you can pull and transform the data and reuplaod the table OR use a more advance engine like dataflow

## 5. Optmization
- how big query work and how to optmize quries and reduce cost

### Schema design:

- optimize for storage
     - use normalized database schema
         - group data into multiple tables and connect them through joins (less space more computation)
         - reduce redundancy (avoids repeating columns and cells)
- optmize for speed 
     - use denormalized database schema:
         - group data into one table (no join logic to excute therefore faster excution)
         - optmizes data appending
         - use nested and repeated fields (ex:customer have multiple phone numbers)
- general rule of thumb
    - if a normalize table is less than 10GB its ok to leave it normalize there is no significant impact in the join logic. however, if there is alot of data manipulation (UPDATE, DELETE) operations denormalization is recommeneded
    
### table partitioning:
group data based on values or range of values in a specified column and store each group separetly
- partitioning tables:
    - by time, date, or int
- ingestion time partitioning table
    - by ingestion or arrival date

### clustring tables !!
clustering groups the rows based on specified columns and then reorganize it based on how related each group