<a href="https://colab.research.google.com/github/Saneesh122/Data-Processing-and-Visualization-/blob/main/03_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Some SQL with BigQuery

The first bit below comes directly from Google, you'll need to do each of those to be successful in getting this document to work.  

## Before you begin


1.   Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.
2.   [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
3.   [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.

In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


Now that I am authenticated, I can start to play around in the dataset.  I am going to look at the liquor sales data from Iowa and try to find the most and least sales by city.  I do have a project called `pic-math` in my BigQuery interface.  So you'll need to make one but keep the name simple but identifiable!

## Why do we use SQL

Below you'll see a basic SQL call.  This illustrates why excel is not useful, 22 million rows is about 21.5 million more than excel can handle!  Essentially SQL will do the data manipulations on the database server side instead of on you machine (or in the cloud with colab)

In [None]:
%%bigquery --project projectfordatavisualization
SELECT 
  COUNT(*) as total_entries
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`

Unnamed: 0,total_entries
0,1424786


Computing the longest trip :-


In [None]:
%%bigquery --project projectfordatavisualization
SELECT 
  MAX(duration_minutes)
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`

Unnamed: 0,f0_
0,34238


Computing the average time for the trip:-

In [None]:
%%bigquery --project projectfordatavisualization
SELECT AVG(duration_minutes) 
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` as table

Unnamed: 0,f0_
0,30.870428


Computing the average time based on starting point:- 

In [None]:
%%bigquery --project projectfordatavisualization
SELECT start_station_name,
AVG(duration_minutes) as Average_time_for_trip
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_name is not null
GROUP BY start_station_name

Unnamed: 0,start_station_name,Average_time_for_trip
0,Zilker Park West,27.168259
1,Toomey Rd @ South Lamar,28.115798
2,State Capitol @ 14th & Colorado,30.421842
3,Waller & 6th St.,23.265843
4,Pease Park,28.648889
...,...,...
188,Republic Square @ Guadalupe & 4th St.,23.067313
189,3rd & West,19.937466
190,3rd/West,34.152547
191,Nueces & 26th,16.805353


Now, computing how many trips start at each starting point:- 


In [None]:
%%bigquery --project projectfordatavisualization
SELECT start_station_name,
COUNT(*) as total_trips
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY start_station_name
ORDER BY total_trips DESC

Unnamed: 0,start_station_name,total_trips
0,21st & Speedway @PCL,72799
1,Riverside @ S. Lamar,40635
2,City Hall / Lavaca & 2nd,36520
3,2nd & Congress,35307
4,Rainey St @ Cummings,34758
...,...,...
188,Marketing Event,4
189,Eeyore's 2018,2
190,Stolen,1
191,Eeyore's 2017,1


Now, I am creating the second most popular starting station down below:-


In [4]:
%%bigquery --project projectfordatavisualization

SELECT COUNT(*) as total_trips, start_station_name
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY start_station_name
ORDER BY total_trips DESC

Unnamed: 0,total_trips,start_station_name
0,72799,21st & Speedway @PCL
1,40635,Riverside @ S. Lamar
2,36520,City Hall / Lavaca & 2nd
3,35307,2nd & Congress
4,34758,Rainey St @ Cummings
...,...,...
188,4,Marketing Event
189,2,Eeyore's 2018
190,1,Stolen
191,1,Eeyore's 2017


The above code computes the second most popular starting station. 


Again, I will be computing how many trips lasted over an hour and were a round trip (started and stopped at the same station) down below:- 

In [6]:
%%bigquery --project projectfordatavisualization

SELECT start_station_name, duration_minutes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_name = end_station_name

Unnamed: 0,start_station_name,duration_minutes
0,Toomey Rd @ South Lamar,82
1,Waller & 6th St.,76
2,Waller & 6th St.,75
3,Waller & 6th St.,74
4,Waller & 6th St.,73
...,...,...
228334,3rd/West,63
228335,3rd & West,63
228336,3rd/West,63
228337,3rd/West,63


In [10]:
%%bigquery --project projectfordatavisualization

SELECT *, COUNTIF(duration_minutes>60) as Numbers_of_trip_took_over_an_hour
FROM (
  SELECT start_station_name, duration_minutes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_name = end_station_name
)
GROUP BY duration_minutes, start_station_name
Order BY Numbers_of_trip_took_over_an_hour DESC

Unnamed: 0,start_station_name,duration_minutes,Numbers_of_trip_took_over_an_hour
0,Riverside @ S. Lamar,62,109
1,Riverside @ S. Lamar,61,95
2,Riverside @ S. Lamar,63,94
3,Riverside @ S. Lamar,69,90
4,Zilker Park,61,88
...,...,...,...
24822,26th/Nueces,40,0
24823,26th/Nueces,17,0
24824,26th/Nueces,53,0
24825,26th/Nueces,48,0


The above first and second code helps to compute total trips lasted over an hour and were a round trip (started and stopped at the same station) 