## 학습목표
- python을 이용하여 데이터베이스에 원하는 데이터를 읽어오고 이를 csv 파일로 저장할 수 있다.
- pandas를 이용한 데이터 처리와 sql을 이용한 데이터 처리의 효율성을 비교한다.

In [None]:
!pip install pandas fiona shapely pyproj rtree geopandas

In [None]:
import pandas as pd
import time
import geopandas as gpd
import folium
import matplotlib.pyplot as plt

## 1. Python으로 BigQuery 읽어오기
---

- select query 결과를 DataFrame로 만드는 방법

In [None]:
# google 계정 인증을 통하여 GDP의 접근 권한을 얻어 옵니다.
from google.colab import auth
auth.authenticate_user()

In [None]:
from google.cloud import bigquery

# 프로젝트 id를 입력합니다.
# 프로젝트 id는 GCP 콘솔(https://console.cloud.google.com/)에서 확인 가능합니다.
# 이때, 프로젝트는 billing과 BigQuery가 활성활 되어 있어야 합니다.
# - billing : https://cloud.google.com/billing/docs/how-to/modify-project?visit_id=637720949614431006-3922838519&rd=1#enable-billing
# - bigquery : https://console.cloud.google.com/flows/enableapi?apiid=bigquer
project_id = 'aiffel-dudi'
client = bigquery.Client(project=project_id)



In [None]:
# 쿼리의 결과를 `to_dataframe` 함수를 이용하여 간단하게 DataFrame으로 변환할 수 있다.
res = client.query("""
SELECT * 
FROM `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2017`
LIMIT 100""").to_dataframe()

res

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,store_and_fwd_flag,rate_code,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,total_amount,payment_type,distance_between_service,time_between_service,trip_type,imp_surcharge,dropoff_location_id,pickup_location_id
0,2,2017-03-03 01:20:55,2017-03-03 01:32:05,N,,,,,,1,,-4.1E-8,,,,,,,1,,,1,,173,82
1,2,2017-12-17 22:08:09,2017-12-17 22:09:32,N,,,,,,1,,,,,,,,,2,,,1,,256,256
2,1,2017-01-06 19:46:39,2017-01-06 19:49:36,N,,,,,,1,,,,,,,,,1,,,1,,225,61
3,2,2017-09-30 08:05:13,2017-09-30 08:32:16,N,,,,,,1,,-3.3E-8,,,,,,,1,,,1,,143,243
4,2,2017-03-27 23:03:20,2017-03-27 23:26:17,N,,,,,,1,,,,,,,,,2,,,1,,76,97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2,2017-08-30 17:43:39,2017-08-30 17:53:48,N,,,,,,1,,,,,,,,,1,,,1,,239,43
96,2,2017-01-01 04:00:09,2017-01-01 04:18:28,N,,,,,,1,,,,,,,,,1,,,1,,61,33
97,2,2017-04-08 16:47:43,2017-04-08 17:03:10,N,,,,,,1,,,,,,,,,2,,,1,,179,260
98,2,2017-07-21 08:32:19,2017-07-21 08:48:07,N,,,,,,1,,-4.1E-8,,,,,,-4.1E-8,2,,,2,,250,18


## 2. 실습
---

비어져 있는 query를 작성하여 문제를 해결하여 봅시다. 실습 결과물로 완성된 DataFrame는 `이름_문제번호.csv` 형식으로 저장하여 노션에 올려주세요.

DataFrame를 csv로 저장하는 방법에 대해서는 [이곳](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)을 참고하여 주세요.

### 1) 실습 데이터

실습에 사용되는 데이터베이스는 GCP public dataset(bigquery-public-data.new_york_taxi_trips.taxi_zone_geom)입니다. 학습과정에서 다뤄볼 수 있는 대표적인 대용량 데이터로 이번에 사용할 2017년도의 자료의 경우 약 18GB로 이루어져 있습니다.

데이터베이스에서 사용할 3가지의 테이블은 다양한 column이 정의되어 있지만, 이번 실습에서 주요하여 보아야할 column은 다음과 같습니다.

- tlc_green_trips_2017 : 2017년의 green cab 운행 정보

| 필드 이름 | 유형 | 설명 |
|:----------|:----:|:-----| 
| passenger_count | INTEGER | The number of passengers in the vehicle. This is a driver-entered value | 
| dropoff_location_id | STRING | |
| pickup_location_id | STRING | |

- tlc_yellow_trips_2017 : 2017년의 yellow cab 운행 정보

| 필드 이름 | 유형 | 설명 |
|:----------|:----:|:-----| 
| passenger_count | INTEGER | The number of passengers in the vehicle. This is a driver-entered value | 
| dropoff_location_id | STRING | |
| pickup_location_id | STRING | |


- taxi_zone_geom : tlc_green_trips_2017와 tlc_yellow_trips_2017에 나타나는 구역(pickup_location_id, dropoff_location_id) 정보

| 필드 이름 | 유형 | 설명 |
|:----------|:----:|:-----| 
| zone_id | STRING | Unique ID number of each taxi zone. <br>Corresponds with the pickup_location_id and dropoff_location_id in each of the trips tables	 | 
| drozone_namepoff_location_id | STRING | Full text name of the taxi zone	 |
| borough | STRING | Borough containing the taxi zone |
| zone_geom | GEOGRAPHY	| Geometric outline that defines the taxi zone suitable for GIS analysis. |


문제풀이의 기본 요구사항은 green cab와 yellow cab에 대한 정보를 모두 출력하는 것이나, 까다로운 부분이 존재하기 때문에 `yellow cab`에 대해서만 진행하여도 무관합니다.

### 2) 사람들이 가장 택시를 많이 타는 곳 top 100은?
- 출력 column : zone_id, zone_name, borough, cnt, zone_geom
    - zone_id : 구역의 id
    - zone_name : 구역이 이름
    - borough : 구역의 속하는 지역
    - cnt : 해당 구역에서 출발한 trip의 수
    - zone_geom : 구역의 geometry 정보

In [None]:
def logging_time(original_fn):
    def wrapper_fn(*args, **kwargs):
        start_time = time.time()
        result = original_fn(*args, **kwargs)
        end_time = time.time()
        print("WorkingTime[{}]: {} sec".format(original_fn.__name__, end_time-start_time))
        return result
    return wrapper_fn

In [None]:
def execute_query(q):
    """
    query를 실행시키고 그 결과를 pnadas.DataFrame에 담아 반환합나다.
    """
    return client.query(q).to_dataframe()

In [None]:
@logging_time
def get_top100_zone(target):
    """
    db에서 테이블의 모든 정보를 가져 온 후, 
    target의 등장 빈도가 가장 높은 100개의 tuple를 반환합니다.
    이때, 반환되는 tuple의 정보는 zone_id, zone_name, borough, cnt, zone_geom 입니다.
    """
    # 1. 두개 table 정보를 모두 읽어 dataframe로 할당한다
    print('load taix trip data...')
    taxi_trips_df = execute_query(f"""
    select pickup_datetime, pickup_location_id, dropoff_location_id
    from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017`
    """)
    # display(taxi_trips_df.info())

    print('load geometry data...')
    taxi_zone_geom_df = execute_query("select * FROM `bigquery-public-data.new_york_taxi_trips.taxi_zone_geo`")
    # display(taxi_zone_geom_df.info())
    
    # 2. taxi_trips_df를 target을 기준으로 
    #    grouping하여 target별 column 수를 계산한다.
    #    이때, pickup_datetime을 사용한 이유는 결측값이 없는 임의의 column을 선정하였기 때문이다.
    top100 = taxi_trips_df[['pickup_datetime', target]].groupby(by=[target]) # target를 기준으로 그룹화
    top100 = top100.count()  # aggrigation func로 count 사용
    top100 = top100.sort_values(by='pickup_datetime', ascending=False) # count 정보를 기준으로 내림차순
    top100 = top100.head(100) # 상위 10개만을 추출
    top100 = top100.reset_index() # index로 되어 있는 target를 column으로 변환
    top100 = top100.rename(columns={'index':target, 'pickup_datetime':'cnt'}) # column 이름을 변경, index → target, pickup_datetime → cnt
    display(top100)
    
    # 3. taxi_zone_geom_df와 top10을 merge
    #    merge 기준은 taxi_zone_geom_df.zone_id와 top100.target이다.
    #    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
    res = top100.merge(taxi_zone_geom_df,
                      how='inner',
                      left_on=target,
                      right_on='zone_id')
    res = res[['zone_id', 'zone_name', 'borough', 'cnt', 'zone_geom']]
    res.dropna(inplace=True)
    # display(res)
    
    return res

In [None]:
res = get_top100_zone('pickup_location_id')
res

load taix trip data...


ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 4))



KeyboardInterrupt: ignored

In [None]:
def show_map(df, trg, tooltip_trg):
    """
    입력되는 pandas.DataFrame를 folium의 지도 위도 출력합니다.
    """
    gs = gpd.GeoSeries.from_wkt(df['zone_geom'])
    gdf = gpd.GeoDataFrame(df, geometry=gs, crs="EPSG:4326")

    f = folium.Figure(width=600, height=500)
    m = folium.Map(location=[40.75, -73.90], zoom_start=10, tiles='CartoDB positron').add_to(f)
    choropleth = folium.Choropleth(geo_data=gdf, 
                    neme='zone_name',
                    data=gdf,
                    columns=['zone_name', trg],
                    key_on='feature.properties.zone_name',
                    fill_color='YlGn',
                    fill_opacity=0.7,
                    line_opacity=0.2,
                    legend_name=trg).add_to(m)

    style_function = lambda x: {'fillColor': '#ffffff', 
                                'color':'#000000', 
                                'fillOpacity': 0.1, 
                                'weight': 0.1}

    tooltip = folium.GeoJson(
        gdf,
        style_function=style_function, 
        control=False,
        tooltip=folium.GeoJsonTooltip(
            fields=tooltip_trg,
            localize=True
        )
    )

    choropleth.add_child(tooltip)
    display(f)

In [None]:
show_map(res, 'cnt', ['borough', 'zone_name'])

✨ 실습 - 쿼리를 활용하여 문제를 해결해보자!

In [None]:
@logging_time
def solution_with_query(q):    
    return execute_query(q)

In [None]:
query = \
"""
select top100.pickup_location_id as zone_id,
       zone.zone_name, zone.borough, top100.cnt, zone.zone_geom
from (select pickup_location_id, count(pickup_location_id) as cnt
      from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017`
      group by pickup_location_id
      order by count(pickup_location_id) desc 
      LIMIT 100) as top100
left join `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` as zone
on top100.pickup_location_id=zone.zone_id
"""
res = solution_with_query(query).dropna()
res

WorkingTime[solution_with_query]: 1.0871655941009521 sec


Unnamed: 0,zone_id,zone_name,borough,cnt,zone_geom
0,237,Upper East Side South,Manhattan,4442294,"POLYGON((-73.9656696 40.7628045559999, -73.965..."
1,161,Midtown Center,Manhattan,4274032,"POLYGON((-73.9748872719999 40.7559237909999, -..."
2,236,Upper East Side North,Manhattan,4023665,"POLYGON((-73.9572940999999 40.7742835549999, -..."
3,186,Penn Station/Madison Sq West,Manhattan,3973288,"POLYGON((-73.9905176129999 40.7460386379999, -..."
4,162,Midtown East,Manhattan,3863529,"POLYGON((-73.970759194 40.7558248509999, -73.9..."
...,...,...,...,...,...
95,70,East Elmhurst,Queens,16896,"POLYGON((-73.8591354529999 40.7614184409999, -..."
96,36,Bushwick North,Brooklyn,16763,"POLYGON((-73.913291624 40.7043126389999, -73.9..."
97,95,Forest Hills,Queens,16410,"POLYGON((-73.8474631399999 40.7389520309999, -..."
98,93,Flushing Meadows-Corona Park,Queens,15576,"POLYGON((-73.8569707835437 40.7640660675688, -..."


In [None]:
show_map(res, 'cnt', ['borough', 'zone_name'])

### 2) 사람들이 가장 택시를 많이 내리는 곳 top 100은?
- 출력 column : zone_id, zone_name, borough, cnt, zone_geom
    - zone_id : 구역의 id
    - zone_name : 구역이 이름
    - borough : 구역의 속하는 지역
    - cnt : 해당 구역에서 출발한 trip의 수
    - zone_geom : 구역의 geometry 정보

✨ 출력 예시 - 데이터를 모두 읽어와 pandas에서 처리하는 경우

In [None]:
res = get_top100_zone('dropoff_location_id')
res

In [None]:
show_map(res, 'cnt', ['borough', 'zone_name'])

✨ 실습 - 쿼리를 활용하여 문제를 해결해보자!

In [None]:
query = \
"""
select top100.dropoff_location_id as zone_id,
       zone.zone_name, zone.borough, top100.cnt, zone.zone_geom
from (select dropoff_location_id, count(dropoff_location_id) as cnt
      from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017`
      group by dropoff_location_id
      order by count(dropoff_location_id) desc 
      LIMIT 100) as top100
left join `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` as zone
on top100.dropoff_location_id=zone.zone_id
"""
res = solution_with_query(query).dropna()
res

WorkingTime[solution_with_query]: 4.004879713058472 sec


Unnamed: 0,zone_id,zone_name,borough,cnt,zone_geom
2,1,Newark Airport,EWR,220034,"POLYGON((-74.1856319999999 40.6916479999999, -..."
3,168,Mott Haven/Port Morris,Bronx,87144,MULTIPOLYGON(((-73.8993874857103 40.8019356650...
4,138,LaGuardia Airport,Queens,1315015,MULTIPOLYGON(((-73.8728849699182 40.7859626331...
5,132,JFK Airport,Queens,998023,MULTIPOLYGON(((-73.7470431634215 40.6375664264...
6,7,Astoria,Queens,529479,"POLYGON((-73.9041889529999 40.767530285, -73.9..."
...,...,...,...,...,...
95,209,Seaport,Manhattan,301485,"POLYGON((-74.0051098989999 40.7120555939999, -..."
96,243,Washington Heights North,Manhattan,210775,"POLYGON((-73.9321985510756 40.8698448708717, -..."
97,152,Manhattanville,Manhattan,196864,"POLYGON((-73.9536710349998 40.8220175339999, -..."
98,127,Inwood,Manhattan,86373,MULTIPOLYGON(((-73.921698419047 40.85667230185...


In [None]:
show_map(res, 'cnt', ['borough', 'zone_name'])

### 3) 등장빈도가 높은 여행경로 100개를 내림차순으로 정렬하시오.
- 출력 column : pickup_zone, dropoff_zone, count
    - pickup_zone : trip의 시작 지점
    - dropoff_zone : trip의 종료 지점
    - cnt : 해당 trip의 등장 횟수

✨ 출력 예시 - 데이터를 모두 읽어와 pandas에서 처리하는 경우

In [None]:
@logging_time
def get_frequent_routine():
    """
    db에서 테이블의 모든 정보를 가져 온 후, 
    target의 등장 빈도가 가장 높은 100개의 tuple를 반환합니다.
    이때, 반환되는 tuple의 정보는 zone_id, zone_name, borough, cnt, zone_geom 입니다.
    """
    # 1. taxi_trips 정보를 모두 읽어 dataframe로 할당한다.
    taxi_trips_df = execute_query("SELECT * FROM taxi_trips WHERE pickup_location_id IS NOT NULL AND dropoff_location_id IS NOT NULL;")
    # display(taxi_trips_df.info())

    # 2. taxi_trips_df에서 pickup_location_id와 dropoff_location_id를 연결하여
    #    새로운 column `routine`를 정의한다.
    taxi_trips_df['routine'] = taxi_trips_df.apply(lambda x : str(x['pickup_location_id']) + " " + str(x['dropoff_location_id']), axis=1)

    
    # 3. taxi_trips_df에서 routine를 기준으로 
    #    grouping하여 routine별 column 수를 계산한다.
    #    이때, pickup_datetime을 사용한 이유는 결측값이 없는 임의의 column을 선정하였기 때문이다.
    top100 = taxi_trips_df[['pickup_datetime', 'routine']].groupby(by=['routine']) # routine을 기준으로 그룹화
    top100 = top100.count()  # aggrigation func로 count 사용
    top100 = top100.sort_values(by='pickup_datetime', ascending=False) # count 정보를 기준으로 내림차순
    top100 = top100.head(100) # 상위 100개만을 추출
    top100 = top100.reset_index() # index로 되어 있는 target를 column으로 변환
    top100 = top100.rename(columns={'index':'routine', 'pickup_datetime':'cnt'})

    # 4. 출력 형식을 맞추기 위하여 `routine` column을 분해하여
    #    `pickup_zone`과 `dropoff_zone`을 생성한다.
    top100['pickup_zone'] = top100.apply(lambda x : x['routine'].split()[0], axis=1)
    top100['dropoff_zone'] = top100.apply(lambda x : x['routine'].split()[1], axis=1)
    res = top100[['pickup_zone', 'dropoff_zone', 'cnt']]

    # display(res)
    return res

In [None]:
res = get_frequent_routine()
res

✨ 실습 - 쿼리를 활용하여 문제를 해결해보자!
- tip ) http://www.devkuma.com/books/pages/1347

In [None]:
query = \
"""
select split(routine, ' ')[safe_offset(0)] as pickup_zone, 
       split(routine, ' ')[safe_offset(1)] as dropoff_zone,
       count(routine) as cnt
from (select concat(pickup_location_id, ' ', dropoff_location_id) as routine 
      from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017`)
group by routine
order by count(routine) desc
limit 100
"""

res = solution_with_query(query).dropna()
res

WorkingTime[solution_with_query]: 4.995290040969849 sec


Unnamed: 0,pickup_zone,dropoff_zone,cnt
0,264,264,1411265
1,237,236,624945
2,236,236,528078
3,236,237,518375
4,237,237,488631
...,...,...,...
95,230,48,149223
96,164,234,149153
97,186,162,149099
98,142,230,148897


### 4) 출발구역에 따른 여행객 수의 비율
- 출력 column : zone_id, zone_name, borough, passenger_1, passenger_2, passenger_3, passenger_4, passenger_other, zone_geom
    - zone_id : 구역의 id
    - zone_name : 구역이 이름
    - borough : 구역의 속하는 지역
    - cnt : 해당 구역에서 출발한 trip의 수
    - passenger_1, passenger_2, passenger_3, passenger_4는 passenger_count에서 오며 각 element는 해당 zone에서 출발한 여행에서 해당 여행객수의 비율(1이하의 float)이다. passenger_other는 1\~4 이외의 여행객 수(0, 5\~9)에 대한 비율이다.

✨ 출력 예시 - 데이터를 모두 읽어와 pandas에서 처리하는 

In [None]:
@logging_time
def get_pessenger_with_zone(trg):
    # 1. 두개 table 정보를 모두 읽어 dataframe로 할당한다
    taxi_trips_df = execute_query("SELECT * FROM taxi_trips WHERE pickup_location_id IS NOT NULL AND dropoff_location_id IS NOT NULL;")
    taxi_zone_geom_df = execute_query("SELECT * FROM taxi_zone_geom;")

    #2. taxi_trips_df에서 trg와 passenger_count만을 추출한 후, passenger_count에 대하여 one-hot encoding을 진행한다.
    passenger_one_hot = pd.get_dummies(taxi_trips_df[['passenger_count',trg]], columns = ['passenger_count'])

    #3. passenger_count_0, passenger_count_5, passenger_count_6을 합하여 새로운 column passenger_other을 정의한다.
    #   이후, 불필요한 column을 제거한다.
    passenger_one_hot['passenger_other'] = passenger_one_hot.apply(lambda x : x['passenger_count_0'] + x['passenger_count_5'] + x['passenger_count_6'], axis=1)
    passenger_one_hot = passenger_one_hot.drop(['passenger_count_0', 'passenger_count_5', 'passenger_count_6'], axis=1)

    # 4. passenger_one_hot 테이블을 trg를 기준으로 하여 그룹핑하고, column의 누적합을 구한다.
    passenger_one_hot = passenger_one_hot.groupby(by=trg).sum().reset_index()
    passenger_one_hot = passenger_one_hot.rename(columns={'passenger_count_1' : 'passenger_1', 'passenger_count_2' : 'passenger_2', 'passenger_count_3' : 'passenger_3', 'passenger_count_4' : 'passenger_4'})

    # 5. 해당 구역에서 출발하는 trip에 대하여 passenger의 수별로 차지하는 비유을 구해야하므로, 전체 trip의 수를 구하여 새로운 column `total_trip`을 정의한다.
    passenger_one_hot['total_trip'] = passenger_one_hot.apply(lambda x : x['passenger_1'] + x['passenger_2'] + x['passenger_3'] + x['passenger_4'] + x['passenger_other'], axis=1)

    # 6. passenger에 대한 column을 전체에서 차지하는 비율 값으로 변환한다.
    passenger_one_hot['passenger_1'] = passenger_one_hot.apply(lambda x : x['passenger_1'] / x['total_trip'], axis=1)
    passenger_one_hot['passenger_2'] = passenger_one_hot.apply(lambda x : x['passenger_2'] / x['total_trip'], axis=1)
    passenger_one_hot['passenger_3'] = passenger_one_hot.apply(lambda x : x['passenger_3'] / x['total_trip'], axis=1)
    passenger_one_hot['passenger_4'] = passenger_one_hot.apply(lambda x : x['passenger_4'] / x['total_trip'], axis=1)
    passenger_one_hot['passenger_other'] = passenger_one_hot.apply(lambda x : x['passenger_other'] / x['total_trip'], axis=1)

    # 7. passenger_one_hot과 taxi_zone_geom_df을 join시켜 원하는 출력 형태를 만든다.
    res = passenger_one_hot.merge(taxi_zone_geom_df,
                                how='inner',
                                left_on=trg,
                                right_on='zone_id')
    res = res[['zone_id', 'zone_name', 'borough', 'passenger_1', 'passenger_2', 'passenger_3', 'passenger_4', 'passenger_other', 'zone_geom']]
    # display(res)
    return res

In [None]:
res = get_pessenger_with_zone('pickup_location_id')
res

In [None]:
def show_map_range_0_1(df, trg, tooltip_trg):
    """
    입력되는 pandas.DataFrame를 folium의 지도 위도 출력합니다.
    """
    gs = gpd.GeoSeries.from_wkt(df['zone_geom'])
    gdf = gpd.GeoDataFrame(df, geometry=gs, crs="EPSG:4326")

    f = folium.Figure(width=600, height=500)
    m = folium.Map(location=[40.75, -73.90], zoom_start=10, tiles='CartoDB positron').add_to(f)
    choropleth = folium.Choropleth(geo_data=gdf, 
                    neme='zone_name',
                    data=gdf,
                    columns=['zone_name', trg],
                    key_on='feature.properties.zone_name',
                    fill_color='YlGn',
                    fill_opacity=0.7,
                    line_opacity=0.2,
                    legend_name=trg,
                    threshold_scale=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1]).add_to(m)

    style_function = lambda x: {'fillColor': '#ffffff', 
                                'color':'#000000', 
                                'fillOpacity': 0.1, 
                                'weight': 0.1}

    tooltip = folium.GeoJson(
        gdf,
        style_function=style_function, 
        control=False,
        tooltip=folium.GeoJsonTooltip(
            fields=tooltip_trg,
            localize=True
        )
    )

    choropleth.add_child(tooltip)
    display(f)

In [None]:
def show_problem_4_5_map(df):
    for i in range(1, 5):
        print(f'passenger_{i} >>>>>')
        show_map_range_0_1(df[['zone_name', 'borough', f'passenger_{i}','zone_geom']], 
                        f'passenger_{i}', 
                        ['zone_name', 'borough', f'passenger_{i}'])
        print()

In [None]:
show_problem_4_5_map(res)

✨ 실습 - 쿼리를 활용하여 문제를 해결해보자!

In [None]:
query = \
"""
select passenger_one_hot.pickup_location_id,
	   zone.zone_name,
	   zone.borough,
	   passenger_one_hot.passenger_1 / passenger_one_hot.total_trip as passenger_1,
	   passenger_one_hot.passenger_2 / passenger_one_hot.total_trip as passenger_2, 
	   passenger_one_hot.passenger_3 / passenger_one_hot.total_trip as passenger_3,
	   passenger_one_hot.passenger_4 / passenger_one_hot.total_trip as passenger_4,
	   passenger_one_hot.passenger_other / passenger_one_hot.total_trip as passenger_other,
	   zone.zone_geom
from(select pickup_location_id, 
			sum(passenger_1) as passenger_1, 
			sum(passenger_2) as passenger_2, 
			sum(passenger_3) as passenger_3, 
			sum(passenger_4) as passenger_4, 
			sum(passenger_other) as passenger_other,
			count(pickup_datetime) as total_trip
		from (select pickup_location_id, 
					(case when passenger_count=1 then 1 else 0 end) as passenger_1,
					(case when passenger_count=2 then 1 else 0 end) as passenger_2,
					(case when passenger_count=3 then 1 else 0 end) as passenger_3,
					(case when passenger_count=4 then 1 else 0 end) as passenger_4,
					(case when passenger_count NOT IN (1,2,3,4) then 1 else 0 end) as passenger_other,
					pickup_datetime
			  from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017`
			  where pickup_location_id is not null and dropoff_location_id is not null) 
		group by pickup_location_id) as passenger_one_hot
left join `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` as zone
on passenger_one_hot.pickup_location_id=zone.zone_id
order by passenger_one_hot.pickup_location_id
"""

res = solution_with_query(query).dropna()
res

WorkingTime[solution_with_query]: 2.6121504306793213 sec


Unnamed: 0,pickup_location_id,zone_name,borough,passenger_1,passenger_2,passenger_3,passenger_4,passenger_other,zone_geom
0,1,Newark Airport,EWR,0.688854,0.200048,0.049563,0.031605,0.029929,"POLYGON((-74.1856319999999 40.6916479999999, -..."
1,10,Baisley Park,Queens,0.688135,0.168873,0.043660,0.023588,0.075744,"POLYGON((-73.7840990399998 40.6896477549999, -..."
2,100,Garment District,Manhattan,0.721961,0.140109,0.043127,0.021325,0.073478,"POLYGON((-73.986849911 40.7510816789999, -73.9..."
3,101,Glen Oaks,Queens,0.835526,0.116228,0.019737,0.008772,0.019737,"POLYGON((-73.7016221228549 40.7524255179464, -..."
4,102,Glendale,Queens,0.736674,0.146055,0.029851,0.015991,0.071429,"POLYGON((-73.8566771439998 40.7076918739999, -..."
...,...,...,...,...,...,...,...,...,...
263,95,Forest Hills,Queens,0.755698,0.124741,0.032236,0.014747,0.072578,"POLYGON((-73.8474631399999 40.7389520309999, -..."
264,96,Forest Park/Highland Park,Queens,0.795527,0.089457,0.038339,0.003195,0.073482,"POLYGON((-73.8382923549999 40.7083966169999, -..."
265,97,Fort Greene,Brooklyn,0.730552,0.142612,0.039657,0.017979,0.069200,"POLYGON((-73.9693556790001 40.6958547289999, -..."
266,98,Fresh Meadows,Queens,0.771930,0.113060,0.035088,0.019493,0.060429,"POLYGON((-73.7643149549999 40.7406726269999, -..."


In [None]:
show_problem_4_5_map(res)

### 5) 도착구역에 따른 여행객 수의 비율
- 출력 column : zone_id, zone_name, borough, passenger_1, passenger_2, passenger_3, passenger_4, passenger_other, zone_geom
    - zone_id : 구역의 id
    - zone_name : 구역이 이름
    - borough : 구역의 속하는 지역
    - cnt : 해당 구역에서 출발한 trip의 수
    - passenger_1, passenger_2, passenger_3, passenger_4는 passenger_count에서 오며 각 element는 해당 zone에서 도착한 여행에서 해당 여행객수의 비율(1이하의 float)이다. passenger_other는 1\~4 이외의 여행객 수(0, 5\~9)에 대한 비율이다.

✨ 출력 예시 - 데이터를 모두 읽어와 pandas에서 처리하는 경우

In [None]:
res = get_pessenger_with_zone('dropoff_location_id')
res

In [None]:
show_problem_4_5_map(res)

✨ 실습 - 쿼리를 활용하여 문제를 해결해보자!

In [None]:
query = \
"""
select passenger_one_hot.dropoff_location_id,
	   zone.zone_name,
	   zone.borough,
	   passenger_one_hot.passenger_1 / passenger_one_hot.total_trip as passenger_1,
	   passenger_one_hot.passenger_2 / passenger_one_hot.total_trip as passenger_2, 
	   passenger_one_hot.passenger_3 / passenger_one_hot.total_trip as passenger_3,
	   passenger_one_hot.passenger_4 / passenger_one_hot.total_trip as passenger_4,
	   passenger_one_hot.passenger_other / passenger_one_hot.total_trip as passenger_other,
	   zone.zone_geom
from(select dropoff_location_id, 
			sum(passenger_1) as passenger_1, 
			sum(passenger_2) as passenger_2, 
			sum(passenger_3) as passenger_3, 
			sum(passenger_4) as passenger_4, 
			sum(passenger_other) as passenger_other,
			count(pickup_datetime) as total_trip
		from (select dropoff_location_id, 
					(case when passenger_count=1 then 1 else 0 end) as passenger_1,
					(case when passenger_count=2 then 1 else 0 end) as passenger_2,
					(case when passenger_count=3 then 1 else 0 end) as passenger_3,
					(case when passenger_count=4 then 1 else 0 end) as passenger_4,
					(case when passenger_count NOT IN (1,2,3,4) then 1 else 0 end) as passenger_other,
					pickup_datetime
			  from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017`
			  where dropoff_location_id is not null and dropoff_location_id is not null) 
		group by dropoff_location_id) as passenger_one_hot
left join `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom` as zone
on passenger_one_hot.dropoff_location_id=zone.zone_id
order by passenger_one_hot.dropoff_location_id
"""

res = solution_with_query(query).dropna()
res

WorkingTime[solution_with_query]: 13.052052736282349 sec


Unnamed: 0,dropoff_location_id,zone_name,borough,passenger_1,passenger_2,passenger_3,passenger_4,passenger_other,zone_geom
0,1,Newark Airport,EWR,0.658885,0.185412,0.049842,0.025655,0.080206,"POLYGON((-74.1856319999999 40.6916479999999, -..."
1,10,Baisley Park,Queens,0.695126,0.157149,0.045192,0.022490,0.080043,"POLYGON((-73.7840990399998 40.6896477549999, -..."
2,100,Garment District,Manhattan,0.709450,0.146197,0.044887,0.023356,0.076110,"POLYGON((-73.986849911 40.7510816789999, -73.9..."
3,101,Glen Oaks,Queens,0.705487,0.155015,0.040474,0.020477,0.078547,"POLYGON((-73.7016221228549 40.7524255179464, -..."
4,102,Glendale,Queens,0.703414,0.153962,0.040651,0.021672,0.080301,"POLYGON((-73.8566771439998 40.7076918739999, -..."
...,...,...,...,...,...,...,...,...,...
260,95,Forest Hills,Queens,0.712800,0.147851,0.039416,0.017570,0.082363,"POLYGON((-73.8474631399999 40.7389520309999, -..."
261,96,Forest Park/Highland Park,Queens,0.731919,0.138380,0.034233,0.017358,0.078110,"POLYGON((-73.8382923549999 40.7083966169999, -..."
262,97,Fort Greene,Brooklyn,0.723809,0.145152,0.037856,0.016460,0.076723,"POLYGON((-73.9693556790001 40.6958547289999, -..."
263,98,Fresh Meadows,Queens,0.720061,0.143515,0.036043,0.019374,0.081006,"POLYGON((-73.7643149549999 40.7406726269999, -..."


In [None]:
show_problem_4_5_map(res)

## 회고
---

### 1) 이번주차의 실습과정에서 알게 된 것은 무엇인가요?
> (내용을 입력하여 주세요)

### 2) 실습과정에서 느낀 pandas로의 데이터처리와 SQL을 이용한 데이터처리의 장단점은 무엇인가요?
> (내용을 입력하여 주세요)

### 3) 느낀점
> (내용을 입력하여 주세요)