 # ETL Project  

In this project, I would like to find out the correlation between the Boba Shop rating and the number of spawns of Pokemon Go Pikachu in SF Bayarea.


I used 2 different datasets from Kaggle.com :
- https://www.kaggle.com/vnxiclaire/bobabayarea
- https://www.kaggle.com/kveykva/sf-bay-area-pokemon-go-spawns

The dataset in the two files included the following information:

Bobashop in bay area:

- id: Unique id of the boba shop
- name : Name of the boba shop
- rating : Yelp rating of the boba shop, on a scale of 1-5
- address : Address of the boba shop
- ity : City the boba shop is in
- lat : Latitude of the boba shop
- long : Longitude of the boba shop

SF Bay Area Pokemon Go Spawns:

- id: Unique id of pokemon
- number:unique number of pokemon character
- lat: Latitude of the despawn location
- long: Longitude of the despawn location
- encounter_ms: Time encountered with Pokemon
- diappear_ms:The time when Pokemon was disappered



## 1. Extract
In order to extract the dataset to use it ETL project, Pandas was used in to load two CSV files.

In [1]:
# Dependencies and Setup
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
import json
from pprint import pprint
from pymongo import MongoClient
import gmaps
# Import API key
from api_keys import g_key

### Extract 1 -poketmon file

In [2]:
# File to Load 
file_to_load_pokemon = "resources/pokemon-spawns.csv"

# Read Pokemon File and store into Pandas data frame
pokemon_df = pd.read_csv(file_to_load_pokemon)
pokemon_df


Unnamed: 0,s2_id,s2_token,num,name,lat,lng,encounter_ms,disppear_ms
0,-9185794522947256000,8085808cc6d,13,Weedle,37.793592,-122.408721,1469520187732,1469519919988
1,-9185794529389707000,8085808b51d,16,Pidgey,37.794746,-122.406420,1469520297172,1469519919992
2,-9185794529389707000,8085808b271,41,Zubat,37.794999,-122.404384,1469520709924,1469519919991
3,-9185794082713108000,808580f3587,16,Pidgey,37.795644,-122.407128,-1,1469519920134
4,-9185794076270658000,808580f4b1d,60,Poliwag,37.795592,-122.406331,1469520741876,1469519920153
5,-9182922218470900000,808fb4e54b3,50,Diglett,37.301129,-122.048453,1469520163692,1469520120130
6,-9182922220618383000,808fb4e4ea1,23,Ekans,37.300757,-122.045701,1469520356612,1469520120198
7,-9182922126129103000,808fb4faf4d,41,Zubat,37.303463,-122.048187,1469520502676,1469520120449
8,-9182982474714579000,808f7e1799d,46,Paras,37.759467,-122.426242,1469520422772,1469520265691
9,-9182982313653305000,808f7e3d705,41,Zubat,37.759499,-122.423233,1469520349332,1469520265692


### Extraction 2. Boba file

In [3]:
# File to Load 
file_to_load = "resources/boba.csv"

# Read Purchasing File and store into Pandas data frame
boba_df = pd.read_csv(file_to_load)
len(boba_df.index)
boba_df

Unnamed: 0.1,Unnamed: 0,id,name,rating,address,city,lat,long
0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040
1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414
2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850
3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043
4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705
5,5,q-tea-monster-newark,Q-Tea Monster,4.0,39181 Cedar Blvd,Newark,37.522960,-122.005786
6,6,gong-cha-fremont,Gong Cha,4.0,46827 Warm Springs Blvd,Fremont,37.488568,-121.929191
7,7,happy-lemon-fremont-2,Happy Lemon,4.5,46873 Warm Spring Blvd,Fremont,37.488443,-121.930384
8,8,factory-tea-bar-fremont-2,Factory Tea Bar,3.5,46461 Mission Blvd,Fremont,37.492298,-121.927919
9,9,super-cue-cafe-fremont,Super Cue Cafe,3.5,43743 Boscell Rd,Fremont,37.500778,-121.973168


## 2. Transform

The following tasks were performed to transform the data:

- Information irrelevant to the project has been deleted..
- Geopy was used to find zip codes through latitude and longitude.
- Two dataset were merged using zip codes.
- Rows containing missing information were deleted.
- Cleaned datasets were saved as csv file and json file.


### Transform 1 :pokemon file

In [4]:
# check how many rows in pokemon data frame
len(pokemon_df.index)



314105

Since the file has too much data to process, so I decided to use only the data from the signature character "Pikachu".


In [5]:
#Check how much data are there for each character
name_count = pokemon_df["name"].value_counts()
name_count

Pidgey        44313
Zubat         37033
Rattata       30407
Spearow       14787
Weedle        13618
Doduo         11821
Ekans         11725
Paras         11604
Eevee          9759
Magikarp       7498
Caterpie       7335
Venonat        5981
Growlithe      5087
Mankey         5006
Nidoran♂       4952
Nidoran♀       4697
Geodude        4543
Sandshrew      4414
Meowth         4067
Poliwag        3680
Krabby         3673
Clefairy       3600
Goldeen        3486
Psyduck        3356
Staryu         3352
Pidgeotto      2796
Bellsprout     2660
Cubone         2614
Oddish         2568
Rhyhorn        2466
              ...  
Exeggutor        35
Starmie          31
Weezing          30
Charmeleon       29
Seadra           29
Farfetch'd       27
Golem            25
Porygon          24
Dragonair        22
Flareon          20
Nidoking         20
Raichu           19
Snorlax          19
Nidoqueen        15
Wigglytuff       15
Ninetales        14
Venusaur         14
Poliwrath        12
Charizard        10


In [6]:
# Filter only Pikachu data.
pikachu_df = pokemon_df.loc[pokemon_df["name"]=="Pikachu",:]
len(pikachu_df.index)
pikachu_df

Unnamed: 0,s2_id,s2_token,num,name,lat,lng,encounter_ms,disppear_ms
10,-9182982313653305000,808f7e3d12d,25,Pikachu,37.759913,-122.422614,1469521009940,1469520265691
100,-9182898095787082000,808fcad5a05,25,Pikachu,37.304823,-121.948866,1469520664820,1469520277694
103,-9182898104377016000,808fcad3c37,25,Pikachu,37.304638,-121.954732,1469520463260,1469520277690
107,-9182898104377016000,808fcad3c37,25,Pikachu,37.304638,-121.954732,1469520463260,1469520277695
108,-9182898095787082000,808fcad5a05,25,Pikachu,37.304823,-121.948866,1469520664820,1469520277694
226,-9183346168250237000,808e3350a6d,25,Pikachu,37.309022,-121.895082,1469520294196,1469520278277
227,-9183346125300564000,808e335a865,25,Pikachu,37.308959,-121.893037,1469520601332,1469520278276
254,-9182897694207640000,808fcb33779,25,Pikachu,37.309967,-121.931000,1469520842436,1469520278412
263,-9183344647831814000,808e34b2e75,25,Pikachu,37.311053,-121.915621,-1,1469520278514
514,-9182921589258191000,808fb577bf5,25,Pikachu,37.322372,-122.005911,-1,1469520279741


In [7]:
#reset index
pikachu_df = pikachu_df.reset_index(drop=True)
pikachu_df
#make a new colunm "zip code"
pikachu_df["zip code"]=""
pikachu_df.head()

Unnamed: 0,s2_id,s2_token,num,name,lat,lng,encounter_ms,disppear_ms,zip code
0,-9182982313653305000,808f7e3d12d,25,Pikachu,37.759913,-122.422614,1469521009940,1469520265691,
1,-9182898095787082000,808fcad5a05,25,Pikachu,37.304823,-121.948866,1469520664820,1469520277694,
2,-9182898104377016000,808fcad3c37,25,Pikachu,37.304638,-121.954732,1469520463260,1469520277690,
3,-9182898104377016000,808fcad3c37,25,Pikachu,37.304638,-121.954732,1469520463260,1469520277695,
4,-9182898095787082000,808fcad5a05,25,Pikachu,37.304823,-121.948866,1469520664820,1469520277694,


In [8]:
# find zip codes through latitude and longitude using geopy.
ind = 0
def get_zipcode(df, geolocator, lat_field, lon_field):
    global ind
    try :
        location = geolocator.reverse((df[lat_field], df[lon_field]))
        print(ind,location.raw['address']['postcode'])
        pikachu_df.loc[ind,'zip code'] = location.raw['address']['postcode']# add zipcode to "zip code" column
        ind += 1
    except(KeyError, IndexError):
        print("NA")
        ind += 1
    
geolocator = geopy.Nominatim(user_agent='heesung80@myserver.com')
zipcodes = pikachu_df.apply(get_zipcode, axis=1, geolocator=geolocator, lat_field='lat', lon_field='lng')


0 94110
1 95128
2 95117
3 95117
4 95128
5 95125
6 95125
7 95128
8 951251
9 95014-3522
10 95126
11 95112
12 95128
13 95128
14 95051
15 95051
16 95050
17 95050
18 95112
19 95050-3653
20 95051
21 95051
22 95110
23 95086
24 95134
25 95035
26 95035
27 95002
28 95134
29 95002
30 95134-1358
31 94040
32 94022
33 94040
34 94538
35 94538
36 94560
37 94538
38 94538
39 94560
40 94538
41 94538
42 94560
43 94538
44 94560
45 94536
46 94538
47 94555
48 94538
49 94555
50 94555
51 94536
52 94587
53 94587
54 94555
55 94555
56 94587
57 94587
58 94555
NA
NA
61 94587
62 94587
63 94545
64 94036
65 94036
66 94301
67 94301
68 94301-2019
69 94301-2019
70 94301-2019
71 515
72 515
73 515
74 515
75 515
76 515
77 515
78 515
79 94070
80 94070
81 94070
82 94070
83 94061
84 94025
85 94025
86 94027
87 94027
88 94027
89 94061
90 94025
91 94025
92 94061
93 94061
94 94025
95 94061
96 94027
97 94061
98 94027
99 94027
100 94061
101 94061
102 94027
103 94027
104 94061
105 94025
106 94025
107 94025
108 94063
109 94063
110 940

756 95133
757 95031
758 95031
759 95031
760 95031
761 95031
762 95131
763 95131
764 95131
765 95131
766 95131
767 95131
768 95131
769 95131
770 95131
771 95131
772 95131
773 95131
774 95131
775 95131
776 95131
777 95134
778 95134
779 95134
780 95134
781 95132
782 95132
783 95134
784 95132
785 95132
786 95134
787 95134
788 95134
789 95131
790 95131
791 95132
792 95132
793 95132
794 95131
795 95131
796 95035
797 95035
798 95035
799 95134
800 95134
801 95035
802 95035
803 95035
804 95035
805 95134-1358
806 95134-1358
807 95134-1358
808 94539
809 94539
810 94539
811 94539
812 94539
813 94539
814 94538
815 94538
816 94538
817 94538
818 94538
819 94538
820 94538
821 94538
822 94538
823 94538
824 94538
825 94538
826 94538
827 94538
828 94538
829 94538
830 94538
831 94538
832 94538
833 90810
834 90810
835 90806
836 90806
837 90018
838 90018
839 90018
840 90018
841 90018
842 90018
843 90018
844 90807
845 90807
846 90807
847 90807
848 90810
849 90810
850 90810
851 90807
852 90807
853 90807
854 9

1491 160-0015
1492 160-0015
1493 160-0015
1494 160-0015
1495 160-0015
1496 160-0015
1497 160-0015
1498 160-0015
1499 160-0015
1500 160-0015
1501 160-0015
1502 102-0083
1503 102-0083
1504 160-0015
1505 160-0015
1506 160-0015
1507 160-0015
1508 160-0015
1509 102-0083
1510 102-0083
1511 160-0015
1512 160-0015
1513 160-0015
1514 160-0015
1515 169-0072
1516 169-0072
1517 169-0072
1518 169-0072
1519 162-0052
1520 162-0052
1521 95826
1522 94110
1523 94110
1524 94110
1525 94110
1526 94110
1527 94110
1528 94117
1529 94117
1530 94117
1531 94117
1532 94158
1533 94607
1534 94607
1535 94607
1536 94607
1537 94501
1538 94501
1539 94501
1540 94501
1541 94017
1542 94017
1543 94130
1544 94130
1545 94130
1546 94130
1547 14123
1548 94133-1312
1549 94133-1312
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
1568 94130
1569 94130
1570 94130
1571 94130
1572 94130
1573 94130
1574 94130
1575 94130
1576 94130
1577 94130
1578 94130
1579 94130
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
1590 14623
1591 14623
1592 14623
15

In [9]:
pikachu_df

Unnamed: 0,s2_id,s2_token,num,name,lat,lng,encounter_ms,disppear_ms,zip code
0,-9182982313653305000,808f7e3d12d,25,Pikachu,37.759913,-122.422614,1469521009940,1469520265691,94110
1,-9182898095787082000,808fcad5a05,25,Pikachu,37.304823,-121.948866,1469520664820,1469520277694,95128
2,-9182898104377016000,808fcad3c37,25,Pikachu,37.304638,-121.954732,1469520463260,1469520277690,95117
3,-9182898104377016000,808fcad3c37,25,Pikachu,37.304638,-121.954732,1469520463260,1469520277695,95117
4,-9182898095787082000,808fcad5a05,25,Pikachu,37.304823,-121.948866,1469520664820,1469520277694,95128
5,-9183346168250237000,808e3350a6d,25,Pikachu,37.309022,-121.895082,1469520294196,1469520278277,95125
6,-9183346125300564000,808e335a865,25,Pikachu,37.308959,-121.893037,1469520601332,1469520278276,95125
7,-9182897694207640000,808fcb33779,25,Pikachu,37.309967,-121.931000,1469520842436,1469520278412,95128
8,-9183344647831814000,808e34b2e75,25,Pikachu,37.311053,-121.915621,-1,1469520278514,951251
9,-9182921589258191000,808fb577bf5,25,Pikachu,37.322372,-122.005911,-1,1469520279741,95014-3522


In [10]:
# Save the pikachu data as csv file
pikachu_df.to_csv("output/pika_zip.csv") 

In [11]:
pikachu_df = pd.read_csv("output/pika_zip.csv")

#Check how many Pikachu appear for each zip code.
pika_zip_count_df = pikachu_df["zip code"].value_counts()
pika_zip_count_df


94025         74
94536         61
94027         48
94538         46
95070         38
94544         36
94587         35
94022         35
95129         33
94545         26
94061         26
94555         25
94303         25
94560         24
95117         24
95112         22
94025-1246    22
94063         21
95130         21
95125         21
95051         20
160-0015      20
95134         19
95131         19
94087         18
94131-3228    18
95050         18
94065         17
93950-2424    16
94304         16
              ..
95122          2
90026          2
95014-0111     2
93955          2
95014-5103     2
90806          2
162-0052       2
95051-4409     2
95014-3884     2
95192          2
94017          2
95014-3151     2
95014-3837     2
95014-2057     2
10280          2
95014-0614     2
9404           2
95014-2706     2
14123          1
95014-3522     1
94158          1
94305          1
90805          1
95826          1
94043-3421     1
94807          1
95050-3653     1
95014-2407    

In [12]:
#Make a data frame with Pikachu value counts with zip code.
grouped_pika_df = pd.DataFrame(pika_zip_count_df)
grouped_pika_df
#reset index
grouped_pika_df.reset_index()
grouped_pika_df
number_list = grouped_pika_df['zip code'].tolist()
number_list
zipcode_list = grouped_pika_df.index.values.tolist()
zipcode_list
pika_number_df = pd.DataFrame({"zip code": zipcode_list,
                               "Number of Pikachu":number_list})
pika_number_df

Unnamed: 0,zip code,Number of Pikachu
0,94025,74
1,94536,61
2,94027,48
3,94538,46
4,95070,38
5,94544,36
6,94587,35
7,94022,35
8,95129,33
9,94545,26


### Transform 2. Boba file

In [13]:
boba_df

Unnamed: 0.1,Unnamed: 0,id,name,rating,address,city,lat,long
0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040
1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414
2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850
3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043
4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705
5,5,q-tea-monster-newark,Q-Tea Monster,4.0,39181 Cedar Blvd,Newark,37.522960,-122.005786
6,6,gong-cha-fremont,Gong Cha,4.0,46827 Warm Springs Blvd,Fremont,37.488568,-121.929191
7,7,happy-lemon-fremont-2,Happy Lemon,4.5,46873 Warm Spring Blvd,Fremont,37.488443,-121.930384
8,8,factory-tea-bar-fremont-2,Factory Tea Bar,3.5,46461 Mission Blvd,Fremont,37.492298,-121.927919
9,9,super-cue-cafe-fremont,Super Cue Cafe,3.5,43743 Boscell Rd,Fremont,37.500778,-121.973168


In [14]:
#make an empry zip code column
boba_df["zip code"]=''
boba_df

Unnamed: 0.1,Unnamed: 0,id,name,rating,address,city,lat,long,zip code
0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040,
1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414,
2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850,
3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043,
4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705,
5,5,q-tea-monster-newark,Q-Tea Monster,4.0,39181 Cedar Blvd,Newark,37.522960,-122.005786,
6,6,gong-cha-fremont,Gong Cha,4.0,46827 Warm Springs Blvd,Fremont,37.488568,-121.929191,
7,7,happy-lemon-fremont-2,Happy Lemon,4.5,46873 Warm Spring Blvd,Fremont,37.488443,-121.930384,
8,8,factory-tea-bar-fremont-2,Factory Tea Bar,3.5,46461 Mission Blvd,Fremont,37.492298,-121.927919,
9,9,super-cue-cafe-fremont,Super Cue Cafe,3.5,43743 Boscell Rd,Fremont,37.500778,-121.973168,


In [15]:
# find zip codes through latitude and longitude using geopy.
ind = 0

def get_zipcode(df, geolocator, lat_field, lon_field):
    global ind
    try :
        location = geolocator.reverse((df[lat_field], df[lon_field]))
        print(ind, location.raw['address']['postcode'])
        boba_df.loc[ind,'zip code'] = location.raw['address']['postcode']
        ind += 1
    except(KeyError, IndexError):
        print("NA")
        ind += 1
    


geolocator = geopy.Nominatim(user_agent='heesung80@myserver.com')

zipcodes = boba_df.apply(get_zipcode, axis=1, geolocator=geolocator, lat_field='lat', lon_field='long')




0 94536
1 94539
2 94536
3 94538
4 94587
5 94560
6 94539
7 94539
8 94539
9 94538
10 94587
11 94539
12 94538
13 94538
14 94555
15 94538
16 94537
17 94555
18 94538
19 94587
20 94536
21 94560
22 94538
23 94587
24 94537
25 94587
26 94538
27 94041
28 94023
29 94539
30 94537
31 94536
32 94587
33 94587
34 93133
35 94546
36 94555
37 95035
38 94587
39 94036
40 95035
41 94587
42 94539
43 94537
44 95035
45 95035
46 95035
47 94587
48 94301
49 94579
50 94108
51 94110
52 94122
53 94118
54 94109
55 94115
56 94108
57 94102
58 94133
59 94133
60 94121
61 94134
62 94104
63 94116
64 94158
65 94121-3131
66 94107
67 94112
68 94122-1515
69 94122
70 94122
71 94107
72 94017
73 94121
74 94103
75 94115
76 94122
77 94121-3131
78 94103
79 94122
80 94122
81 94116
82 94133
83 94118-1316
84 94121
85 94118
86 94107
87 94131-3228
88 94112
89 94133
90 94133
91 94110
92 94112
93 94133
94 94118-1316
95 94014
96 94121-3131
97 94118
98 94122
99 94121
100 94607
101 94609
102 94607
103 94704
104 94607
105 94704
106 94501
107 9

In [19]:
# Save the boba data as csv file
boba_df.to_csv("output/boba_zip.csv") 

In [20]:

boba_zip_df = pd.read_csv("output/boba_zip.csv")
boba_zip_df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,name,rating,address,city,lat,long,zip code
0,0,0,99-tea-house-fremont-2,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040,94536
1,1,1,one-tea-fremont-2,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414,94539
2,2,2,royaltea-usa-fremont,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850,94536
3,3,3,teco-tea-and-coffee-bar-fremont,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043,94538
4,4,4,t-lab-fremont-3,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705,94587
5,5,5,q-tea-monster-newark,Q-Tea Monster,4.0,39181 Cedar Blvd,Newark,37.522960,-122.005786,94560
6,6,6,gong-cha-fremont,Gong Cha,4.0,46827 Warm Springs Blvd,Fremont,37.488568,-121.929191,94539
7,7,7,happy-lemon-fremont-2,Happy Lemon,4.5,46873 Warm Spring Blvd,Fremont,37.488443,-121.930384,94539
8,8,8,factory-tea-bar-fremont-2,Factory Tea Bar,3.5,46461 Mission Blvd,Fremont,37.492298,-121.927919,94539
9,9,9,super-cue-cafe-fremont,Super Cue Cafe,3.5,43743 Boscell Rd,Fremont,37.500778,-121.973168,94538


In [21]:
#Delete unnecessary columns.
del boba_zip_df['Unnamed: 0']
del boba_zip_df['Unnamed: 0.1']
del boba_zip_df['id']

In [22]:
boba_zip_df

Unnamed: 0,name,rating,address,city,lat,long,zip code
0,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040,94536
1,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414,94539
2,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850,94536
3,TECO Tea & Coffee Bar,4.5,39030 Paseo Padre Pkwy,Fremont,37.553694,-121.981043,94538
4,T-LAB,4.0,34133 Fremont Blvd,Fremont,37.576149,-122.043705,94587
5,Q-Tea Monster,4.0,39181 Cedar Blvd,Newark,37.522960,-122.005786,94560
6,Gong Cha,4.0,46827 Warm Springs Blvd,Fremont,37.488568,-121.929191,94539
7,Happy Lemon,4.5,46873 Warm Spring Blvd,Fremont,37.488443,-121.930384,94539
8,Factory Tea Bar,3.5,46461 Mission Blvd,Fremont,37.492298,-121.927919,94539
9,Super Cue Cafe,3.5,43743 Boscell Rd,Fremont,37.500778,-121.973168,94538


### Transform 3. Merge two datasets

In [23]:
#Merge two datasets
Merge_df=pd.merge(boba_zip_df,pika_number_df,on="zip code",how="right")
Merge_df
#Remove rows with missing data
clean_boba_pika_df = Merge_df.dropna(how="any")
clean_boba_pika_df

Unnamed: 0,name,rating,address,city,lat,long,zip code,Number of Pikachu
0,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040,94536,61
1,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850,94536,61
2,Tata Teahouse,3.0,39230 Argonaut Way,Fremont,37.543782,-121.986825,94536,61
3,T4,3.0,36400 Fremont Blvd,Fremont,37.563112,-122.015291,94536,61
4,Q Cup,3.0,39129 Fremont Blvd,Fremont,37.546170,-121.987402,94536,61
5,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414,94539,14
6,Gong Cha,4.0,46827 Warm Springs Blvd,Fremont,37.488568,-121.929191,94539,14
7,Happy Lemon,4.5,46873 Warm Spring Blvd,Fremont,37.488443,-121.930384,94539,14
8,Factory Tea Bar,3.5,46461 Mission Blvd,Fremont,37.492298,-121.927919,94539,14
9,Tea Island,4.0,46196 Warm Springs Blvd,Fremont,37.493541,-121.929889,94539,14


In [24]:
#Save merged data as csv file
clean_boba_pika_df.to_csv("output/pika_boba.csv")

In [25]:
#Save merged data as json file
clean_boba_pika_df.to_json(r'boba_pika.json', orient = "records")

In [26]:
boba_pika_json=clean_boba_pika_df.to_json(orient = "records")
parsed = json.loads(boba_pika_json)
print(json.dumps(parsed, indent=4, sort_keys=True))


[
    {
        "Number of Pikachu": 61,
        "address": "3623 Thornton Ave",
        "city": "Fremont",
        "lat": 37.56295,
        "long": -122.01004,
        "name": "99% Tea House",
        "rating": 4.5,
        "zip code": "94536"
    },
    {
        "Number of Pikachu": 61,
        "address": "38509 Fremont Blvd",
        "city": "Fremont",
        "lat": 37.5513151288,
        "long": -121.993849799,
        "name": "Royaltea USA",
        "rating": 4.0,
        "zip code": "94536"
    },
    {
        "Number of Pikachu": 61,
        "address": "39230 Argonaut Way",
        "city": "Fremont",
        "lat": 37.543782,
        "long": -121.986825,
        "name": "Tata Teahouse",
        "rating": 3.0,
        "zip code": "94536"
    },
    {
        "Number of Pikachu": 61,
        "address": "36400 Fremont Blvd",
        "city": "Fremont",
        "lat": 37.5631118502,
        "long": -122.0152914759,
        "name": "T4",
        "rating": 3.0,
        "zip code": "

## Bonus 

I made a map because I wanted to see how many Pikachu places and Boba shops with good ratings overlap.

- Mapping 35 best area where Pikachu appears the most
- Mapping 35 best rated boba stores 

In [2]:
# File to Load 
file_to_load_pika_boba = "output/pika_boba.csv"
pika_boba_df = pd.read_csv(file_to_load_pika_boba)
pika_boba_df

Unnamed: 0.1,Unnamed: 0,name,rating,address,city,lat,long,zip code,Number of Pikachu
0,0,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.562950,-122.010040,94536,61
1,1,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.993850,94536,61
2,2,Tata Teahouse,3.0,39230 Argonaut Way,Fremont,37.543782,-121.986825,94536,61
3,3,T4,3.0,36400 Fremont Blvd,Fremont,37.563112,-122.015291,94536,61
4,4,Q Cup,3.0,39129 Fremont Blvd,Fremont,37.546170,-121.987402,94536,61
5,5,One Tea,4.5,46809 Warm Springs Blvd,Fremont,37.489067,-121.929414,94539,14
6,6,Gong Cha,4.0,46827 Warm Springs Blvd,Fremont,37.488568,-121.929191,94539,14
7,7,Happy Lemon,4.5,46873 Warm Spring Blvd,Fremont,37.488443,-121.930384,94539,14
8,8,Factory Tea Bar,3.5,46461 Mission Blvd,Fremont,37.492298,-121.927919,94539,14
9,9,Tea Island,4.0,46196 Warm Springs Blvd,Fremont,37.493541,-121.929889,94539,14


In [3]:
#gmap was used to make a map
gmaps.configure(api_key=g_key)

In [4]:
# Thirty-five of the best rating boba shops were filtered and their information was made into data frames.
pika_boba_df_sorting = pika_boba_df.sort_values("rating",ascending=False)
pika_boba_df_sorting.head(35)
top_35_boba_df =pd.DataFrame(pika_boba_df_sorting.head(35))
top_35_boba_df

Unnamed: 0.1,Unnamed: 0,name,rating,address,city,lat,long,zip code,Number of Pikachu
193,194,Taza Deli & Cafe,5.0,1796 Broadway,Redwood City,37.486866,-122.223413,94063,21
243,244,Mr. Green Bubble,5.0,1255 S Mary Ave,Sunnyvale,37.35338,-122.05071,94087-2248,9
258,259,Honey Bear Smoothie Tea & Dessert,5.0,1 Southland Mall Dr,Hayward,37.654233,-122.104842,94545,26
0,0,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.56295,-122.01004,94536,61
169,170,Happiness Cafe,4.5,1688 Hostetter Rd,San Jose,37.386388,-121.884803,95131,19
234,235,Honeyberry,4.5,3655 N 1st St,San Jose,37.409314,-121.945219,95134,19
232,233,Teaspoon,4.5,2675 Middlefield Rd,Palo Alto,37.434085,-122.129446,94303,25
69,70,Pokeatery,4.5,18911 Lake Chabot Rd,Castro Valley,37.70864,-122.09134,94546,3
186,187,Mints & Honey,4.5,1524 El Camino Real,San Carlos,37.49613,-122.2477,94070,16
219,220,Chilly & Munch,4.5,2101 Showers Dr,Mountain View,37.406811,-122.106985,94040,11


In [5]:
#The areas where 35 Pikachu appear most frequently are filtered and their information is made into data frames.
pika_boba_df_sorting = pika_boba_df.sort_values("Number of Pikachu",ascending=False)
pika_boba_df_sorting.head(35)
top_35_pika_df =pd.DataFrame(pika_boba_df_sorting.head(35))
top_35_pika_df

Unnamed: 0.1,Unnamed: 0,name,rating,address,city,lat,long,zip code,Number of Pikachu
244,245,Koma Sushi Restaurant,3.5,211 El Camino Real,Menlo Park,37.448354,-122.174377,94025,74
0,0,99% Tea House,4.5,3623 Thornton Ave,Fremont,37.56295,-122.01004,94536,61
2,2,Tata Teahouse,3.0,39230 Argonaut Way,Fremont,37.543782,-121.986825,94536,61
3,3,T4,3.0,36400 Fremont Blvd,Fremont,37.563112,-122.015291,94536,61
4,4,Q Cup,3.0,39129 Fremont Blvd,Fremont,37.54617,-121.987402,94536,61
1,1,Royaltea USA,4.0,38509 Fremont Blvd,Fremont,37.551315,-121.99385,94536,61
15,15,i-Tea,3.5,43421 Christy St,Fremont,37.504992,-121.971232,94538,46
23,23,Bean Scene Cafe,4.0,4000 Bay St,Fremont,37.532729,-121.95934,94538,46
21,21,Storm Crepes,4.5,39658 Cedar Blvd,Newark,37.521817,-121.997366,94538,46
20,20,Fusion Mix Frozen Yogurt,4.5,4144 Walnut Ave,Fremont,37.543602,-121.984061,94538,46


In [6]:
#create info_box_template
info_box_template = """
<dl>
<dt>Boba Name</dt><dd>{name}</dd>
<dt>City</dt><dd>{city}</dd>
<dt>Rate</dt><dd>{rating}</dd>
<dt>Pikachu</dt><dd>{Number of Pikachu}</dd>
</dl>
"""

boba_info = [info_box_template.format(**row) for index, row in top_35_boba_df.iterrows()]
boba_locations = top_35_boba_df[["lat", "long"]]
pika_info = [info_box_template.format(**row) for index, row in top_35_pika_df.iterrows()]
pika_locations = top_35_pika_df[["lat", "long"]]

In [7]:
#creating symbol layers

boba_symbol_layer = gmaps.symbol_layer(
    boba_locations,
    info_box_content=boba_info,
    display_info_box = True
    
)
pika_symbol_layer = gmaps.symbol_layer(
    pika_locations,
    info_box_content=pika_info,
    display_info_box = True
    
)


fig = gmaps.figure()
boba_symbols=gmaps.symbol_layer(boba_locations)
fig.add_layer(boba_symbol_layer)
pika_symbols=gmaps.symbol_layer(pika_locations)
fig.add_layer(pika_symbol_layer)

#Boba store was marked green and Pikachu place was marked yellow.
boba_symbols = gmaps.symbol_layer(boba_locations, fill_color='green', stroke_color='green')
fig.add_layer(boba_symbols)
pika_symbols = gmaps.symbol_layer(pika_locations, fill_color='yellow', stroke_color='yellow')
fig.add_layer(pika_symbols)
fig



Figure(layout=FigureLayout(height='420px'))

According to symbol map, there seems to be not much overlap between Boba shops with good rating and the areas where many Pikachu appeared. Therefore, the number of spawning of Pikachu and the rating of the boba store seem to be irrelevant.

## 3. Load
Since MongoDB has more flexible schema than sql, MongoDB was used in this project

In [None]:
client = MongoClient('localhost', 27017)
db = client['']
collection_pikaboba = db['SF_pika_boba']


In [None]:
with open('boba_pika.json') as f:
    file_data = json.load(f)
collection_pikaboba.insert_many(file_data)
client.close()