# Setting up Foursquare data for analysis 

Today's lab is going to get your hands dirty with respect to the Foursquare API. We're also going to build a simple crawler/scraper that will go through the JSON hierarchy, extract the data we want, and deposit them into a Pandas table so we can do simple analysis. 

Just in case you're unfamiliar with this concept, please refer to the Wikipedia page (it's actually pretty good): https://en.wikipedia.org/wiki/Web_scraping, and maybe spend a few moments discussing the concepts and how it could help you in the future as a data scientist to have this "hackish" skill. 

Setup your access token to foursquare

In [1]:
# Solutions

import foursquare
import json
import pandas as pd
import unicodedata


#ACCESS_TOKEN = ""
#client = foursquare.Foursquare(access_token=ACCESS_TOKEN)

CLIENT_ID = 'YOUR_CLIENT_ID'
CLIENT_SECRET = 'YOUR_CLIENT_SECRET'
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)

Use a foursquare python library method to search for suitable venues around a city near you. Print the associated JSON output in a nice way with appropriate spacing and indentation

In [2]:
# Solution

starting_list = client.venues.search(params={'near': 'Sydney, NSW', 'radius':'1500'})
print(starting_list)


{u'confident': True, u'geocode': {u'parents': [], u'what': u'', u'where': u'sydney nsw', u'feature': {u'highlightedName': u'<b>Sydney</b>, <b>NSW</b>, Australia', u'displayName': u'Sydney, NSW, Australia', u'name': u'Sydney', u'longId': u'72057594040075650', u'cc': u'AU', u'id': u'geonameid:2147714', u'geometry': {u'center': {u'lat': -33.86785, u'lng': 151.20732}, u'bounds': {u'sw': {u'lat': -34.18942992635369, u'lng': 150.58820004399706}, u'ne': {u'lat': -33.37850601479644, u'lng': 151.34320099544556}}}, u'matchedName': u'Sydney, NSW, Australia', u'woeType': 7, u'slug': u'sydney'}}, u'venues': [{u'verified': True, u'name': u'The Westin Sydney', u'referralId': u'v-1477449429', u'venueChains': [], u'url': u'http://westinsydney.com', u'storeId': u'1183', u'hereNow': {u'count': 0, u'groups': [], u'summary': u'Nobody here'}, u'specials': {u'count': 0, u'items': []}, u'allowMenuUrlEdit': True, u'contact': {u'facebookName': u'The Westin Sydney', u'twitter': u'westin', u'phone': u'+6128223111

Wow... that should look like a total mess to you. Read the following docs: https://docs.python.org/2/library/json.html, and read the part about pretty printing. Once you think you've understood the method, deploy it here and see the world a difference a bit of spacing and indenting makes! 

In [3]:
print(json.dumps(starting_list, indent = 4))

{
    "confident": true, 
    "geocode": {
        "parents": [], 
        "what": "", 
        "where": "sydney nsw", 
        "feature": {
            "highlightedName": "<b>Sydney</b>, <b>NSW</b>, Australia", 
            "displayName": "Sydney, NSW, Australia", 
            "name": "Sydney", 
            "longId": "72057594040075650", 
            "cc": "AU", 
            "id": "geonameid:2147714", 
            "geometry": {
                "center": {
                    "lat": -33.86785, 
                    "lng": 151.20732
                }, 
                "bounds": {
                    "sw": {
                        "lat": -34.18942992635369, 
                        "lng": 150.58820004399706
                    }, 
                    "ne": {
                        "lat": -33.37850601479644, 
                        "lng": 151.34320099544556
                    }
                }
            }, 
            "matchedName": "Sydney, NSW, Australia", 
            "woeType"

Now that we can make some sense of the structure let's practice traversing the JSON hieararchy, select one of the venues in the list and output it's name

In [17]:
# Solution
starting_list['venues'][17]['categories'][0]['name']

u'Concert Hall'

Note that the output isn't exactly what we want. It says u'Park', and if you check the type, Python will output Unicode. This isn't good, we need to recover the original intended type. Read the following docs: 

https://docs.python.org/2/library/unicodedata.html, and checkup the method 'normalize'. Once you think you've understood this method. Implement it on the above call and see if you can recover the appropriate type for that data.


Now for some exploratory analysis, let's print the number of total venues in your list

In [9]:
# Solution

len(starting_list['venues'])

30

Extract the location id for your starting list. Make sure it's normalized to its correct type, and not Unicode. Put this id in a variable called temp. From this id, we will get a list of other venues.

In [19]:
starting_list['venues'][17]['id']

u'4b05876af964a5205d9122e3'

In [20]:
#starting_list['venues'][17]['name']

u'City Recital Hall Angel Place'

In [10]:
# Solution

temp = unicodedata.normalize('NFKD', starting_list['venues'][17]['id']).encode('ascii','ignore')
print(temp)

4b05876af964a5205d9122e3


Print the venues list (in the nicely formatted JSON)

In [11]:
# Solution

temp1 = client.venues(temp);
print(json.dumps(temp1, indent = 4))

{
    "venue": {
        "rating": 8.6, 
        "reasons": {
            "count": 1, 
            "items": [
                {
                    "reasonName": "upcomingEventsReason", 
                    "type": "general", 
                    "target": {
                        "object": {
                            "ignorable": false, 
                            "type": "events", 
                            "id": "5810180938fa05a8b97596fc", 
                            "target": {
                                "url": "/venues/4b05876af964a5205d9122e3/events", 
                                "type": "path"
                            }
                        }, 
                        "type": "navigation"
                    }, 
                    "summary": "There is an upcoming event here"
                }
            ]
        }, 
        "likes": {
            "count": 61, 
            "groups": [
                {
                    "count": 61, 
                   

Create a procedure that will only extract the comments in a list. There are a few ways you can do this, but I highly recommend you look up the map method from the base Python library: https://docs.python.org/2/tutorial/datastructures.html

This is the same "map" function, that's one part of the map-reduce duo used in "Big Data" applications. So it may be helpful to get familiar with this method now if that's where you think you may want to take your career in the future. 

In [12]:
# Solution
map(lambda h: h['text'], temp1['venue']['tips']['groups'][0]['items'])

[u'One of the most beautiful venues in Sydney!',
 u'Work work work!!',
 u'comfortable :)',
 u'Acoustics and beautiful space',
 u'Fantastic venue, if the bar on the first floor is crowded.... Go to the one on the second floor!',
 u'Great acoustics.',
 u'This January, City Recital Hall sees Joshua Redman and Brad Mehldau reunited after 15 years for a night of jazz. You may also want to catch Assembly - an awe-inspiring mixture of movement and voice.',
 u'Seats are comfy, and great views from the balconies.',
 u'Festival Keynote: Hope 2011 - stories that must be told: Monday 24 January 2011. January offers a fresh start, the turning of a page and optimism for the year ahead.',
 u"Inside Film Awards, DateSun 14 NovWith its fast-paced awards ceremony and notorious after party, the Kodak IF Awards is widely known as one of the Aussie Film industry's best-loved events. Hosted th",
 u'In January Catch a great peformance at the City Recital hall for Sydney Festival, including Philip Glass, Beac

Now we're going to bring the above mini-tasks together into a nice little method, that will allow us to convert any foursquare JSON data into a nice tabular / rectangular table for further analysis. First instnatiate a pandas data frame.

In [14]:
venue_table = pd.DataFrame()

Write a procedure that will take your list of venues around a certain geography/lat/long whatever, and output a table that will have for each row, a comment associated for the venue (multiple comments will mean multiple rows, each per comment), the venue name, the tip count, the user count, and the store category. Make sure that each column is populated with appropriately typed values, i.e. names/categories should be strings, and numbers should be numerical data type.

**Hint**: Before you begin, think about the process. You're going to start with a loop of some kind, then think about the following:
- How many of those do you need? 
- Think about the JSON structure, how "deep" do you need to penetrate the hierarchy to reach the data you want (this will help you think about how many loops you need for your crawler
- How should you iteratively add on to your Pandas data frame? 
- Think of any tests you may need to put in to ensure your procedure does not cause an error (this may help you figure out how many if statements you may need, and where to place them.


In [15]:
for v_index in range(len(starting_list['venues'])-1):
    temp = unicodedata.normalize('NFKD', starting_list['venues'][v_index]['id']).encode('ascii','ignore')
    temp1 = client.venues(temp)
    print v_index
    comment_list = map(lambda h: h['text'], temp1['venue']['tips']['groups'][0]['items'])
    for c_index in range(len(comment_list)-1):
        print c_index
        comment_converter = unicodedata.normalize('NFKD', comment_list[c_index]).encode('ascii','ignore')
        print "test"
        if (starting_list['venues'][v_index]['categories']) != []:  
            venue_table = venue_table.append(pd.DataFrame({"name": unicodedata.normalize('NFKD', starting_list['venues'][v_index]['name']).encode('ascii','ignore'),
                                            "tip count": starting_list['venues'][v_index]['stats']['tipCount'],
                                            "users count": starting_list['venues'][v_index]['stats']['usersCount'],
                                             "store category": unicodedata.normalize('NFKD', starting_list['venues'][v_index]['categories'][0]['name']).encode('ascii','ignore'), 
                                             "comments": comment_converter}, index = [v_index + c_index]))
        else:
            venue_table = venue_table.append(pd.DataFrame({"name": unicodedata.normalize('NFKD', starting_list['venues'][v_index]['name']).encode('ascii','ignore'),
                                            "tip count": starting_list['venues'][v_index]['stats']['tipCount'],
                                            "users count": starting_list['venues'][v_index]['stats']['usersCount'],
                                             "store category": "No categories", 
                                             "comments": comment_converter}, index = [v_index + c_index]))


0
0
test
1
test
2
test
3
test
4
test
5
test
6
test
7
test
8
test
9
test
10
test
11
test
12
test
13
test
14
test
15
test
16
test
17
test
18
test
1
0
test
1
test
2
test
3
test
4
test
5
test
6
test
7
test
8
test
9
test
10
test
11
test
12
test
13
test
14
test
15
test
16
test
17
test
18
test
2
0
test
1
test
2
test
3
test
4
test
5
test
6
test
7
test
8
test
9
test
10
test
11
test
12
test
13
test
14
test
15
test
16
test
17
test
18
test
3
0
test
1
test
4
0
test
1
test
2
test
3
test
4
test
5
test
6
test
7
test
8
test
9
test
10
test
11
test
5
0
test
1
test
2
test
3
test
4
test
5
test
6
test
7
test
8
test
9
test
10
test
11
test
12
test
13
test
14
test
15
test
16
test
6
0
test
1
test
2
test
3
test
4
test
5
test
7
0
test
1
test
8
9
10
11
0
test
1
test
2
test
3
test
4
test
5
test
6
test
7
test
8
test
9
test
10
test
11
test
12
test
13
test
14
test
15
test
16
test
17
test
18
test
12
0
test
1
test
2
test
13
14
15
0
test
1
test
2
test
3
test
4
test
5
test
6
test
7
test
8
test
9
test
10
test
11
test
12
te

Finally, output the Venue table

In [16]:
venue_table.drop_duplicates()

Unnamed: 0,comments,name,store category,tip count,users count
0,Huge rooms for a downtown hotel. Great locatio...,The Westin Sydney,Hotel,77,3688
1,View real guest photos & video of rooms at The...,The Westin Sydney,Hotel,77,3688
2,Don't expect free bottled water for SPG platin...,The Westin Sydney,Hotel,77,3688
3,Historical and ornate hotel in the hear of the...,The Westin Sydney,Hotel,77,3688
4,Faster to Cut across lobby to get to gym .,The Westin Sydney,Hotel,77,3688
5,have a martini in the lobby bar,The Westin Sydney,Hotel,77,3688
6,Recommend the Tower Premium Rooms and Exec Sui...,The Westin Sydney,Hotel,77,3688
7,great gym set up perfect to keep up with my tr...,The Westin Sydney,Hotel,77,3688
8,This is probably ground zero for Sydney CBD. ...,The Westin Sydney,Hotel,77,3688
9,Check out the High Tea Party happening on 8 an...,The Westin Sydney,Hotel,77,3688


You've done it! You've built a simple crawler that traverses a JSON directory, and you've deposited the results in a nice Pandas data frame. Congratulations! You're now ready for more data-mining in the future, and have just beefed up the **data** part of the data science combination :)