## Python Dealership Scraping Tutorial Part 2: Honda Motor Company

Hello! In this tutorial we will use Python to retrieve dealership data from the [Honda North America](https://www.honda.com/) website!

What you'll need for this part of the tutorial (and all other parts): 
* Python3.3 or later
* Jupyter Notebooks installed
* A basic understanding of how websites retrieve and load content to users (or a willingness to learn!)
* Patience to experiment!

Because individual websites store, retrieve and load data (and by extension, content) to users in vastly different ways, this tutorial is not all-encompassing. What I mean is: While the rough methodology and experimental/scientific method outlined here will work for any site you come across, the code you will have to write to retrieve the data will be different. 

Depending on the site administration, you might sometimes get lucky. For instance, Chrsyler, Dodge, Jeep and Ram all use the same back-end [REST API](https://restfulapi.net/) for their United States dealer services. The only difference is the endpoint of the URL you'll want to make your requests to changes depending on whether you want Chrysler, Dodge, Jeep or Ram dealer results. 

In most cases, though, you'll find you're repeating the same experimental steps on every site you want to scrape. There will be some similarities that can be generalized, bits of code that can be made into functions and so on, but you'll always want to start with a fresh Jupyter Notebook to get yourself started (in my opinion).

### A Skippable Tangent: Why Jupyter? 

You can skip this section to get right to the "good" stuff if you want. But I'd like to just say I think Jupyter Notebooks, for all their shortcomings as a proper IDE, are the best tool to use for jobs like this. 

The rapid experimentation and manipulation of objects (lists, dictionaries, etc.) that you can achieve in a Jupyter Notebook is unrivaled. The interactive nature of them provides a great environment to serve as your scratchpad from which you can copy and paste your "solution" into the IDE of your choice. 

There are some things an IDE like PyCharm will do you for: parameter previewing, underlining errors, proper debugging, and so on are just a few to name. But for just getting started, especially if this is your first time writing Python code, Jupyter is tough to beat. 

### Loading Libraries

Okay! Let's load in our libraries. For this I'll be using:

* json : the built-in Python JSON parsing/outputting library
* requests : the best Python library for interacting with websites/APIs with [HTTP methods](https://www.w3schools.com/tags/ref_httpmethods.asp) (we'll use GET and POST exclusively)
* mbtools : a custom-made module that has a bunch of useful functions for web scraping

Unfortunately, the name of the author of mbtools has been lost to time. So I can't give them credit in this Notebook for their work. 

Notice how I'm importing `mbtools` from a different directory, `lib`, compared to the one I'm in, `scrape_unsecured_js_file`, so the import is structured like:

```python
import utils.mbtools as mbtools
```

This is possible because of the first code block below. What this does is it adds the `lib` directory that contains the `scraping_utils.py` and `mbtools.py` files to the PATH. 

In [1]:
import sys

sys.path.insert(0, '../lib')

Now this works as you'd expect since this file can "see" the `lib` directory, even though it's in the `scrape_unsecured_js_file` directory. 

In [2]:
import json
import requests
import mbtools
import scraping_utils as su # Personal set of utility functions I've defined for this tutorial
import pprint # For pretty printing JSON-like objects

Keep this in mind as you're moving custom module and utility files around: The project structure is very important and you'll spend a ton of time fixing path issues, relative import issues, etc. if you're not carful. 

If you'd like to see the `show_obj_head` and `equivalence_checker` code separately, take a look at the `scraping_utils.py` file located in the `lib` directory. This is where these helper functions are stored. 

### First Steps

I'm not going to repeat a bunch of text and movie references here, if you're interested in seeing the proper introduction to this section I'd advise you to revisit Part One of this tutorial. 

Instead, let's just cut to the chase and get rolling. You know the drill, navigate to the dealership search functionality on the Honda website and start playing with the website. If you're having trouble finding it or are not feeling like searching, [here you go!](https://automobiles.honda.com/tools/dealership-locator)

Same deal as Part One: We want to manipulate the site into showing us how it makes requests to a database, API, .js file, etc. to retrieve information. So, let's do that. 

Here's what the initial site looks like:
<img src=../images/honda_initial_site.png align="center">
<br>
<br>

And some things to notice right away that we might be able to take advantage of, before even opening up the Inspect Console:
* You can search by ZIP
* You can search by City, State combination
* You can search by name
<br>

The easiest seems to be ZIP since we don't actually know what the names of the dealerships are. Passing City and State gets rid of that issue, but ZIP is also a geographic search and requires one input (ZIP) instead of two (City and State). So let's assume that's the route we want to attack this from. 

Anyway, let's open up the Inspect Console (`CTRL` + `SHIFT` + `I` if you've forgotten) and start playing!

### Learn By Doing

Right away after opening up the Network tab and looking at the XHR requests, I see something that's very promising!
<img src=../images/honda_request.png align="center">
<br>
<br>

And upon a little more investigation of this request, here's the data it returns:
<img src=../images/honda_request_response.png align="center">
<br>
<br>

If you're not seeing these same results, try searching for a new ZIP or reloading the page. 

So, we need to do one of two things:
1. Send requests to a random ZIP and return a sufficiently-larger number of dealers from it
2. Send requests to a sufficiently-large number of ZIPs and return only a few dealers from each

The first is easier, but less-likely to be supported by the site. Most of these OEM sites limit you either on the search radius around a ZIP or limit you on the number of results that can be returned from a single request. It's likely we'll need to try the second option, but we'll explore both. 

### Sending Requests

Okay - how do we actually send the ZIP as a part of the request though? 

Basically, we can send a set of parameters along with our `GET` request to the API which will tell it what data we either want to send or receive. 

Sometimes you can pass a dictionary to the `requests.get()` function in the `params` argument like below:

In [3]:
base_url = 'https://automobiles.honda.com/platform/api/v1/dealer?'

params = \
{
'productDivisionCode': 'A',
'excludeServiceCenters': 'true',
'zip': '48371',
'maxResults': '100'
}

response = requests.get(url=base_url, params=params)
results = response.json()['Dealers']
pprint.pprint(su.show_obj_head(obj=results, n_items=1))

The object passed in is a list with 100 items!

[{'ASARanking': None,
  'Address': '1819 S Telegraph Rd ',
  'Attributes': [{'Code': 'HF',
                  'Description': 'AHFC Certified Honda Auto Dealer',
                  'FullDescription': None,
                  'Name': 'AHFC Certified Dealer'},
                 {'Code': 'CU',
                  'Description': 'Dealer Locator only, not on mainframe.',
                  'FullDescription': None,
                  'IconUrl': 'https://location.services.honda.com/images/icons/A_CU_icon.gif',
                  'Name': 'Certified Used Dealer'},
                 {'Code': 'HC',
                  'Description': 'This dealership has earned the Council of '
                                 'Excellence award from Honda Financial '
                                 'Services.',
                  'FullDescription': None,
                  'IconUrl': 'https://location.services.honda.com/images/icons/A_HC_icon.gif',
                  'Name': 'Counc

What's this doing? It's essentially telling the API we want the results pertaining to productDivisionCode = "A", to set excludeServiceCenters to "True", return results centered around the ZIP of 48371 and to return a maximum of 100 results. Pretty simple. And remember what I said last time about someone having to program this API: the parameters and settings you can adjust should be named something that makes sense, just like here!

Basically what the `requests.get()` function has done here is "bolted-on" the parameters to the end of the `base_url` string. Have a look for yourself, this is the URL that actually gets sent to the API:

In [4]:
print(f'The URL for this request is: {response.url}')

The URL for this request is: https://automobiles.honda.com/platform/api/v1/dealer?productDivisionCode=A&excludeServiceCenters=true&zip=48371&maxResults=100


See what's going on here? It's nothing more than just altering the URL! Cool!

Hey, let's see if we can manipulate this into sending us <i>all</i> of the data back in one go!

In [5]:
params = \
{
'productDivisionCode': 'A',
'excludeServiceCenters': 'true',
'zip': '48371',
'maxResults': '500'
}

response = requests.get(url=base_url, params=params)
results = response.json()['Dealers']
pprint.pprint(su.show_obj_head(obj=results, n_items=1))

The object passed in is a list with 124 items!

[{'ASARanking': None,
  'Address': '1819 S Telegraph Rd ',
  'Attributes': [{'Code': 'HF',
                  'Description': 'AHFC Certified Honda Auto Dealer',
                  'FullDescription': None,
                  'Name': 'AHFC Certified Dealer'},
                 {'Code': 'CU',
                  'Description': 'Dealer Locator only, not on mainframe.',
                  'FullDescription': None,
                  'IconUrl': 'https://location.services.honda.com/images/icons/A_CU_icon.gif',
                  'Name': 'Certified Used Dealer'},
                 {'Code': 'HC',
                  'Description': 'This dealership has earned the Council of '
                                 'Excellence award from Honda Financial '
                                 'Services.',
                  'FullDescription': None,
                  'IconUrl': 'https://location.services.honda.com/images/icons/A_HC_icon.gif',
                  'Name': 'Counc

Nope, seems like there's some limit. We can't just pass in a large number into the `params` `dict` and expect to get that number of results back. Interesting...

Anyway, before we get back to the task at hand, let's have a quick detour to talk about parameters in your HTTP requests. 

### A Short Detour

Keep in mind some APIs won't let you pass a Python dictionary into `requests.get()`. I've sometimes seen it where you need to use the `json.dumps()` function to "dump" the `dict` to string-representation of a JSON object like below:

```Python
base_url = 'https://automobiles.honda.com/platform/api/v1/dealer?'

params = \
{
'productDivisionCode': 'A',
'excludeServiceCenters': 'true',
'zip': '48371',
'maxResults': '100'
}

params = json.dumps(params)
response = requests.get(url=base_url, params=params)
```


In [6]:
params = json.dumps(params)
print(su.show_obj_head(obj=params))
print('')

response = requests.get(url=base_url, params=params)
print(response.text)
print('')
print(f'The URL for this request is: {response.url}')

The object passed in is a str with 98 characters!

{"productDivisionCode": "A", "excludeServiceCenters": "true", "zip": "48371", "maxResults": "500"}

{"Message":"No HTTP resource was found that matches the request URI 'https://automobiles.honda.com/platform/api/v1/dealer?{\"productDivisionCode\": \"A\", \"excludeServiceCenters\": \"true\", \"zip\": \"48371\", \"maxResults\": \"500\"}'."}

The URL for this request is: https://automobiles.honda.com/platform/api/v1/dealer?%7B%22productDivisionCode%22:%20%22A%22,%20%22excludeServiceCenters%22:%20%22true%22,%20%22zip%22:%20%2248371%22,%20%22maxResults%22:%20%22500%22%7D


Seems like this API doesn't like this - it's added a bunch of slashes and extra characters to the URL and it's causing it to return a 404. But if you encounter an API where sending in a Python `dict` of parameters doesn't work, see if this little trick makes the difference! 

One other way you can do this is by altering the URL that you pass into `requests.get()` itself. I would usually advise against this, but if you're struggling to pass in either a `str` or a `dict` to the `params` argument of `requests.get()` to make a request to an API, give this a shot and see if it works!

You can read about query strings at the [Wikipedia page](https://en.wikipedia.org/wiki/Query_string) and [here's a StackOverflow question](https://stackoverflow.com/questions/39266970/what-is-the-difference-between-url-parameters-and-query-strings) about the difference between query strings and URL parameters. 

In [7]:
base_url = 'https://automobiles.honda.com/platform/api/v1/dealer?'

params = \
{
'productDivisionCode': 'A',
'excludeServiceCenters': 'true',
'zip': '48371',
'maxResults': '100'
}


# Write out the dictionary to a str but with "&" and "=" to represent the key-value pairs
print(json.dumps(params, separators=('&', '=')))
print('')

# Get rid of the curly brackets from the string object; we won't need them anymore
print(json.dumps(params, separators=('&', '='))[1:-1])
print('')

# This gets rid of the double-quotes in the string
# Note, I have to use the escape-character \, don't worry about that for now
print(json.dumps(params, separators=('&', '='))[1:-1].replace('\"', ''))
print('')

qs_params = json.dumps(params, separators=('&', '='))[1:-1].replace('\"', '')

response = requests.get(url=base_url+qs_params)
print(f'The URL for this request is: {response.url}')
print('')

results = response.json()['Dealers']
pprint.pprint(su.show_obj_head(obj=results, n_items=1))

{"productDivisionCode"="A"&"excludeServiceCenters"="true"&"zip"="48371"&"maxResults"="100"}

"productDivisionCode"="A"&"excludeServiceCenters"="true"&"zip"="48371"&"maxResults"="100"

productDivisionCode=A&excludeServiceCenters=true&zip=48371&maxResults=100

The URL for this request is: https://automobiles.honda.com/platform/api/v1/dealer?productDivisionCode=A&excludeServiceCenters=true&zip=48371&maxResults=100

The object passed in is a list with 100 items!

[{'ASARanking': None,
  'Address': '1819 S Telegraph Rd ',
  'Attributes': [{'Code': 'HF',
                  'Description': 'AHFC Certified Honda Auto Dealer',
                  'FullDescription': None,
                  'Name': 'AHFC Certified Dealer'},
                 {'Code': 'CU',
                  'Description': 'Dealer Locator only, not on mainframe.',
                  'FullDescription': None,
                  'IconUrl': 'https://location.services.honda.com/images/icons/A_CU_icon.gif',
                  'Name': 'Certified

Anyway, you get the idea. There are literally thousands of great resources, webpages, tutorials, videos, etc. you can check out if  you want to learn more about interacting with REST APIs via Python. 

Here's the [Quickstart](https://requests.readthedocs.io/en/master/user/quickstart/) for the `requests` library in Python, which should be good enough to get you rolling and asking the right questions!

### Back to Regularly-Scheduled Programming

Okay, enough toying around. Let's get down to business. 

Here's the objective: We want to send incremental requests to the API one ZIP at a time, getting as many results back as possible. We can't make a nationwide request since there's a limit on how many dealers it'll show us at a time. 

Really we need three things: a list of ZIP codes and the ability to loop through them, make a request, store the results and then to write the results out to a file. Simple. 

Because the `zdb.txt` file that was included with the web scraping had some issues with it (inconsistent data types and its length), I took the liberty to clean it up, slim it down and store the results in a new list: `geo_data_slim.json`. 

Let's read that data in now:

In [8]:
geo_data_slim = su.read_json('../data/input/geo_data_slim.json')
pprint.pprint(su.show_obj_head(obj=geo_data_slim))

The object passed in is a list with 194 items!

[{'city': 'Amherst',
  'lat': 42.377651,
  'lon': -72.50323,
  'state': 'MA',
  'zip': '01002'},
 {'city': 'Plymouth',
  'lat': 41.894676,
  'lon': -70.61999,
  'state': 'MA',
  'zip': '02360'},
 {'city': 'Littleton',
  'lat': 44.313606,
  'lon': -71.79751,
  'state': 'NH',
  'zip': '03561'},
 {'city': 'Augusta',
  'lat': 44.344406,
  'lon': -69.76345,
  'state': 'ME',
  'zip': '04330'},
 {'city': 'Millinocket',
  'lat': 45.646916,
  'lon': -68.74801,
  'state': 'ME',
  'zip': '04462'}]


In [9]:
import ast
with open('../data/input/zdb.txt', 'r+') as f:
    raw_data = f.read()
    f.close()
    zdb = ast.literal_eval(raw_data)

pprint.pprint(su.show_obj_head(obj=zdb))

The object passed in is a list with 3061 items!

[{'city': 'Amherst',
  'lat': '42.377651',
  'lon': '-72.50323',
  'state': 'MA',
  'zip': '01002'},
 {'city': 'Palmer',
  'lat': '42.176401',
  'lon': '-72.32646',
  'state': 'MA',
  'zip': '01069'},
 {'city': 'West Springfield',
  'lat': '42.112793',
  'lon': '-72.64066',
  'state': 'MA',
  'zip': '01089'},
 {'city': 'Springfield',
  'lat': '42.096644',
  'lon': '-72.5798',
  'state': 'MA',
  'zip': '01105'},
 {'city': 'Cheshire',
  'lat': '42.551731',
  'lon': '-73.1653',
  'state': 'MA',
  'zip': '01225'}]


The advantage of not looping through all 3061 items is, of course, speed. For the purposes of this notebook, I'll just be using the slim one. Most of the time, it should capture the same data as using the entire `zdb` but with only 1/15th the coordinates to check! There might be <i>rare</i> instances where the slimmed-down version of the coordinate data will miss a dealership in the US, but I would expect these to be few and far between. 

For a cleaned version of `zdb` where I've made sure the data types of all the entries are consistent and the file adheres to JSON standards, you can use `geo_data.json`. This one includes all 3061 entries that are in `zdb`, see below.

In [10]:
geo_data = su.read_json('../data/input/geo_data.json')
pprint.pprint(su.show_obj_head(obj=geo_data))

The object passed in is a list with 3061 items!

[{'city': 'Amherst',
  'lat': 42.377651,
  'lon': -72.50323,
  'state': 'MA',
  'zip': '01002'},
 {'city': 'Palmer',
  'lat': 42.176401,
  'lon': -72.32646,
  'state': 'MA',
  'zip': '01069'},
 {'city': 'West Springfield',
  'lat': 42.112793,
  'lon': -72.64066,
  'state': 'MA',
  'zip': '01089'},
 {'city': 'Springfield',
  'lat': 42.096644,
  'lon': -72.5798,
  'state': 'MA',
  'zip': '01105'},
 {'city': 'Cheshire',
  'lat': 42.551731,
  'lon': -73.1653,
  'state': 'MA',
  'zip': '01225'}]


Below is a common design pattern in programming you'll encounter. We want to loop through the coordinates, make a request to the API, get the results back into a `list`, check whether each entry in that `list` is new (we haven't seen that dealer ID before) and if so, add it to a `list` that holds only the unique entries. Pretty straightforward. 

I won't bother breaking this down step-by-step since you should be able to follow it at this point. I'll add some code comments below so you can see what we're doing if it's not immediately clear. 

In [None]:
# Used for storing the unique results from the API calls
unique_results = []

# Used for storing the ID's of the dealers so we don't get duplicates
id_list = []

base_url = 'https://automobiles.honda.com/platform/api/v1/dealer?'

params = \
{
'productDivisionCode': 'A',
'excludeServiceCenters': 'true',
'zip': 'FOO!',
'maxResults': '200'
}

# For location in our geo_data_slim list...
for coord in geo_data_slim:
    
    # The number of unique dealers for a given request will go here
    # We want this to reset every time the loop executes, that's why it's set to an empty list ("[]") here
    new = []
    
    # Print out some info about which ZIP we're trying to grab
    print('Finding Honda dealers near ' + coord['zip'] + '...')
    
    # Set the 'zip' entry in the params dict to the current coordinate's ZIP
    params['zip'] = coord['zip']
    
    # Make the request
    response = requests.get(url=base_url, params=params)
    results = response.json()['Dealers']
    
    # For each result in the latest query's result set...
    for res in results:
        
        # Find the unique ID for the dealer
        dlr_nbr = res['DealerNumber']
        
        # If the ID hasn't been seen before...
        if dlr_nbr not in id_list:
            
            # Add it to the ID list so we can make note we've seen it
            id_list.append(dlr_nbr)
            
            # Add the dealer itself to our unique list
            unique_results.append(res)
            
            # Add it to our list that keeps track of how many "new" dealers we found on this loop execution
            new.append(res)
    
    # Before going into the next execution, print out some stats
    mbtools.print_count(results, new, unique_results)

This'll take a few minutes, but once it finishes you'll have something that looks like what we've been scraping this whole time, except it'll have over 1,000 items in it!

In [12]:
pprint.pprint(su.show_obj_head(obj=unique_results, n_items=1))

The object passed in is a list with 1062 items!

[{'ASARanking': None,
  'Address': '293 King St ',
  'Attributes': [{'Code': 'HF',
                  'Description': 'AHFC Certified Honda Auto Dealer',
                  'FullDescription': None,
                  'Name': 'AHFC Certified Dealer'},
                 {'Code': 'CU',
                  'Description': 'Dealer Locator only, not on mainframe.',
                  'FullDescription': None,
                  'IconUrl': 'https://location.services.honda.com/images/icons/A_CU_icon.gif',
                  'Name': 'Certified Used Dealer'},
                 {'Code': '46',
                  'Description': 'Dealer is participating in CVA Enhanced '
                                 'Purchase Program',
                  'FullDescription': None,
                  'Name': 'CVA Enhanced Purchase Program'},
                 {'Code': 'DDP',
                  'Description': 'Dealer Websites managed by Digital Sales '
                                 

d['other'] = mbtools.list_to_csv(list(set([attr['Name'] for attr in res['Attributes']])))

That's all well and good, but we can't just write this file out how we have it. We want to extract some information from it first. 

Well, that's what we're about to do below. 

We want to take every item in the `unique_results` `list` and extract some data from it like the name, address, etc. Same as before. 

I'm going to show you how to, in one line of code, you can extract the names of all the unique attributes and get them into the data structure. 

The original code looked something like:

```Python
for res in unique_results:
    d = {}
    
    services = []
    for svc in res['Attributes']:
        if svc['Name'] not in services:
            services.append(svc['Name'])
    d['other'] = mbtools.list_to_csv(services)
    
```

Which is perfectly fine, except we can do better. Here's my solution:

```Python
for res in unique_results:
    d = {}
    d['other'] = mbtools.list_to_csv(list(set([attr['Name'] for attr in res['Attributes']])))
```

Let's break this down...

In [13]:
# Assume we're only working with the first item for now...
res = unique_results[0]

pprint.pprint(res['Attributes'])

[{'Code': 'HF',
  'Description': 'AHFC Certified Honda Auto Dealer',
  'FullDescription': None,
  'Name': 'AHFC Certified Dealer'},
 {'Code': 'CU',
  'Description': 'Dealer Locator only, not on mainframe.',
  'FullDescription': None,
  'IconUrl': 'https://location.services.honda.com/images/icons/A_CU_icon.gif',
  'Name': 'Certified Used Dealer'},
 {'Code': '46',
  'Description': 'Dealer is participating in CVA Enhanced Purchase Program',
  'FullDescription': None,
  'Name': 'CVA Enhanced Purchase Program'},
 {'Code': 'DDP',
  'Description': 'Dealer Websites managed by Digital Sales department',
  'FullDescription': None,
  'Name': 'Dealer Digital Program'},
 {'Code': '09',
  'Description': 'Dealer Participates in the Express Service Program',
  'FullDescription': None,
  'IconUrl': 'https://location.services.honda.com/images/icons/A_09_icon.gif',
  'Name': 'Express Service Program'},
 {'Code': '08',
  'Description': '',
  'FullDescription': None,
  'Name': 'Honda EItems Certified'},
 {

So we've got a list of dictionaries. And here's what we want to do:

Loop through all the dictionaries, grabbing the name. As we loop, we want to store these in a `list` (sound familiar?). Then, we want to make sure the `list` only contains unique values. This is exactly what a [`set`](https://www.w3schools.com/python/python_sets.asp) is in Python!

Once we've got that list, the `mbtools` module has a nice little `list_to_csv` function that takes a list and turns it into a comma-separated string. Neat!

One step at a time now, let's start with a list comprehension to get the list of attribute names together. 

In [14]:
attrs = [attr['Name'] for attr in res['Attributes']]
pprint.pprint(su.show_obj_head(obj=attrs, n_items=len(attrs)))

The object passed in is a list with 12 items!

['AHFC Certified Dealer',
 'Certified Used Dealer',
 'CVA Enhanced Purchase Program',
 'Dealer Digital Program',
 'Express Service Program',
 'Honda EItems Certified',
 'Honda Test Drive Enabled',
 'Internet Certified',
 'Oil Certified',
 'PAX Certified Service',
 'Retail and Service Dealer',
 'Tire Certified']


Let's wrap this in a `set()` to turn it into a `set` object. This will get rid of any duplicates, although it doesn't look like there are any. 

In [15]:
pprint.pprint(set(attrs))

{'AHFC Certified Dealer',
 'CVA Enhanced Purchase Program',
 'Certified Used Dealer',
 'Dealer Digital Program',
 'Express Service Program',
 'Honda EItems Certified',
 'Honda Test Drive Enabled',
 'Internet Certified',
 'Oil Certified',
 'PAX Certified Service',
 'Retail and Service Dealer',
 'Tire Certified'}


From here, we can wrap this in a `list()` to turn this set back into a `list` object, although that's kind of redundant. A quick inspection of the `mbtools.list_to_csv()` function shows it won't work for a `set`. Why? Because `sets` are <i>unordered</i>, meaning there's no ability to do something like:

```Python
myset = {1, 2, 3, 4, 5} # Sets use curly brackets
myset[0] # This would break!

mylist = [1, 2, 3, 4, 5] # Lists use square brackets
mylist[0] # This works!
```

So, we can either convert this back into a `list` by wrapping this whole thing in a `list()`, just like we did with the `set()`. Or, we can write our own function that accepts any <i>iterable</i> (which a `set`, of course, is). If you want to learn more about data structures in the Python Standard Library, you can check that out at any of these links:

* [hackr.io link with some nice examples](https://hackr.io/blog/python-data-structures)
* [Python Standard Library docs: check under Built-in Types](https://docs.python.org/3/library/)
* [Differences between iterators and iterables in Python](https://stackoverflow.com/questions/9884132/what-exactly-are-iterator-iterable-and-iteration)

Let's try it both ways.

In [16]:
# Method one
foo = mbtools.list_to_csv(list(set([attr['Name'] for attr in res['Attributes']])))
print(foo)

AHFC Certified Dealer, Dealer Digital Program, CVA Enhanced Purchase Program, Internet Certified, PAX Certified Service, Certified Used Dealer, Honda Test Drive Enabled, Oil Certified, Retail and Service Dealer, Tire Certified, Honda EItems Certified, Express Service Program


In [17]:
# Method two - make our own function and use it
def iterable_to_csv(iterable):
    
    csv_str = ''
    
    # Makes the items all strings, strips them of leading/trailing spaces, replaces newlines, 
    # returns and tabs with spaces
    cleaned_iterable = [str(item)\
                        .strip()\
                        .replace('\n', ' ')\
                        .replace('\r', ' ')\
                        .replace('\t', ' ') \
                        for item in iterable]
    
    
    for item in cleaned_iterable:
        csv_str += item + ', ' # Append a comma and space so it's comma-separated
    
    csv_str = csv_str[:-2] # Get rid of the last comma and space
    
    return csv_str

bar = iterable_to_csv(set([attr['Name'] for attr in res['Attributes']]))
print(bar)

AHFC Certified Dealer, Dealer Digital Program, CVA Enhanced Purchase Program, Internet Certified, PAX Certified Service, Certified Used Dealer, Honda Test Drive Enabled, Oil Certified, Retail and Service Dealer, Tire Certified, Honda EItems Certified, Express Service Program


I'll make this function available in the `scraping_utils` module I'm adding to as I go through these tutorials!

In [18]:
su.equivalence_checker(obj_one=foo, obj_two=bar, expected_value=True)

Josh knows a lot about Python - the actual matches expected!


Our finishing move, like usual, is to iterate through the unique results, pull the information we want into a `dict`, append to a `locations` `list` and then write this `locations` `list` out to a file!

In [19]:
# Final list which will have the data we want to write out
locations = []

for res in unique_results:
    
    d = {'brand': 'Honda', 
         'name': res['Name'], 
         'address': res['Address'], 
         'city': res['City'], 
         'state': res['State'],
         'zip': res['ZipCode'], 
         'phone': res['Phone'], 
         'url': res['WebAddress'], 
         'lat': res['Latitude'],
         'lon': res['Longitude'], 
         'other': iterable_to_csv(set([attr['Name'] for attr in res['Attributes']])),
         'site_id': res['DealerNumber']
        }
    locations.append(d)

my_keys = 'brand,name,address,city,state,zip,phone,url,lat,lon,other,site_id'
mbtools.make_tsv(locations,'../data/output/honda_us',keys=my_keys)