## Python Dealership Scraping Tutorial Part 1: GAC Toyota Motor Co., Ltd

Hello! In this tutorial we will use Python to retrieve dealership data from the [GAC Toyota Motor Co., Ltd](http://www.gac-toyota.com/) website!

What you'll need for this part of the tutorial (and all other parts): 
* Python3.3 or later
* Jupyter Notebooks installed
* A basic understanding of how websites retrieve and load content to users (or a willingness to learn!)
* Patience to experiment!

Because individual websites store, retrieve and load data (and by extension, content) to users in vastly different ways, this tutorial is not all-encompassing. What I mean is: While the rough methodology and experimental/scientific method outlined here will work for any site you come across, the code you will have to write to retrieve the data will be different. 

Depending on the site administration, you might sometimes get lucky. For instance, Chrsyler, Dodge, Jeep and Ram all use the same back-end [REST API](https://restfulapi.net/) for their United States dealer services. The only difference is the endpoint of the URL you'll want to make your requests to changes depending on whether you want Chrysler, Dodge, Jeep or Ram dealer results. 

In most cases, though, you'll find you're repeating the same experimental steps on every site you want to scrape. There will be some similarities that can be generalized, bits of code that can be made into functions and so on, but you'll always want to start with a fresh Jupyter Notebook to get yourself started (in my opinion).

### A Skippable Tangent: Why Jupyter? 

You can skip this section to get right to the "good" stuff if you want. But I'd like to just say I think Jupyter Notebooks, for all their shortcomings as a proper IDE, are the best tool to use for jobs like this. 

The rapid experimentation and manipulation of objects (lists, dictionaries, etc.) that you can achieve in a Jupyter Notebook is unrivaled. The interactive nature of them provides a great environment to serve as your scratchpad from which you can copy and paste your "solution" into the IDE of your choice. 

There are some things an IDE like PyCharm will do you for: parameter previewing, underlining errors, proper debugging, and so on are just a few to name. But for just getting started, especially if this is your first time writing Python code, Jupyter is tough to beat. 

### Loading Libraries

Okay! Let's load in our libraries. For this I'll be using:

* json : the built-in Python JSON parsing/outputting library
* requests : the best Python library for interacting with websites/APIs with [HTTP methods](https://www.w3schools.com/tags/ref_httpmethods.asp) (we'll use GET and POST exclusively)
* mbtools : a custom-made module that has a bunch of useful functions for web scraping

Unfortunately, the name of the author of mbtools has been lost to time. So I can't give them credit in this Notebook for their work. 

Notice how I'm importing mbtools from a child directory, utils, so the import is structured like:

```python
import utils.mbtools as mbtools
```

Keep that in mind as you're moving custom module and utility files around: The project structure is very important and you'll spend a ton of time fixing path issues, relative import issues, etc. if you're not carful. 

In [1]:
import json
import requests
import utils.mbtools as mbtools
import pprint # For pretty printing JSON-like objects

### Let the Hacking Begin

<img src=images/let_the_hacking_begin.gif>

Okay - had to show off some cool Jupyter Notebook functionality there (image embeds just like a webpage!) with one of my [all-time favorite movies](https://www.rottentomatoes.com/m/the_social_network). But anyway, this is where the fun (hacking, but nobody says "hack" anymore) starts. 

Let's navigate to the GAC Toyota site and load up the dealer searching portal. At the time of the time of this writing, [this](https://www.gac-toyota.com.cn/buy/shopping/dealer-search) is the URL. 

Our goal, similar to Mark Zuckerberg in this scene, is to make requests to a server which returns us some data we want to use, store, manipulate, etc. If you've seen the movie, you'll know his goal was a little less professional than ours. In his case, he wants to return pictures of classmates from the Harvard dorm facebooks. In our case, we want to retrieve information about car dealerships in China. But still, the basic premise is the same: We want to take advantage of exposed API endpoints, JavaScript functions, hosted .data/.html/.json/.txt files, HTML structures, etc. to get what we want from a website. 

<i>Sidenote: I should probably pause here and tell you that if you want to be a good, upstanding Citizen of the Internet, I'd advise you to not maliciously gather data you're not authorized to see, use, etc. Always look up a site's [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) to see if they allow scraping on the site, and always adhere to any local, state or federal regulations while gathering data. API's are generally more forgiving (and should specify useage and rate limits) whereas traditional scraping is less so. Use good judgement and always remember... <img src=images/with_great_power_comes_great_responsibility.gif></i>

### Learn By Doing

The best (and really only) way to figure out what avenues you're able to take to get the data you want is to play with the site and see how it responds to user inputs. You'll want to navigate to the page, open up the Inspect Console (`CTRL` + `SHIFT` + `I` on Windows) and navigate from the `Elements` tab (displays the page HTML) to the `Network` tab (displays the requests the page makes to servers for displaying content and data). 

If you refresh the page, you'll see all of the files the page has to load, all of the URLs it makes requests to and so on to actually load the content and data. Some of these are more useful than others; you'll typically find the requests you're interested in on the `XHR` or `JS` tabs of the Network menu if you're scraping data. 

Get comfortable playing with a webpage while having the Inspect Console open. You'll want to check out a few things as you play with the page. 

<b>Does the URL change as you interact with the page and make searches or give it inputs?</b>
If so, you're probably looking at a situation where you'll be parsing HTML using BeautifulSoup. 

<b>Are there any `XHR` or `JS` requests that have promising names?</b>
If so, you might be looking at a situation where you can make a request to one of those endpoint URLs instead of the page itself. 

Remember, someone had to create this website. So if they wanted to make their life easier, they'd probably follow some generally-accepted-programming principles like: 

* Name your files descriptively
* Name your functions descriptively
* Write code that doesn't suck

What I'm getting at here is: If the developer created a REST endpoint for retrieving dealership data, they probably wouldn't structure the endpoint name like: https://api.toyota.com/v1/vehicle_catalog. It's probably something more like: https://api.toyota.com/v1/dealers or something to that effect. Because the code that drives the site has to be maintained, you can usually count on it making some sense to follow. Unless the developer was disgruntled and named all the files different kinds of beer. Then you might be in some trouble. 

Anyway...

These are the JavaScript Network requests the Toyota page makes to load a given province and city's dealership data: 
<img src=images/javascript_requests.png align="center">
<br>
<br>

See any that look promising? This group right here might prove useful, especially one labeled `dealerData.js`. 
<img src=images/useful_javascript_requests.png align="center">
<br>
<br>

You can click on a request to see more details about it, like so:
<img src=images/dealer_data_js.png align="center">
<br>
<br>

This opens up another submenu. Here we can see the URL it's making a request to, https://www.gac-toyota.com.cn/js/newprovincecitydealer/data/dealerData.js, any headers (parameters or settings) it's sending with the request, and the HTTP method for the request itself (GET). 

You can also see the server's response to this request by clicking on the `Response` tab of this new submenu which has appeared. In this case, it shows us the data the GET request is returning.
<img src=images/dealer_data_response.png align="center">
<br>
<br>

Oooooh! A quick visual inspect of this response reveals it is over 10,000 lines in length and contains all of the data this page could ever want to load for a given province and city combination. I think we've found our golden ticket!

### What Is This Page?

Just in case you're confused as to what's going on here behind the scenes. 

Toyota is literally hosting a .js file on their site's back-end which serves as an ad hoc database, containing all of their current China dealership data in a JSON structure. Secure? Not even a little bit. Convenient for us? Absolutely. 

This design pattern of a site not using an API, but rather running JavaScript against an exposed file that contains all of the data the page could want to load is very primitive. The security around the data and the ease with which you can access all of it is not something I've seen on any major sites from US-based companies. Keep that in mind as you remember elements from this tutorial for later; the only easier way to grab the data from a site like this is for the site owner to just send us a .json or .csv of the data themselves. 

So, let's make a request to this page in Python and see what it returns.

In [2]:
url = 'https://www.gac-toyota.com.cn/js/newprovincecitydealer/data/dealerData.js'
response = requests.get(url=url)
response

<Response [200]>

Sweet, it's a [200](https://www.restapitutorial.com/httpstatuscodes.html) response, so we're good to keep going! Let's look at the text of this thing. 

<i> Sidenote: I'm going to just specify a function that accepts a response object and then prints out the first 500 characters of the text so we can call it again and again. </i>

In [3]:
def show_obj_head(obj, n_chars=500, n_items=5):
    
    try:
        
        if isinstance(obj, requests.models.Response):
        
            print(f'The text of the response object returns a str with {len(obj.text)} characters!')
            print('')
            print(obj.text[:n_chars]) # Only returns the first 500 char's
        
        elif isinstance(obj, list):
            print(f'The object passed in is a list with {len(obj)} items!')
            print('')
            pprint.pprint(obj[:n_items]) # Only returns the first 5 items
        
        elif isinstance(obj, dict):
            print(f'The object passed in is a dict with {len(obj.keys())} items!')
            print('')
            pprint.pprint(obj)  
    
    except:
        print('Something went wrong... try another URL or try again.')

In [4]:
show_obj_head(obj=response, n_chars=500)

The text of the response object returns a str with 357952 characters!

var dealerJson =[
  {
    "DealerIndex": 1.0,
    "City": 57001001.0,
    "dealerid": "E02DF521-B9E2-451D-8857-BC46A52D0B44",
    "DealerCode": "2.30E+11",
    "DealerName": "广汽丰田大庆世腾高新区店",
    "DealerURL": "https://www.gac-toyota.com.cn/province/heilongjiang/daqing/dealer/dqstgxq",
    "EvaluateTotal": "服务评价(2003)",
    "ScoreTotal": 4.94,
    "ScoreN": "(4.9分)",
    "Tel": "0459-60391110459-6039222",
    "As_Tel": "",
    "Lable1": "技术专业(528)",
    "Lable2": "值得信任(313)",
    "La


Very cool! Let's see if we can use the `json.loads()` function to turn this `str` object into a Python `dict`!

In [5]:
json.loads(response.text)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Anddddd it failed. And wow what a cryptic error that is. `JSONDecodeerror`? `Expecting value: line 1 column 1 (char 0)`? What the heck? Let's turn to the Internet to figure this out.

### StackOverflow to the Rescue!

The first [StackOverflow question](https://stackoverflow.com/questions/16573332/jsondecodeerror-expecting-value-line-1-column-1-char-0) I found after [Googling](https://lmgtfy.com/?q=JSONDecodeError%3A+Expecting+value%3A+line+1+column+1+(char+0)) this error seems to offer some insight. 

Here's the second-most upvoted answer as of this writing: 
<img src=images/json_decode_error.png align="center">

Huh? The character in the first position (char 0) does not conform to JSON's structure. 

Let's review what a JSON object should look like then:

<img src="https://i2.wp.com/techeplanet.com/wp-content/uploads/2019/01/json-example.jpg?w=608&ssl=1" align="center">

And what do we have?

In [6]:
show_obj_head(obj=response, n_chars=500)

The text of the response object returns a str with 357952 characters!

var dealerJson =[
  {
    "DealerIndex": 1.0,
    "City": 57001001.0,
    "dealerid": "E02DF521-B9E2-451D-8857-BC46A52D0B44",
    "DealerCode": "2.30E+11",
    "DealerName": "广汽丰田大庆世腾高新区店",
    "DealerURL": "https://www.gac-toyota.com.cn/province/heilongjiang/daqing/dealer/dqstgxq",
    "EvaluateTotal": "服务评价(2003)",
    "ScoreTotal": 4.94,
    "ScoreN": "(4.9分)",
    "Tel": "0459-60391110459-6039222",
    "As_Tel": "",
    "Lable1": "技术专业(528)",
    "Lable2": "值得信任(313)",
    "La


See it yet? That `var dealerJson =` part of the `str` object is causing us problems. No matter, because we know how to deal with that right? 

Let's use the `.replace()` method of a `str` to replace the `var dealerJson =` with just an empty string (''). 

That looks something like this:

In [7]:
dlr_results_str = response.text.replace('var dealerJson =', '')
dlr_results = json.loads(dlr_results_str)

No errors? Seems promising. Let's do our type checking on `results` now.

In [8]:
show_obj_head(obj=dlr_results, n_items=2)

The object passed in is a list with 620 items!

[{'Address': '黑龙江省大庆市高新技术产业开发区安萨路32号',
  'As_Tel': '',
  'City': 57001001.0,
  'DealerCode': '2.30E+11',
  'DealerIndex': 1.0,
  'DealerName': '广汽丰田大庆世腾高新区店',
  'DealerURL': 'https://www.gac-toyota.com.cn/province/heilongjiang/daqing/dealer/dqstgxq',
  'EvaluateTotal': '服务评价(2003)',
  'Lable1': '技术专业(528)',
  'Lable2': '值得信任(313)',
  'Lable3': '细心周到(208)',
  'Latitude': 46.570219,
  'Longitude': 125.163992,
  'ScoreN': '(4.9分)',
  'ScoreTotal': 4.94,
  'Tel': '0459-60391110459-6039222',
  'dealerid': 'E02DF521-B9E2-451D-8857-BC46A52D0B44'},
 {'Address': '黑龙江省哈尔滨市香坊区学府路403号',
  'As_Tel': '',
  'City': 57001002.0,
  'DealerCode': '23A30',
  'DealerIndex': 1.0,
  'DealerName': '广汽丰田哈尔滨文华学府路店',
  'DealerURL': 'https://www.gac-toyota.com.cn/province/heilongjiang/haerbin/dealer/hebwhxfl',
  'EvaluateTotal': '服务评价(321)',
  'Lable1': '技术专业(79)',
  'Lable2': '值得信任(35)',
  'Lable3': '细心周到(21)',
  'Latitude': 45.661131,
  'Longitude': 126.63079,
  '

Very nice! Now we've got a `list` of `dicts`, a nice data structure for us to work with! 

One thing to note here which shows me that we're not done yet: I don't see any actual name for the city or province. And I think that might be important to keep. I see a numeric `City` code, but it's not anything we can use right now. 

Somehow, the page must be able to translate the Mandarin string containing the province and city and decode that into a numeric ID. So, it'd stand to reason there's some way to translate the combination of a city and province to an ID. We should be able to go the other way around, right? 

<i> Sidenote: If this `City` was a hash-value, probably not without knowing the hashing algorithm. But since it looks like a generic code and that we found this .js file on the site for anyone to use publicly in the first place, I'm thinking we're smarter than this! </i>

Let's go back to the page and see if any of those JS requests make any sense to investigate. Remember, there was the dealerData.js request that we just used. But there was a cityData.js request and a provinceData.js request too. Those might be the way to go. 

### Moving Right Along

Let's look at the `Response` on the city one first. And you know what? Let's just do it in Python. Grab the URL from the site and let's get cracking.

In [9]:
city_url = 'https://www.gac-toyota.com.cn/js/newprovincecitydealer/data/cityData.js'
response = requests.get(url=city_url)
response

<Response [200]>

[Looking good, Billy Ray!](https://www.youtube.com/watch?v=sCkiHxcBgXg)

In [10]:
show_obj_head(obj=response, n_chars=512)

The text of the response object returns a str with 23436 characters!

var cityJson = [
  {
    "value": "057001001",
    "name": "大庆市",
    "parent": "057001"
  },
  {
    "value": "057001002",
    "name": "哈尔滨市",
    "parent": "057001"
  },
  {
    "value": "057001003",
    "name": "佳木斯市",
    "parent": "057001"
  },
  {
    "value": "057001004",
    "name": "牡丹江市",
    "parent": "057001"
  },
  {
    "value": "057001005",
    "name": "齐齐哈尔市",
    "parent": "057001"
  },
  {
    "value": "057001006",
    "name": "绥化市",
    "parent": "057001"
  }


[Feeling good, Louis!](https://youtu.be/sCkiHxcBgXg?t=3) Those look like the same style of codes that were used in the earlier JSON we looked at. 

We'll need to do the same thing here on the cities as we did with the dealer JSON: that `var cityJson =` is going to cause us fits. 

In [11]:
city_results = json.loads(response.text.replace('var cityJson = ', ''))
show_obj_head(obj=city_results)
print('')
show_obj_head(obj=city_results[0])

The object passed in is a list with 285 items!

[{'name': '大庆市', 'parent': '057001', 'value': '057001001'},
 {'name': '哈尔滨市', 'parent': '057001', 'value': '057001002'},
 {'name': '佳木斯市', 'parent': '057001', 'value': '057001003'},
 {'name': '牡丹江市', 'parent': '057001', 'value': '057001004'},
 {'name': '齐齐哈尔市', 'parent': '057001', 'value': '057001005'}]

The object passed in is a dict with 3 items!

{'name': '大庆市', 'parent': '057001', 'value': '057001001'}


Cool, but now we need the provinces too. I have a hunch that's what the `parent` means on each one of these entries. 

So basically, the structure of the code is the first six digits belong to the province (parent) and the last three belong to the child (city). And I'm going to guess the last three digits (the city digits) are just a sequence number. 

So for a given province, 555555, you will have cities 001, 002, 003 and so on. And for another province, 666666, you will have cities 001, 002 and 003. We have the city names now, but let's take it one further and get the provinces.

In [12]:
prov_url = 'https://www.gac-toyota.com.cn/js/newprovincecitydealer/data/provinceData.js'
response = requests.get(url=prov_url).text
prov_results = json.loads(response.replace('var provinceJson = ', ''))
show_obj_head(obj=prov_results)
print(' ')
show_obj_head(obj=prov_results[0])

The object passed in is a list with 31 items!

[{'name': '黑龙江省', 'value': '057001'},
 {'name': '吉林省', 'value': '057002'},
 {'name': '辽宁省', 'value': '057003'},
 {'name': '北京市', 'value': '058001'},
 {'name': '河北省', 'value': '058002'}]
 
The object passed in is a dict with 2 items!

{'name': '黑龙江省', 'value': '057001'}


Okay, so it seems like we're really starting to make some moves here. Let's recap what we've got:

1. A `list` of `dicts` with all of the dealership data and a code which contains the province-city combination
2. A `list` of `dicts` with all of the city names, stored at a province and city level
3. A `list` of `dicts` with all of the province names, stored at the province level

What'd be great is those last two to instead get stored as dictionaries where the key is the ID and the value is the city or province name. 

That way, to lookup a province name in the province "container" object, instead of having to loop through all of the dictionaries in that `prov_results` `list` to find the one where the 'value' is equal to the value we're interested in, we can just call that entry directly from the dictionary. 

It's the difference between these two pieces of code:

In [13]:
value_of_interest = '058004'
i = 0
while True:
    print(f'{i}\'th time entering the loop!')
    if prov_results[i]['value'] == value_of_interest:
        print(f'Hey! I found it on the {i}\'th time through the loop!')
        print(prov_results[i]['name'])
        break
    i += 1

0'th time entering the loop!
1'th time entering the loop!
2'th time entering the loop!
3'th time entering the loop!
4'th time entering the loop!
5'th time entering the loop!
6'th time entering the loop!
Hey! I found it on the 6'th time through the loop!
山西省


In [14]:
sample_dict = \
{'0': 'foo',
 '1': 'bar',
 '2': 'foobar',
 '058004': '山西省'}

sample_dict[value_of_interest]

'山西省'

Of course, I'm sure there's a better way to do this `list` search than the brute-force method. But the point still stands: This is what dictionaries are made for, so let's use them as intended. And the solution is a very elegant piece of code:

In [15]:
provs = {item['value']: item['name'] for item in prov_results}
show_obj_head(provs)

The object passed in is a dict with 31 items!

{'057001': '黑龙江省',
 '057002': '吉林省',
 '057003': '辽宁省',
 '058001': '北京市',
 '058002': '河北省',
 '058003': '内蒙古自治区',
 '058004': '山西省',
 '058005': '天津市',
 '059001': '广东省',
 '059002': '广西壮族自治区',
 '059003': '海南省',
 '060001': '福建省',
 '060003': '江西省',
 '061001': '甘肃省',
 '061002': '贵州省',
 '061003': '宁夏回族自治区',
 '061004': '青海省',
 '061005': '陕西省',
 '061006': '四川省',
 '061007': '西藏自治区',
 '061008': '新疆维吾尔自治区',
 '061009': '云南省',
 '061010': '重庆市',
 '062001': '安徽省',
 '062002': '河南省',
 '062003': '湖北省',
 '062004': '湖南省',
 '063001': '山东省',
 '064001': '江苏省',
 '064002': '上海市',
 '065001': '浙江省'}


This is a [dictionary comprehension](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html) which is a very compact and elegant way to essentially do a for loop in a single line of code. 

The code I showed you above is logically-equivalent to what you'll see below:

In [16]:
foo = {}
for item in prov_results:
    foo[item['value']] = item['name']

In [17]:
def equivalence_checker(obj_one, obj_two, expected_value):
    
    actual_value = obj_one == obj_two
    
    if expected_value == actual_value:
        print('Josh knows a lot about Python - the actual matches expected!')
    else:
        print('Josh is wrong - the actual is different than the expected!')
        
equivalence_checker(obj_one=provs, obj_two=foo, expected_value=True)

provs[value_of_interest] # Now we can just do this to find the string of a prov ID

Josh knows a lot about Python - the actual matches expected!


'山西省'

Very cool, now we can just lookup the value of a province in this dictionary by its ID and return the string. No need to loop through a `list` and then check dictionaries. Neat!


### Python Code on Fleek

Let's get even crazier here. Because the city ID <i>technically</i> contains both the province and city information, what we really want is a JSON structure that looks like:

In [18]:
sample_dict = \
{123456001: {'city_nm': 'City #1 name here', 'prov_nm': 'Province #1 name here'},
 123456002: {'city_nm': 'City #2 name here', 'prov_nm': 'Province #1 name here'},
 123456003: {'city_nm': 'City #3 name here', 'prov_nm': 'Province #1 name here'},
 987654001: {'city_nm': 'City #4 name here', 'prov_nm': 'Province #2 name here'},
 987654002: {'city_nm': 'City #5 name here', 'prov_nm': 'Province #2 name here'}
}


pprint.pprint(sample_dict)

{123456001: {'city_nm': 'City #1 name here',
             'prov_nm': 'Province #1 name here'},
 123456002: {'city_nm': 'City #2 name here',
             'prov_nm': 'Province #1 name here'},
 123456003: {'city_nm': 'City #3 name here',
             'prov_nm': 'Province #1 name here'},
 987654001: {'city_nm': 'City #4 name here',
             'prov_nm': 'Province #2 name here'},
 987654002: {'city_nm': 'City #5 name here',
             'prov_nm': 'Province #2 name here'}}


If this were a tabular structure, we'd want to do something like:

```SQL
-- Let's call this the city_provs view :)
SELECT c.value as key, c.name as city_nm, p.name as city_nm
FROM city_results c
JOIN prov_results p
ON c.parent = p.value
```

By knowing the city ID on the original dealer record, we can immediately get back to the province that city is in too in this kind of a reference table. 

That'd look like:
```SQL
SELECT dlr.*, cp.*
FROM dlr_results dlr
JOIN city_provs cp
ON dlr.city = cp.key
```

Make sense? If you don't know SQL, probably not. But it makes sense in my head. 

So how do we do get something like city_provs? How about another dictionary comprehension?

In [19]:
city_provs = {int(item['value']): 
              {'city_nm': item['name'], 
               'prov_nm': provs[item['parent']]
              } 
              for item in city_results
             }

show_obj_head(city_provs)

The object passed in is a dict with 285 items!

{6100105: {'city_nm': '天水市', 'prov_nm': '甘肃省'},
 6100106: {'city_nm': '平凉市', 'prov_nm': '甘肃省'},
 6100107: {'city_nm': '武威市', 'prov_nm': '甘肃省'},
 6100108: {'city_nm': '白银市', 'prov_nm': '甘肃省'},
 6100508: {'city_nm': '安康市', 'prov_nm': '陕西省'},
 6100809: {'city_nm': '哈密市', 'prov_nm': '新疆维吾尔自治区'},
 6200220: {'city_nm': '鹤壁市', 'prov_nm': '河南省'},
 57001001: {'city_nm': '大庆市', 'prov_nm': '黑龙江省'},
 57001002: {'city_nm': '哈尔滨市', 'prov_nm': '黑龙江省'},
 57001003: {'city_nm': '佳木斯市', 'prov_nm': '黑龙江省'},
 57001004: {'city_nm': '牡丹江市', 'prov_nm': '黑龙江省'},
 57001005: {'city_nm': '齐齐哈尔市', 'prov_nm': '黑龙江省'},
 57001006: {'city_nm': '绥化市', 'prov_nm': '黑龙江省'},
 57001008: {'city_nm': '双鸭山市', 'prov_nm': '黑龙江省'},
 57001009: {'city_nm': '七台河市', 'prov_nm': '黑龙江省'},
 57002001: {'city_nm': '吉林市', 'prov_nm': '吉林省'},
 57002002: {'city_nm': '松原市', 'prov_nm': '吉林省'},
 57002003: {'city_nm': '延吉市', 'prov_nm': '吉林省'},
 57002004: {'city_nm': '长春市', 'prov_nm': '吉林省'},
 5700200

Oooooh! Now, what would the equivalent non-comprehension code be to this?

In [20]:
city_provs = {int(item['value']): {'city_nm': item['name'], 'prov_nm': provs[item['parent']]} for item in city_results}

In [21]:
bar = {}
for item in city_results:
    city_nm = item['name']
    prov_nm = provs[item['parent']]
    
    bar[int(item['value'])] = {'city_nm': city_nm, 'prov_nm': prov_nm}

equivalence_checker(obj_one=city_provs, obj_two=bar, expected_value=True)

Josh knows a lot about Python - the actual matches expected!


So the idea is:
For each item in `city_results`, I want to take the integer of the item's "value". This becomes a key in an entry in my new dictionary. 

The value is actually a dictionary too! I want the item's name from the `city_results` dictionary to become the value to the `city_nm` key in this sub-dictionary. And I want the associated province's name to be the value to the `prov_nm` key in this sub-dictionary. 

Notice, too, how I'm grabbing the province's name. It's the city's parent which then gets piped into the `provs` dictionary. If you remember from earlier, this is why I wanted the `provs` dictionary to be at that level: So I could just pass it an ID value and get back the province name as a string!

And now, this new `city_provs` dictionary is incredibly simple to return the data we want. You get a `City` key in the `dlr_results` and you can immediately get back the city's string and the province's string from this new dictionary we've created. Beautiful!

### Finishing Touches

How do we put this all together now? Simple! We've done a lot of the hard work, now all we need to do is loop through the `dlr_results` `list` of `dicts`, grab the elements we want, put them in another `list` for safe-keeping and then write out a file from that list!

In [22]:
locations = []
id_list = []
for res in dlr_results:
    
    # If we've seen this dealer before, let's skip him
    # If not, let's add his ID to a list so we can skip him next time we see him
    if res['dealerid'] in id_list:
        continue
    else:
        id_list.append(res['dealerid'])
    
    # In case the JSON for a dealer doesn't contain this key, we don't want it to throw any errors
    # And we know it won't have a province either, so let's set them to None
    # If it does have a city, assign the city_nm and prov_nm
    if res['City'] not in city_provs.keys():
        city = None
        prov = None
    else:
        city = city_provs[int(res['City'])]['city_nm']
        prov = city_provs[int(res['City'])]['prov_nm']
    
    # Make a new dictionary from the pieces
    d = {'brand': 'GAC Toyota',
         'name': res['DealerName'],
         'address': res['Address'],
         'city': city,
         'province': prov,
         'phone': res['Tel'],
         'lat': res['Latitude'],
         'lon': res['Longitude'],
         'url': res['DealerURL']
         }

    locations.append(d)
    
show_obj_head(locations, n_items=2)

The object passed in is a list with 620 items!

[{'address': '黑龙江省大庆市高新技术产业开发区安萨路32号',
  'brand': 'GAC Toyota',
  'city': '大庆市',
  'lat': 46.570219,
  'lon': 125.163992,
  'name': '广汽丰田大庆世腾高新区店',
  'phone': '0459-60391110459-6039222',
  'province': '黑龙江省',
  'url': 'https://www.gac-toyota.com.cn/province/heilongjiang/daqing/dealer/dqstgxq'},
 {'address': '黑龙江省哈尔滨市香坊区学府路403号',
  'brand': 'GAC Toyota',
  'city': '哈尔滨市',
  'lat': 45.661131,
  'lon': 126.63079,
  'name': '广汽丰田哈尔滨文华学府路店',
  'phone': '0451-88889958',
  'province': '黑龙江省',
  'url': 'https://www.gac-toyota.com.cn/province/heilongjiang/haerbin/dealer/hebwhxfl'}]


In [23]:
mykeys='address,brand,city,lat,lon,name,phone,province,url'
mbtools.make_tsv(L1=locations, file_name='gac_toyota', keys=mykeys)