# 2. Scraping CNET Reviews

The CNET scraper crawls the CNET website for expert reviews.

## 2.1 Prerequisites

### 2.1.1 Setup Virtual Environment

- TODO: Explain how to set up a new virtual environment if necessary.


### 2.1.2 Download Project

- TODO: Explain how project is downloaded from Git


### 2.1.3 Install Project Dependenices

- TODO: Explain how to install project dependencies using pip.


### 2.1.4 Activate Virtual Environment

- TODO: Explain how to activate the virtual environment for the project



## 2.2 Running the CNET Scraper

First, specify the project directory for the scraper you want to launch. In this case, it is for the [CNET website](http://www.cnet.com).

The project directory normally exists within the ReviewAnalyzer source files. It will contain the file `scrapy.cfg`, and the directory name will be descriptive enough to relate to the project.

For the CNET crawler, the project directory will be similar to `/review-analyser-ea/scrapers/cnet`. We set this value to the `PROJECT_PATH` variable.

In [1]:
PROJECT_PATH = '/Users/mik/projects/review-analyser-ea/scrapers/cnet'

To start the scraping process, the crawler needs to be launched from the project directory. Therefore, we change the current working directory to the value in `PROJECT_PATH`.

In [2]:
import os
os.chdir(PROJECT_PATH)

## 2.3 Specifying Product URLs to Crawl

The input to the CNET crawler is a file containing a list of URLs that specifies which product reviews to download.

Then we call on `Scrapy` with the necessary parameters to start the cralwer. The parameter `-a urls=cnet/spiders/urls.txt` tells the crawler what URLs to scrape from the [CNET website](http://www.cnet.com). 

The file `urls.txt` should contain a list valid URLs on the [CNET website](http://www.cnet.com). Each URL links to product page (e.g. `http://www.cnet.com/products/samsung-galaxy-s5/`).

The crawler use this file to download all the reviews for all the URLs specified in the `urls.txt` file.

To launch the CNET crawler, run:

In [3]:
# !cd /Users/mik/projects/review-analyser-ea/scrapers/cnet
!scrapy crawl productreviewspider -o output/results.json

[2015-08-27 16:41:11,819] {productreviewspider:ProductReviewSpider:28} INFO - CWD: /Users/mik/projects/review-analyser-ea/scrapers/cnet
[2015-08-27 16:41:11,819] {productreviewspider:ProductReviewSpider:29} INFO - URLS_FILE: /Users/mik/projects/review-analyser-ea/scrapers/cnet/urls.txt


## 2.4 Accessing the Downloaded Reviews

By default, the CNET crawler sends all downloaded product reviews to a central MongoDB installation. However, if you specified the `-o output/results.json` option when launching the crawler, you can access the downloaded reviews in a [JSON](https://en.wikipedia.org/wiki/JSON) format. 

The downloaded reviews in JSON format resides in the `/output/` directory.


To view the downloaded reviews via terminal, use:

```bash
less /path/to/project/output/results.json
```

or

```bash
cat less /path/to/project/output/results.json
```

To load the reviews programmatically, use:

```python
from pprint import pprint as pp
import simplejson as json

with open(PROJECT_PATH + '/output/results.json') as f:
    pp(json.load(f)[0]) # load and print the first review from the file.
```

2.4.1 Examples

The following examples demonstrate how the downloaded reviews can be accessed from a Jupyter notebook

In [4]:
from pprint import pprint as pp
import simplejson as json

with open(PROJECT_PATH + '/output/results.json') as f:
    pp(json.load(f)[0]) # load and print the first review from the file.

{'batch_id': '20150827164111',
 'cons': 'The Galaxy S5 is a only small upgrade over the Galaxy S4. The fingerprint scanner can be confusing to use, and the heart-rate monitor is a niche feature at best. In some regions, the Galaxy S5 costs significantly more than rival top-rated handsets.',
 'date_reviewed': '2014-04-07T21:01:00-0700',
 'date_updated': '2014-10-13T18:47:00-0700',
 'editors_rating': '4.5',
 'product_id': 'Samsung Galaxy S5 - Charcoal Black',
 'product_manufacturer': 'Samsung',
 'product_name': 'Samsung Galaxy S5 - Charcoal Black',
 'product_sku': 'SM-G900PZKASPR',
 'product_url': 'http://www.cnet.com/products/samsung-galaxy-s5/',
 'pros': "Samsung's Galaxy S5 excels at everything that matters -- Android 4.4 KitKat OS; a bright, beautiful display; blistering quad-core processor; and an excellent camera experience. In addition, Samsung's efforts to streamline its own custom interface and reduce pre-installed bloatware pay off.",
 'review_text': u'Here\'s why the Samsung G

In [7]:
!cat {PROJECT_PATH + '/output/results.json'}

[{"editors_rating": "4.5", "cons": "The Galaxy S5 is a only small upgrade over the Galaxy S4. The fingerprint scanner can be confusing to use, and the heart-rate monitor is a niche feature at best. In some regions, the Galaxy S5 costs significantly more than rival top-rated handsets.", "user_id": "Jessica Dolcourt", "product_id": "Samsung Galaxy S5 - Charcoal Black", "product_sku": "SM-G900PZKASPR", "date_updated": "2014-10-13T18:47:00-0700", "product_name": "Samsung Galaxy S5 - Charcoal Black", "pros": "Samsung's Galaxy S5 excels at everything that matters -- Android 4.4 KitKat OS; a bright, beautiful display; blistering quad-core processor; and an excellent camera experience. In addition, Samsung's efforts to streamline its own custom interface and reduce pre-installed bloatware pay off.", "review_text": "Here's why the Samsung Galaxy S5 should grab your attention: it looks good, it performs very well, and it has everything you need to become a fixture in nearly every aspect of your 

In [None]:
!less {PROJECT_PATH + '/output/results.json'}

7[?47h[?1h=[{"editors_rating": "4.5", "cons": "The Galaxy S5 is a only small upgrade over the Galaxy S4. The fingerprint scanner can be confusing to use, and the heart-rate monitor is a niche feature at best. In some regions, the Galaxy S5 costs significantly more than rival top-rated handsets.", "user_id": "Jessica Dolcourt", "product_id": "Samsung Galaxy S5 - Charcoal Black", "product_sku": "SM-G900PZKASPR", "date_updated": "2014-10-13T18:47:00-0700", "product_name": "Samsung Galaxy S5 - Charcoal Black", "pros": "Samsung's Galaxy S5 excels at everything that matters -- Android 4.4 KitKat OS; a bright, beautiful display; blistering quad-core processor; and an excellent camera experience. In addition, Samsung's efforts to streamline its own custom interface and reduce pre-installed bloatware pay off.", "review_text": "Here's why the Samsung Galaxy S5 should grab your attention: it looks good, it performs very well, and it has everything you need to become a fixture in nearly every 