### This workshop is divided into 4 main parts:
* Python for Data Science basics
* Natural Language Processing and Sentiment Analysis 
* Web Scraping with Sentiment Analysis 
* Website and Chat bot integration
    
### Goals for part 3:
* Build a function that scrapes Amazon products from keywords
* Build a function that scrapes Tweets from Twitter
* Build a function that identifies the rate of positive/negative tweets for a certain topic
* Display the tweet coordinates on a world map

### Introduction

* What is web scraping ?

![](images/web_scrap.jpg)

#### RUN THE AMAZON SCRAPER TO GET PRODUCTS

* In terminal run this command:

  `python scrape_amazon.py -k xbox`


In [12]:
# IMPORT REQUIRED LIBRARIES
import json
import pandas as pd
import requests
from lxml import html
import numpy as np
import utils as ut 

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
           '(KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}

In [6]:
# GO TO AN AMAZON WEBSITE AND GET HTML
amazon_url  = 'https://www.amazon.ca/gp/product/1593275994'
page = requests.get(amazon_url, headers = headers).text
parser = html.fromstring(page)

In [7]:
# GET DESIRED TEXT FROM HTML
xpath_review = '//h1//span[@id="productTitle"]//text()'
reviews = parser.xpath(xpath_review)
print(reviews)

['Automate the Boring Stuff with Python: Practical Programming for Total Beginners']


In [8]:
# GET REVIEWS

['13 customer reviews']


In [9]:
# GET GENERATED FILE
with open('datasets/amazon_xbox.json') as json_data:
     adata = json.load(json_data)

adata = pd.DataFrame(adata)
adata.head()

# SORT PRICES AND RATING

Unnamed: 0,avg rating,n_reviews,name,price,url
0,4.38,161,XIBERIA 3.5mm Surround Sound Gaming Headset No...,0.0,http://www.amazon.com/dp/B01MG03BNT
1,3.84,61,Microsoft Xbox One 1TB Console,284.89,http://www.amazon.com/dp/B00KL3WBBC
2,4.19,543,Xbox One S 500GB Console - Minecraft Bundle,209.0,http://www.amazon.com/dp/B01L1Y0RZQ
3,4.45,7196,Xbox $10 Gift Card - Digital Code,10.0,http://www.amazon.com/dp/B00F4CEHNK
4,4.31,398,Xbox 360 4GB System Console with Peggle 2 Bundle,164.0,http://www.amazon.com/dp/B00OEA4ADU


In [None]:
# DO THE SAME FOR PS4


#### SCRAPE TWITTER
![](images/twitters.jpg)

![](images/stuff.jpg)

* Get public opinion and sentiments on a topic
* Get trending food/fashion products
* Win prizes by automatically joining twitter contests [link](http://gizmodo.com/i-wrote-a-bot-that-won-twitter-contests-1722126436)

#### SCRAPE TWITTER

##### FOLLOW STEPS [HERE]( https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/) TO REGISTER TWITTER KEY:

#### ADD KEYS IN AND RUN THE COMMAND:

`python scrape_twitter.py`


In [2]:
# PARSE TWITTER REVIEWS
with open('datasets/stream_starbucks_demo.json') as json_data:
     d = json.load(json_data)

dd = pd.DataFrame(d)    
text_list = np.array(dd["text"])

sent_dict = {"positive": 0, "neutral": 0, "negative": 0}
sent_lists = {"positive": [], "neutral": [], "negative": []}
for text in text_list:
    sentiment = ut.get_sentiment(text)
    sent_dict[sentiment] += 1
    sent_lists[sentiment] += [text]
    print "text: %s - sentiment: %s " % (text, sentiment)
    
sent_lists["positive"]
sent_dict

NameError: name 'json' is not defined

In [20]:
# GET TWEETS COORDINATES
with open('datasets/geo_data.json') as json_data:
     d = json.load(json_data)

coors = []
for tweet in d["features"]:
    coors += [{"lat":tweet["geometry"]["coordinates"][0],
               "lon": tweet["geometry"]["coordinates"][1]}]
    
pd.DataFrame(coors).head()

Unnamed: 0,lat,lon
0,-2.184229,52.038061
1,-10.427177,52.160523
2,-6.234815,53.367134
3,-1.297611,52.314245
4,-8.204491,52.29447


#### [link](https://marcobonzanini.com/2015/06/16/mining-twitter-data-with-python-and-js-part-7-geolocation-and-interactive-maps/) on how to plot tweet locations on maps

![](images/map.png)