# Web Scrapping
Extraction of data (information) from websites, transforming it into structured data for further analysis in an automated way using a computer program.
## Example use case: Customers Review Analysis
Scrapping data for a specific product available on Amazon and analyzing its customers’ reviews. 


## Required Libraries: 
`requests` and `BeautifulSoup`

In [125]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Procedure:
### Step 1: Crawl
The first step will be to navigate to the target website and download the source code of the web page. 


In [126]:
# target website url
url = "https://www.amazon.in/Apple-MacBook-Chip-13-inch-256GB/dp/B08N5XSG8Z/ref=sr_1_1_sspa?crid=39DGATOSVC4MV&keywords=apple%2Blaptop&qid=1643427205&sprefix=apple%2Blapt%2Caps%2C1230&sr=8-1-spons&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExS1lZUEQwVkczUVFJJmVuY3J5cHRlZElkPUEwNjU3MTIwMlFQNjlPNDZGNE1YMiZlbmNyeXB0ZWRBZElkPUEwNzA2NjQ2MzVaOFdZN0pKQ0hQMSZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU&th=1"

# GET request to the target website
HEADERS = ({'User-Agent': '(Windows NT 10.0; Win64; x64)', 'Accept-Language': 'en-US, en;q=0.5'})
def getdata(url):
    r = requests.get(url, headers=HEADERS)
    return r.text

### Step 2: Parse and Transform
Once the source code of the target webpage is downloaded, filter the contents needed.
In this step, the downloaded source code is parsed into a HTML Parser and for this, `BeautifulSoup` library will be required. 


In [127]:
# PARSING the source code
def html_code(url):
    htmldata = getdata(url)
    soup = BeautifulSoup(htmldata, 'html.parser')
    return (soup)

In [128]:
soup = html_code(url)

In [129]:
# FILTERING the soup(parsed data) to get the needed content : customer list
def cus_data(soup):
    data_str = ""
    cus_list = []

    for item in soup.find_all("span", class_="a-profile-name"):
        data_str = data_str + item.get_text()
        cus_list.append(data_str)
        data_str = ""
    return cus_list

In [130]:
cus_res = cus_data(soup)
cus_res

['Prady',
 'tech geek',
 'Shiran Lone',
 'Nomadic',
 'RK',
 'ashish',
 'Dr Vignesh',
 'suryakala']

In [131]:
# FILTERING the soup(parsed data) to get the needed content : customer reviews
def cus_rev(soup):
    data_str = ""    
    for item in soup.find_all("div", class_="a-expander-content reviewText review-text-content a-expander-partial-collapse-content"):
        data_str = data_str + item.get_text()

    result = data_str.split("\n")
    return result

In [132]:
rev_data = cus_rev(soup)
# LOOPING through the array to remove any empty strings
rev_result = []
for i in rev_data:
    if i == "":
        pass
    else:
        rev_result.append(i)
rev_result

["  Best in class. Performance, Display, Battery backup are above excellent.A must have for every tech geekBought this after selling my Yamaha R15, but no regrets.... It's speed is better than R15. 😅",
 "  Pros:-1. It's Superfast. It will feel fast on everything - from bootup, to app opening, to builds etc.2. It's slim. Air has no fan hence it's form factor is even slimmer than Pro.3. It remains cold even during heavy code builds. It's hard to find things which makes it warm actually.4. I have tried several graphics heavy games and they run great without any heat as well5. If you are just browsing with Wifi on, typically it loses 10% battery in 7-8 hours. But it's for Safari browser. It has achieved I guess what people will call power-efficiency nirvana.6. Screen, Sound and Mic quality are awesomeCons:-1. Since it's winters in India now, some people might not like that it doesn't heat up the surroundings2. For longer workloads - like if you are doing daily large video compressions/conv

In [133]:
# FILTERING the soup(parsed data) to get the needed content : product information
def product_info(soup):
    data_str = ""
    pro_info = []

    for item in soup.find("div", {"id": "poExpander"}):
        data_str = data_str + item.get_text()
        pro_info.append(data_str.split("\n"))
        data_str = ""
    return pro_info

In [134]:
pro_data = product_info(soup)
# LOOPING through the array to remove any empty strings
pro_result = []
for item in pro_data:
        for i in item:
            if i == "" or i == " ":
                pass
            else:
                pro_result.append(i)
pro_result

['     Model Name   MacBook Air     Brand   Apple     Specific Uses For Product   Multimedia     Screen Size   13 Inches     Operating System   MacOS 10.14 Mojave     Human Interface Input   Keyboard     CPU Manufacturer   Apple     Graphics Card Description   Integrated     Special Feature   Portable     Colour   Gold    ',
 'See more']

### Step 3: Store the Data
The final step is to store the extracted data.

In [135]:
data = {'Name': cus_res,
        'review': rev_result}

df = pd.DataFrame(data)

# Save the output into a csv file
df.to_csv('amazon_review.csv')

### Working with images

In [136]:
def rev_img(soup):
    images = []

    for img in soup.findAll('img', class_="a-dynamic-image cr-customer-image-thumbnail"):
        images.append(img.get('src'))
    return images

img_result = rev_img(soup)
img_result

['https://m.media-amazon.com/images/I/B1WqbnEWmAS._CR0,840,2160,2160_UX175.jpg',
 'https://m.media-amazon.com/images/I/91NhY2JNaBL._CR1084,0,1864,1864_UX175.jpg',
 'https://m.media-amazon.com/images/I/B1iQJUOZQoS._CR504,0,3024,3024_UX175.jpg',
 'https://m.media-amazon.com/images/I/B1RDBsn7mxS._CR504,0,3024,3024_UX175.jpg']

In [137]:
image_count = 1
for image in img_result:
    with open('image_'+str(image_count)+'.jpg', 'wb') as f:
        res = requests.get(image)
        f.write(res.content)
    image_count = image_count + 1