# Scraping Amazon product data
---

## Data Extraction

In this project, I will focus on **extracting information from an Amazon product category**. Specifically, I will be scraping information from products under the category "iphone". Subsequently, I will store the data in a dataframe and perform a brief data cleansing. To facilitate the readability and comprehension of the code, I will put explanations before each code cell about what is about to be done and how to do it. 

That said, to begin with it is necessary to import a set of Python libraries. In this case, I will use the requests library to make the calls, the pandas library to create the dataframe where the data will be stored, the time library to set a waiting period between each API call and, finally, I will import BeautifulSoup to parse the html of the web.

Additionally I have imported the warnings library to omit the messages that could come out by default when executing the code cells.

In [1]:
#Importing libraries
import requests
import pandas as pd
import time
from bs4 import BeautifulSoup

#Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

Once this is done, we move on to the extraction part. The general idea is to **obtain information about the name, price, rating, number of comments and url image of each product within the category**. To do this, we must obtain the web url that contains the data, which we can find by doing any search on amazon.

In this case, my url is ```'https://www.amazon.com/s?field-keywords=iphone'```. The problem is that when we access this link, we will see that the products are in several pages. Therefore, if we use this url we would only obtain data of the first products.

To solve this, we must create a list of urls (in text format) containing the links to each of the pages we want to scrap. This is why **I have created a function (pages_to_scrap) that takes as parameter a number given by the user that defines the number of pages to scan**. The function then iterates over a range of numbers from 1 to this number and creates a new url by attaching the current number to the url via the "page" parameter. Each url is stored in the variable 'urls'.


In [2]:
def pages_to_scrap(num_page):
    #Create a list with number of pages to scrap 
    pages = list(range(1,num_page + 1))

    #Create a list with complete urls of all pages
    urls = []

    for page in range(len(pages)):
        url = "https://www.amazon.com/s?field-keywords=iphone&page="+str(pages[page])
        urls.append(url)

    return urls

The **next step is to obtain data of the products within each page** (each link within urls). To do this we can use the **get method in each url**, but to make it work it is necessary to set the parameter headers with information from our device (to simulate a web search). Otherwise, Amazon will detect our activity and could block us, since it has a very strict policy regarding data extraction. You can review Alex the analyst's video in the readme.md references to see how to obtain this data.

Then, we **create a loop** in which we use the get method for each url within urls, using the headers parameter. **If the response is positive (status code = 200), we continue the extraction. If it is negative, we get an error message**. 

Now comes the interesting part, to get the data (assuming status code = 200) we must **parse the html** we got in the answer. Doing this, we will be able to use the functions **find** and **find_all** to extract the information we need. **But what should we look for with these functions?** At this point we must be clear that the answer we have parsed actually contains all the html of the Amazon page, which holds the code that displays the information we see on the web when we scroll through the page (product names, prices, etc.). We can see that code if we hit F12 while we are on the web, so we can scroll down and look for the code snippets where the desired information is located. Once we find it, we can use the find and find_all functions to search for keywords in the html code and get that information.

Don't worry if you don't know anything about html, you can use the tool at the top left of the window to select elements on the page and inspect them. By hovering the mouse over the page elements, the code snippets where this information is located will be highlighted. Here is an example where I put the mouse over the title of a product. I suggest you to use this tool to familiarize yourself with the structure of the html code.

![IMG](Images/html_elements.png)

Having understood this, we can continue with the code. Now the strategy is to find a part of the code that identifies each product separately. That is, the html class that contains all the information of each product. We can use the previous tool to guide us in this process. By doing this, we will find that all products have a common class, as can be seen in the following image.

![IMG](Images/html_products.png)

Each cell contains product information (if we expand it we can see more than what appears in the image), so we are interested in creating a function that iterates through each product and extracts the information we need. 

To achieve this, we first **save all the products in a variable (divs)** using the function find_all, which use the html class we just found as search parameters. Done this, it only remains to **create a loop for each element inside divs and to store in variables the information that we need**. For it we use the same process of looking for the html class in the code and use the find function to obtain the information. Here is an example: 

![IMG](Images/html_price.png)

As there can be products without name, photo, etc., I added a condition that determines that if the data is not found, it is put ''. 

Finally, **in each loop iteration the data is saved in a dataframe**. When all products in an url are scanned, a timeout is set in order to avoid generating a large amount of data traffic all at once that could cause Amazon to block us.

All the code is wrapped in a function that has, as parameters, a dataframe and the list of urls that we will get with the function that was created previously.

In [3]:
def get_data(df, urls):
    
    #Save computer information to simulate a web search
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    #Loop throught each url to obtain information
    for url in urls:
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            #Parse the response in html
            soup = BeautifulSoup(response.content, "html.parser")

            #Find all products, which are in the same html class
            divs = soup.find_all("div", class_="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 sg-col s-widget-spacing-small sg-col-12-of-16")

            #Iterate over div elements to obtain product information if search keywords are found, else put ''
            for div in divs:
                #Obtain text title 
                span = div.find("span", class_="a-size-medium a-color-base a-text-normal")
                title = span.text if span else ''

                #Obtain price 
                span = div.find("span", class_="a-offscreen")
                price = span.text if span else ''

                #Obtain rating 
                span = div.find("span", class_="a-icon-alt")
                rating = span.text if span else ''

                #Obtain number of coments
                span = div.find("span", class_="a-size-base s-underline-text")
                comments = span.text if span else ''

                #Obtain image url
                img = div.find("img", class_="s-image")
                image = img["src"] if img else ''

                #Append information to dataframe
                df = df.append({
                                "title": title,
                                "price": price,
                                "rating": rating,
                                "comments": comments,
                                "image": image
                                },
                                ignore_index = True)
            
            #Set waiting time between each call 
            time.sleep(10)
            
        else:
            print("Error while fetching data for page:", url)
    return df

Once the two functions have been defined, we can use them to obtain the data. As a reminder, the first function creates a list of urls and the second one uses its output together with a dataframe (that we will have to create now) to extract and save the data.

In [4]:
#Using function to set number of pages to scrap and create a list of urls
urls = pages_to_scrap(20)

In [5]:
#Create dataframe and get data
df = pd.DataFrame(columns = ["title", "price", "rating", "comments", "image"])
df = get_data(df, urls)

In [6]:
#Show results
df

Unnamed: 0,title,price,rating,comments,image
0,"Apple iPhone 11, 64GB, Black - Unlocked (Renewed)",$319.99,4.3 out of 5 stars,33606,https://m.media-amazon.com/images/I/31PpUfTCiF...
1,"Apple iPhone XR, 64GB, Black - Unlocked (Renewed)",$229.99,4.5 out of 5 stars,56602,https://m.media-amazon.com/images/I/717KHGCJ6e...
2,"Apple iPhone 11 Pro, 64GB, Space Gray - Unlock...",$364.81,4.3 out of 5 stars,16021,https://m.media-amazon.com/images/I/81ldhum0M4...
3,"Apple iPhone X, US Version, 64GB, Silver - Unl...",$214.00,4.2 out of 5 stars,22550,https://m.media-amazon.com/images/I/81SSw14XZH...
4,"Apple iPhone 12 Pro, 256GB, Pacific Blue - Ful...",$560.95,4.2 out of 5 stars,2569,https://m.media-amazon.com/images/I/71z4b3G3GA...
...,...,...,...,...,...
301,Ulefone Unlocked Smartphones Note 9P Android 1...,$123.99,3.9 out of 5 stars,3089,https://m.media-amazon.com/images/I/71gxbnt6fw...
302,"TracFone TCL 30 Z, 32GB, Black - Prepaid Smart...",$39.88,3.2 out of 5 stars,26,https://m.media-amazon.com/images/I/718+xprSIB...
303,Thuraya satellite Satsleeve + (Plus) for Smart...,,2.9 out of 5 stars,8,https://m.media-amazon.com/images/I/71PYF4sosE...
304,Sudroid Mini Small Mobil Cell Phone L8star BM7...,$21.99,3.6 out of 5 stars,90,https://m.media-amazon.com/images/I/51lMXm5u+y...


## Data Cleaning

We already have our data, but we can do some transformations to clean them up. Specifically, I would like to remove the dollar sign from the price column (since it would not allow us to make calculations if we wanted to do some kind of analysis), remove the final part of the rating ('out of 5 stars') and check for duplicates.

In [7]:
#Delete dollar sign with ''
df['price'] = df['price'].str.replace('$', '')

#Delete 'out of 5 stars' in rating column
df['rating'] = df['rating'].str.replace(' out of 5 stars', '')

#Delete '()' 
df['comments'] = df['comments'].str.replace('(', '')
df['comments'] = df['comments'].str.replace(')', '')

In [8]:
#Counting duplicate data in dataframe
df.duplicated().sum()

40

In [9]:
# Dropping duplicate rows
df.drop_duplicates(inplace = True)

In [10]:
#Reset Index
df.reset_index(drop = True, inplace = True)

In [11]:
#Show results
df

Unnamed: 0,title,price,rating,comments,image
0,"Apple iPhone 11, 64GB, Black - Unlocked (Renewed)",319.99,4.3,33606,https://m.media-amazon.com/images/I/31PpUfTCiF...
1,"Apple iPhone XR, 64GB, Black - Unlocked (Renewed)",229.99,4.5,56602,https://m.media-amazon.com/images/I/717KHGCJ6e...
2,"Apple iPhone 11 Pro, 64GB, Space Gray - Unlock...",364.81,4.3,16021,https://m.media-amazon.com/images/I/81ldhum0M4...
3,"Apple iPhone X, US Version, 64GB, Silver - Unl...",214.00,4.2,22550,https://m.media-amazon.com/images/I/81SSw14XZH...
4,"Apple iPhone 12 Pro, 256GB, Pacific Blue - Ful...",560.95,4.2,2569,https://m.media-amazon.com/images/I/71z4b3G3GA...
...,...,...,...,...,...
261,Ulefone Unlocked Smartphones Note 9P Android 1...,123.99,3.9,3089,https://m.media-amazon.com/images/I/71gxbnt6fw...
262,"TracFone TCL 30 Z, 32GB, Black - Prepaid Smart...",39.88,3.2,26,https://m.media-amazon.com/images/I/718+xprSIB...
263,Thuraya satellite Satsleeve + (Plus) for Smart...,,2.9,8,https://m.media-amazon.com/images/I/71PYF4sosE...
264,Sudroid Mini Small Mobil Cell Phone L8star BM7...,21.99,3.6,90,https://m.media-amazon.com/images/I/51lMXm5u+y...


In [12]:
#Export df to csv 
df.to_csv('amazon_iphone_data.csv', index=False)