## Web Scrapping from Amazon
### In this Dataset Creation file we scrape reviews of one of the highest selling product, that is, Crocs, from Amazon. Step 1: To start the data extraction from Amazon, we will import the relevant libraries to be used for this task.

In [1]:
#import relevant libraries for data extraction task
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Step 2: In this step, we write a method to fetch data using html tags. We start by giving url of the website page from where we are required to fetch the data, as the parameter of the method.

In [2]:
def fetch_url_data(url):
     
    headers={"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36","Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    page=requests.get(url, headers=headers)
    
    soup = BeautifulSoup(page.content)


    table = []
    for a in soup.find_all('div', attrs={'class':'a-section review aok-relative'}):
        author_name = a.find('span',attrs={'class':'a-profile-name'})
        title = a.find('a', attrs={'class':'a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold'})
        rating = a.find('span', attrs={'class':'a-icon-alt'})
        review = a.find('span', attrs={'class':'a-size-base review-text review-text-content'})
        size = a.find('a', attrs={'class':'a-size-mini a-link-normal a-color-secondary'})
        helpfullness = a.find('div', attrs={'class':'a-row a-spacing-small'})
        review_date = a.find('span', attrs={'class','a-size-base a-color-secondary review-date'})
        
        columns=[]
        
        if title is not None:
            columns.append(title.text)
        else:
            columns.append(np.nan)
            
        if author_name is not None:
            columns.append(author_name.text)
        else:
            columns.append(np.nan)

        if rating is not None:
            columns.append(rating.text)
        else:
            columns.append(np.nan)
            
        if review is not None:
            columns.append(review.text)
        else:
            columns.append(np.nan)
            
        if size is not None:
            columns.append(size.text)
        else:
            columns.append(np.nan)
            
        if helpfullness is not None:   
            columns.append(helpfullness.text)
        else:
            columns.append(np.nan)
            
        if review_date is not None:
            columns.append(review_date.text)
        else:
            columns.append(np.nan)
            
        table.append(columns)
              
    return table

### Step 3: We use a for loop to call the url by using the method defined in step 2.

In [3]:
complete_data = []
for i in range(0, 1000):
    url = 'https://www.amazon.com/Crocs-Unisex-Adult-Classic-Water-Comfortable/product-reviews/B08L7T3L8C/ref=cm_cr_getr_d_paging_btm_next_'+str(i)+'?ie=UTF8&reviewerType=all_reviews&pageNumber='+str(i)+'&filterByStar=critical'
    complete_data.append(fetch_url_data(url))

for i in range(0, 1000):
    url = 'https://www.amazon.com/Crocs-Unisex-Adult-Classic-Water-Comfortable/product-reviews/B08L7T3L8C/ref=cm_cr_getr_d_paging_btm_next_'+str(i)+'?ie=UTF8&reviewerType=all_reviews&pageNumber='+str(i)+'&filterByStar=positive'
    complete_data.append(fetch_url_data(url))
    
flatten = lambda l: [item for sublist in l for item in sublist]

### Step 4: We convert the data collected into a dataframe and give the relevant column names. 

In [4]:
df = pd.DataFrame(flatten(complete_data),columns=['Title','Author Name','Rating','Review', 'Size', 'Helpfullness', 'Review_Date'])
df.head(70)

Unnamed: 0,Title,Author Name,Rating,Review,Size,Helpfullness,Review_Date
0,\nNot the good made in USA ones of years ago\n,Outdoor Enthusiast!,1.0 out of 5 stars,"\n\n I have worn Crocs for years, but the wel...",Size: 9 Women/7 MenColor: Grass Green,367 people found this helpful,"Reviewed in the United States on July 18, 2018"
1,\nTwo completely different sized shoes marked ...,MoMo Wondertoes,1.0 out of 5 stars,\n\n We have been buying crocs for a long tim...,Size: 7 Women/5 MenColor: Black,327 people found this helpful,"Reviewed in the United States on August 6, 2017"
2,\nLove Crocs but....\n,PJM,1.0 out of 5 stars,\n\n I was hoping these would fit like the Cl...,Size: 9 Women/7 MenColor: Navy,157 people found this helpful,"Reviewed in the United States on June 25, 2018"
3,\nNot buying Crocs.\n,Shemit,1.0 out of 5 stars,\n\n I would never recommend the now-a-day Cr...,Size: 7 Women/5 MenColor: Black,154 people found this helpful,"Reviewed in the United States on May 27, 2017"
4,\nDidn't last a month\n,Amazon Customer,1.0 out of 5 stars,\n\n I bought these to replace an earlier pai...,Size: 11 Women/9 MenColor: Navy,71 people found this helpful,"Reviewed in the United States on November 2, 2018"
...,...,...,...,...,...,...,...
65,\nFake “crocs”\n,UNION INSURANCE AGENCY,1.0 out of 5 stars,\n\n Very upset. I purchased some Crocs (or s...,Size: 7 Women/5 MenColor: Black,,"Reviewed in the United States on December 28, ..."
66,\nThese new shoes are made in China and are de...,Mikecho,1.0 out of 5 stars,\n\n The interior surface of this shoe is def...,Size: 13 Women/11 MenColor: Navy,3 people found this helpful,"Reviewed in the United States on August 1, 2019"
67,"\nThey Work for What I Need, But They Seem Hig...",katie d.,3.0 out of 5 stars,\n\n I bought these specifically for my rowin...,Size: 7 Women/5 MenColor: Black,5 people found this helpful,"Reviewed in the United States on June 17, 2015"
68,\nNarrower and no arch support.\n,MplsMary,1.0 out of 5 stars,\n\n These are different from the classic Cro...,Size: 10 Women/8 MenColor: Black,7 people found this helpful,"Reviewed in the United States on September 28,..."


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8696 entries, 0 to 8695
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Title         7156 non-null   object
 1   Author Name   8696 non-null   object
 2   Rating        8696 non-null   object
 3   Review        8696 non-null   object
 4   Size          8480 non-null   object
 5   Helpfullness  1222 non-null   object
 6   Review_Date   8696 non-null   object
dtypes: object(7)
memory usage: 475.7+ KB


### As the final step for data creation, we save the dataframe into a .csv file in the local storage.

In [6]:
df.to_csv('crocs_reviews.csv', index=False, encoding='utf-8')