# Web scraping

### Objective:
> #### Data (Customer reviews) has to be scraped from a given website(Social media - Facebook) and desired fields are selected and stored in a desired format

### Steps:
1. Import libraries
2. Web scraping
3. Sample data of single observation
4. Extract data of all observations
5. Data formatting
6. Desired data from scraping
7. Export data to csv

### 1. Import libraries

In [1]:
import os
import pandas as pd
import numpy as np
from time import sleep

from selenium import webdriver
from bs4 import BeautifulSoup

### 2. Web scraping

1. Download the Selenium chrome driver & mention the path
2. Input the website to be scraped
3. Input script for scrolling along with sleep time
4. Input the desired class from html using browser Inspect

In [2]:
driver = webdriver.Chrome(executable_path=os.path.abspath(r'************\chromedriver_win32\chromedriver.exe'))
driver.get("https://www.facebook.com/**************")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
sleep(50)
post=driver.find_elements_by_css_selector("div[class='_5pcr userContentWrapper']")

In [3]:
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
week=soup.find(id='facebook')
items=week.find_all('div','span',class_='_5pcr userContentWrapper')

In [4]:
# Number of observations
len(items)

133

### 3. Sample data of single observation

In [7]:
print(items[0].find(class_='timestampContent').get_text())
print(items[0].find(class_='_5pbx userContent _3576').get_text())
print(items[0].find(class_='_51mq img sp_daR8qCHGLan sx_4d7322'))

print(items[0].find(class_='_51mq img sp_daR8qCHGLan sx_7209d2'))

print(items[0].find(class_='profileLink'))
print(items[0].find(class_='_3hg- _42ft'))
print(items[0].find(class_='_6qw3'))

7 May at 07:15
I have placed orders for pineapple cake @ Ck's bakery, Keelkatalai. It was near to my house and my friends use to recommend, but not tried yet, But I have tried today and I thought of missed this yummy taste I have tasted before...I have bought Pineapple cake I love to try other flavors too......🍍🍰🎂
None
<i class="_51mq img sp_daR8qCHGLan sx_7209d2"></i>
<a class="profileLink" data-ft='{"tn":"l"}' href="https://www.facebook.com/lalitharishi?hc_ref=ARRLv3Or5KtqHL0r66-l6Dwo3GQ1fouJSx-5xolpyR-i0dkGxXqSF8ZzJwtaAyS6eAQ&amp;ref=nf_target" title="Lalitha Krishna">Lalitha Krishna</a>
None
None


### 4. Extract data of all observations

In [8]:
time=[item.find(class_='timestampContent').get_text() for item in items]
rev=[item.find(class_='_5pbx userContent _3576') for item in items]

rat1=[item.find(class_='_51mq img sp_daR8qCHGLan sx_4d7322') for item in items]
rat3=[item.find(class_='_51mq img sp_daR8qCHGLan sx_34407d') for item in items]
rat4=[item.find(class_='_51mq img sp_daR8qCHGLan sx_9193ef') for item in items]
rat5=[item.find(class_='_51mq img sp_daR8qCHGLan sx_2b0ba9') for item in items]

recommend=[item.find(class_='_51mq img sp_daR8qCHGLan sx_7209d2') for item in items]
notrecommend=[item.find(class_='_51mq img sp_daR8qCHGLan sx_e527fb') for item in items]

name=[item.find(class_='profileLink') for item in items]
no_comments=[item.find(class_='_3hg- _42ft') for item in items]
comments=[item.find(class_='_6qw3') for item in items]

In [9]:
# Contains - Matching partial ids
#import re
#r = [item.findAll('class', id=lambda x: x and x.startswith('_5pbx')) for item in items]
#r = [item.findAll('class', id=re.compile('^_51mq img')) for item in items]

In [10]:
df=pd.DataFrame(
{
    'Date':time,
    'Review':rev,
    'Rating1': rat1,
    'Rating3': rat3,
    'Rating4': rat4,
    'Rating5': rat5,
    'Recommend': recommend,
    'NotRecommend': notrecommend,
    'Name': name,
    'no_comments':no_comments,
    'comments':comments
})

In [11]:
df.head()

Unnamed: 0,Date,Review,Rating1,Rating3,Rating4,Rating5,Recommend,NotRecommend,Name,no_comments,comments
0,7 May at 07:15,[[I have placed orders for pineapple cake @ Ck...,,,,,[],,[Lalitha Krishna],,
1,7 May at 04:29,"[[Excellent cakes and yummy sandwiches, [<span...",,,,,[],,[Krishna],,
2,24 January 2017,[[[Service has become pathetic and the staff h...,[[1 star]],,,,,,[Vikram Sarveshwar],[5 comments],"[[[CK's Bakery], , [<span class=""_3l3x""><span..."
3,7 May at 04:20,"[[Good & Affordable Variety in cakes, Choco tr...",,,,,[],,[Karuppan Chetty],,
4,20 November 2017,[[[Poor service n worst design. Waste of money...,[[1 star]],,,,,,[Maahi Jain],[11 comments],


### 5. Data formatting

In [12]:
df["Date"]=df['Date'].astype(str)
df["date"]=df['Date'].str.replace('at' , '')

df['Review'] = df['Review'].astype(str)
df[['review']] = df['Review'].str.extract('<p>(.*?)<')

df['Rating1'] = df['Rating1'].astype(str)
df['rating1'] = df['Rating1'].str.extract('<u>(.*?)</')
df['Rating3'] = df['Rating3'].astype(str)
df['rating3'] = df['Rating3'].str.extract('<u>(.*?)</')
df['Rating4'] = df['Rating4'].astype(str)
df['rating4'] = df['Rating4'].str.extract('<u>(.*?)</')
df['Rating5'] = df['Rating5'].astype(str)
df['rating5'] = df['Rating5'].str.extract('<u>(.*?)</')

df['Recommend'] = df['Recommend'].astype(str)
df['recommend'] = df['Recommend'].str.replace('<i class="_51mq img sp_daR8qCHGLan sx_7209d2"></i>' , 'recommend')
df['recommend'] = df['recommend'].str.replace('None' ,'')

df['NotRecommend'] = df['NotRecommend'].astype(str)
df['notrecommend'] = df['NotRecommend'].str.replace('<i class="_51mq img sp_daR8qCHGLan sx_e527fb"></i>' , 'notrecommend')
df['notrecommend'] = df['notrecommend'].str.replace('None' , '')

df['Name'] = df['Name'].astype(str)
df['name'] = df['Name'].str.extract('title="(.*?)">')

df['no_comments'] = df['no_comments'].astype(str)
df['no_of_comments'] = df['no_comments'].str.extract('">(.*?)</')

df['comments'] = df['comments'].astype(str)
df['Comments'] = df['comments'].str.extract('<span>(.*?)</span')

In [13]:
#Combine columns into one columns
cols = ['rating1', 'rating3', 'rating4', 'rating5']
df["rating"] = df[cols].apply(lambda x: ','.join(x.dropna()), axis=1)

#Combine columns into one columns
cols = ['recommend', 'notrecommend']
df["recommendation"] = df[cols].apply(lambda x: ''.join(x.dropna()), axis=1)

In [14]:
df.head(5)

Unnamed: 0,Date,Review,Rating1,Rating3,Rating4,Rating5,Recommend,NotRecommend,Name,no_comments,...,rating3,rating4,rating5,recommend,notrecommend,name,no_of_comments,Comments,rating,recommendation
0,7 May at 07:15,"<div class=""_5pbx userContent _3576"" data-ft='...",,,,,"<i class=""_51mq img sp_daR8qCHGLan sx_7209d2"">...",,"<a class=""profileLink"" data-ft='{""tn"":""l""}' hr...",,...,,,,recommend,,Lalitha Krishna,,,,recommend
1,7 May at 04:29,"<div class=""_5pbx userContent _3576"" data-ft='...",,,,,"<i class=""_51mq img sp_daR8qCHGLan sx_7209d2"">...",,"<a class=""profileLink"" data-ft='{""tn"":""l""}' hr...",,...,,,,recommend,,Krishna,,,,recommend
2,24 January 2017,"<div class=""_5pbx userContent _3576"" data-ft='...","<i class=""_51mq img sp_daR8qCHGLan sx_4d7322"">...",,,,,,"<span class=""_39_n profileLink"" data-ft='{""tn""...","<a class=""_3hg- _42ft"" data-ft='{""tn"":""O""}' hr...",...,,,,,,Vikram Sarveshwar,5 comments,"Hello Sir, we are extremely sorry for your exp...",1 star,
3,7 May at 04:20,"<div class=""_5pbx userContent _3576"" data-ft='...",,,,,"<i class=""_51mq img sp_daR8qCHGLan sx_7209d2"">...",,"<span class=""_39_n profileLink"" data-ft='{""tn""...",,...,,,,recommend,,Karuppan Chetty,,,,recommend
4,20 November 2017,"<div class=""_5pbx userContent _3576"" data-ft='...","<i class=""_51mq img sp_daR8qCHGLan sx_4d7322"">...",,,,,,"<span class=""_39_n profileLink"" data-ft='{""tn""...","<a class=""_3hg- _42ft"" data-ft='{""tn"":""O""}' hr...",...,,,,,,Maahi Jain,11 comments,,1 star,


### 6. Desired data from scraping

In [15]:
# View columns
df.columns

Index(['Date', 'Review', 'Rating1', 'Rating3', 'Rating4', 'Rating5',
       'Recommend', 'NotRecommend', 'Name', 'no_comments', 'comments', 'date',
       'review', 'rating1', 'rating3', 'rating4', 'rating5', 'recommend',
       'notrecommend', 'name', 'no_of_comments', 'Comments', 'rating',
       'recommendation'],
      dtype='object')

In [16]:
df1=df[["name","date","review","rating","recommendation","no_of_comments"]]

In [17]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            133 non-null    object
 1   date            133 non-null    object
 2   review          71 non-null     object
 3   rating          133 non-null    object
 4   recommendation  133 non-null    object
 5   no_of_comments  41 non-null     object
dtypes: object(6)
memory usage: 6.4+ KB


In [18]:
df1.head(5)

Unnamed: 0,name,date,review,rating,recommendation,no_of_comments
0,Lalitha Krishna,7 May 07:15,I have placed orders for pineapple cake @ Ck's...,,recommend,
1,Krishna,7 May 04:29,Excellent cakes and yummy sandwiches,,recommend,
2,Vikram Sarveshwar,24 January 2017,Service has become pathetic and the staff have...,1 star,,5 comments
3,Karuppan Chetty,7 May 04:20,"Good &amp; Affordable Variety in cakes, Choco ...",,recommend,
4,Maahi Jain,20 November 2017,Poor service n worst design. Waste of money. C...,1 star,,11 comments


In [19]:
df1.tail(5)

Unnamed: 0,name,date,review,rating,recommendation,no_of_comments
128,Beryl Sarah,14 June 2016,,5 star,,
129,Koti Reddy,13 June 2016,,5 star,,
130,Roshan Bennet,10 June 2016,,5 star,,
131,Abinaya Chandru,9 June 2016,,3 star,,
132,Srinath,9 June 2016,,5 star,,


### 7. Export data to csv

In [20]:
#df1.to_csv('scrapped_reviews_facebook1.csv',index=False)