### Data Collection - Web scraping

I am searching for an efficient and recently launched laptop to buy. There are lots of models and variations available in the market. 'notebookcheck.net' is the best website to look for laptop reviews which has the most recently launched models of every brand. So I scraped this website to get the specifications from the last 3 months laptop review details .

In [1]:
import requests
import bs4
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import re
from random import randint
from time import sleep
import pandas as pd

In [2]:
# GET REVIEW URLS STEP - This step will go through the web pages and take the last 3 months laptop review urls #
# from each page and store it in a list. #

# Last ran on - Jul 14, 2021

base_url = 'https://www.notebookcheck.net/Laptop.315784.0.html?&page={}&tagArray[]=16&typeArray[]=1'
three_month_date = date.today() - relativedelta(months=3)
n = 0
review_urls = []
loop_flag=True
while loop_flag:      
    scrap_url = base_url.format(n)
    res = requests.get(scrap_url)
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    url = soup.select('.introa_large.introa_review')
    for i in range(len(url)):
        date_match=re.search(r'(\d{2}\s[a-zA-Z]{3}\s\d{4})',url[i].text)
        url_date = datetime.strptime(date_match[0], '%d %b %Y').date()
        if url_date >= three_month_date:
            review_urls.append(url[i]['href'])
        else:
            loop_flag=False
            break            
    n += 1

In [3]:
# GET SPEC DETAILS OF LAPTOPS STEP - This step will go through each web page fo review urls and get the spec details # 
# Since 3 months data we are pulling, it will take some time to finish. 
# !!!!! Don't run this twice by mistake !!!!! #

spec_details=[]
for n in range(len(review_urls)):
    result=requests.get(review_urls[n])
    sleep(randint(5,15))
    spec_soup=bs4.BeautifulSoup(result.text,'lxml')
    
    spec_details.append((spec_soup.find_all(class_=['specs_header', 'specs', 'specs_details'])))

In [4]:
# DATA FORMATTING STEP - This step will format the raw data from previous step and get the required spec details # 

specs_dict={}
specs_list=[]
specs_needed=['Processor','Graphics adapter','Memory','Display','Storage',
       'Size','Battery','Camera','Weight','Price']
for i in range(len(spec_details)):
    if 'Performance-Analysis' in review_urls[i]:
        continue
    if len(spec_details[i]) != 0:        
        specs_dict['Product_name'] = spec_details[i][0].text.replace(u'\xa0',u' ')
        for j in range(1,len(spec_details[i])):
            if j%2 !=0 and spec_details[i][j].text in specs_needed:
                specs_dict[spec_details[i][j].text] = spec_details[i][j+1].text.replace(u'\xa0',u' ')
        specs_dict['URL_link'] = review_urls[i]
        specs_list.append(specs_dict.copy())
        specs_dict={}

In [61]:
# Copy the spec details into dataframe and convert it into csv #

df=pd.DataFrame(specs_list)
df.head(5)

Unnamed: 0,Product_name,Processor,Graphics adapter,Memory,Display,Storage,Size,Battery,Camera,Weight,Price,URL_link
0,Lenovo Yoga 6 13 82ND0009US (Yoga 6 13 Series),"AMD Ryzen 5 5500U 6 x 2.1 - 4 GHz, 21 W PL2 / ...","AMD Radeon RX Vega 7 - 512 MB, Memory: 1333 MH...","8192 MB , 1600 MHz, 22-22-22-52, Dual-Channel...","13.30 inch 16:9, 1920 x 1080 pixel 166 PPI, 10...","WDC PC SN530 SDBPMPZ-265G, 256 GB",height x width x depth (in mm): 18.2 x 308 x 2...,60 Wh Lithium-Polymer,Webcam: 720pPrimary Camera: 0.9 MPix,"1.331 kg ( = 46.95 oz / 2.93 pounds), Power Su...",800 USD,https://www.notebookcheck.net/Lenovo-Yoga-6-13...
1,Asus ZenBook Flip 13 UX363EA-HP069T (ZenBook F...,"Intel Core i7-1165G7 4 x 2.8 - 4.7 GHz, 51 W P...","Intel Iris Xe Graphics G7 96EUs, Memory: 1300 ...","16384 MB , LPDDR4-4266, soldered","13.30 inch 16:9, 1920 x 1080 pixel 166 PPI, mu...","WDC PC SN730 SDBPNTY-1T00, 1024 GB , 950 GB free",height x width x depth (in mm): 13 x 305 x 211...,"67 Wh, 4220 mAh Lithium-Ion",Webcam: HDPrimary Camera: 0.9 MPixSecondary Ca...,"1.188 kg ( = 41.91 oz / 2.62 pounds), Power Su...","1,799 EUR",https://www.notebookcheck.net/Asus-ZenBook-Fli...
2,Medion Erazer Beast X20 (Erazer Series),"Intel Core i7-10870H 8 x 2.2 - 5 GHz, 60 W PL2...","NVIDIA GeForce RTX 3070 Laptop GPU - 8192 MB, ...","32768 MB , DDR4-3200, Dual-Channel-Mode, two ...","17.30 inch 16:9, 2560 x 1440 pixel 170 PPI, BO...","Phison E12S-2TB-Phison-SSD-BICS4, 2048 GB , 2...",height x width x depth (in mm): 23 x 395 x 262...,"91 Wh, 7900 mAh Lithium-Polymer",Webcam: HDPrimary Camera: 0.9 MPix,"2.246 kg ( = 79.23 oz / 4.95 pounds), Power Su...",2299 Euro,https://www.notebookcheck.net/Medion-Erazer-Be...
3,Razer Blade 15 Advanced Model Core i7-11800H (...,"Intel Core i7-11800H 8 x 2.3 - 4.6 GHz, 160 W ...","NVIDIA GeForce RTX 3080 Laptop GPU, Core: 1245...","32768 MB , Samsung DDR4-3200, 22-22-22-52, Du...","15.60 inch 16:9, 1920 x 1080 pixel 141 PPI, TL...","SSSTC CA6-8D1024, 1024 GB",height x width x depth (in mm): 16.99 x 355 x ...,80 Wh,Webcam: FHDPrimary Camera: 2 MPix,"2.026 kg ( = 71.47 oz / 4.47 pounds), Power Su...",3100 USD,https://www.notebookcheck.net/Razer-Blade-15-A...
4,Samsung Galaxy Book NP750XDA (Galaxy Book Series),"Intel Core i5-1135G7 4 x 2.4 - 4.2 GHz, 32 W P...","Intel Iris Xe Graphics G7 80EUs, Core: 1300 MH...","8192 MB , Dual-Channel","15.60 inch 16:9, 1920 x 1080 pixel 141 PPI, CE...","Lite-On CL1-8D512, 512 GB , 450 GB free",height x width x depth (in mm): 16 x 357 x 230...,54 Wh Lithium-Ion,"Webcam: 0.9 MP, 16:9 (1280x720)","1.6 kg ( = 56.44 oz / 3.53 pounds), Power Supp...",900 Euro,https://www.notebookcheck.net/Galaxy-Book-2021...


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Product_name      75 non-null     object
 1   Processor         75 non-null     object
 2   Graphics adapter  75 non-null     object
 3   Memory            75 non-null     object
 4   Display           75 non-null     object
 5   Storage           75 non-null     object
 6   Size              75 non-null     object
 7   Battery           75 non-null     object
 8   Camera            72 non-null     object
 9   Weight            75 non-null     object
 10  Price             66 non-null     object
 11  URL_link          75 non-null     object
dtypes: object(12)
memory usage: 7.2+ KB


In [63]:
df.to_csv('Notebookcheck_last_3_months_data.csv',index=False)