# <span style="color:darkblue"> QTM 530 Final Project 1 </span>

Linchuan Zhang

10/24/2024

Note: Sorry for the 90-minute delay in submission. I was stuck because five items did not have reviews or ratings. It wasn’t just missing values, but the entire section was absent, which took me a bit longer to adjust the loop function to address this issue.

**Project Description**

I want to buy a new laptop. To gather information on the laptops available in the market, I search for laptops on Dell's official website and compile a list of Dell laptops for further analysis. The data source is Dell’s official website for laptop computers, with data collected on October 24, 2024. As of this date, there are 67 laptops available on the website, displayed across six pages. The information I extract includes the produce name, model, current price, number of reviews, average star rating, and display screen size for each laptop. Since there are six pages in total, the data collection code will stop after reaching page six and will input the collected information into a structured dataset.

Below is a screenshot of the Dell website.

<div style="text-align: center;">
<img src="./screenshot.png" alt="Dell Website Screenshot" width="1000">
</div>


**Scraping Algorithm**

In [201]:
# Import packages for data processing, web scraping, and outputting data
exec(open("./scripts/import_packages.py").read())
import os

In [202]:
# Initialize web driver
options = webdriver.ChromeOptions()
options.headless = False 
driver = webdriver.Chrome(options=options)

In [203]:
# Define URL
starting_url = 'https://www.dell.com/en-us/shop/dell-laptops/scr/laptops?gacd=9684992-1111-5761040-266906002-0&dgc=ST&SA360CID=71700000115908354&gad_source=1&gclid=CjwKCAjw9p24BhB_EiwA8ID5BsGCMi_neCHdr-GymQFfdMeN5_DdQNlN0C_yx6PxPJ6C_v7zFZgd9hoCH7AQAvD_BwE&gclsrc=aw.ds'
# Open website
driver.get(starting_url)

In [204]:
# Pre-extraction quality check
search_qualitycheck = driver.find_element('xpath', '//div[@class="variant-stack-layout"]')

search_qualitycheck_elements = search_qualitycheck\
    .find_elements('xpath','//main/div/div/section/div/div/div/div/div/div/article/section/div[@class = "ps-variant-title-container"]') 

num_elements = len(search_qualitycheck_elements)
print(num_elements)

12


As shown above, the number of elements I can extract from the "path" I find in the website's developer mode is the same with the number of items displayed on the website (12, as shown in the screenshot). Each element contains detailed information for each laptop item. I use this as a pre-extraction quality check to make sure I am on the right track in extracting the information.

In [205]:
# Create an empty list
data = []

for j in range(1, 7): # Scrape data for multiple pages with loop
    search_1 = driver.find_element('xpath', '//div[@class="variant-stack-layout"]')
    html_code = search_1.get_attribute("outerHTML")
    parse_code = BeautifulSoup(html_code, "html.parser")
    
    # Find all product blocks
    product_blocks = parse_code.find_all('section', class_='ps-show-hide')
    
    # Collect information
    for product_block in product_blocks:
        # Initialize variables with None (for possible missing values)
        product_name = None
        model = None
        price = None
        rating = None
        review_count = None
        display_size = None

        # Extract product name
        product_element = product_block.find('h3', class_='ps-title')
        product_name = product_element.get_text(strip=True)
        
        # Extract model
        model_element = product_block.find('span', class_='ps-model-title')
        model = model_element.find_next_sibling('span').get_text(strip=True)
        
        # Extract price
        price_button = product_block.find('button', string=lambda text: "Starting at" in text if text else False)
        price_element = price_button.find_next_sibling('span')
        price = price_element.get_text(strip=True) 

        # Extract number of reviews
        review_count_element = product_block.find('meta', itemprop='reviewCount')
        if review_count_element: # some items don't have reviews
            review_count = review_count_element['content']  # Extracts '15498'

        # Extract average star rating
        rating_element = product_block.find('span', class_='ratings-text')
        if rating_element:  # some items don't have ratings
            rating = rating_element.get_text(strip=True).split()[0]

        # Extract display size
        display_label = product_block.find('div', string="Display")
        display_size_element = display_label.find_next_sibling('div', class_='spec-value')
        display_size = display_size_element.get_text(strip=True)

        # Add variables to the data list
        data.append({"Product Name": product_name, "Model": model, "Price":price, "Reviews Count": review_count, "1-5 Star Rating": rating, "Display Size": display_size, "Page": j})
    
    # Click the "next" button to go to the next page (navigation)
    if j < 6:
        next_button = driver.find_element('xpath', '//button[@class="dds__button dds__button--tertiary dds__button--sm dds__pagination__next-page"]')
        next_button.click()
        time.sleep(3)

# Convert raw data to a DataFrame
result_df = pd.DataFrame(data)

In [206]:
# Remove $ from price and convert to float
result_df['Price'] = result_df['Price'].replace({r'\$': '', ',': ''}, regex=True).astype(float)

# Remove " from display size and convert to float
result_df['Display Size'] = result_df['Display Size'].replace({r'"': ''}, regex=True).astype(float)

# Convert reviews count and star rating to float
result_df['Reviews Count'] = result_df['Reviews Count'].astype(float)
result_df['1-5 Star Rating'] = result_df['1-5 Star Rating'].astype(float)

In [207]:
# Create a csv file
output_file_path = os.path.expanduser("~/Desktop/Final_Project_1_Data_Linchuan_Zhang.csv")
result_df.to_csv(output_file_path, index=False)

**Results**

In [185]:
# Overview of the extracted data
result_df

Unnamed: 0,Product Name,Model,Price,Reviews Count,1-5 Star Rating,Display Size,Page
0,Inspiron 15 Laptop,3520,279.99,15498.0,4.3,15.6,1
1,XPS 15 Laptop,9530,1099.00,5332.0,4.4,16.3,1
2,Latitude 5550 Laptop,5550,1129.00,347.0,4.5,15.6,1
3,Alienware m18 R2 Gaming Laptop,R2,1899.99,720.0,4.6,18.0,1
4,Inspiron 16 Laptop,5640,549.99,765.0,4.5,16.0,1
...,...,...,...,...,...,...,...
62,Inspiron 16 Laptop,5630,599.99,3531.0,4.5,16.0,6
63,Inspiron 16 Plus Laptop,7630,899.99,1179.0,4.4,16.0,6
64,Latitude 9440 2-in-1,9440 2-in-1,1589.00,237.0,4.6,14.0,6
65,Latitude 5540 Laptop,5540,879.00,1746.0,4.5,15.6,6


In [186]:
# Quality check

# Number of observations
total_observations = len(result_df)

# Check for missing values in the four columns of the dataset
missing_values = {
    "product_name": result_df["Product Name"].isnull().sum(),
    "model": result_df["Model"].isnull().sum(),
    "price": result_df["Price"].isnull().sum(),
    "review": result_df["Reviews Count"].isnull().sum(),
    "rating": result_df["1-5 Star Rating"].isnull().sum(),
    "display_size": result_df["Display Size"].isnull().sum()
}

# Present a table counting the number of observations and missing values
summary_table_1 = pd.DataFrame({
    "Total Observations": [total_observations] * 6,
    "Missing Values": [missing_values["product_name"], missing_values["model"],missing_values["price"],
                       missing_values["review"], missing_values["rating"],missing_values["display_size"]]
}, index=["Product Name", "Model", "Price", "Reviews Count","1-5 Star Rating","Display Size"])

summary_table_1

Unnamed: 0,Total Observations,Missing Values
Product Name,67,0
Model,67,0
Price,67,0
Reviews Count,67,5
1-5 Star Rating,67,5
Display Size,67,0


As shown above, the dataset contains 67 observations. There are no missing values except for "Reviews Count" and "1-5 Star Rating." After checking the website, I find that five items do not have this information available, so the missingness is not due to errors related to my data extraction. (Please see the screenshot of the website below)

<div style="text-align: center;">
<img src="./screenshot2.png" alt="Dell Website Screenshot" width="1000">
</div>

In [187]:
# Additional quality check

# Remove potential repeated rows
unique_df = result_df.drop_duplicates(subset=['Product Name', 'Model'])

# Group by "page", count number of observations per page
summary_table_2 = unique_df.groupby('Page').agg(
    observations=('Product Name', 'size')
)

summary_table_2

Unnamed: 0_level_0,observations
Page,Unnamed: 1_level_1
1,12
2,12
3,12
4,12
5,12
6,7


As shown above, for pages 1-5, I successfully collected 12 non-repeated rows. I also successfully collected all 7 non-repeated rows on page 6. This quality check addresses the potential problem of failing to extract any items from the website due to reasons such as delayed loading on my computer.

In [188]:
# Check the data types of each column, and find numeric variables
result_df.dtypes

Product Name        object
Model               object
Price              float64
Reviews Count      float64
1-5 Star Rating    float64
Display Size       float64
Page                 int64
dtype: object

In [198]:
# Mean and variance for numeric variables

price_mean = result_df['Price'].mean()
price_variance = result_df['Price'].var()
review_mean = result_df['Reviews Count'].mean()
review_variance = result_df['Reviews Count'].var()
rating_mean = result_df['1-5 Star Rating'].mean()
rating_variance = result_df['1-5 Star Rating'].var()
display_size_mean = result_df['Display Size'].mean()
display_size_variance = result_df['Display Size'].var()

summary_table_3 = pd.DataFrame({
    "Mean": [price_mean, review_mean, rating_mean, display_size_mean],
    "Variance": [price_variance, review_variance, rating_variance, display_size_variance]
}, index=["Price", "Reviews Count", "1-5 Star Rating", "Display Size"])

# with two decimal places
summary_table_3 = summary_table_3.astype(float).map(lambda x: f"{x:.2f}")
summary_table_3

Unnamed: 0,Mean,Variance
Price,1234.81,521909.12
Reviews Count,713.32,4346267.21
1-5 Star Rating,4.4,0.03
Display Size,14.5,2.28


**Discussion (conclusion)**

I use two loop functions to extract information from the website. The first loop is about automatic navigation from page 1 to page 6. Within this loop, a second inner loop extracts information for the 12 laptop items displayed on each page. As a result, I successfully extract data on 67 laptops from the website.

After the quality check, I confirm that the number of rows in the dataset matches the total number of laptops shown on the website, and there are no missing values caused by my extration. The data extraction process is successful.

As shown in the Results section, the dataset contains 67 laptop items. Five items on the website don't have information on "Reviews Count" and "1-5 Star Rating". The mean and variance for numeric variables, including price, number of reviews, average star rating, and display size, are shown in the Results section.

A more detailed analysis will be presented in Project 2.