### Web Scraping with Beautiful Soup
As the saying goes, practice makes perfect.In this project, I worked on another web scraping task, this time focusing on Etsy, a global e-commerce platform known for its unique and creative goods. Etsy hosts a diverse range of extraordinary items, from handcrafted pieces to vintage treasures. My goal was to explore the Kenyan products available on the platform. For this, I used Beautiful Soup to scrape the data and Pandas to manipulate it for analysis.

* I'll start by importing the necesssary libraries

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

In [2]:
#defining the url
url ="https://www.etsy.com/market/kenya?ref=pagination&page="
headers ={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36", 
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
}

#initializing an empty list that will store our data
all_products =[]

#iterating over the 50 pages
for page in range(1,51):
    urls =f"{url}{page}"
    response =requests.get(urls,headers=headers)

    if response.status_code == 200:
        soup =bs(response.text,"html.parser")
        products =soup.find_all("div",class_="v2-listing-card__info")
        for product in products:
            name =product.find("h2",class_="wt-text-caption")
            currency =product.find("span",class_="currency-symbol")
            price =product.find("span",class_="currency-value")
        
            name =name.get_text(strip=True) if name else None
            currency =currency.get_text(strip=True) if currency else None
            price =price.get_text(strip=True) if price else None
        
            goods ={
                "Name":name,
                "Currency":currency,
                "Price":price
            }
        
            all_products.append(goods)
    else:
        print(f"Failed to retrieve {page}: {response.status_code}")

df =pd.DataFrame(all_products)

* Lets check our data to see if its tidy

In [3]:
display(df.head(10))
print(f"Dataset shape: {df.shape}")
print(f"\nDuplicates present: {df.duplicated().any()}")
print(f"\nMissing values present: {df.isna().any()}")
print(f"\nData types{df.dtypes}")

Unnamed: 0,Name,Currency,Price
0,Mount Kenya Poster -Chogoria Route Art -Mount ...,USD,16.48
1,Kenyan Map-Shaped Wooden Jewelry Tray,USD,90.0
2,"Kericho Gold Kenyan Tea (Black Tea, 100 Envelo...",USD,13.99
3,"Pallasite Meteorite slice, 831g Kenyan meteori...",USD,2493.0
4,Kenya Cookbook - Uncover the Rich and Diverse ...,USD,3.48
5,Swahili Wall Art Set of 4 Definition Prints Ka...,USD,9.17
6,"Kenya OS PNG, exclusive design Kenya OS in png...",USD,2.87
7,"Kenya Heart Sweatshirt, Kenya Hoodie, Kenya Sh...",USD,32.0
8,Mzungu Definition Men's Short Sleeve T-Shirt |...,USD,26.29
9,"Kericho Gold Kenyan Tea (Black Tea, 100 Envelo...",USD,13.99


Dataset shape: (2746, 3)

Duplicates present: True

Missing values present: Name        False
Currency    False
Price       False
dtype: bool

Data typesName        object
Currency    object
Price       object
dtype: object


* After reviewing the dataset, I identified duplicates, found that some data types required adjustment and confirmed there were no missing values.

In [4]:
#lets drop the duplicates
df.drop_duplicates(inplace=True)

#lets change the data type
df["Price"] =df["Price"].str.strip()
df["Price"] =df["Price"].str.replace(r"[^\d.]","",regex=True)
df["Price"] =pd.to_numeric(df["Price"])

#### Data preprocessing
Since we are looking at products of Kenya, we need to change the currency to Kenya Shillings

In [5]:
df["Currency"] =df["Currency"].str.replace("USD","Ksh")
df["Price"] =df["Price"] * 130 

In [6]:
df.head()

Unnamed: 0,Name,Currency,Price
0,Mount Kenya Poster -Chogoria Route Art -Mount ...,Ksh,2142.4
1,Kenyan Map-Shaped Wooden Jewelry Tray,Ksh,11700.0
2,"Kericho Gold Kenyan Tea (Black Tea, 100 Envelo...",Ksh,1818.7
3,"Pallasite Meteorite slice, 831g Kenyan meteori...",Ksh,324090.0
4,Kenya Cookbook - Uncover the Rich and Diverse ...,Ksh,452.4


* Saving our new dataset as a csv

In [7]:
df.to_csv(r"C:\Users\USER\Web-Scraping\Etsy_Data.csv")