# Name Entity Recognition and Webscraping for Products
 In this Jupyter Notebook, I will extract product names from raw text. The raw text will be produced by web scraping, in the cleanest format possible.
 This project has 2 main parts:
 1. **Web scraping multiple links** (704 in total), only if they contain relevant text and if the page can be accessed; if they do, they will be appended into a list called ``` df_contents``` , each element [0],[1],...,[1], having a successfully web-scraped web page clean text.
 2. **Performing Name Entity Recognition** for each of the extracted web-page texts, and extracting the products for each page in a new list called ``` product_list```. This will be done using `spacy` library and pre-trained models/

In [None]:
# Importing basic libs and web scraping libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
from bs4.element import Comment
import urllib.request


In [11]:
#reading the link dataset provided
df = pd.read_csv("furniture_stores_pages.csv")

In [12]:
# checking if the dataset was loaded successfuly, by printing
print(df)

                                             max(page)
0    https://www.factorybuys.com.au/products/euro-t...
1      https://dunlin.com.au/products/beadlight-cirrus
2    https://themodern.net.au/products/hamar-plant-...
3    https://furniturefetish.com.au/products/oslo-o...
4            https://hemisphereliving.com.au/products/
..                                                 ...
699  https://signaturefinefurniture.ca/products/jac...
700  http://aonefurniture.in/userdata/products/no%2...
701  https://furnituremama.com/products/three-door-...
702                    https://gfurniture.ca/products/
703  https://www.mwfurnitureoutlet.com/products/dre...

[704 rows x 1 columns]


In [13]:
# renaming the column max(page) to something a bit more... intuitive?
df.rename(columns={"max(page)":"Link"}, inplace=1)

## Defining functions for extracting relevant data from web-pages without gibberish
Here I am defining fucntions that will remove any gibberish from the extracted text, such as html, style scripts, scripts, any head, meta, etc.

The first function, ``` tag_visible()``` defines the web-page elements that should be excluded from the extraction.

The second function ``` text_from_html``` extracts the web-page content and filters the web-page elements defined in the ``` tag_visible()``` function.

Finally, I test if the extraction is done correctly, and only the text I require (clean text) remains.

In [14]:
# tags that should be filtered from the extracted text
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

# function that filters any tags we mentioned in the previous function, from the text on the web-page
def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(string=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

#testing if the function extracts relevant text only
html = urllib.request.urlopen('https://www.artificialintelligence-news.com/2023/11/20/microsoft-recruits-former-openai-ceo-sam-altman-co-founder-greg-brockman/').read()
print(text_from_html(html))

             Menu                                                ARTICLE     LOG IN                               News  Categories   Applications   Chatbots  Face Recognition  Virtual Assistants  Voice Recognition    Companies   Amazon  Apple  Google  Meta (Facebook)  Microsoft  NVIDIA    Deep & Reinforcement Learning  Enterprise  Ethics & Society  Industries   Energy  Entertainment & Retail  Gaming  Healthcare  Manufacturing    Legislation & Government  Machine Learning  Privacy  Research  Robotics  Security  Surveillance    Events  Webinars & Resources  Work With Us   Work With AI News  About AI News  Contact Us    Subscribe / Login  TechForge   The Block  Cloud Tech News  Developer News  Edge Computing News  IoT News  Marketing Tech News  Telecoms Tech News    Upcoming Events                      News  Categories   Applications   Chatbots  Face Recognition  Virtual Assistants  Voice Recognition    Companies   Amazon  Apple  Google  Meta (Facebook)  Microsoft  NVIDIA    Deep & Reinfo

## Extracting text from all web-pages provided

Now that I know the extraction functions work like a charm, it's time to perform the extraction iteratively on all elements of the Link dataset provided. 

To do so, I create a variable x that starts from 0 - which will be iterated for each link in the list. The ``` element ``` variable could be used, I know - but I am a C guy, and I like to C (see) important variables, heh. Don't judge!

Then, I define the df_contents list that will contain all extracted text from each web-page.

Then, I simply iterate through each element and extract the information from it, if it is possible to access and if it contains relevant data. This is simply the test done above, on a random link - but done on all of the links in the dataset you provided.

I also use a try-except pair to simply pass on the links that cannot be extracted, and proceed to extract those who are able to be extracted.

In [15]:
x=0
df_contents=[]
for element in df.iterrows():
    try:
        url = df["Link"][x]
        scraped_text = urllib.request.urlopen(url).read()
        df_contents.append(text_from_html(scraped_text))
        x=x+1
    except:
        print("(",x,")","This website could not be scraped-> ",df["Link"][x])
        x=x+1


( 3 ) This website could not be scraped->  https://furniturefetish.com.au/products/oslo-office-chair-white
( 4 ) This website could not be scraped->  https://hemisphereliving.com.au/products/
( 5 ) This website could not be scraped->  https://home-buy.com.au/products/bridger-pendant-larger-lamp-metal-brass
( 7 ) This website could not be scraped->  https://beckurbanfurniture.com.au/products/page/2/
( 9 ) This website could not be scraped->  https://edenliving.online/collections/summerloving/products/nice-lounge-1
( 10 ) This website could not be scraped->  https://www.ourfurniturewarehouse.com.au/products/athens-3pce-lounge-includes-2x-armless-3-seater-and-corner-ottoman-in-grey-storm-fabric?_pos=1&_sid=9f9ca4320&_ss=r
( 11 ) This website could not be scraped->  https://cane-line.co.uk/collections/news-2020/products/breeze-sunbed-5569
( 12 ) This website could not be scraped->  https://haute-living.com/products/beam-desk
( 13 ) This website could not be scraped->  https://www.knoll.com

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


( 153 ) This website could not be scraped->  https://www.dimshome.com/products/eave-desk
( 157 ) This website could not be scraped->  https://sjotime.com/products/notch-desk
( 158 ) This website could not be scraped->  https://www.puji.com/www.puji.com/products/urban-industrial-bistro-dining-table
( 159 ) This website could not be scraped->  https://littlepartners.com/products/bentwood
( 161 ) This website could not be scraped->  https://www.shoptorious.com/products/nova-domus-jagger-modern-dark-grey-walnut-bedroom-set
( 165 ) This website could not be scraped->  https://www.arighibianchi.co.uk/collections/new-in-homepage/products/birds-coat-rack
( 166 ) This website could not be scraped->  https://idlehands.design/products/page/2/
( 167 ) This website could not be scraped->  https://www.diyfurniturestore.com/products/inside-out
( 168 ) This website could not be scraped->  https://pastperfect.sg/products/
( 171 ) This website could not be scraped->  https://apato.com.au/products/work/k

Now, I simply check how many websites are successfully extracted, using the len() function.

In [16]:

len(df_contents)

268

There are 268 in total - so many were removed since they didn't have good data.

Now, let's print one of them, to see what's in it. We print the first index

In [23]:
df_contents[0]

"                                 Skip to content      FREE SHIPPING  ALL MATTRESSES*   Buy Now Pay Later Available!   Fast Shipping Australia Wide!                                           Home Furniture             Home Furniture   Bedroom            Bedroom   Mattresses   Bed Frames   Bedroom Packages   Bedsides   Wardrobes   Tallboys   Dressing Tables   Headboards   Ottomans   Jewellery Cabinets     Dining            Dining   Bar Stools   Dining Chairs   Dining Tables   Dining Sets   Sideboards & Buffets   Kitchen Benches     Living Room            Living Room   Sofas   TV Units   Recliner Chairs   Coffee & Side Tables   Chairs   Massage Chairs   Hall Tables   Bean Bags   Arm Chairs   Futons & Daybeds   Room Dividers   Living Room Packages     Office            Office   Office Chairs   Office Massage Chairs   Desks   Sit Stand Desks   Monitor Stands   Laptop Desks   Filing Cabinets   Office Accessories   Office Packages     LED Furniture            LED Furniture   LED Bed Frames  

# End of Web Scraping Part

# Name Entity Recognition Using spacy

For this task, we are going to use the ```spacy``` library from python, and a pre-trained transformer provided by them, called `en_core_web_trf`. This is a bigger stronger version of Transformers, compared to the "fast" version they classically provide.

The model [can be found at this link (click)](https://huggingface.co/spacy/en_core_web_trf), if you want to look at it.

Other pre-trained models can also be used, but let's hope this one will do the job.

I will first try to extract for a single element (text) from the `df_contents` list, and if that works, I will iteratively extract the rest.

In [38]:
#to install spacy run below
#!pip3 install spacy
import spacy

In [None]:
# this is used to download the needed transformer model
#!python -m spacy download en_core_web_trf

In [39]:
#loading the model
nlp = spacy.load('en_core_web_trf')

In [28]:
#defining the category -1 in this case, PRODUCT
category = ["PRODUCT"]

In [29]:
# we take one element from the list
test1 = df_contents[1]

In [31]:
# Now we tokenize the text 
doc = nlp(test1)

 Now that we have tokenized the text using the `en_core_web_trf`, we can extract the relevant information we want (the category defined above).

We do that iteratively for each `PRODUCT` we extracted from that text.

In [32]:
# Identifying named entities from the text
entities = []
for ent in doc.ents:
    if ent.label_ in category:
        entities.append((ent.text,ent.label_))
       

Now, if everything went allright, we can print all of the products extracted.

In [40]:
for entity,category in entities:
    print(f"{entity}:{category}")

Cirrus:PRODUCT
Beadlight:PRODUCT
Beadlight:PRODUCT
Cirrus:PRODUCT
Beadlight:PRODUCT
Beadlight:PRODUCT


We can also visualize the results using the `spacy` library.

In [34]:
spacy.displacy.render(doc,style="ent")

# Iteratively extracting all of the products from all of the extracted data

Everything is now set, and I can start extracting iteratively from all links.

To do that ,we create a for function to do just that, using the code above.


In [53]:
extracted_products=[]
y=0
for element in df_contents:
    doc = nlp(df_contents[y])
    # Identifying named entities from the text
    entities = []
    for ent in doc.ents:
        if ent.label_ in category:
            entities.append((ent.text,ent.label_))
    extracted_products.append(entities)
    y=y+1

# Visualizing the result
Now that's done, let's print and see the final result!

Below, you will see a list for each link in the list, from the first to the last.

In [58]:
z = 0
for element in extracted_products:
    print("The extracted products for link >>>",z,"<<< are:\n",extracted_products[z])
    z = z+1
    print("\n \n")

The extracted products for link >>> 0 <<< are:
 [('Sofas             ', 'PRODUCT'), ('Chaise Sofas', 'PRODUCT'), ('Sofas', 'PRODUCT'), ('Chaise Sofas', 'PRODUCT'), ('Euro Top Mattress', 'PRODUCT'), ('Euro top', 'PRODUCT'), ('Euro Top Mattress', 'PRODUCT'), ('Euro top', 'PRODUCT')]

 

The extracted products for link >>> 1 <<< are:
 [('Cirrus', 'PRODUCT'), ('Beadlight', 'PRODUCT'), ('Beadlight', 'PRODUCT'), ('Cirrus', 'PRODUCT'), ('Beadlight', 'PRODUCT'), ('Beadlight', 'PRODUCT')]

 

The extracted products for link >>> 2 <<< are:
 [('The Hamar Plant Stand', 'PRODUCT')]

 

The extracted products for link >>> 3 <<< are:
 [('Sofas', 'PRODUCT'), ('Grado  ', 'PRODUCT'), ('Grado', 'PRODUCT'), ('Citta  Mist Vase', 'PRODUCT')]

 

The extracted products for link >>> 4 <<< are:
 [('e15    ', 'PRODUCT'), ('Elan Plus', 'PRODUCT'), ('NaughtOne', 'PRODUCT'), ('Noritake', 'PRODUCT'), ('Skupa', 'PRODUCT'), ('Stellar', 'PRODUCT'), ('Living Edge', 'PRODUCT'), ('Living Edge', 'PRODUCT')]

 

The extrac

We can also simply print all of the extracted products, as seen below. Just need to do a bit of magic with 2 for loops and some counter variables :)!


In [67]:
a=0
for el in extracted_products:
    extract1 = extracted_products[a]
    b=0
    for el2 in extract1:
        print(extract1[b])
        b=b+1
    a=a+1

('Sofas             ', 'PRODUCT')
('Chaise Sofas', 'PRODUCT')
('Sofas', 'PRODUCT')
('Chaise Sofas', 'PRODUCT')
('Euro Top Mattress', 'PRODUCT')
('Euro top', 'PRODUCT')
('Euro Top Mattress', 'PRODUCT')
('Euro top', 'PRODUCT')
('Cirrus', 'PRODUCT')
('Beadlight', 'PRODUCT')
('Beadlight', 'PRODUCT')
('Cirrus', 'PRODUCT')
('Beadlight', 'PRODUCT')
('Beadlight', 'PRODUCT')
('The Hamar Plant Stand', 'PRODUCT')
('Sofas', 'PRODUCT')
('Grado  ', 'PRODUCT')
('Grado', 'PRODUCT')
('Citta  Mist Vase', 'PRODUCT')
('e15    ', 'PRODUCT')
('Elan Plus', 'PRODUCT')
('NaughtOne', 'PRODUCT')
('Noritake', 'PRODUCT')
('Skupa', 'PRODUCT')
('Stellar', 'PRODUCT')
('Living Edge', 'PRODUCT')
('Living Edge', 'PRODUCT')
('A Vast Icon', 'PRODUCT')
('the Samson Daybed', 'PRODUCT')
('Prima Linen', 'PRODUCT')
('Sophie Sky Grey', 'PRODUCT')
('Ease Silver Dollar', 'PRODUCT')
('Taylor Felt Grey', 'PRODUCT')
('Romo Moon', 'PRODUCT')
('Earl Grey', 'PRODUCT')
('Romo Paprika', 'PRODUCT')
('Mont Blanc Smoke', 'PRODUCT')
('Nottin

# Final Words

This was quite an interesting project, which actually thought me how to extract cleaner text then I had before. I used to try a worse technique when extracting html pages, which wasn't that good in returning clean text. Now I learned how to do it better!

About your suggestion of using PySpark - that is what I tried doing the first time. However, for some reason that eludes me, my dual-boot linux distro broke, and I cannot acces it, so I would need to reinstall. 

I tried doing it with PySpark on Windows, and set up all environment variables, installed all of the thigs needed and it still didn't work - so I gave up on it.

I will try it again after I fix my Linux Dual-Boot, and will do the same NER as I did here - maybe with a custom trained network, this time.

--------------------------------------
Anyhow, it was a nice project. I look forward to working with you, if I am selected as a candidate and if everything I did here was OK.

