# **HOW TO USE:** 
This notebook will provide you an example as to how you can quickly utilize this scrapper and its associated functionalities

## Generic Steps (Reading urls)

In [1]:
# Importing libraries
import os
import pandas as pd

pd.set_option("display.max_rows", 600)
pd.set_option("display.max_columns", 500)
pd.set_option("max_colwidth", 400)

# Read the example data
data = pd.read_csv(os.path.join('..', 'data', 'url_data.csv')).sample(500, random_state=42)

url_list = data.URLS.to_list()
len(url_list)

500

In [2]:
os.chdir(os.path.dirname(os.getcwd()))
os.getcwd()

'c:\\Users\\manash.jyoti.konwar\\Documents\\AI_Random_Projects\\Utility-Text-Scrapper'

## Module Level Execution

In [3]:
# Importing Text Scrapper Module
from url_entrypoint import URLValidator, URLScrapper

# Importing Concurrent Thread Executor
from multiprocessing import cpu_count

In [4]:
%time

# Running Validator
validator_instance = URLValidator(url_list=url_list, no_of_workers=2*cpu_count())
validator_instance.run_validation()

CPU times: total: 0 ns
Wall time: 0 ns


100%|██████████| 500/500 [00:46<00:00, 10.64it/s]


In [5]:
print(f'{len(validator_instance.get_active_urls)} links are active')
print(f'{len(validator_instance.get_inactive_urls)} links are inactive')
print(f'{len(validator_instance.get_unreachable_links)} links are not reachable / working')

39 links are active
450 links are inactive
11 links are not reachable / working


In [6]:
%time

# Running Scrapper
scrapping_instance = URLScrapper(url_list=validator_instance.get_active_urls, no_of_workers=2*cpu_count())
scrapping_instance.run_scrapping()

CPU times: total: 0 ns
Wall time: 0 ns


100%|██████████| 39/39 [00:06<00:00,  5.72it/s]


In [7]:
scrapping_instance.get_scrapped_df

Unnamed: 0,url_id,url,text,title,author,hostname,date,categories
0,0,HTTPS://ADWAREREMOVAL.INFO/WIN32-SPY-NUMANDO-L/,"The Win32/Spy.Numando.L is considered dangerous by lots of security experts. When this infection is active, you may notice unwanted processes in Task Manager list. In this case, it is adviced to scan your computer with GridinSoft Anti-Malware.\nGridinsoft Anti-Malware\nRemoving PC viruses manually may take hours and may damage your PC in the process. We recommend using GridinSoft Anti-Malware ...",Win32/Spy.Numando.L removal guide – Adware Reports,Paul Valéry,adwareremoval.info,2020-10-18,Spy
1,1,HTTPS://ADWAREREMOVAL.INFO/ULISE-120961/,"The Ulise.120961 is considered dangerous by lots of security experts. When this infection is active, you may notice unwanted processes in Task Manager list. In this case, it is adviced to scan your computer with GridinSoft Anti-Malware.\nGridinsoft Anti-Malware\nRemoving PC viruses manually may take hours and may damage your PC in the process. We recommend using GridinSoft Anti-Malware for vir...",Ulise.120961 (file analysis) – Adware Reports,Paul Valéry,adwareremoval.info,2020-10-13,Malware
2,2,HTTPS://FINANCE.YAHOO.COM/NEWS/ALTRIA-ABANDONS-FULL-OUTLOOK-HALTS-112713091.HTML,,,,,,
3,3,HTTPS://WWW.BIZJOURNALS.COM/PRNEWSWIRE/PRESS_RELEASES/2018/12/17/DC05175,Request unsuccessful. Incapsula incident ID: 707000890374935439-230364560991917632,,,,,
4,4,HTTPS://WWW.KRISTENLOURIE.COM/VAPE-SHOP-IN-NOTTINGHAM-UK/,"Author Bio\nEkaterina Mironova\nAuthor Biograhy: Ekaterina Mironova is a co-founder of CBD Life Mag and an avid blogger on the Hemp, CBD and fashion subjects. Ekaterina is also on the panel of the CBD reviewers and she most enjoys CBD gummies. Ekaterina has developed a real interest in CBD products after she started taking CBD tincture oil to help her ease her anxiety that was part-and-parcel ...","Vape Shop in Nottingham, UK - Kristen Lourie",Admin,kristenlourie.com,2021-03-28,Uncategorized
5,5,HTTPS://WWW.FACEBOOK.COM/10157317287039775_10157318669634775,"ಫೇಸ್ಬುಕ್ ನಲ್ಲಿ WBIR Channel 10 ಕುರಿತು ಇನ್ನಷ್ಟು ನೋಡಿ\nಫೇಸ್ಬುಕ್ ನಲ್ಲಿ WBIR Channel 10 ಕುರಿತು ಇನ್ನಷ್ಟು ನೋಡಿ\nಹೊಸ ಖಾತೆಯನ್ನು ರಚಿಸಿ\nಅಥವಾ\nಪುಟದ ಇತ್ತೀಚಿನ ಪೋಸ್ಟ್\nA data breach doesn’t automatically make your credit score drop, but it does make it easier for a scammer to meddle in your finances.\nThe state of New York was hit hard with a brutal winter storm over C...hristmas weekend. The blizzard drop...",WBIR Channel 10,,facebook.com,2022-12-29,
6,6,HTTPS://SPORTSGRINDENTERTAINMENT.COM/3-BIG-DIVIDEND-STOCKS-YIELDING-7-ANALYSTS-SAY-BUY/,"Inflations fears are rising, along with the price of gasoline and lumber and milk – and, oddly, the unemployment rate. The initial unemployment claims ticked up last week, even as the number of job openings reached a record high level.\nBetween the COVID relief bill, the infrastructure proposal, and a jobs act, the Biden Administration’s spending plans are totaling $6 trillion. And with econom...",3 Big Dividend Stocks Yielding 7%; Analysts Say ‘Buy’ - S.G.E,Christine Watkins,sportsgrindentertainment.com,2021-06-17,
7,7,HTTPS://GULFBUSINESS.COM/PHILIP-MORRIS-INTERNATIONAL-UNVEILS-NEW-IQOS-BOUTIQUE-AT-DUBAI-MALL/,"Home Industry Retail Philip Morris International unveils new IQOS Boutique at Dubai Mall The UAE officially legalised the sale and use of electronic cigarettes in April 2019 by David Ndichu October 7, 2020 Philip Morris International (PMI) has opened UAE’s first IQOS Boutique at Dubai Mall. PMI says the new store underlines its ambition of “achieving a smoke-free future for the GCC”. The IQOS ...",Philip Morris International unveils new IQOS Boutique at Dubai Mall,David Ndichu,gulfbusiness.com,2020-10-07,Retail;Dubai;UAE
8,8,HTTPS://WWW.DAILYVANGUARD.COM/TOBACCO-COMPANIES-IN-A-RACE-TO-DEVELOP-SAFER-E-CIGARETTES/,"Top scientists with years of experience in developing curative medicines have been hired by tobacco companies to make safer electronic cigarettes.\nPhilip Morris International (PMI) has recruited upwards of 400 scientists and technical staff to work at its research centre in Neuchâtel, Switzerland. The list includes biologists, chemists, toxicologists, biostatisticians as well as experts in re...",Tobacco Companies in a Race to Develop Safer E-Cigarettes,Admin,dailyvanguard.com,2015-06-21,"Health, e-cigarettes, news, Research, tobacco"
9,9,HTTPS://WWW.BIGCOUNTRYHOMEPAGE.COM/NEWS/HEALTH-NEWS/FDA-CALLS-FOR-REMOVAL-OF-FRUITY-DISPOSABLE-PUFF-BAR-VAPES/,"WASHINGTON (AP) — U.S. health officials are cracking down on fruity disposable electronic cigarettes popular with teenagers, saying the companies never received permission to sell them in the U.S.\nThe Food and Drug Administration sent a letter Monday telling the company behind Puff Bar e-cigarettes to remove them from the market within 15 business days, including flavors like mango, pink lemo...","FDA calls for removal of fruity, disposable e-cigarettes",MATTHEW PERRONE; Associated Press,bigcountryhomepage.com,2020-07-20,Health News
