# **HOW TO USE:** 
This notebook will provide you an example as to how you can quickly utilize this scrapper and its associated functionalities

## Generic Steps (Reading urls)

In [1]:
# Importing libraries
import os
import pandas as pd

pd.set_option("display.max_rows", 600)
pd.set_option("display.max_columns", 500)
pd.set_option("max_colwidth", 400)

# Read the example data
data = pd.read_csv(os.path.join('..', 'data', 'url_data.csv')).sample(500, random_state=42)

url_list = data.URLS.to_list()
len(url_list)

500

In [2]:
os.chdir(os.path.dirname(os.getcwd()))
os.getcwd()

'c:\\Users\\manash.jyoti.konwar\\Documents\\AI_Random_Projects\\Utility-Text-Scrapper'

## Module Level Execution

In [3]:
# Importing Text Scrapper Module
from url_entrypoint import URLValidator, URLScrapper

# Importing Concurrent Thread Executor
from multiprocessing import cpu_count

In [4]:
%time

# Running Validator
validator_instance = URLValidator(url_list=url_list, no_of_workers=2*cpu_count())
validator_instance.run_validation()

CPU times: total: 0 ns
Wall time: 0 ns


100%|██████████| 500/500 [00:54<00:00,  9.10it/s]


In [5]:
print(f'{len(validator_instance.get_active_urls)} links are active')
print(f'{len(validator_instance.get_inactive_urls)} links are inactive')
print(f'{len(validator_instance.get_unreachable_links)} links are not reachable / working')

39 links are active
447 links are inactive
14 links are not reachable / working


In [6]:
%time

# Running Scrapper
scrapping_instance = URLScrapper(url_list=validator_instance.get_active_urls, no_of_workers=2*cpu_count())
scrapping_instance.run_scrapping()

CPU times: total: 0 ns
Wall time: 0 ns


100%|██████████| 39/39 [00:08<00:00,  4.54it/s]


In [7]:
scrapping_instance.get_scrapped_df.head(5)

Unnamed: 0,url_id,url,text,title,author,hostname,date,categories
0,0,HTTPS://ADWAREREMOVAL.INFO/WIN32-SPY-NUMANDO-L/,"The Win32/Spy.Numando.L is considered dangerous by lots of security experts. When this infection is active, you may notice unwanted processes in Task Manager list. In this case, it is adviced to scan your computer with GridinSoft Anti-Malware.\nGridinsoft Anti-Malware\nRemoving PC viruses manually may take hours and may damage your PC in the process. We recommend using GridinSoft Anti-Malware ...",Win32/Spy.Numando.L removal guide – Adware Reports,Paul Valéry,adwareremoval.info,2020-10-18,Spy
1,1,HTTPS://ADWAREREMOVAL.INFO/ULISE-120961/,"The Ulise.120961 is considered dangerous by lots of security experts. When this infection is active, you may notice unwanted processes in Task Manager list. In this case, it is adviced to scan your computer with GridinSoft Anti-Malware.\nGridinsoft Anti-Malware\nRemoving PC viruses manually may take hours and may damage your PC in the process. We recommend using GridinSoft Anti-Malware for vir...",Ulise.120961 (file analysis) – Adware Reports,Paul Valéry,adwareremoval.info,2020-10-13,Malware
2,2,HTTPS://FINANCE.YAHOO.COM/NEWS/ALTRIA-ABANDONS-FULL-OUTLOOK-HALTS-112713091.HTML,,,,,,
3,3,HTTPS://WWW.BIZJOURNALS.COM/PRNEWSWIRE/PRESS_RELEASES/2018/12/17/DC05175,,,,,,
4,4,HTTPS://WWW.KRISTENLOURIE.COM/VAPE-SHOP-IN-NOTTINGHAM-UK/,"Author Bio\nEkaterina Mironova\nAuthor Biograhy: Ekaterina Mironova is a co-founder of CBD Life Mag and an avid blogger on the Hemp, CBD and fashion subjects. Ekaterina is also on the panel of the CBD reviewers and she most enjoys CBD gummies. Ekaterina has developed a real interest in CBD products after she started taking CBD tincture oil to help her ease her anxiety that was part-and-parcel ...","Vape Shop in Nottingham, UK - Kristen Lourie",Admin,kristenlourie.com,2021-03-28,Uncategorized


## Exploring Results

### Metadata Extracted

Metadata:  
1. URL unique ids  
2. URL  
3. Text  
4. Title  
5. Author  
6. Hostname  
7. Date  
8. Categories

In [8]:
scrapping_instance.get_scrapped_df.columns

Index(['url_id', 'url', 'text', 'title', 'author', 'hostname', 'date',
       'categories'],
      dtype='object')

### Text Extracted

Able to scrap different languages and is not limited to English only.

In [9]:
scrapping_instance.get_scrapped_df.iloc[5].text

'ಫೇಸ್ಬುಕ್ ನಲ್ಲಿ WBIR Channel 10 ಕುರಿತು ಇನ್ನಷ್ಟು ನೋಡಿ\nಫೇಸ್ಬುಕ್ ನಲ್ಲಿ WBIR Channel 10 ಕುರಿತು ಇನ್ನಷ್ಟು ನೋಡಿ\nಹೊಸ ಖಾತೆಯನ್ನು ರಚಿಸಿ\nಅಥವಾ\nಪುಟದ ಇತ್ತೀಚಿನ ಪೋಸ್ಟ್\nThe change matters because rural and urban areas often qualify for d...ifferent types of federal funding for transportation, housing, health care and education. ಇನ್ನಷ್ಟು ನೋಡಿ\nFormer Tennessee quarterback Josh Dobbs threw the first NFL touchdow...n of his career on Thursday night against the Dallas Cowboys in the third quarter. ಇನ್ನಷ್ಟು ನೋಡಿ\nPresident Joe Biden on Thursday signed a $1.7 trillion spending bill... that will keep the federal government operating through the end of the federal budget year in September 2023, and provide tens of billions of dollars in new aid to Ukraine for its fight against the Russian military. ಇನ್ನಷ್ಟು ನೋಡಿ'