# Description

In this project, rather than generating an automated web driver using `Selenium` and scrolling through the entire page, which took way more time than expected, 
I checked for the ajax post request that the website was making to load new elements. Then i immitated that  request and was able to scrape the nft data massively fast.

In [1]:
import requests
import json
from bs4 import BeautifulSoup
import pickle
import pandas as pd
import time
import aiohttp
import asyncio
import nest_asyncio
from datetime import timedelta

pd.set_option("display.max_rows", None)

In next section I checked the `robots.txt` of rarity sniper website and got the sitemap which contained the links for all the collections listed in the website in xml format in `loc` tags\
Afer, getting the links and removing the unncessary links , I saved it in a text file. 

In [2]:
"""
resp_ = requests.get('https://raritysniper.com/sitemap')
soup = BeautifulSoup(resp_.text, 'lxml')
links = soup.find_all("loc")
links = [link.text for link in links]
links = links[5:]
links = links[::2]
links.remove('https://raritysniper.com/imaginary-ones')  #This link is a useless link and has to be removed manually

with open('parent_links.txt','wb') as f:
    pickle.dump(links, f)
"""

'\nresp_ = requests.get(\'https://raritysniper.com/sitemap\')\nsoup = BeautifulSoup(resp_.text, \'lxml\')\nlinks = soup.find_all("loc")\nlinks = [link.text for link in links]\nlinks = links[5:]\nlinks = links[::2]\nlinks.remove(\'https://raritysniper.com/imaginary-ones\')  #This link is a useless link and has to be removed manually\n\nwith open(\'parent_links.txt\',\'wb\') as f:\n    pickle.dump(links, f)\n'

In [6]:
#loading the links
with open("parent_links.txt", "rb") as f:
    links = pickle.load(f)
len(links)

2022

In [7]:
#getting the asset name from the links    e.g The asset name for the collection with link https://raritysniper.com/one-shogun is `one-shogun`
assets = [asset[25:] for asset in links]
len(assets)

2022

Next section has the main code that will collect the data for us in the form of lists :\
- the function `get_data` will do the work of making the post requests and collecting the data in response
- Use of async/await here is to implement asynchronous programming and make fast requests
- To make post requests , parameters `payload`,`params` and `json` were required whose value was obtained by inspecting the webpage
- NFTs in a collection are displayed 24 per page and therefore the requests had to be made accordingly keeping the page number and total elements in mind
- The results are derived from the returned response and are stored in lists which further are stored in a dataframe

I scraped 500 nfts per run rather than whole bunch at a time so as to avoid any unecessary problem.\
I stored them in different files and then finally combined all the nfts from all 2022 collections

In [8]:
count = 1501 #global variable to keep count of the number of collection to be scraped

In [9]:
%%time
nest_asyncio.apply()     # To allow event loop to be nested

async def get_data(asset_name):
    global count
    ids = []
    ranks = []
    scores = []
    name = []
    url = 'https://search2.raritysniper.com/multi_search'
    #query string parameters to be used in post request
    params = {
        'use_cache' : 'true',
        'x-typesense-api-key' : 'L1NoMW9ITm1SYWNodFk4cWpmaHphQWZTS2tuaTVFWDNGdmxjT1llcEpLdz1uNWhMeyJmaWx0ZXJfYnkiOiJwdWJsaXNoZWQ6dHJ1ZSJ9'
    }
    #describe what and how the data is to be returned in response
    payload = {
	"searches": [{
		"sort_by": "rank:asc,nftId:asc",
		"collection": "assets_"+ asset_name,
		"q": "*",
		"page": 1,
		"per_page": 24
	}] }
    #generates a session to make a request 
    async with aiohttp.ClientSession() as session:
        async with session.post(url, params=params, json=payload) as resp:
            ans = await resp.json()
            total = ans['results'][0]['found']
            
    name.append(ans['results'][0]['hits'][0]['document']['collectionName'])
    page = 1
    loop = 24  #number of nfts displayed per page
    start = time.time()
    print("Now scraping '{asset}' ....... Total elements {total}".format(asset=asset_name, total=total))
    #will run until the total elements to be retrieved are 0
    while total != 0:
        # when number of nfts left to display are less than 24
        if total < loop:
            loop = total
            
        payload["searches"][0]["page"] = page #change the page number accordingly
        
        async with aiohttp.ClientSession() as session:
            async with session.post(url, params=params, json=payload) as resp:
                ans = await resp.json()
                
        for i in range(loop):
            ids.append(ans['results'][0]['hits'][i]['document']['nftId'])
            ranks.append(ans['results'][0]['hits'][i]['document']['rank'])
            scores.append(ans['results'][0]['hits'][i]['document']['rarityScore'])
            total = total - 1
            
        page = page + 1
    end = time.time()
    
    print("Time for scraping - {time}   <-->  {count} \n".format(time=str(timedelta(seconds=end-start)), count=count))
    count = count + 1
    return ids, ranks, scores, name


ids = []
ranks = []
scores = []
col_names = []

for asset in assets[1501:]:
    id_, rank_, score_, col_name_ = asyncio.run(get_data(asset))
    l = len(id_)
    ids.extend(id_)
    ranks.extend(rank_)
    scores.extend(score_)
    col_names.extend(col_name_ * l)

Now scraping 'troll-town-wtf' ....... Total elements 9999
Time for scraping - 0:00:20.654905   <-->  1501 

Now scraping 'streethers' ....... Total elements 799
Time for scraping - 0:00:01.457290   <-->  1502 

Now scraping 'dank-ducks' ....... Total elements 6972
Time for scraping - 0:00:13.269844   <-->  1503 

Now scraping '32px-pandas' ....... Total elements 3300
Time for scraping - 0:00:06.040556   <-->  1504 

Now scraping 'cozy-bears' ....... Total elements 444
Time for scraping - 0:00:00.805640   <-->  1505 

Now scraping 'balloon-town' ....... Total elements 1097
Time for scraping - 0:00:01.950368   <-->  1506 

Now scraping 'goblin-ghosts' ....... Total elements 2928
Time for scraping - 0:00:05.334686   <-->  1507 

Now scraping 'celestial-keys' ....... Total elements 4444
Time for scraping - 0:00:08.724615   <-->  1508 

Now scraping 'rabbit-college-club' ....... Total elements 10000
Time for scraping - 0:00:20.081620   <-->  1509 

Now scraping 'kitbash-boogers' ....... Tot

In [10]:
len(ids), len(ranks), len(scores), len(col_names)

(2696278, 2696278, 2696278, 2696278)

In [11]:
#store the returned lists in a dataframe
nfts = pd.DataFrame({'Id': ids,
                     'Collection Name': col_names,
                    'Rank': ranks,
                    'Rarity Score': scores})

In [12]:
nfts.head()

Unnamed: 0,Id,Collection Name,Rank,Rarity Score
0,5425,troll-town.wtf,1,2427
1,137,troll-town.wtf,2,2313
2,5192,troll-town.wtf,3,2169
3,5082,troll-town.wtf,4,2112
4,1527,troll-town.wtf,5,2060


In [13]:
nfts.shape

(2696278, 4)

In [14]:
nfts.isnull().sum()

Id                 0
Collection Name    0
Rank               0
Rarity Score       0
dtype: int64

In [None]:
nfts.to_csv("nft-collections-p4.csv")

# Final Collection
Combining all the dataframes to form the final collection that contains around 11 million nfts

In [3]:
df1 = pd.read_csv("nft-collections-p1.csv")
df2 = pd.read_csv("nft-collections-p2.csv")
df3 = pd.read_csv("nft-collections-p3.csv")
df4 = pd.read_csv("nft-collections-p4.csv")

In [7]:
nft_df = pd.concat([df1, df2, df3, df4])

In [9]:
nft_df.drop(nft_df.columns[0], axis=1, inplace=True)

In [10]:
nft_df.shape

(11152231, 4)

In [11]:
nft_df.head()

Unnamed: 0,Id,Collection Name,Rank,Rarity Score
0,2712,NFT Worlds Genesis Avatars,1,43380
1,6952,NFT Worlds Genesis Avatars,1,43380
2,9677,NFT Worlds Genesis Avatars,1,43380
3,7325,NFT Worlds Genesis Avatars,4,3370
4,11212,NFT Worlds Genesis Avatars,5,3265


In [15]:
nft_df["Collection Name"].value_counts()

Otherdeed for Otherside                            97589
The Matrix Avatars                                 56891
VeeFriends Series 2                                50798
The Seekers                                        47895
McPepes                                            42069
ASM AIFA All-Stars                                 36497
Chicken Derby                                      33333
Space Game - Marines & Aliens                      27965
Colonists                                          25000
Adam Bomb Squad                                    25000
Sewer Pass                                         22754
World of Women Galaxy                              20609
Meebits                                            20000
tubby cats                                         20000
7LuX                                               20000
Songbird Punks                                     20000
Rug Radio Faces of Web3 by Cory Van Lew            20000
ALPACADABRAZ 3D                

In [13]:
nft_df.isnull().sum()

Id                 0
Collection Name    0
Rank               0
Rarity Score       0
dtype: int64

In [18]:
nft_df.to_csv("nft-collections.csv")

*****************************************************************************************************************************
