## Description

This project is for scraping the nfts from the site https://raritysniper.com. The website uses javascript to load the nfts i.e. You have to scroll to show all the nft collections and nfts. \
A total of 2019 collections were present and each collection had about 6000 nfts which makes a total of 12 million nfts\
approximately. The code in this project is scalable and can be used to extract all the nfts.\
\
Due to the limitation of specifications of my device, it took me around 7 hours to scrape 5000 nfts. Thats why, I scraped only around 10000 nfts.\
\
But again, this code is scalable and can be altered according to the specifications of the users device.\
As the rarity score of the nfts was visible only when hovered or if the nft was clicked.\
I chose the latter and it  added to the scraping time. Halting the driver to let the scrolled nfts to load , too added to the time.\
\
I extracted top 50 nfts from 200 nft collections

In [2]:
from bs4 import BeautifulSoup  
from selenium import webdriver
import pandas as pd
import time
import pickle                                   #for dumping and loading to files
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

pd.set_option("display.max_rows", None)

-  `render_page` - This function will scroll the driver to end of the site and return it. If we want all the  urls embedded in the `collections` page , we will use this function. It will return approx 2000 urls. Not using this function will only return some 24 urls as the page uses javascript and the only way to get all the urls is to manually scroll through the page. 
 

In [2]:
def render_page(url):
    driver = webdriver.Chrome()   #generate a driver
    driver.get(url)               #going to the url
    driver.maximize_window()     #maximizes the window
    time.sleep(2)                #waits 2 seconds after loading
    screen_height = driver.execute_script("return window.screen.height;") #get current screen height
    i = 1
    #This loop will run until we reach the end of the infinite scrol
    while True:
        #scroll the page to the next screen height one by one
        driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
        i += 1
        #wait 5 seconds to load the elements after scrolling
        time.sleep(5)
        #get current scrollHeight
        scroll_height = driver.execute_script("return document.body.scrollHeight;")  
        #break the loop if the screen height exceeds the scroll height
        if (screen_height) * i > scroll_height:
            break
    return driver

- `render_child` - Same function as above, only differs in the method of scrolling. Also ,the number of scrolls are limited.Used for rendering the nfts embedded in the collection obtained from the `collections` page 


In [3]:
def render_child(url):
    driver = webdriver.Chrome()
    driver.get(url)
    driver.maximize_window()
    time.sleep(2)
    # find the body tag and use keys to scroll driver down the page only limited number of times
    
    for timer in range(8):
        driver.find_element(By.TAG_NAME, "body").send_keys(Keys.PAGE_DOWN)
        time.sleep(3.5)
        
    return driver

- `get_links` - This function is used to get extract links from the url 
(https://raritysniper.com/nft-collections ) and return them in list.

In the the page that has all the collections, I extracted the uls from the div elements that have the value of atrribute 'slot' as 'hit'

In [4]:

def get_links(url):
    links = []
    driver = render_page(url)
    # using BeautifulSoup to parse the page source obtained from driver as html
    
    soup = BeautifulSoup(driver.page_source, "html.parser")
    for x in soup.find_all("div", attrs={"slot" : "hit"}):
        links.append(x.find("a")["href"])
    
    return links

- `find_all_occurances` - A helper function which returns the indexes of a substring in the original string

In [5]:
# returns a generator object 
def find_all_occurances(sub, og):
    start = 0
    while True:
        start = og.find(sub, start)
        if start == -1: 
            return
        yield start
        start += len(sub)

- `get_ids` - This function will return the `id` for every nft embedded in the particular collection.

In [6]:
#returns a list of ids
def get_ids(url):
    ids = []
    driver = render_child(url)
    elems = driver.find_elements(By.ID, "wrap") #find the elements with 'id' -> 'wrap' 
   
    #next few lines will filter the 'ids' from the string
    
    s1 = str(elems[0].text)
    s1 = s1.replace("\n", " ")
    loc = list(find_all_occurances("#", s1)) 
    
    # 'id' is placed next to the character '#'
    
    for i in loc:
        s = ""
        c = i+1
        #get the characters unitl a space is occured
        while s1[c]!=" ":
            s += s1[c]
            c += 1
        ids.append(s)
        
    return ids

- `get_ranks_and_scores` - Used to extract the rarity score and the associated rank of a nft

In [7]:
#returns two lists - ranks and rarity scores
def get_ranks_and_scores(ids, url):
    ranks = []
    scores = []
    driver = webdriver.Chrome()
    for idx_ in range(len(ids)):
        r = ""
        driver.get(url + "/"+ ids[idx_])
        time.sleep(3)
        rank = driver.find_element('xpath',
                    '//*[@id="wrap"]/div[1]/div[3]/div/div/div/div/div/div[2]/div[1]/div[1]/div[1]/div').text
        for i in range(1,len(rank)+1):
            if rank[-1*i] == " ":
                break
            r += rank[-1*i]
        rank = r[::-1]
        score = driver.find_element("xpath", 
                    '//*[@id="wrap"]/div[1]/div[3]/div/div/div/div/div/div[2]/div[1]/div[2]/div[3]/div/span[3]').text
        ranks.append(rank)
        scores.append(score)
    return ranks, scores

In [8]:
url_col = "https://raritysniper.com/nft-collections"
url_main = "https://raritysniper.com"

 I saved all the links in a text file so that i dont need to get the links again\
 Next cell can be uncommented if the links are to searched again 

In [9]:
# links = get_links(url_col)      
# len(links)                      

In [10]:
with open("links.txt","rb") as f:
    links = pickle.load(f)

In [12]:
# check if the links are working -- response [200] will mean it works
import requests
for link in links[100:105]:
    print(url_main+link)
    print(requests.get(url_main+link))

https://raritysniper.com/tutti-frutiz
<Response [200]>
https://raritysniper.com/turtle-united-genesis
<Response [200]>
https://raritysniper.com/cronoscruiser
<Response [200]>
https://raritysniper.com/kart-racing-league-racers
<Response [200]>
https://raritysniper.com/mad-meerkat-burrow-poly
<Response [200]>


In [13]:
len(links)

2019

Next cell is the main loop that will collect the id, parent collection, rank and rarity score of every nft \
uses all the functions mentioned above \
I have limited the number of collections and the nfts in each collection to be explored

I ran the below cell three times with different parent and child values.\
The main goal was to scrape 10050 nfts. I decided to divide it into 2 parts : first to get 5000 nfts on one and second to get next 5050 nfts.\
But I ran into an error in the second run and got only 1850 nfts. To get the remaining 3200 ( or rather 3198) I had to to initate a third run.  \
But for each run, I stored the scraped nfts into csv files.\
And at last, combined all three files to get the final csv file.

In [39]:
%%time
limit_child = 50
limit_parent = 201
parent = []
length = []
ids = []
ranks = []
scores = []
for link in links[137 : limit_parent]:
    url_child = url_main + link
    name = link[1:].replace("-"," ").upper()
    ids_ = get_ids(url_child)
    ranks_, scores_ = get_ranks_and_scores(ids_[ : limit_child], url_child)
    ids.extend(ids_[ : limit_child])
    parent.append(name)
    length.append(len(ids_[ : limit_child]))
    ranks.extend(ranks_)
    scores.extend(scores_)

Wall time: 4h 22min 50s


In [89]:
len(scores), len(ids), len(ranks), len(parent), len(length)

(3198, 3198, 3198, 64, 64)

In [90]:
# Altering types of every list

length = [int(l) for l in length]
ids = [int(d) for d in ids]
ranks = [int(r) for r in ranks]
scores = [float(s) for s in scores]

In [91]:
#main dictionary for storing information
data_dict = {'Id': [], 'Parent' : [], 'Rank' : [], 'Rarity Score' : []}

In [95]:
#stores the parent of each nft
idx = 0
for p in parent:
    data_dict['Parent'].extend([p]*length[idx])
    idx += 1

In [96]:
data_dict['Id'].extend(ids)
data_dict['Rank'].extend(ranks)
data_dict['Rarity Score'].extend(scores)

In [134]:
df3 = pd.DataFrame(data_dict)
df3.head()

Unnamed: 0,Id,Parent,Rank,Rarity Score
0,172,THE ARISTOCRATS SOCIETY,1,29515.91
1,197,THE ARISTOCRATS SOCIETY,2,28520.86
2,69,THE ARISTOCRATS SOCIETY,3,27514.13
3,247,THE ARISTOCRATS SOCIETY,4,26541.78
4,248,THE ARISTOCRATS SOCIETY,5,25631.87


In [98]:
df3.shape

(3198, 4)

In [135]:
df3.tail()

Unnamed: 0,Id,Parent,Rank,Rarity Score
3193,85,ELEPHANTS,46,17574.26
3194,4752,ELEPHANTS,47,17531.74
3195,7507,ELEPHANTS,48,17391.0
3196,4937,ELEPHANTS,49,17291.97
3197,3789,ELEPHANTS,50,17267.12


In [100]:
#df.to_csv("collections1.csv")  #to store the first 5000 nfts 
#df.to_csv("collections2.csv")  #to store the next 1850 nfts
df3.to_csv("collection3.csv")    #for the next 3198 nfts

In [133]:
# merge all the loaded csv files into one

In [3]:
df1 = pd.read_csv("collection1.csv")
df2 = pd.read_csv("collection2.csv")
df3 = pd.read_csv("collection3.csv")

In [4]:
df1.shape , df2.shape, df3.shape

((5000, 5), (1850, 5), (3198, 5))

#### The final csv file which contains around 10000 nfts

In [5]:
df = pd.concat([df1, df2, df3])

In [6]:
df.drop(df.columns[0], axis=1, inplace=True)

In [7]:
df.shape

(10048, 4)

In [8]:
df.head()

Unnamed: 0,Id,Parent,Rank,Rarity Score
0,1029,UTOPIA AVATARS,1,48471.67
1,286,UTOPIA AVATARS,2,48218.41
2,3394,UTOPIA AVATARS,3,48143.18
3,556,UTOPIA AVATARS,4,41887.46
4,2182,UTOPIA AVATARS,5,41475.51


In [9]:
df.describe()

Unnamed: 0,Id,Rank,Rarity Score
count,10048.0,10048.0,10048.0
mean,2364.161525,24.148587,112223.3
std,3265.58156,20.182119,626460.0
min,0.0,1.0,43.04
25%,349.0,10.75,787.8
50%,1232.0,24.0,3615.965
75%,3199.0,37.0,14001.0
max,47229.0,1137.0,9007593.0


In [10]:
df.isnull().sum()

Id              0
Parent          0
Rank            0
Rarity Score    0
dtype: int64

In [11]:
df.to_excel("nft-collections.xls")

  df.to_excel("nft-collections.xls")


In [12]:
df.to_csv("nft-collections.csv")