# Scrapping most popular CAD models in different categeroies from GRABCAD

> GRABCAD ( largest online community of professional engineers, designers & students to work and share there cad models among the community )

<img src='https://drive.google.com/uc?id=1ph-gIOrM681yAFWOfr6wP82h6ottGjdy'>

**INTRODUCTION:**

Grabcad is a platform where we can upload or download CAD models to show up our work and get a chance to win exciting prizes too. Basically GRABCAD evolved as a community of engineers and currently there 52 lakh registered users and 31 lakh open source models available on the website . This vast [free cad model library](https://grabcad.com/library) is very helpful for students and learning professionals who wants to be a part of CAD related jobs or research for learning different designing softwares such as [Solidworks](https://grabcad.com/library?page=1&time=all_time&sort=popular&softwares=solidworks),  [Catia](https://grabcad.com/library?page=1&time=all_time&sort=popular&softwares=catia),  [Autocad](https://grabcad.com/library?page=1&time=all_time&sort=popular&softwares=autocad),  [pro-E](https://grabcad.com/library?page=1&time=all_time&sort=popular&softwares=pro-slash-engineer-wildfire) etc. 
It brings together all the tools engineers need to manage and share CAD files into one platform.

**OBJECTIVE:**

As a Data Science Engineer we aims to get the all time most downloaded design models by parsing the information from this website in to a form of tabular data under different categories of knowledge domain such as [Machine Design](https://grabcad.com/library?page=1&time=all_time&sort=popular&categories=machine-design),  [3D printing](https://grabcad.com/library?page=1&time=all_time&sort=popular&categories=3d-printing),  [Aerospace](https://grabcad.com/library?page=1&time=all_time&sort=popular&categories=aerospace),  [Electrical](https://grabcad.com/library?page=1&time=all_time&sort=popular&categories=electrical)  so that we can further get to know the interests among the community, difficulty level faced to design the models and ofcourse to distribute the prizes for the most popular ones. 

(In this notebook we will limit our objective to scrape the data for each category separately to limit the dataset , We can also combine the data for different categories and further analysis and testing on that complete data can be done on a similar path)

**The overall steps I'll follow are:**

1. Understanding the structure of [grabcad](https://grabcad.com/library)website
2. Install and Import required libraries
3. Download the page and extract the urls from [grabcad's all time most downloaded library page](https://grabcad.com/library?page=1&time=all_time&sort=most_downloaded) using <code>selenium.webdriver</code> and <code>kora.selenium</code>  under different cageories (Total 33 gategories are there on the page) 
4. Extract model links( 100 per page) from each url extracted above under the required categories among those 33 mentioned above 
5. Download each model link and parse the data out of it in 4 categories i.e  Names, Downloads, Likes, Comments 
6. Combine extracted data into a dictionary from each category.
7. Compiling all details into a <code>Pandas</code> dataframe and creating a CSV file

**By the end of the project, is expected to create a csv file with the following information under machine design category:**
```
name,downloads, likes, comments
Stepper Motor Nema 17, 41925, 575, 78
MQ-1 Predator UAV, 31373, 802, 144
CNC 3-axis, 30116, 994, 175
Planetary Gearbox, 29050, 900, 189


```
**NOTE:**

1. Grabcad is a dynamic website using javascript therefore we can not extract the webpage HTML from <code>beautiful soup</code> here, Therefore use of <code>selenium</code> is preffered for these kind of websites. But yes we can use <code>beautiful soup</code> after getting the webpage HTML from the webdriver in some websites. 

2. If you want to code on you local computer install <code>Selenium</code> and one of the webdriver depends on your browser to extract the page, But if you are coding on cloud based services such as google colab then you need to install <code>kora Selenium</code> but remember this <code>kora Selenium</code> will not work on binder and others so be aware.

# Install and Import required libraries:

In [2]:
!pip install kora -q 
!pip install requests 
from kora.selenium import wd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup
import requests





# Downloading the initial page

In [3]:
## Downloading page

base_url= 'https://grabcad.com/library?page=1&time=all_time&sort=most_downloaded'

def most_downloaded_library(url):
  
  response=wd.get(url)
  return response

<img src='https://drive.google.com/uc?id=131eMpuB2-2z_09sWPwN8pSKGIdsyggZw'>

# Extracting the categories

In [4]:
## Printing categories in format for url bouilding 

category_names=[]           ## Category names extracted from the x_path of elements
category_list=[]            ## List of strings in individual word (all in lower case and blank spaces is repaced by '-')
category_url_format=[]      ## Category names in format with url building, in all there are 33 categories present 
variable=""

def categories_in_url_format(path):
  
  x_path_categories =wd.find_elements_by_xpath('//*[@id="community_frontend-library_models"]/div[3]/div[1]/div[2]/div[3]/div/div[2]/div[2]/ul//li/a')

  for check in x_path_categories :
    category_names.append(check.get_attribute('text').strip())

  for i in range(len(category_names)):
    category_names[i]=category_names[i].lower()
    Y=list(category_names[i])
    res=[sub.replace(" ","-") for sub in Y]
    category_list.append(res)

  for j in range(len(category_list)):
    category_url_format.append("".join(category_list[j]))

  return category_url_format

categories_in_url_format(most_downloaded_library(base_url))



['3d-printing',
 'aerospace',
 'agriculture',
 'architecture',
 'automotive',
 'aviation',
 'components',
 'computer',
 'construction',
 'educational',
 'electrical',
 'energy-and-power',
 'fixtures',
 'furniture',
 'hobby',
 'household',
 'industrial-design',
 'interior-design',
 'jewellery',
 'just-for-fun',
 'machine-design',
 'marine',
 'medical',
 'military',
 'miscellaneous',
 'nature',
 'piping',
 'robotics',
 'speedrun',
 'sport',
 'tech',
 'tools',
 'toys']

# Making of url's (category wise)

In [5]:
## Url Building of each category {Here we are woorking on machine_design category }

def url_building(category, per_page):
  
  url_first_part='https://grabcad.com/library?page=1&'

  url_second_part='per_page='+str(per_page)

#### Note we can add per_page= 24 or 48 or 100 according to the number of items we want to compare on a page 

  url_third_part='&time=all_time&sort=most_downloaded&'

  url_last_part='categories='+ category

  complete_url=url_first_part+url_second_part+url_third_part+url_last_part

  return complete_url

url_building('machine-design',100)

'https://grabcad.com/library?page=1&per_page=100&time=all_time&sort=most_downloaded&categories=machine-design'

<img src='https://drive.google.com/uc?id=1Kq8bPNQdELzJjyTL4nRxeQb1jSp6bgRj'>




# Extracting each model link from individual category

In [6]:

links=[]  ### There are 100 links in a page for every category  


def model_links(category_page):
  
  response=wd.get(category_page)     ### downloading category page
  
  a=wd.find_elements_by_xpath('//div[@class="modelName text-bold"]//a')   ### x_path to extract links, its in form of list because we used elements not element
  
  for individual_link in a:
    links.append(individual_link.get_attribute('href'))         ### href tag from each x_path elemnet will provide the required links

  return links

model_links(url_building('machine-design',100))[:5]

['https://grabcad.com/library/stepper-motor-nema-17',
 'https://grabcad.com/library/mq-1-predator-uav',
 'https://grabcad.com/library/cnc-3-axis',
 'https://grabcad.com/library/planetary-gearbox',
 'https://grabcad.com/library/mechanical-horse']

In [16]:
len(links)

200

# Parsing data from each model link 

<img src='https://drive.google.com/uc?id=1EQsVzF7voBJqOS0imx-x2ZWS0zVnpfJF'>



In [8]:

cad_model_names=[]    ### Model names written inside each link of `machine_design category 
counts=[]             ### Counts the number of downloads, likes, comments 

def get_most_downloaded(variable_link): 
  
    for i in range(len(links)):
      wd.get(links[i])
      cad_model_names.append([tag.text for tag in WebDriverWait(wd, 20).until(EC.visibility_of_all_elements_located((By.XPATH,"//h1[@class='content-title title--fluid is-3 ng-binding']")))])
      counts.append([tag.text for tag in WebDriverWait(wd, 20).until(EC.visibility_of_all_elements_located((By.XPATH,"//span[@class='count ng-binding']")))])
    
    return cad_model_names
    return counts

get_most_downloaded(model_links(url_building('machine-design',100)))[:10]

[['Stepper Motor Nema 17'],
 ['MQ-1 Predator UAV'],
 ['CNC 3 axis'],
 ['Planetary Gearbox'],
 ['mechanical horse'],
 ['Brake Disc Brembo'],
 ['Forklift truck'],
 ['FANUC-430 Robot'],
 ['TOWER CRANE -ASSEMBLY-'],
 ['Formula car full chassis']]

In [9]:
counts[:10]   ### Counts the number of downloads, likes, comments 

[['41926', '575', '78'],
 ['31373', '802', '144'],
 ['30119', '995', '175'],
 ['29051', '900', '189'],
 ['26882', '721', '347'],
 ['26005', '537', '122'],
 ['23363', '794', '189'],
 ['22948', '495', '93'],
 ['22255', '693', '107'],
 ['20407', '689', '64']]

# Combine extracted data into a list of dictionary

In [10]:
newlist=[]

for i,j in zip(range(len(cad_model_names)),range(len(counts))):
    newlist.append((cad_model_names[i])+(counts[j]))

def parse (j):
  
  a=j[0]
  b=j[1]
  c=j[2]
  d=j[3]
  
  return {'name':a,'downloads':b,'likes':c,'comments':d}


combined_list=[parse(tag) for tag in newlist ]
combined_list[:10]

[{'comments': '78',
  'downloads': '41926',
  'likes': '575',
  'name': 'Stepper Motor Nema 17'},
 {'comments': '144',
  'downloads': '31373',
  'likes': '802',
  'name': 'MQ-1 Predator UAV'},
 {'comments': '175',
  'downloads': '30119',
  'likes': '995',
  'name': 'CNC 3 axis'},
 {'comments': '189',
  'downloads': '29051',
  'likes': '900',
  'name': 'Planetary Gearbox'},
 {'comments': '347',
  'downloads': '26882',
  'likes': '721',
  'name': 'mechanical horse'},
 {'comments': '122',
  'downloads': '26005',
  'likes': '537',
  'name': 'Brake Disc Brembo'},
 {'comments': '189',
  'downloads': '23363',
  'likes': '794',
  'name': 'Forklift truck'},
 {'comments': '93',
  'downloads': '22948',
  'likes': '495',
  'name': 'FANUC-430 Robot'},
 {'comments': '107',
  'downloads': '22255',
  'likes': '693',
  'name': 'TOWER CRANE -ASSEMBLY-'},
 {'comments': '64',
  'downloads': '20407',
  'likes': '689',
  'name': 'Formula car full chassis'}]

In [19]:
## Remove Duplicates with help of set function

seen = set()
new_l = []
for d in combined_list:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

print(len(new_l))

100


# Create a CSV file

In [21]:
import pandas as pd

df_machine_design = pd.DataFrame(new_l)

df_machine_design

Unnamed: 0,name,downloads,likes,comments
0,Stepper Motor Nema 17,41926,575,78
1,MQ-1 Predator UAV,31373,802,144
2,CNC 3 axis,30119,995,175
3,Planetary Gearbox,29051,900,189
4,mechanical horse,26882,721,347
...,...,...,...,...
95,Gear Collection,6785,621,49
96,Servo Robot ARM,6745,330,70
97,Tamiya King Hauler Truck Complete,6731,928,197
98,3-axis desktop cnc machining(concept model),6689,299,27


In [22]:
# Finally we convert the DataFrame into a CSV file.

df_machine_design.to_csv('machine_design_most_downloaded.csv', encoding='utf-8')


<img src='https://drive.google.com/uc?id=1xBgQ89wuaYs-p5iGLDxwfwUpBsBurcLd'>



# Summary

*   kora.selenium is used to parse out data from a dynamic site on a cloud based platform (google colab)

*   Information regarding CAD models on grabcad website is extracted.

*   Creation of CSV file for the data extracted under individual category.

*   While Scraping the data we came across intensive use of functions,list,dictionaries of python along with x_path, CSS selector and other documentation of selenium, as well which gives sufficient knowledge to parse most of the websites. 








# Refference


*   Selenium download and installation: "https://selenium-python.readthedocs.io/installation.html"

*   https://www.analyticsvidhya.com/blog/2020/10/web-scraping-selenium-in-python/

*   https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747

*  https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747






# Future work

*   Analysis based on this data can be done such as hypothesis testing of relation between number of downloads and community interest or between number of downloads and difficulty of models.

*   Different categories can be clubbed and major analysis can be done on the 33*100 set of data across L.



# Run the Codes and sub

Use the "Run" button to execute the code.

In [28]:
# Execute this to save new versions of the notebook
jovian.commit(project="project-web-scraping-with-python")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/dwivedi-rishabh95/project-web-scraping-with-python


'https://jovian.ai/dwivedi-rishabh95/project-web-scraping-with-python'

In [25]:
jovian.commit(files = ['machine_design_most_downloaded.csv'])

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Uploading additional files...[0m
Committed successfully! https://jovian.ai/dwivedi-rishabh95/project-web-scraping-with-python


'https://jovian.ai/dwivedi-rishabh95/project-web-scraping-with-python'

In [27]:
jovian.submit(assignment="project-web-scraping-with-python",
              notebook_url="https://jovian.ai/dwivedi-rishabh95/project-web-scraping-with-python")


[jovian] Submitting assignment..[0m


[31m[jovian] Error: Jovian submit failed. (HTTP 404) Assignment not found[0m
