# The goal of our web scraping project
We would like to build a collection of interesting open-source machine learning projects. Therefore, we will scrape top machine learning projects from this ```Github Collection```.

(If this collection is closed in the future, you can find other collections from ```Github > Explore``` page).

# Tech we need for this project
* __Python__ (version 2.X or 3.X should be okay. I am using Python 3) – The easiest way to install Python on your machine is ```Anaconda```.
* __Selenium Webdriver for Google Chrome__: ```Chromedriver``` – Download it and place it anywhere on your machine.
* __Python Libraries__ – Install them with Python command line ‘pip install xxx’ or ‘!pip install xxx’ in Jupyter Notebook
    * __Selenium Webdriver__ – pip install selenium
    * __Pandas__ (For exporting data) – pip install pandas

# Step 1 Import Selenium Webdriver & Test

In [79]:
'''
Di sini kita import semua keperluan terlebih dahulu
'''
from selenium import webdriver # allow launching browser
from selenium.webdriver.common.by import By # allow search with parameters
from selenium.webdriver.support.ui import WebDriverWait # allow waiting for page to load
from selenium.webdriver.support import expected_conditions as EC # determine whether the web page has loaded
from selenium.common.exceptions import TimeoutException # handling timeout situation
from selenium.webdriver.chrome.service import Service # To specify the ChromeDriver location because the executable_path argument has been deprecated.
import pandas as pd

Prepare the code for easily opening new browser window (This will be useful when we are doing parallelization)

In [80]:
''' 
driver_option = webdriver.ChromeOptions()
driver_option.add_argument(" — incognito")
chromedriver_path = 'e:/chromedriver-win64/chromedriver.exe' # Change this to your own chromedriver path!
def create_webdriver():
 return webdriver.Chrome(executable_path=chromedriver_path, chrome_options=driver_option)

 Ini yg original, tapi sdh gk bisa karena inisialisasi webdriver.Chrome() sudah berubah. di versi terbaru selenium.
'''
def create_webdriver():
 chromedriver_path = 'e:/chromedriver-win64/chromedriver.exe' # Change this to your own chromedriver path!
 service = Service(chromedriver_path) # Memulai/menghentikan chromedriver
 driver_option = webdriver.ChromeOptions() # Object dibuat untuk diutak-atik nantinya
 driver_option.add_argument(" — incognito") # Membuka chrome browser yg incognito
 return webdriver.Chrome(service= service, options=driver_option) # mengembalikan hasil dari perintah di atas
 

Note that you have to change “chromedriver_path=’…’” to the path you are storing Chromedriver. If you don’t know the path, simply drag and drop the folder into Terminal window. It should show the path.

# Step 2 Open the Github page & Extract the HTML elements we need

We can start scraping by passing the URL to the Webdriver:

In [81]:
# Open the website
browser = create_webdriver()
browser.get("https://github.com/collections/machine-learning")

There will be a Google Chrome window opens with the URL we specified. This means we open the browser successfully.

Next, we would like to extract the name and URL of all projects listed on this page. We can right click on the web page and click ‘__Inspect__‘ to view the underlying HTML of this page.

We found that each project has ```h1``` tag with class “h3 ln-condensed”. Therefore, we can extract all projects using the following code:

In [82]:
# Extract all projects
projects = browser.find_elements(By.XPATH, "//h1[@class='h3 lh-condensed']")

The above code will store each ```h1``` elements and their children tags in the list. We can iterate through the list to extract each project’s name and URL using the following code:

In [83]:
# Extract information for each project
project_list = {}
for proj in projects:
 proj_name = proj.text # Project name
 # Find the <a> tag inside each project
 link_element = proj.find_element(By.XPATH,".//a") # Find the <a> tag within the project element
 '''
 Pake find_element soalnya <a> cuman ada satu.
proj.text = Extract the raw text from the element x
proj.get_attribute("y") = Extract the value in attribute y from element x
 '''
 proj_url = link_element.get_attribute('href')  # Get the href attribute (URL)
 project_list[proj_name] = proj_url

 print(project_list)

{'apache / spark': 'https://github.com/apache/spark'}
{'apache / spark': 'https://github.com/apache/spark', 'apache / hadoop': 'https://github.com/apache/hadoop'}
{'apache / spark': 'https://github.com/apache/spark', 'apache / hadoop': 'https://github.com/apache/hadoop', 'jbhuang0604 / awesome-computer-vision': 'https://github.com/jbhuang0604/awesome-computer-vision'}
{'apache / spark': 'https://github.com/apache/spark', 'apache / hadoop': 'https://github.com/apache/hadoop', 'jbhuang0604 / awesome-computer-vision': 'https://github.com/jbhuang0604/awesome-computer-vision', 'GSA / data': 'https://github.com/GSA/data'}
{'apache / spark': 'https://github.com/apache/spark', 'apache / hadoop': 'https://github.com/apache/hadoop', 'jbhuang0604 / awesome-computer-vision': 'https://github.com/jbhuang0604/awesome-computer-vision', 'GSA / data': 'https://github.com/GSA/data', 'GoogleTrends / data': 'https://github.com/GoogleTrends/data'}
{'apache / spark': 'https://github.com/apache/spark', 'apach

In [84]:
# Close connection
browser.quit()

# Step 3 Save the data to CSV using Pandas
Now that we have the data stored in Python dictionary, we will generate Pandas table from the dictionary and export CSV file.

We can convert the dictionary to Pandas DataFrame using this code:

In [85]:
# Extracting Data
project_df = pd.DataFrame.from_dict(project_list, orient= 'index') 
print(project_df)


                                                                                                 0
apache / spark                                                     https://github.com/apache/spark
apache / hadoop                                                   https://github.com/apache/hadoop
jbhuang0604 / awesome-computer-vision            https://github.com/jbhuang0604/awesome-compute...
GSA / data                                                             https://github.com/GSA/data
GoogleTrends / data                                           https://github.com/GoogleTrends/data
nationalparkservice / data                             https://github.com/nationalparkservice/data
fivethirtyeight / data                                     https://github.com/fivethirtyeight/data
beamandrew / medical-data                               https://github.com/beamandrew/medical-data
src-d / awesome-machine-learning-on-source-code  https://github.com/src-d/awesome-machine-learn...
igrigorik 

You can see that we have to fix this a little bit to have appropriate column names. We can do it using the following code:

__Reminder__ : DON'T RE-RUN THE CODE IF YOU DON'T WANT TO MAKE THE ```project_name``` INTO INT

In [86]:
# Checking the index
#print(project_df.index)

# Manipulate the table
project_df['project_name'] = project_df.index
project_df.columns = ['project_url', 'project_name']
project_df = project_df.reset_index(drop=True)
# Display
print(project_df)

                                          project_url  \
0                     https://github.com/apache/spark   
1                    https://github.com/apache/hadoop   
2   https://github.com/jbhuang0604/awesome-compute...   
3                         https://github.com/GSA/data   
4                https://github.com/GoogleTrends/data   
5         https://github.com/nationalparkservice/data   
6             https://github.com/fivethirtyeight/data   
7          https://github.com/beamandrew/medical-data   
8   https://github.com/src-d/awesome-machine-learn...   
9           https://github.com/igrigorik/decisiontree   
10                https://github.com/keon/awesome-nlp   
11                      https://github.com/openai/gym   
12              https://github.com/aikorea/awesome-rl   
13            https://github.com/umutisik/Eigentechno   
14    https://github.com/jpmckinney/tf-idf-similarity   
15  https://github.com/scikit-learn-contrib/lightning   
16             https://github.c

For the last step, we can save this DataFrame as CSV file using the following code:

In [88]:
# Export project dataframe to CSV
project_df.to_csv('project_list.csv')

# [Tip] Speed up web scraping with parallelization
As mentioned earlier that you can do web scrape faster by scraping many pages at the same time. You can use the ‘concurrent’ library in Python to accomplish this.

Here is the example code for doing parallelization in web scraping:

In [None]:
from concurrent.futures import ProcessPoolExecutor
import concurrent.futures
def scrape_url(url):
 new_browser = create_webdriver()
 new_browser.get(url)
 
 # Extract required data here
 # ...
 
 new_browser.quit()
 
 return data

with ProcessPoolExecutor(max_workers=4) as executor:
 future_results = {executor.submit(scrape_url, url) for url in urlarray}

results = []
 for future in concurrent.futures.as_completed(future_results):
 results.append(future.result())

__urlarray__ = List of all URL you would like to scrape

You can change __max_workers=4__ to other number. Note that this code will open 4 browser windows at the same time.

As I mentioned earlier, please be mindful when scraping a website. They can block or sue you at anytime, especially when you open multiple connections to flood their websites.