# Automate Python Pacakge Downloads

The following is a simple demonstration of reading in a list of items you would like to selectively scrape and download from the internet. Our specific use case will be focused on a Data Scientist setting up a virtual environment, requiring the manual downloading of several packages which can be a long and boring task to perform. Let's automate it with Selenium

Motivation: Developers who work on multiple projects at a time typically use virtual environments to manage python pacakges. While it is fairly easy to use pip or a requirements.txt file to install a bulk number of packages, there are use cases that require manually downloading packages from sites like https://repo.anaconda.com/pkgs/main/linux-64/

Table of contents:
* [Reading in List of Pacakges](#read)
* [Accessing Chrome via Selenium](#selenium)
* [Extracting Search Criteria](#search)
* [Automatically Download Elements From Webpage](#download)
* [Validate Downloaded Data](#validate)

### Reading in List of Pacakges<a id='read' ></a>

In [1]:
#We use pandas to read in and manipulate our data as a dataframe
import pandas as pd
import os 

#expand df to see all text
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

In [4]:
df = pd.read_csv('/Users/PATH/package_list.txt')
df.head()

Unnamed: 0,package_name
0,https://repo.anaconda.com/pkgs/main/linux-64/anaconda-navigator-1.9.7-py37_0.tar.bz2
1,https://repo.anaconda.com/pkgs/main/linux-64/anaconda-2020.02-py38_0.tar.bz2
2,https://repo.anaconda.com/pkgs/main/linux-64/atom-0.4.3-py38hfd86e86_0.tar.bz2
3,https://repo.anaconda.com/pkgs/main/linux-64/gensim-3.8.0-py37h962f231_0.tar.bz2
4,https://repo.anaconda.com/pkgs/main/linux-64/hdijupyterutils-0.12.6-py37_0.tar.bz2


In [6]:
#How many files do we need to download?
df.shape

(5, 1)

In [41]:
#We need to extract only the name of the package we would like to use as a search criteria on the anaconda repo
import re

df_list = df['package_name'].tolist()
package_list = []
for package in df_list:
    term = re.split('linux-64/',package)[1]
    package_list.append(term)
    
package_list

['anaconda-navigator-1.9.7-py37_0.tar.bz2',
 'anaconda-2020.02-py38_0.tar.bz2',
 'atom-0.4.3-py38hfd86e86_0.tar.bz2',
 'gensim-3.8.0-py37h962f231_0.tar.bz2',
 'hdijupyterutils-0.12.6-py37_0.tar.bz2']

### Accessing Chrome via Selenium<a id='selenium' > </a>

In [42]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import os
import time

# Using Chrome to access web via selenium
path = '/Users/PATH' #identify path for chromedriver
web_page = 'https://repo.anaconda.com/pkgs/main/linux-64/'


driver = webdriver.Chrome(executable_path=path)
driver.get(web_page)

### Extracting Search Criteria <a id='search' > </a>

In [51]:
#In order to find matches via text, we need to create input items Selenium can work with
reg_begin = '//*[contains(text(), '
reg_end = ')]'

In [52]:
#build search elements for xpath
search_elements = []
for package in package_list:
    term = reg_begin + "'" + str(package) + "'" + reg_end
    search_elements.append(term)

#This is the Xpath format Selenium reads to execute text related searches on a webpage
search_elements

["//*[contains(text(), 'anaconda-navigator-1.9.7-py37_0.tar.bz2')]",
 "//*[contains(text(), 'anaconda-2020.02-py38_0.tar.bz2')]",
 "//*[contains(text(), 'atom-0.4.3-py38hfd86e86_0.tar.bz2')]",
 "//*[contains(text(), 'gensim-3.8.0-py37h962f231_0.tar.bz2')]",
 "//*[contains(text(), 'hdijupyterutils-0.12.6-py37_0.tar.bz2')]"]

In [49]:
#Get click inputs for what Selenium finds based on search criteria
search_inputs = []
for term in search_elements:
    search_inputs.append(driver.find_elements_by_xpath(term))

#Selenium click inputs
search_inputs

[[<selenium.webdriver.remote.webelement.WebElement (session="384c4aa2e521fe9f7c47e832ec47bbf1", element="b80c8921-4372-4161-973d-2c9c04aa9220")>],
 [<selenium.webdriver.remote.webelement.WebElement (session="384c4aa2e521fe9f7c47e832ec47bbf1", element="e351b4b6-d2d9-4d76-9dff-e2add50d3eb4")>],
 [<selenium.webdriver.remote.webelement.WebElement (session="384c4aa2e521fe9f7c47e832ec47bbf1", element="375787e6-ad8c-4da4-aaca-7e3c41d60dd8")>],
 [<selenium.webdriver.remote.webelement.WebElement (session="384c4aa2e521fe9f7c47e832ec47bbf1", element="e47cd3aa-d02b-4c3e-af43-3311487c247a")>],
 [<selenium.webdriver.remote.webelement.WebElement (session="384c4aa2e521fe9f7c47e832ec47bbf1", element="8bf0730c-ec2a-4a18-a40a-853b284fc213")>]]

### Automatically Download Elements From Webpage <a id='download' > </a>

In [53]:
#Run this block of code to iterate and click through your list of final_search_inputs to download files

for i in range(0,len(search_inputs)):
    search_inputs[i][0].click()
    #pausing the for loop for two seconds to prevent selenium from crashing
    time.sleep(2)

### Validate Data Download  <a id='validate' > </a>

In [57]:
#make we have all the right missing packages
import glob
new_path = '/Users/PATH/*.tar.bz2' #including *.tar.bz2 collects all files in the folder with the .tar.bz2 file extension
tar_packages = glob.glob(new_path)
path = '/Users/PATH' #do not inculde the *.tar.bz2

final_tar_list = []
for i in tar_packages:
    beg = re.split(path, i)[1] #split path
    end = re.split('.tar.bz2', beg)[0] #remove file extension
    final_tar_list.append(end)

#Confirm we have downloaded the correct packages
final_tar_list

['gensim-3.8.0-py37h962f231_0',
 'anaconda-2020.02-py38_0',
 'anaconda-navigator-1.9.7-py37_0',
 'hdijupyterutils-0.12.6-py37_0',
 'atom-0.4.3-py38hfd86e86_0']