# Co-op Rich Perfomance Task
#### At U+Education I had to make a Web Crawling Application. Using Python, I had to research libraries and come up with innovative ways to make this application. This is my documentation of my thought process throughout making the application, and also an explaination on how to make it. This documentation also demonstrates what I have learned through this project. What I learned include, learning how to use Selenium, Google APIs, and creating functions in Python.

## Goal
#### The goal of this application is to scrape information off the web automatically, minimizing human interaction. 

## Step 1
#### First, we need to find and import libraries that can, 1. Open a browser, 2. Parse the HTML code and 3. Read the HTML code.
> I did some research and found the best library to work with for me was Selenium. Selenium would allow me to use a browser through Python, parse the HTML code from the browser, and also read the HTML code and turn the elements into text. Therefore, in the code, I imported Selenium's Chrome Browser and it's "By" function.

In [11]:
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By

## Step 2
#### Initializing AI to help us easily find the text in HTML code
> We will be using Google's Gemeni Generative AI API to help us. First we will need to import the API, and create a function so we can talk to the Generative AI. We will use this later on to ask it to find specific text in HTML code. 

In [12]:
# Import the API
import google.generativeai as genai

# Initialize the Generative AI
API_KEY = 'AIzaSyBrIPsq4D7Z6nDy_XqsIu68WyfegDGXr9E'
genai.configure(api_key=API_KEY)
geminiModel = genai.GenerativeModel('gemini-pro')

# Create a function to talk to the AI
def gemeni_response(message):
  return geminiModel.generate_content(message).text

## Step 3
#### We then need to initialize variables for the Browser, our URL, and open the Chrome browser.
> For the browser, we can initialize a variable called "driver" with the Chrome() function. For this application, we will scrape information from the following link, https://westernu.campuslabs.ca/engage/organizations, and scrape the names and emails of all the organizations.

In [13]:
# Initialize variables
driver = Chrome()
URL = 'https://westernu.campuslabs.ca/engage/organizations'

# Open the browser
driver.get(URL)

## Step 4
#### We must load all the data of the site
> When opening the site, you will notice that there is a "load more" button. We must use code to press this button however amount of times until all organizaitons are loaded onto the page. We can do this by locating the "load more" element and clicing it using Selenium's functions.

In [14]:
# Import nessesary libraries (in this case we need a function to pause the program for half a second)
from time import sleep

# Load All Contents Within the Site
while True: 
  try: 
    driver.find_element(By.CSS_SELECTOR, 'button[tabindex="0"][type="button"]').click(), sleep(0.05)
  except: 
    break

## Step 5 and 6
#### The next step is to scrape the HTML code, and certain elements off the website by using Selenium's "By" functions. After finding these elements, we also need to save the information into a JSON file for easy access. 
> For step 5, we need to know what elements we are looking for and the HTML tags for these elements. We can find this using the developer tools on the browser and selecting the elements you are looking for. For this website, we are specifically looking for the Organization Name, and the Organization Email. In this step, we will also utilize the Generative AI we first intialized and ask us to find the Organization Name and Email in the HTML code.

> For step 6, we will need to save our information into a JSON file named, "westernorgsinfo.json". Before this, we will need to first load the JSON file, and give the code permission to write in it. We will then save every name and email of the organizations into the file.

In [15]:
# ==================STEP 5==================
# Finds all the links to each organization
clubs = driver.find_elements(By.CSS_SELECTOR, 'a[href][style="display: block; text-decoration: none; margin-bottom: 20px;"]')

# Create a loop to click through all the links found above
for i in range(9):
  clubs = driver.find_elements(By.CSS_SELECTOR, 'a[href][style="display: block; text-decoration: none; margin-bottom: 20px;"]')
  # Click link
  clubs[i].click()
  sleep(1.25)
  
  # Get Name of Club
  try: 
    name = driver.find_element(By.CSS_SELECTOR, 'h1[style="padding: 13px 0px 0px 85px;"]').get_attribute('innerHTML')
    # Use Generative AI to get Email from HTML Code
    name1 = gemeni_response(f'What is this name found in this code (respond with only the name): {name}')
  except: 
    # If no name is found, put n/a
    name = 'n/a'

  # Get Email of Club
  try: 
    email_card = driver.find_element(By.CSS_SELECTOR, 'div[style="margin-left: 5px; padding: 5px 15px; border-left: 1px solid rgb(210, 210, 210);"]').get_attribute('innerHTML')
    # Use Generative AI to get Email from HTML Code
    email = gemeni_response(f'What is this email found in this code (respond with only the email): {email_card}')
  except: 
    # If no email was found, put n/a
    email = 'n/a'

# ==================STEP 6==================
  # Import JSON library
  import json

  # Save info
  data = {
    'club' : name,
    'email' : email
  }

  # Load JSON file
  with open('westernorgsinfo.json', 'a') as JSONFile:
    json.dump(data, JSONFile)
    # Write info into JSON file
    JSONFile.write(',\n')

  # Tells browser to go back to last page
  driver.back(), sleep(0.75)


## Final Step
#### Check results and see if they are accurate
> We can check our results by clicking on the JSON file, "westernorgsinfo.json", and seeing if there are 229 organizations recorded. After clicking, you will see that there are 229 organizations, and most if not all information is recorded properly.

## Conclusion
#### This is how I created my first Web Crawling Application. After some research, and reading Python documentation, I have learned, firstly, how to use Selenium to use Python to search the web, secondly, how to use Google's Generative AI API in my Python applications, and lastly, how to save data from this Python file into another file. Overall, I have succesfully fulfilled my goal of automating finding information on the web, and minimize as much human interation as possible.
> Click on the "demonstration.mp4" file to look at how this application works.