# Web Scraping for Non-profit organization's profile information

This script will extract the names and URLs of the following non-profit organizations from [idealist.org](https://www.idealist.org/es) to create a directory and map of organizations providing health and human services in the Washington, DC, metropolitan area.

### Import the required libraries and set up a webdriver to scrape the data using selenium

In [1]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

import pandas as pd

Create a container for the organization URL data to extract the data about each organization from their `idealist.org` profiles.

In [2]:
org_url = []

In [3]:
for i in range(1,51):
    ser = Service('chromedriver/chromedriver')
    op = webdriver.ChromeOptions()
    driver = webdriver.Chrome(service=ser, options=op)
    
    URL = f'https://www.idealist.org/en/organizations?page={i}&q='    
    driver.get(URL)
    
    html = driver.page_source
    soup = BeautifulSoup(html)
    
    data = soup.findAll('a',{'class':'sc-uln2n7-0 eIARxM'})
    
    try:
        for val in data:            
            org_url.append('https://www.idealist.org'+str(val).split(' href=')[1].split(' ')[0].replace('"',''))
    
    except:
        org_url.append('No value')
    
    driver.close()

Extract information from every organization using their URLs.

In [4]:
org_name = []
org_location = []
org_website = []
org_about = []
org_services = []
org_type = []

In [5]:
for url in org_url:    
    
    ser = Service('chromedriver/chromedriver')
    op = webdriver.ChromeOptions()
    driver = webdriver.Chrome(service=ser, options=op)
        
    driver.get(url)
    
    html = driver.page_source
    soup = BeautifulSoup(html)
    
    try:        
        org_name.append((url,soup.findAll('h1',{'class':'sc-1q4cy5p-0 dyiqpT'})))
    
        org_location.append((url,soup.findAll('div',{'class':'sc-59ntl3-0 gpJdOX'})))
        
        org_website.append((url,
                            soup.findAll('a',{'class':'sc-dlo1ho-0 QxmwN'})))
        
        org_about.append((url,soup.findAll('div',{'class':'sc-n1vyd2-1'})))
        
        org_services.append((url,soup.findAll('li',{'class':'sc-59ntl3-0 gNRvwb'})))
        
        org_type.append((url,soup.findAll('h5',{'class':'sc-1q4cy5p-0 POIXj'})))
        
    except:
        org_name.append((url,'No value'))
        org_location.append((url,'No value'))
        org_website.append((url,'No value'))
        org_about.append((url,'No value'))
        org_services.append((url,'No value'))
        org_type.append((url,'No value'))
    
    driver.close()

Combine all data frames into one dataframe.

In [6]:
name = pd.DataFrame({'org_url':[i[0] for i in org_name],'name':[i[1] for i in org_name]}).drop_duplicates(subset=['org_url'])
location = pd.DataFrame({'org_url':[i[0] for i in org_location],'location':[i[1] for i in org_location]}).drop_duplicates(subset=['org_url'])
website = pd.DataFrame({'org_url':[i[0] for i in org_website],'website':[i[1] for i in org_website]}).drop_duplicates(subset=['org_url'])
about = pd.DataFrame({'org_url':[i[0] for i in org_about],'about':[i[1] for i in org_about]}).drop_duplicates(subset=['org_url'])
services = pd.DataFrame({'org_url':[i[0] for i in org_services],'services':[i[1] for i in org_services]}).drop_duplicates(subset=['org_url'])
types = pd.DataFrame({'org_url':[i[0] for i in org_type],'org_type':[i[1] for i in org_type]}).drop_duplicates(subset=['org_url'])

In [7]:
for feature in [location,website,about,services,types]:
    name = name.merge(feature, on = 'org_url', how ='left', indicator = True)
    name = name[name['_merge'] == 'both']
    name.drop(columns=['_merge'],inplace=True)

In [8]:
df = name

Create a column for the location of each organization to identify their latitude and longitude using the Google Maps API.

In [9]:
df['location'] = [str(loc).split('>Share')[1].split('<div class="sc-59ntl3-0 gpJdOX">')[1].split('<')[0] for loc \
                  in df['location']]

Using the URLs obtained and the organizations' addresses, use the Google Maps API to find the latitude and longitude of each organization's address to plot data on map.

In [10]:
df['location_url'] = ['+'.join(address.split(' ')) for address in df['location']]

In [11]:
address_lat_long = []

To access the longitude and latitude coordinates based on the address of each organization, please obtain a [Google Maps API key here](https://developers.google.com/maps) and replace the key parameter better with your API key.

In [12]:
for address in df['location_url']:
    URL = f'https://maps.googleapis.com/maps/api/geocode/json?address={address}&key=YOUR_API_KEY_HERE'
    
    try:
        req = requests.get(URL)

        json = req.json()

        address_lat_long.append((address,json["results"][0]["geometry"]["location"]))
    except:
        address_lat_long.append((address,'No coordinates found'))

Create two columns for the latitude and longitude of each address.

In [13]:
df['lat_long'] =  [address[1] for address in address_lat_long]

In [14]:
df['lat'] = [lat_long['lat'] if lat_long != 'No coordinates found' else lat_long for lat_long in df.lat_long]
df['lng'] = [lat_long['lng'] if lat_long != 'No coordinates found' else lat_long for lat_long in df.lat_long]

In [15]:
df.drop(columns=['lat_long','location_url'],inplace=True)

Obtain the zip code value associated with each address.

In [16]:
df['zip'] = [loc.replace(' United States','').replace(' USA','').replace(',','').split(' ')[-1] for loc in df['location']]

Export the data for futher cleaning and exploratory analysis.

In [22]:
df.to_csv('../StartingWithToday/data/non-profit-orgs.csv',
         index=False)