## Module 10 Assignment - Scraping a Website
* Author: Ashish Rogannagari
* version 1.0

We will be creating a web scraper to parse a table from the Charities Bureau Website. From the website: “All 
charitable organizations operating in New York State are required by law to register and file annual financial reports 
with the Attorney General's Office. This includes any organization that conducts charitable activities, holds property 
that is used for charitable purposes, or solicits financial or other contributions.”

### Step 1.1: Load Modules and Setup WebDriver

In [47]:
# Load necessary modules
#!pip install webdriver-manager
#!pip install awscli
import awscli
import boto3
import selenium
import pandas as pd
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Set up WebDriver
s=Service(ChromeDriverManager().install())
browser = webdriver.Chrome(service=s)


### Step 1.2: Scrape the Website and Create DataFrame

In [49]:
# Access the URL with WebDriver
browser.get('https://www.charitiesnys.com/RegistrySearch/search_charities.jsp')

# Find and input search criteria
inputElement = browser.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[2]/td[2]/input[1]')
inputElement.send_keys('0')
inputElement1 = browser.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[10]/td/input[1]').click()
sleep(4)

# Scrape the table
table = browser.find_element(By.CSS_SELECTOR,'table.Bordered')
sleep(1)

# Create empty DataFrame
df = []

# Loop through table to extract data
for row in table.find_elements(By.CSS_SELECTOR,'tr'):
    cols = df.append([cell.text for cell in row.find_elements(By.CSS_SELECTOR,'td')])

# Update DataFrame with headers
df = pd.DataFrame(df, columns = ["Organization Name", "NY Reg #", "EIN" ,"Registrant Type","City","State"])

# Display DataFrame
display(df)


Unnamed: 0,Organization Name,NY Reg #,EIN,Registrant Type,City,State
0,,,,,,
1,"""Forever Captain Poodaman"" The Ahmad Butler Fo...",48-07-16,843800926.0,NFP,PHILADELPHIA,PA
2,"""Incredibly Blessed"" Inc",49-54-61,842071758.0,NFP,STATEN ISLAND,NY
3,"""R"" S.U.C.C.E.S.S. Foundation Inc.",49-06-59,874012670.0,NFP,ROCHESTER,NY
4,"""Studio 5404"" Inc.",44-39-58,463180470.0,NFP,MASSAPAQUA,NY
5,"""THEY ARE HAITIAN"" FUND, INC.",20-63-46,300170128.0,NFP,HUDSON,NY
6,"""Y"" Dive, Inc.",48-45-01,854252095.0,NFP,SAINT ALBANS,NY
7,(ASMA) American Syrian Multicultural Associati...,42-84-63,273130182.0,NFP,BROOKLYN,NY
8,#FeedHamburg,48-37-35,854150318.0,NFP,HAMBURG,NY
9,#HicksStrong Inc.,48-10-48,842612081.0,NFP,CLIFTON PARK,NY


### Step 1.3: Create an S3 Bucket

In [50]:
# Initialize S3 client without specifying the region
s3 = boto3.client('s3')

# Create a unique bucket name
bucket_name = 'm-10-assignment-ashish'

# Create S3 bucket
s3.create_bucket(Bucket=bucket_name)

print(f"Bucket '{bucket_name}' created successfully.")

Bucket 'm-10-assignment-ashish' created successfully.


### Step 1.4:  Convert it into csv and Load Data into S3 Bucket
###### add the timestamp.

In [55]:
import awscli
import boto3
import pandas as pd
from io import StringIO  # For handling CSV content as a string buffer
from datetime import datetime  # For generating timestamps

# Assuming 'df' is your final DataFrame
csv_buffer = StringIO()
df.to_csv(csv_buffer)

# Generate timestamp
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Initialize an S3 client using boto3
s3_resource = boto3.resource('s3')
bucket_name = 'm-10-assignment-ashish'  # Replace with your bucket name
file_name = f'charities_bureau_scrape_{timestamp}.csv'  # Append timestamp to file name

# Upload the CSV file
s3_resource.Object(bucket_name, file_name).put(Body=csv_buffer.getvalue())

print(f" Succesfully File : {file_name} uploaded to s3 bucket: {bucket_name}")


 Succesfully File : charities_bureau_scrape_2024-04-09_15-11-55.csv uploaded to s3 bucket: m-10-assignment-ashish


### Step 1.5: Check the Objects in the S3 bucket.
   ###### for cross verification. 

In [56]:
s3_client = boto3.client('s3')

# Specify the bucket name
bucket_name = 'm-10-assignment-ashish'  # Replace with your bucket name

# List objects in the bucket
response = s3_client.list_objects_v2(Bucket=bucket_name)

# Print the list of objects
if 'Contents' in response:
    print("Objects in the bucket:")
    for obj in response['Contents']:
        print(obj['Key'])
else:
    print("The bucket is empty or does not exist.")

Objects in the bucket:
charities_bureau_scrape_2024-04-09_15-11-55.csv


Sucessfully Completed scrapping the data from  website and converted it into df and loaded into AWS s3 bucket.


### Step 2: Perform Data Cleaning


##### The csv data we uploaded it has a empty rows, so we need to remove and also adjust the index.

In [57]:
# Remove the first row with null values.

df = df[1:]  

# Reset index starting from 0

df.reset_index(drop=True, inplace=True)

# Increment index by 1 to start from 1.

df.index += 1  


##### Make a copy of DataFrame df and assign it to df1

In [58]:
df1= df

#### Let's call df1 to check whether it has removed the empty values or not.

In [61]:
df1

Unnamed: 0,Organization Name,NY Reg #,EIN,Registrant Type,City,State
1,"""Forever Captain Poodaman"" The Ahmad Butler Fo...",48-07-16,843800926,NFP,PHILADELPHIA,PA
2,"""Incredibly Blessed"" Inc",49-54-61,842071758,NFP,STATEN ISLAND,NY
3,"""R"" S.U.C.C.E.S.S. Foundation Inc.",49-06-59,874012670,NFP,ROCHESTER,NY
4,"""Studio 5404"" Inc.",44-39-58,463180470,NFP,MASSAPAQUA,NY
5,"""THEY ARE HAITIAN"" FUND, INC.",20-63-46,300170128,NFP,HUDSON,NY
6,"""Y"" Dive, Inc.",48-45-01,854252095,NFP,SAINT ALBANS,NY
7,(ASMA) American Syrian Multicultural Associati...,42-84-63,273130182,NFP,BROOKLYN,NY
8,#FeedHamburg,48-37-35,854150318,NFP,HAMBURG,NY
9,#HicksStrong Inc.,48-10-48,842612081,NFP,CLIFTON PARK,NY
10,#WalkAway Foundation,47-15-80,832820906,NFP,CARLSBAD,CA


### Step 2.1 : Upload the updated df1 data into the s3 bucket.

In [59]:
import awscli
import boto3
import pandas as pd
from io import StringIO  # For handling CSV content as a string buffer
from datetime import datetime  # For generating timestamps

# Assuming 'df' is your final DataFrame
csv_buffer = StringIO()
df1.to_csv(csv_buffer)

# Generate timestamp
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Initialize an S3 client using boto3
s3_resource = boto3.resource('s3')
bucket_name = 'm-10-assignment-ashish'  # Replace with your bucket name
file_name = f'Updated_csv_charities_bureau_scrape_{timestamp}.csv'  # Append timestamp to file name

# Upload the CSV file
s3_resource.Object(bucket_name, file_name).put(Body=csv_buffer.getvalue())

print(f"File {file_name} uploaded to {bucket_name}")


File Updated_csv_charities_bureau_scrape_2024-04-09_15-22-45.csv uploaded to m-10-assignment-ashish


### Step 2.2 : Check the Objects in the S3 bucket..

In [60]:

# Initialize an S3 client using boto3
s3_client = boto3.client('s3')

# Specify the bucket name
bucket_name = 'm-10-assignment-ashish'  # Replace with your bucket name

# List objects in the bucket
response = s3_client.list_objects_v2(Bucket=bucket_name)

# Print the list of objects
if 'Contents' in response:
    print("Objects in the bucket:")
    for obj in response['Contents']:
        print(obj['Key'])
else:
    print("The bucket is empty or does not exist.")


Objects in the bucket:
Updated_csv_charities_bureau_scrape_2024-04-09_15-22-45.csv
charities_bureau_scrape_2024-04-09_15-11-55.csv


## References
* https://www.programiz.com/python-programming/working-csv-files
* https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket
* https://realpython.com/python-boto3-aws-s3/
* https://robertorocha.info/setting-up-a-selenium-web-scraper-on-aws-lambda-with-python/ 

<center>

# Thank you

</center>
