# HelpYourNGO Before Glue Transformations

## helpyourngo.json

__Data provided by:__ www.helpyourngo.com <br/>
__Source:__ s3://daanmatchdatafiles/webscrape-fall2021/helpyourngo.json <br/>
__Type:__ json <br/>
__Last Modified:__ October 31, 2021, 14:58:41 (UTC-07:00) <br/>
__Size:__ 1.6 MB <br/>

helpyourngo.json named helpyourngo_df contains: <br/>
List of NGOs indexed on helpyourngo.com

* COLUMN NAME: Content
    * Issues
    * Transformations

* name: NGO Name
    * Issues: Duplicate Names (e.g. Search NGO)
* last_updated: Most recent year that this data was collected
    
* address: Address
    * Issues: Escape chars

* mobile: Phone Number
    * Issues: Some NGOs have multiple phone numbers in the same column, Numbers may have an extra leading 0, Country code might be duplicated, Formatting varies dramatically
    
* email: Email
    * Issues: Some NGOs have multiple emails in the same column
    
* website: Website
    * Issues: String representation of NA (e.g. 'NA', 'N.A.', 'N. A.', 'N.A', 'Under Construction')
    * Transformations: Convert string representations of NA to None
    
* annual_expenditure: Annual Expenditure for the last_updated Year
    * Issues: Contains negative values
    * Transformations: Remove commas and convert from str to int
    
* description: Description of the NGO
    * Issues: Has abnormal spacings and irregular characters (‘ vs ')

## Imports

In [1]:
import boto3
import io
import string
import requests

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno


sns.set(rc={'figure.figsize':(16,5)})

## Load Data

In [2]:
# df = pd.read_json('helpyourngo.json', orient='values')

client = boto3.client('s3')
resource = boto3.resource('s3')

In [3]:
obj = client.get_object(Bucket='daanmatchdatafiles', Key='webscrape-fall2021/helpyourngo.json')
df = pd.read_json(io.BytesIO(obj['Body'].read()))
df_transformed = df.copy()

<br/>

## Column Transformations

### name

In [4]:
# No transformations for now

<br/>

### last_updated

In [5]:
# Convert last_updated to date
df_transformed["last_updated"] = pd.to_datetime(df["last_updated"], format='%Y', errors='coerce').dt.strftime('%Y-%m-%d')
df_transformed.head()

Unnamed: 0,name,last_updated,address,mobile,email,website,annual_expenditure,description
0,Aai Caretaker,2020-01-01,"Room No. B/4, Ashok Nagar, Near Krishna Medica...",+91 22 25530537,info@aaicaretaker.org.in,www.aaicaretaker.in,138990084.0,Aai Caretaker is involved in diverse activitie...
1,Aakriti,2015-01-01,"J-159, Sector-10 DLF, Faridabad 121006. Haryana",+91 9312263021,aakritischool@yahoo.in,www.aakritingo.org,1023204.0,"A parent-initiative, Association for Ability K..."
2,Aakash Maindwal Foundation,2016-01-01,"107, First Floor, Block - Milano, Mahagun Mosa...",+91 120 4377527,aakashmaindwalfoundation@gmail.com,www.amfindia.org,767980.0,Aakash Maindwal Foundation has been working to...
3,Aaradhana Sanstha,2013-01-01,"14, Sulabhpuram, Sikandara Bodla Road, Agra 28...",+91 9639161612,drhchaudhary@yahoo.com,,,Aaradhana Sanstha was formed for educational u...
4,Action Against Hunger (Fight Hunger Foundation),2019-01-01,"201, Sai Prasad Building, Sion Kamgar CHS Ltd,...",+91 022 2611 1275,contact@fighthungerfoundation.org,www.actionagainsthunger.in,86348954.0,Action Against Hunger (AAH) registered as Figh...


<br/>

### address

In [6]:
# No transformations for now

<br/>

### mobile

In [7]:
# Replace all non numeric characters and then take the last 10 chars
# last 10 chars won't capture leading 0 or the country code

temp = df["mobile"]
to_remove = [" ", "-", "/", "(", ")", ";", ",", "+"]

def format_mobile(m):
    if m is None or len(m)<10:
        return None
    else:
        return m[-10:]

for c in to_remove:
    temp = temp.str.replace(c, "")

df_transformed["mobile"] = temp.apply(format_mobile)
df_transformed.head()

  


Unnamed: 0,name,last_updated,address,mobile,email,website,annual_expenditure,description
0,Aai Caretaker,2020-01-01,"Room No. B/4, Ashok Nagar, Near Krishna Medica...",2225530537,info@aaicaretaker.org.in,www.aaicaretaker.in,138990084.0,Aai Caretaker is involved in diverse activitie...
1,Aakriti,2015-01-01,"J-159, Sector-10 DLF, Faridabad 121006. Haryana",9312263021,aakritischool@yahoo.in,www.aakritingo.org,1023204.0,"A parent-initiative, Association for Ability K..."
2,Aakash Maindwal Foundation,2016-01-01,"107, First Floor, Block - Milano, Mahagun Mosa...",1204377527,aakashmaindwalfoundation@gmail.com,www.amfindia.org,767980.0,Aakash Maindwal Foundation has been working to...
3,Aaradhana Sanstha,2013-01-01,"14, Sulabhpuram, Sikandara Bodla Road, Agra 28...",9639161612,drhchaudhary@yahoo.com,,,Aaradhana Sanstha was formed for educational u...
4,Action Against Hunger (Fight Hunger Foundation),2019-01-01,"201, Sai Prasad Building, Sion Kamgar CHS Ltd,...",2226111275,contact@fighthungerfoundation.org,www.actionagainsthunger.in,86348954.0,Action Against Hunger (AAH) registered as Figh...


<br/>

### email

In [8]:
# Replace existing delimiters with common delimiter and capture everything before that
temp = df.email
to_remove = [";", ","]
delimiter = " #CAPTURE THE STUFF BEFORE ME# "

def extract_first_email(email):
    if not email or delimiter not in email:
        return email
    else: # delimiter is in email
        return email[:email.find(delimiter)] # return stuff before delimiter

for c in to_remove:
    temp = temp.str.replace(c, delimiter)

df_transformed["email"] = temp.apply(extract_first_email)
df_transformed.head()

Unnamed: 0,name,last_updated,address,mobile,email,website,annual_expenditure,description
0,Aai Caretaker,2020-01-01,"Room No. B/4, Ashok Nagar, Near Krishna Medica...",2225530537,info@aaicaretaker.org.in,www.aaicaretaker.in,138990084.0,Aai Caretaker is involved in diverse activitie...
1,Aakriti,2015-01-01,"J-159, Sector-10 DLF, Faridabad 121006. Haryana",9312263021,aakritischool@yahoo.in,www.aakritingo.org,1023204.0,"A parent-initiative, Association for Ability K..."
2,Aakash Maindwal Foundation,2016-01-01,"107, First Floor, Block - Milano, Mahagun Mosa...",1204377527,aakashmaindwalfoundation@gmail.com,www.amfindia.org,767980.0,Aakash Maindwal Foundation has been working to...
3,Aaradhana Sanstha,2013-01-01,"14, Sulabhpuram, Sikandara Bodla Road, Agra 28...",9639161612,drhchaudhary@yahoo.com,,,Aaradhana Sanstha was formed for educational u...
4,Action Against Hunger (Fight Hunger Foundation),2019-01-01,"201, Sai Prasad Building, Sion Kamgar CHS Ltd,...",2226111275,contact@fighthungerfoundation.org,www.actionagainsthunger.in,86348954.0,Action Against Hunger (AAH) registered as Figh...


<br/>

### Website

In [9]:
# If string resembles "NA", then convert it's value to None
temp = df.website
str_rep_NA = ["N.A.", "N. A.", "NA", "N.A", "Under Construction"]

for s in str_rep_NA:
    temp[temp == s] = None
    
df_transformed["website"] = temp
df_transformed.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,name,last_updated,address,mobile,email,website,annual_expenditure,description
0,Aai Caretaker,2020-01-01,"Room No. B/4, Ashok Nagar, Near Krishna Medica...",2225530537,info@aaicaretaker.org.in,www.aaicaretaker.in,138990084.0,Aai Caretaker is involved in diverse activitie...
1,Aakriti,2015-01-01,"J-159, Sector-10 DLF, Faridabad 121006. Haryana",9312263021,aakritischool@yahoo.in,www.aakritingo.org,1023204.0,"A parent-initiative, Association for Ability K..."
2,Aakash Maindwal Foundation,2016-01-01,"107, First Floor, Block - Milano, Mahagun Mosa...",1204377527,aakashmaindwalfoundation@gmail.com,www.amfindia.org,767980.0,Aakash Maindwal Foundation has been working to...
3,Aaradhana Sanstha,2013-01-01,"14, Sulabhpuram, Sikandara Bodla Road, Agra 28...",9639161612,drhchaudhary@yahoo.com,,,Aaradhana Sanstha was formed for educational u...
4,Action Against Hunger (Fight Hunger Foundation),2019-01-01,"201, Sai Prasad Building, Sion Kamgar CHS Ltd,...",2226111275,contact@fighthungerfoundation.org,www.actionagainsthunger.in,86348954.0,Action Against Hunger (AAH) registered as Figh...


<br/>

### annual_expenditure

In [10]:
# Convert to float
temp = df.annual_expenditure
remove_commas = temp.str.replace(',', '')

df_transformed["annual_expenditure"] = pd.to_numeric(remove_commas)
df_transformed.head()

Unnamed: 0,name,last_updated,address,mobile,email,website,annual_expenditure,description
0,Aai Caretaker,2020-01-01,"Room No. B/4, Ashok Nagar, Near Krishna Medica...",2225530537,info@aaicaretaker.org.in,www.aaicaretaker.in,138990084.0,Aai Caretaker is involved in diverse activitie...
1,Aakriti,2015-01-01,"J-159, Sector-10 DLF, Faridabad 121006. Haryana",9312263021,aakritischool@yahoo.in,www.aakritingo.org,1023204.0,"A parent-initiative, Association for Ability K..."
2,Aakash Maindwal Foundation,2016-01-01,"107, First Floor, Block - Milano, Mahagun Mosa...",1204377527,aakashmaindwalfoundation@gmail.com,www.amfindia.org,767980.0,Aakash Maindwal Foundation has been working to...
3,Aaradhana Sanstha,2013-01-01,"14, Sulabhpuram, Sikandara Bodla Road, Agra 28...",9639161612,drhchaudhary@yahoo.com,,,Aaradhana Sanstha was formed for educational u...
4,Action Against Hunger (Fight Hunger Foundation),2019-01-01,"201, Sai Prasad Building, Sion Kamgar CHS Ltd,...",2226111275,contact@fighthungerfoundation.org,www.actionagainsthunger.in,86348954.0,Action Against Hunger (AAH) registered as Figh...


<br/>

### description

In [11]:
# No transformations for now

<br/>

## Save Transformed Data

In [12]:
df_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688 entries, 0 to 687
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                688 non-null    object 
 1   last_updated        679 non-null    object 
 2   address             688 non-null    object 
 3   mobile              681 non-null    object 
 4   email               683 non-null    object 
 5   website             611 non-null    object 
 6   annual_expenditure  660 non-null    float64
 7   description         688 non-null    object 
dtypes: float64(1), object(7)
memory usage: 43.1+ KB


In [13]:
df_transformed.to_json("helpyourngo_before_glue_transformation.json", lines=True, orient='records')