# Practice Working with the GroupBy Function

In this project, I am working to practice the Pandas groupby function using the data concerning who received funds from the Paycheck Protection Program (PPP).

In additional to all of the necessary imports, I ingested the PPP dataset into a Pandas dataframe.

The source of the dataset: https://www.kaggle.com/susuwatari/ppp-loan-data-paycheck-protection-program

In [4]:
import pandas as pd
import numpy as np

In [5]:
data = pd.read_csv('PPP_data_150k_plus.csv', engine='c')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 661218 entries, 0 to 661217
Data columns (total 16 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   LoanRange      661218 non-null  object 
 1   BusinessName   661210 non-null  object 
 2   Address        661201 non-null  object 
 3   City           661203 non-null  object 
 4   State          661218 non-null  object 
 5   Zip            661202 non-null  float64
 6   NAICSCode      654435 non-null  float64
 7   BusinessType   659789 non-null  object 
 8   RaceEthnicity  661218 non-null  object 
 9   Gender         661218 non-null  object 
 10  Veteran        661218 non-null  object 
 11  NonProfit      42462 non-null   object 
 12  JobsRetained   620712 non-null  float64
 13  DateApproved   661218 non-null  object 
 14  Lender         661218 non-null  object 
 15  CD             661218 non-null  object 
dtypes: float64(3), object(13)
memory usage: 80.7+ MB


Next, I wanted to see the total number of unique values for each of the five features listed below as well as what those actualy unique values were.

In [6]:
for col in data[['LoanRange', 'State', 'BusinessType', 'NonProfit', 'DateApproved']]:
    select_data = data[col]
    print(col,'\t\t', len(select_data.unique()), '\n', select_data.unique(), '\n')

LoanRange 		 5 
 ['a $5-10 million' 'b $2-5 million' 'c $1-2 million'
 'd $350,000-1 million' 'e $150,000-350,000'] 

State 		 57 
 ['AK' 'AL' 'AR' 'AS' 'AZ' 'CA' 'CO' 'CT' 'DC' 'DE' 'FL' 'GA' 'GU' 'HI'
 'IA' 'ID' 'IL' 'IN' 'KS' 'KY' 'LA' 'MA' 'MD' 'ME' 'MI' 'MN' 'MO' 'MP'
 'MS' 'MT' 'NC' 'ND' 'NE' 'NH' 'NJ' 'NM' 'NV' 'NY' 'OH' 'OK' 'OR' 'PA'
 'PR' 'RI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VA' 'VI' 'VT' 'WA' 'WI' 'WV' 'WY'
 'XX'] 

BusinessType 		 18 
 ['Non-Profit Organization' 'Subchapter S Corporation' 'Corporation'
 'Limited  Liability Company(LLC)' 'Cooperative' nan 'Partnership'
 'Professional Association' 'Sole Proprietorship'
 'Employee Stock Ownership Plan(ESOP)' 'Trust'
 'Limited Liability Partnership' 'Joint Venture'
 'Non-Profit Childcare Center' 'Independent Contractors'
 'Self-Employed Individuals' 'Tenant in Common'
 'Rollover as Business Start-Ups (ROB'] 

NonProfit 		 2 
 ['Y' nan] 

DateApproved 		 79 
 ['04/14/2020' '04/15/2020' '04/11/2020' '04/29/2020' '06/10/2020'
 '05/19/20

When I saw that there were 57 states listed, it drew my attention, so I wanted to sort and list them (to make it easier to read through them). 

In [7]:
sorted(data['State'].unique())

['AK',
 'AL',
 'AR',
 'AS',
 'AZ',
 'CA',
 'CO',
 'CT',
 'DC',
 'DE',
 'FL',
 'GA',
 'GU',
 'HI',
 'IA',
 'ID',
 'IL',
 'IN',
 'KS',
 'KY',
 'LA',
 'MA',
 'MD',
 'ME',
 'MI',
 'MN',
 'MO',
 'MP',
 'MS',
 'MT',
 'NC',
 'ND',
 'NE',
 'NH',
 'NJ',
 'NM',
 'NV',
 'NY',
 'OH',
 'OK',
 'OR',
 'PA',
 'PR',
 'RI',
 'SC',
 'SD',
 'TN',
 'TX',
 'UT',
 'VA',
 'VI',
 'VT',
 'WA',
 'WI',
 'WV',
 'WY',
 'XX']

Next, it is time to continue cleaning the data. I decided that if the designation for NonProfit was listed as NaN (essentially, no value was provided) that I would just change those to 'No'. In my experience, if an organization is a non-profit, they willingly provide that information.

In [8]:
data['NonProfit']= data['NonProfit'].replace(np.nan, 'N')
data['NonProfit']

0         Y
1         N
2         N
3         N
4         N
         ..
661213    N
661214    N
661215    N
661216    N
661217    N
Name: NonProfit, Length: 661218, dtype: object

I decided to eliminate the columns that will not be used to make the process more efficient.

In [9]:
selected_data = data[['LoanRange', 'State', 'BusinessType', 'NonProfit', 'DateApproved']]
selected_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 661218 entries, 0 to 661217
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   LoanRange     661218 non-null  object
 1   State         661218 non-null  object
 2   BusinessType  659789 non-null  object
 3   NonProfit     661218 non-null  object
 4   DateApproved  661218 non-null  object
dtypes: object(5)
memory usage: 25.2+ MB


Here, I provided the number of PPP Loans approved for each state. For this analysis, I only retrieved the 'State' and 'DateApproved' features and then renamed them, so they are more reader-friendly.

In [18]:
stateGrouping = data[['State', 'DateApproved']]
stateGrouping = stateGrouping.rename(columns = {"State" : "State", "DateApproved" : "# of Loans Approved"})
stateGrouping.groupby(['State']).count().sort_values(by='# of Loans Approved', ascending=False).astype(int)

Unnamed: 0_level_0,# of Loans Approved
State,Unnamed: 1_level_1
CA,87689
TX,52150
NY,46888
FL,42207
IL,27412
PA,26095
OH,22888
NJ,21858
MI,19971
GA,18291


Here I provided the number of PPP Loans and grouped them by their Business Type. For this analysis, I only retrieved the 'State' and 'BusinessType' features and then renamed them, so they are more reader-friendly.

In [11]:
# Business Type
BusType_data = data[['State', 'BusinessType']]
BusType_data = BusType_data.rename(columns={'State' : "Count", 'BusinessType' : "BusinessType"})
BusType_data.groupby(by='BusinessType').count().sort_values(by='Count', ascending=False)

Unnamed: 0_level_0,Count
BusinessType,Unnamed: 1_level_1
Corporation,275482
Limited Liability Company(LLC),172643
Subchapter S Corporation,132436
Non-Profit Organization,41819
Partnership,12902
Sole Proprietorship,8774
Limited Liability Partnership,7649
Professional Association,3858
Cooperative,1851
Self-Employed Individuals,683


Here I provided the number of PPP Loans approved based on if the oeganization is a non-profit. For this analysis, I only retrieved the 'State' and 'NonProfit' features and then renamed them, so they are more reader-friendly.

In [12]:
# NonProfit
non_Profit_data = data[['State', 'NonProfit']]
non_Profit_data = non_Profit_data.rename(columns={'State' : 'Totals', 'NonProfit' : 'Non Profit'})
non_Profit_data.groupby(by='Non Profit').count().sort_values(by='Non Profit', ascending=False)

Unnamed: 0_level_0,Totals
Non Profit,Unnamed: 1_level_1
Y,42462
N,618756


Here I provided the number of PPP Loans approved each month. For this analysis, I only retrieved the 'DateApproved' features and then renamed them, so they are more reader-friendly.

Left Column (month of the year):
4 = April
5 = May
6 = June

In [16]:
monthly_approved = data[['DateApproved']]
monthly_approved = monthly_approved.rename(columns={'DateApproved' : 'Month Approved'})
monthly_approved.index = pd.to_datetime(monthly_approved['Month Approved'], format='%m/%d/%Y')
monthly_approved.groupby(by=[monthly_approved.index.month], dropna=False).count()

Unnamed: 0_level_0,Month Approved
Month Approved,Unnamed: 1_level_1
4,552332
5,95197
6,13689


Here I provided the number of PPP Loans Approved based on the day of the month. For this analysis, I only retrieved 'DateApproved' feature and then renamed them, so they are more reader-friendly.

In [None]:
# DateApproved (by day of the month)
day_of_month = data[['DateApproved']]
day_of_month = day_of_month.rename(columns={'DateApproved' : 'Day of Month'})
day_of_month.index = pd.to_datetime(day_of_month['Count'], format='%m/%d/%Y')
day_of_month.groupby(by=[day_of_month.index.day]).count()

Here, I provided the number of PPP Loans approved per day of the week. For this analysis, I only retrieved the 'DateApproved' feature and then renamed them, so they are more reader-friendly.

To decode the  the left column:
0 = Sunday,
1 = Monday,
2 = Tuesday,
3 = Wednesday,
4 = Thursday,
5 = Friday,
6 = Saturday

In [None]:
day_of_week = data[['DateApproved']]
day_of_week = day_of_week.rename(columns={'DateApproved' : 'Day of Week'})
day_of_week.index = pd.to_datetime(day_of_week['Count'], format='%m/%d/%Y')
day_of_week.groupby(by=[day_of_week.index.dayofweek]).count()

Here I provided the number of jobs retained within each state. For this analysis, I only retrieved the 'JobsRetained' and 'State' features and then renamed them, so they are more reader-friendly. They are sorted in descending order based on the total number of jobs saved in each state.

In [None]:
jobs_retained_by_state = data[[ 'JobsRetained', 'State' ]]
jobs_retained_by_state = jobs_retained_by_state.rename(columns={'JobsRetained' : 'Total Jobs Retained', 'State' : 'State'})
jobs_retained_by_state.groupby(by='State').sum().sort_values(by='Total Jobs Retained', ascending=False)

Here I provided the total number of loans within each 'Loan Range'. For this analysis, I only retrieved the 'LoanRange' and 'State' features and then renamed them, so they are more reader-friendly.

In [17]:
# By LoanRange
loan_range = data[["LoanRange", "State"]]
loan_range = loan_range.rename(columns={"State" : "Total", "LoanRange" : "Loan Range"})
loan_range.groupby(by="Loan Range").count().sort_values(by="Loan Range", ascending=False)

Unnamed: 0_level_0,Total
Loan Range,Unnamed: 1_level_1
"e $150,000-350,000",379054
"d $350,000-1 million",199456
c $1-2 million,53030
b $2-5 million,24838
a $5-10 million,4840


Here I provided the number of jobs retained and grouped them by the 'Loan Range'. For this analysis, I only retrieved the 'LoanRange' and 'JobsRetained' features and then renamed them, so they are more reader-friendly.

In [15]:
# compare the Job retained by LoanRange
JobsRetained_vs_LoanRange = data[['LoanRange', 'JobsRetained']]
JobsRetained_vs_LoanRange = JobsRetained_vs_LoanRange.rename(columns={'LoanRange' : 'Loan Range', 'JobsRetained' : 'Total Jobs Retained'})
JobsRetained_vs_LoanRange.groupby(by='Loan Range').sum().sort_values(by='Total Jobs Retained', ascending=False)


Unnamed: 0_level_0,Total Jobs Retained
Loan Range,Unnamed: 1_level_1
"d $350,000-1 million",10015580.0
"e $150,000-350,000",8726969.0
c $1-2 million,5906794.0
b $2-5 million,5167844.0
a $5-10 million,1639326.0
