This notebook is used for only knowing and cleaning the data for the future analysis

In [2]:
import pandas as pd
import json 

In [3]:
df = pd.read_csv("USASpendings.csv")
df

Unnamed: 0,agency_id,toptier_code,abbreviation,agency_name,congressional_justification_url,active_fy,active_fq,outlay_amount,obligated_amount,budget_authority_amount,current_total_budget_authority_amount,percentage_of_total_budget_authority,agency_slug
0,1525,247,AAHC,400 Years of African-American History Commission,,2025,3,1.953400e+06,2.197040e+04,2.934499e+06,1.419547e+13,2.067209e-07,400-years-of-african-american-history-commission
1,1146,310,USAB,Access Board,https://www.access-board.gov/cj,2025,3,5.918139e+06,5.765031e+06,1.358844e+07,1.419547e+13,9.572384e-07,access-board
2,1136,302,ACUS,Administrative Conference of the U.S.,https://www.acus.gov/cj,2025,3,2.394067e+06,2.269892e+06,3.600360e+06,1.419547e+13,2.536275e-07,administrative-conference-of-the-us
3,1144,306,ACHP,Advisory Council on Historic Preservation,https://www.achp.gov/sites/default/files/2021-...,2025,3,8.149314e+06,7.729430e+06,1.863945e+07,1.419547e+13,1.313056e-06,advisory-council-on-historic-preservation
4,1527,166,USADF,African Development Foundation,https://www.usadf.gov/cj,2025,3,1.858159e+07,1.707742e+07,6.479858e+07,1.419547e+13,4.564737e-06,african-development-foundation
...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,1522,77,DFC,U.S. International Development Finance Corpora...,,2025,3,3.853499e+08,2.083858e+08,8.190721e+09,1.419547e+13,5.769956e-04,us-international-development-finance-corporation
106,1162,510,CSB,United States Chemical Safety Board,https://www.csb.gov/cj,2025,3,8.945224e+06,8.489627e+06,2.166740e+07,1.419547e+13,1.526361e-06,united-states-chemical-safety-board
107,1169,345,CAVC,United States Court of Appeals for Veterans Cl...,https://www.uscourts.cavc.gov/cj,2025,3,2.976940e+07,3.243715e+07,1.264026e+08,1.419547e+13,8.904436e-06,united-states-court-of-appeals-for-veterans-cl...
108,90,1133,USTDA,United States Trade and Development Agency,https://www.ustda.gov/cj,2025,3,5.250495e+07,3.134760e+07,1.822763e+08,1.419547e+13,1.284046e-05,united-states-trade-and-development-agency


agency_id                                       :Unique identifier for the agency
toptier_code	                                :Code for top-level federal agency classification
abbreviation	                                :Acronym of the agency name
agency_name	                                    :Full name of the federal agency
congressional_justification_url	                :Link to agency’s budget request document submitted to Congress
active_fy	                                    :Current fiscal year(financial year) for which data is reported
active_fq	                                    :Current fiscal quarter
outlay_amount	                                :Actual amount spent by the agency
obligated_amount                                :Funds committed by the agency but not yet spent
budget_authority_amount                         :Authorized budget for the agency
current_total_budget_authority_amount           :Total budget authority including all adjustments
percentage_of_total_budget_authority            :Agency’s share of the total government budget
agency_slug	                                    :URL-friendly version of the agency name used in web links

In [4]:
df.shape

(110, 13)

In [6]:
df["agency_id"].nunique() == df.shape[0]

True

This indicates,
All the agencies are exclusive and none is repeated

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 13 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   agency_id                              110 non-null    int64  
 1   toptier_code                           110 non-null    int64  
 2   abbreviation                           110 non-null    object 
 3   agency_name                            110 non-null    object 
 4   congressional_justification_url        97 non-null     object 
 5   active_fy                              110 non-null    int64  
 6   active_fq                              110 non-null    int64  
 7   outlay_amount                          110 non-null    float64
 8   obligated_amount                       110 non-null    float64
 9   budget_authority_amount                110 non-null    float64
 10  current_total_budget_authority_amount  110 non-null    float64
 11  percen

In [11]:
df.isnull().sum()

agency_id                                 0
toptier_code                              0
abbreviation                              0
agency_name                               0
congressional_justification_url          13
active_fy                                 0
active_fq                                 0
outlay_amount                             0
obligated_amount                          0
budget_authority_amount                   0
current_total_budget_authority_amount     0
percentage_of_total_budget_authority      0
agency_slug                               0
dtype: int64

Here, 
"congressional_justification_url" have a lot  of NULL values which may affect our analysis,
also we have "agency_slug" which does almost the same thing.

So we will drop this column entirely.

In [None]:
del df['congressional_justification_url']

In [14]:
df.shape

(110, 12)

Data is now cleaned so we can save this data as a new file.

In [15]:
df.to_csv("Cleaned_USASpendings.csv")