# Rebuild Income Revenue as Reported by Congress
Evan Sellers + Michael Yager

We are attempting to rebuild the congressional revenue generated from income tax in 2020, to verify our dataset properly.

In 2020 the US governemnt made **$1.6 trillon** in revenue from income tax

In 2019 the US governemnt made **$1.7 trillon** in revenue from income tax \
[Revenues in Fiscal Year 2020](https://www.cbo.gov/system/files/2020-11/56746-MBR.pdf)

## Data Setup
As noted by the documentation all money amounts are in thousands

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("tax_data.csv")
df["agi_stub"] = df["agi_stub"].astype("category") 

In [3]:
def toMillon(amount):
    return round(amount / 1000000, 2)

def toBillon(amount):
    return round(amount / 1000000000, 2)

def toTrillon(amount):
    return round(amount / 1000000000000, 10)

## Rebuild 2020s
Based on the data documentation we should be using column `A06500`, which is income tax after credits amount

In [4]:
# Income tax after credits amount
toTrillon(df["A06500"].sum() * 1000)

3.335538966

We got `$3.33 Trillion` this is not correct the number should be close to `$1.6 Trillon`

## Rebuild 2019s

In [5]:
df2019 = pd.read_csv("tax_data_2019.csv")

In [6]:
# Income tax after credits amount
toTrillon(df2019["A06500"].sum() * 1000)

3.069631654

We got `$3.07 Trillion` which is not correct the number should be close to `$1.7 Trillon`

## Why is are the numbers so far off?

In [7]:
df.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,ELF,CPREP,...,N85300,A85300,N11901,A11901,N11900,A11900,N11902,A11902,N12000,A12000
0,1,AL,0,1,785000.0,519980.0,85690.0,165290.0,724170.0,22560.0,...,0.0,0.0,57720.0,46577.0,674840.0,1827202.0,672200.0,1818867.0,2900.0,6089.0
1,1,AL,0,2,554310.0,270870.0,121420.0,146470.0,515150.0,13260.0,...,0.0,0.0,81770.0,112540.0,470410.0,1445383.0,466960.0,1432458.0,4660.0,11648.0
2,1,AL,0,3,290630.0,113280.0,124770.0,44570.0,269700.0,6420.0,...,0.0,0.0,70360.0,144380.0,220710.0,626662.0,216530.0,610170.0,5760.0,16235.0
3,1,AL,0,4,181010.0,42010.0,120820.0,14410.0,168830.0,2570.0,...,0.0,0.0,49500.0,135429.0,130670.0,437179.0,126790.0,419324.0,3730.0,14903.0
4,1,AL,0,5,269080.0,31310.0,224330.0,8270.0,252360.0,3250.0,...,100.0,20.0,103250.0,470206.0,165650.0,724529.0,156910.0,642895.0,11280.0,80064.0


I notice we have zipcodes labeled `0` which seems very odd. The documentation says private zipcode will be labeled as `99999`, but nothing about the zipcode `0`.

In [8]:
len(df[df.zipcode == 0])

306

In [9]:
toTrillon(df[df.zipcode == 0]["A06500"].sum()*1000)

1.667769483

Thats odd, for some reason that is about `50%` of the value we want.

In [10]:
df[df.zipcode == 0]["A06500"].sum()

1667769483.0

In [11]:
df[df.zipcode != 0]["A06500"].sum()

1667769483.0

In [12]:
df2019[df2019.zipcode == 0]["A06500"].sum()

1534815827.0

In [13]:
df2019[df2019.zipcode != 0]["A06500"].sum()

1534815827.0

They are the exact same amount! Why is all the data duplicated with a zero zipcode?

## Verify Duplicated Zero Zipcode

In [14]:
nonzerozips = df[df.zipcode != 0]
del nonzerozips["STATEFIPS"]
del nonzerozips["STATE"]
del nonzerozips["agi_stub"]
del nonzerozips["zipcode"]
nonzerozips = nonzerozips.sum()

zerozips = df[df.zipcode != 0]
del zerozips["STATEFIPS"]
del zerozips["STATE"]
del zerozips["agi_stub"]
del zerozips["zipcode"]
zerozips = zerozips.sum()

In [15]:
isEqual = True
for column in nonzerozips == zerozips:
    if column != True: isEqual = False
print("True, Array are Equal") if isEqual else print("False, Array are NOT Equal")

True, Array are Equal


In [16]:
nonzerozips = df2019[df2019.zipcode != 0]
del nonzerozips["STATEFIPS"]
del nonzerozips["STATE"]
del nonzerozips["agi_stub"]
del nonzerozips["zipcode"]
nonzerozips = nonzerozips.sum()

zerozips = df2019[df2019.zipcode != 0]
del zerozips["STATEFIPS"]
del zerozips["STATE"]
del zerozips["agi_stub"]
del zerozips["zipcode"]
zerozips = zerozips.sum()

In [17]:
isEqual = True
for column in nonzerozips == zerozips:
    if column != True: isEqual = False
print("True, Array are Equal") if isEqual else print("False, Array are NOT Equal")

True, Array are Equal


This means there is duplicated data, unsure why this is the case. Most likly to add up zip code regions in some way.

## Rebuild 2020

In [18]:
# Income tax after credits amount
toTrillon(df[df.zipcode != 0]["A06500"].sum() * 1000)

1.667769483

We got `$1.66 trillion` which is the number reported by congress in 2020.

## Rebuild 2019

In [19]:
# Income tax after credits amount
toTrillon(df2019[df2019.zipcode != 0]["A06500"].sum() * 1000)

1.534815827

We got `$1.53 trillion` which isn't the number exact number reported by congress in 2019, but decently close, and we don't plan on using the 2019 dataset, so we will call this close enough.

## Conclusion

Unsure why there was duplicated data for based on the zipcode 0. But we have proved that the data was duplicated. This means when you import the data use the following method to clean the data...
```python
# Import + Preprocess Data
df = pd.read_csv("tax_data.csv")
df             = df[df.zipcode != 0]
df["agi_stub"] = df["agi_stub"].astype("category") 
df["STATE"]    = df["STATE"].astype("category")
```

Use column `A06500` to represent the amount paid in taxes after all duductions. \
Additionally remember all money amounts are represented in thousands of dollars.