# USA County Council 
## Data Exploration & Cleaning

##  1 . What was the best month for  Tax Revenue? How much was earned that month?

##  2. Which County had the highest number of sales?

##  2. Which City had the highest number of sales?
The first part of any data analysis or predictive modeling task is an initial exploration of the data. Even if you collected the data yourself and you already have a list of questions in mind that you want to answer, it is important to explore the data before doing any serious analysis, since oddities in the data can cause bugs and muddle your results. Before exploring deeper questions, you have to answer many simpler ones about the form and quality of data. That said, it is important to go into your initial data exploration with a big picture question in mind since the goal of your analysis should inform how you prepare the data.

In [1]:
# Load in libraries
import calendar
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# load dataset
county_df = pd.read_csv(r"C:\Users\jki\Downloads\catalog data websites\US County Revenue.csv")
county_df.head(5)

Unnamed: 0,Fiscal Year,Quarter Ending,County_Number,County,City,Number of Returns,Taxable Sales,Computed Tax,Percent of Tax,FIPS County Code,Primary Lat Dec,Primary Long Dec
0,2012,09/30/2011,1,Adair,Greenfield,112,5806502.0,347969.0,0.07,19001,41.330746,-94.470941
1,2012,09/30/2011,1,Adair,Adair,58,3113359.0,184309.0,0.04,19001,41.330746,-94.470941
2,2012,09/30/2011,1,Adair,Fontanelle,38,1175786.0,70547.0,0.01,19001,41.330746,-94.470941
3,2012,09/30/2011,1,Adair,Stuart,28,2760190.0,158996.0,0.03,19001,41.330746,-94.470941
4,2012,09/30/2011,1,Adair,Orient,16,274254.0,16449.0,0.0,19001,41.330746,-94.470941


After getting a sense of the data's structure, it is a good idea to look at a statistical summary of the variables with df.describe()

In [2]:
county_df.describe()

Unnamed: 0,Fiscal Year,County_Number,Number of Returns,Taxable Sales,Computed Tax,Percent of Tax,FIPS County Code,Primary Lat Dec,Primary Long Dec
count,39151.0,39151.0,39151.0,39151.0,39151.0,39151.0,39151.0,39151.0,39151.0
mean,2017.34372,50.622666,107.969477,11881360.0,710150.7,0.119676,19100.259227,42.102228,-93.370935
std,3.418063,28.373334,319.001271,57982770.0,3464780.0,0.583621,56.739873,0.77517,1.617781
min,2012.0,0.0,1.0,0.0,0.0,0.0,19001.0,40.399828,-96.604626
25%,2014.0,25.0,18.0,401710.9,24082.67,0.0,19049.0,41.573216,-94.684188
50%,2017.0,50.0,34.0,1129626.0,67653.0,0.01,19099.0,42.078948,-93.327364
75%,2020.0,77.0,75.0,3448440.0,206048.0,0.03,19153.0,42.735494,-91.949987
max,2023.0,99.0,10470.0,1289790000.0,76923130.0,10.44,19197.0,43.50006,-90.193078


In [3]:
#  let find out if we have missing values
missing_vlaues = county_df.isna().sum()
print(missing_vlaues)

Fiscal Year          0
Quarter Ending       0
County_Number        0
County               0
City                 0
Number of Returns    0
Taxable Sales        0
Computed Tax         0
Percent of Tax       0
FIPS County Code     0
Primary Lat Dec      0
Primary Long Dec     0
dtype: int64


In [4]:
# let check for data type
county_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39151 entries, 0 to 39150
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Fiscal Year        39151 non-null  int64  
 1   Quarter Ending     39151 non-null  object 
 2   County_Number      39151 non-null  int64  
 3   County             39151 non-null  object 
 4   City               39151 non-null  object 
 5   Number of Returns  39151 non-null  int64  
 6   Taxable Sales      39151 non-null  float64
 7   Computed Tax       39151 non-null  float64
 8   Percent of Tax     39151 non-null  float64
 9   FIPS County Code   39151 non-null  int64  
 10  Primary Lat Dec    39151 non-null  float64
 11  Primary Long Dec   39151 non-null  float64
dtypes: float64(5), int64(4), object(3)
memory usage: 3.6+ MB


In [5]:
# let chnage the date daa type
county_df['Quarter Ending'] = pd.to_datetime(county_df['Quarter Ending'])
county_df['Quarter Ending'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 39151 entries, 0 to 39150
Series name: Quarter Ending
Non-Null Count  Dtype         
--------------  -----         
39151 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 306.0 KB


In [6]:
# lets create new month column
county_df['Month'] =county_df['Quarter Ending'].dt.month
county_df['Month'].describe()


count    39151.000000
mean         7.657965
std          3.410841
min          3.000000
25%          6.000000
50%          9.000000
75%         12.000000
max         12.000000
Name: Month, dtype: float64

In [7]:
# lets confirm month column  has been added
county_df.head(5)

Unnamed: 0,Fiscal Year,Quarter Ending,County_Number,County,City,Number of Returns,Taxable Sales,Computed Tax,Percent of Tax,FIPS County Code,Primary Lat Dec,Primary Long Dec,Month
0,2012,2011-09-30,1,Adair,Greenfield,112,5806502.0,347969.0,0.07,19001,41.330746,-94.470941,9
1,2012,2011-09-30,1,Adair,Adair,58,3113359.0,184309.0,0.04,19001,41.330746,-94.470941,9
2,2012,2011-09-30,1,Adair,Fontanelle,38,1175786.0,70547.0,0.01,19001,41.330746,-94.470941,9
3,2012,2011-09-30,1,Adair,Stuart,28,2760190.0,158996.0,0.03,19001,41.330746,-94.470941,9
4,2012,2011-09-30,1,Adair,Orient,16,274254.0,16449.0,0.0,19001,41.330746,-94.470941,9


In [8]:
# lets create year column
county_df['Year'] = county_df['Quarter Ending'].dt.year
county_df['Year'].describe()

count    39151.000000
mean      2016.861153
std          3.422961
min       2011.000000
25%       2014.000000
50%       2017.000000
75%       2020.000000
max       2023.000000
Name: Year, dtype: float64

In [9]:
# lets confirm Year column  has been added
county_df.head(5)

Unnamed: 0,Fiscal Year,Quarter Ending,County_Number,County,City,Number of Returns,Taxable Sales,Computed Tax,Percent of Tax,FIPS County Code,Primary Lat Dec,Primary Long Dec,Month,Year
0,2012,2011-09-30,1,Adair,Greenfield,112,5806502.0,347969.0,0.07,19001,41.330746,-94.470941,9,2011
1,2012,2011-09-30,1,Adair,Adair,58,3113359.0,184309.0,0.04,19001,41.330746,-94.470941,9,2011
2,2012,2011-09-30,1,Adair,Fontanelle,38,1175786.0,70547.0,0.01,19001,41.330746,-94.470941,9,2011
3,2012,2011-09-30,1,Adair,Stuart,28,2760190.0,158996.0,0.03,19001,41.330746,-94.470941,9,2011
4,2012,2011-09-30,1,Adair,Orient,16,274254.0,16449.0,0.0,19001,41.330746,-94.470941,9,2011


##  1 . What was the best month for  Tax Revenue? How much was earned that month?

In [11]:
# Load in some packages
import calendar
import warnings
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations
from collections import Counter

warnings.filterwarnings("ignore")

# 1. What was the best month for  Tax Revenue ? How much was earned that month?

# Replace NaN or inf values in the 'Month' column with a default value (e.g., 0)
county_df['Month'] = county_df['Month'].fillna(0).astype(int)

# Convert month numbers to abbreviated month names
county_df['Month Name'] = county_df['Month'].apply(lambda x: calendar.month_abbr[x])

# Group by month and calculate total sales for each month
sales_by_month = county_df.groupby('Month Name').sum()['Computed Tax']

# Find the best month for sales and the corresponding earnings
best_month = sales_by_month.idxmax()
earnings_for_best_month = sales_by_month.max()

print(f"The best month for Tax Revenue was {best_month} with earnings of ${earnings_for_best_month:,.2f}")

The best month for Tax Revenue was Dec with earnings of $7,448,274,141.82
