# Requirements :
1. Produce a list of Zip values in xyzcust10 along with their frequencies. **HINT: First, we want to create a Series whose index are the 37 different zip codes in the ZIP column and whose values are the frequencies of the zip codes appearing in that column. Second, we want to do the same for either the ZIP9_Supercode or ZIP9_SUPERCODE columns. These two columns are duplicate columns so that we only need to consider one of them. How do you verify that they are duplicate columns?**
2. How many records with missing ZIP values are there in xyzcust10? **Note: Typically, NaN values are used to denote missing data. How would you check that there are no NaN values in the ZIP/ZIP4/ZIP9 columns? Then we should check for 0 values. Why?**
3. How many active and inactive buyers are in xyzcust10? **HINT: Check the values in the BUYER_STATUS column.**


# Deliverables:

- Submit a single zip-compressed file that has the name: YourLastName_Exercise_1 that has the following files:

 1. Your **HTML document** that has your Source code and output
 2. Your **ipynb script** that has your Source code and output


# Submission Formats :

Create a folder or directory with all supplementary files with your last name at the beginning of the folder name, compress that folder with zip compression, and post the zip-archived folder under the assignment link in Canvas. The following files should be included in an archive folder/directory that is uploaded as a single zip-compressed file. (Use zip, not StuffIt or any 7z or any other compression method.)


1. Complete IPYNB script that has the source code in Python used to access and analyze the data. The code should be submitted as an IPYNB script that can be be loaded and run in Jupyter Notebook for Python
2. Output from the program, such as console listing/logs, text files, and graphics output for visualizations. If you use the Data Science Computing Cluster or School of Professional Studies database servers or systems, include Linux logs of your sessions as plain text files. Linux logs may be generated by using the script process at the beginning of your session, as demonstrated in tutorial handouts for the DSCC servers.
3. List file names and descriptions of files in the zip-compressed folder/directory.


Formatting Python Code
When programming in Python, refer to Kenneth Reitz’ PEP 8: The Style Guide for Python Code:
http://pep8.org/ (Links to an external site.)Links to an external site.
There is the Google style guide for Python at
https://google.github.io/styleguide/pyguide.html (Links to an external site.)Links to an external site.
Comment often and in detail.


In [1]:
# load pandas library
import pandas as pd

# read in the file to a dataframe
xyzcust10 = pd.read_csv('xyzcust10.csv')

In [19]:
# find the shape of the data frame
print('There are {} rows and {} columns in this data frame'.format(xyzcust10.shape[0], xyzcust10.shape[1]))

There are 30471 rows and 11 columns in this data frame


In [5]:
# review the data types of each column using the dtypes attribute
xyzcust10.dtypes

ACCTNO                    object
ZIP                        int64
ZIP4                       int64
LTD_SALES                float64
LTD_TRANSACTIONS           int64
YTD_SALES_2009           float64
YTD_TRANSACTIONS_2009      int64
CHANNEL_ACQUISITION       object
BUYER_STATUS              object
ZIP9_Supercode             int64
ZIP9_SUPERCODE             int64
dtype: object

### Number 1: Provide count by zip code

In [40]:
# counts by ZIP column
pd.value_counts(xyzcust10['ZIP'])

60091    3458
60093    3178
60062    3099
60067    3050
60068    2781
60089    2007
60056    1529
60074    1313
60060    1296
60061    1207
60076    1090
60069     784
60077     740
60084     723
60073     686
60090     648
60098     564
60070     463
60085     379
60083     344
60081     322
60087     268
60097     151
60096     125
60071      98
60064      42
60072      34
60088      28
60078      25
60065      21
60075       5
60094       4
60082       3
60079       2
60192       2
60095       1
0           1
Name: ZIP, dtype: int64

In [65]:
# check to see if ZIP9_Supercode and ZIP9_SUPERCODE are the same
zip9_non_matching = xyzcust10.loc[xyzcust10['ZIP9_Supercode'] != xyzcust10['ZIP9_SUPERCODE']]
non_match_count = len(zip9_non_matching)

# if statement to print out number of non-matching rows
if non_match_count == 0:
    print('All values match between the two columns')
else:
    print('There are {} row(s) that don\'t match'.format(non_match_count))

All values match between the two columns


In [66]:
# get frequency counts by zip9_supercode
pd.value_counts(xyzcust10['ZIP9_Supercode'])

60062        31
600933737    19
600692905    16
600674772    14
600611243    14
600674727    13
600931003    13
600677825    13
600611235    13
600934213    13
600845006    13
600911608    12
600845024    12
600912405    12
600693319    12
600932440    11
600932441    11
60093        11
600693812    11
600911547    11
600674984    11
600911625    11
600694012    11
600934051    10
600911006    10
600933939    10
600611234    10
600933708    10
600931552    10
600934301    10
             ..
600734679     1
600933017     1
600771224     1
600978357     1
600625811     1
600904443     1
600683734     1
600562434     1
600623904     1
600674619     1
600701243     1
600896311     1
600747112     1
600912687     1
600609579     1
600933161     1
600931110     1
600986403     1
600685342     1
600894212     1
600933145     1
600771352     1
600894228     1
600625939     1
600857362     1
600841260     1
600842404     1
600845068     1
600933129     1
600692803     1
Name: ZIP9_Supercode, Le

### Number 2: How many NULL Zip codes are there?

In [76]:
# check for NULL values in Zip Columns
column_names = ['ZIP', 'ZIP9_Supercode', 'ZIP9_SUPERCODE']
null_count = []

# for loop to check each column in column_names object
for i in xyzcust10[column_names]:
    null_count += xyzcust10[i].isnull().sum() # sums the count of NULL rows and appends to list

import numpy as np
null_count = np.array(null_count) # change to array so I can sum over it

# print the number of NULL values
print('There are {} NULL values in the columns specified'.format(np.sum(null_count)))

There are 0.0 NULL values in the columns specified


It looks like there are no NULL or NaN values in any of the three zip columns.  
Let's check for any rows with the value 0

In [75]:
zip_zero = len(xyzcust10[xyzcust10['ZIP'] == 0])
zip9_lower_zero = len(xyzcust10[xyzcust10['ZIP9_Supercode'] == 0])
zip9_upper_zero = len(xyzcust10[xyzcust10['ZIP9_SUPERCODE'] == 0])

print('There are(is) {} row(s) with no zip code in the ZIP column'.format(zip_zero))
print('There are(is) {} row(s) with no zip code in the ZIP9_Supercode column'.format(zip9_lower_zero))
print('There are(is) {} row(s) with no zip code in the ZIPZIP9_SUPERCODE column'.format(zip9_upper_zero))

There are(is) 1 row(s) with no zip code in the ZIP column
There are(is) 0 row(s) with no zip code in the ZIP9_Supercode column
There are(is) 0 row(s) with no zip code in the ZIPZIP9_SUPERCODE column


### Number 3: How many active and inactive buyers are there?

In [18]:
# create a series with just buyer status counts
buyer_status = xyzcust10.groupby('BUYER_STATUS').size()
print('There are {} active buyers'.format(buyer_status.loc['ACTIVE']))
print('There are {} inactive buyers'.format(buyer_status.loc['INACTIVE']))

There are 13465 active buyers
There are 9078 inactive buyers
