<h1>Validating new Tigers Data</h5>
New dataset sent over on March 27th, 2025. Must validate that shape of data is consistent with prior dataset in order to rerun old code.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from Helpers import *

In [2]:
# load the data
dfOld = pd.read_csv('data/DTIFanData_2-21-25.csv')
dfNew = pd.read_csv('data/DTIFanData_3-25-25.csv')

  dfOld = pd.read_csv('data/DTIFanData_2-21-25.csv')
  dfNew = pd.read_csv('data/DTIFanData_3-25-25.csv')


In [3]:
# validate same column names
assert (dfOld.columns == dfNew.columns).all()

In [6]:
# compare the size of the data
print('Old data size:', dfOld.shape)
print('New data size:', dfNew.shape)
print(f'Difference in size: New data +{dfNew.shape[0] - dfOld.shape[0]} entries')

Old data size: (397187, 196)
New data size: (413222, 196)
Difference in size: New data +16035 entries


In [7]:
# drop all null values from new data to see if like the old data all rows have nulls
dfNewNoNulls = dfNew.dropna()
print('New data size without nulls:', dfNewNoNulls.shape)
if dfNewNoNulls.shape[0] != 0:
    print('YES! YES! YES!')
else:
    print('Yeah, that checks out')

New data size without nulls: (0, 196)
Yeah, that checks out


In [12]:
# print out all column names in a list
print('Columns:', dfNew.columns.tolist())

Columns: ['KeepFlag', 'GlobalKey', 'SeasonKey', 'FanSinceDate', 'FirstGameAttended', 'TotalGamesAttended', 'FirstGameBought', 'LastGameBought', 'TotalTicketsPurchased', 'TotalLifetimeValue', 'CurrentSeasonEmailActivities', 'PreviousSeasonsEmailActivities', 'STMFlagCurr', 'TicketingFanType', 'EmailFanType', 'FullSeasonBuyer', 'HalfSeasonBuyer', 'QuarterSeasonBuyer', 'MiniPlanBuyer', 'IndividualGameBuyer', 'City', 'State', 'PostalCd', 'Country', 'Gender', 'Education', 'Occupation', 'Age', 'MaritalStatus', 'PresenceOfChildren', 'DwellingType', 'HouseholdIncome', 'NetWorth', 'PrimaryVehicleType', 'MSADescription', 'MailSuppresionFlg', 'WorkingWomanFlg', 'BankCardHolderFlg', 'GasDepartmentRetailCardHolderFlg', 'TravelEntertainmentCardHolderFlg', 'CreditCardHolderUnknownTypeFlg', 'PremiumCardHolderFlg', 'UpscaleDepartmentStoreCardHolderFlg', 'MailOrderResponderFlg', 'TruckOwnerFlg', 'MotorcycleOwnerFlg', 'RVOwnerFlg', 'IntTheatrePerformingArtsFlg', 'IntArtsFlg', 'IntDomesticTravelFlg', 'IntH

In [13]:
# check out the ticket buyer data to see if it's still uninterpretable
buyerColumns = ['FullSeasonBuyer', 'HalfSeasonBuyer', 'QuarterSeasonBuyer', 'MiniPlanBuyer', 'IndividualGameBuyer']
# print the first 10 rows of the buyer data
print(dfNew[buyerColumns].head(30))

    FullSeasonBuyer  HalfSeasonBuyer  QuarterSeasonBuyer  MiniPlanBuyer  \
0               NaN              NaN                 NaN            NaN   
1               NaN              NaN                 NaN            NaN   
2               NaN              NaN                 NaN            NaN   
3               NaN              NaN                 NaN            NaN   
4               NaN              NaN                 NaN            NaN   
5               NaN              NaN                 NaN            NaN   
6               NaN              NaN                 NaN            NaN   
7               NaN              NaN                 NaN            NaN   
8               NaN              NaN                 NaN            NaN   
9               NaN              NaN                 NaN            NaN   
10              NaN              NaN                 NaN            NaN   
11              NaN              NaN                 NaN            NaN   
12              NaN      

In [18]:
# old totals
oldBuyerTotals = {}
for col in buyerColumns:
    oldBuyerTotals[col] = dfOld[col].sum()
print('Old Buyer Totals:', oldBuyerTotals)

Old Buyer Totals: {'FullSeasonBuyer': 6.0, 'HalfSeasonBuyer': 0.0, 'QuarterSeasonBuyer': 0.0, 'MiniPlanBuyer': 1062.0, 'IndividualGameBuyer': 0.0}


In [19]:
# bad results from the buyer columns, lets calculate the total number of buyers in each category
buyerTotals = {}
for col in buyerColumns:
    buyerTotals[col] = dfNew[col].sum()
print('New Buyer Totals:', buyerTotals)

New Buyer Totals: {'FullSeasonBuyer': 1191.0, 'HalfSeasonBuyer': 558.0, 'QuarterSeasonBuyer': 2597.0, 'MiniPlanBuyer': 1591.0, 'IndividualGameBuyer': 14172.0}


<p>Awesome!</p>

In [24]:
# are the columns exclusive? i.e. can a FullSeasonBuyer also be a HalfSeasonBuyer?

# iterate through rows, for each row sum each column (if not NaN) and check if the sum is greater than 1
# if it is, then the columns are not exclusive
exclusive = True
exampleRow = None
for i in range(dfNew.shape[0]):
    row = dfNew.iloc[i]
    if row[buyerColumns].sum() > 1:
        exclusive = False
        exampleRow = row
        break

if exclusive:
    print('Buyer columns are exclusive!')
else:
    print(f'Buyer columns are not exclusive, example row:\n {exampleRow[buyerColumns]}')


Buyer columns are not exclusive, example row:
 FullSeasonBuyer        0.0
HalfSeasonBuyer        0.0
QuarterSeasonBuyer     1.0
MiniPlanBuyer          0.0
IndividualGameBuyer    1.0
Name: 204, dtype: object


In [26]:
# how many rows have at least one buyer column filled out?
buyers = dfNew[dfNew[buyerColumns].sum(axis=1) > 0]
print(f'Number of rows with at least one buyer column filled out: {buyers.shape[0]}')
print(f'Percentage of rows with at least one buyer column filled out: {buyers.shape[0] / dfNew.shape[0] * 100:.2f}%')

Number of rows with at least one buyer column filled out: 19093
Percentage of rows with at least one buyer column filled out: 4.62%


<b>Ticket buyer columns are better. But only 4.62% of the entries are marked as ticket buyers which seems very wrong.</b>