# Enter Name here: Ryan McCarthy

Please complete your final project in the space below. Do not forget to explain and interpret the process. You can find the rubric here: https://nathanmichalewicz.org/courses/python/assignments/project-rubric.html

# PROJECT DESCRIPTION

I have chosen to use the "US Collegiate Sports Dataset" for my project. I think this dataset is interesting because I'm a huge fan of the South Carolina Gamecocks football team and the SEC Division. I also play Division I lacrosse. I'd like to learn how much football programs contributed to the overall sports revenue at SEC schools in 2019, whether men's lacrosse programs created more revenue at Division I schools than Division II schools in the years 2015 to 2019, and how many women played lacrosse at Division I schools in 2019.

# VARIABLE ANALYSIS

In the first cell import your libraries and load your data.

Complete one t-test and one ANOVA test.

The t-test and the ANOVA test should not include the same variables and should be related to your research question. Explain why you chose the variables you chose and interpret your results.

In [4]:
# Import libraries and load the dataset

import kagglehub
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

from statistics import mean, median, mode
from kagglehub import KaggleDatasetAdapter

# Set the debug flag for this segment of code
debug = False

# Download the latest version
path = kagglehub.dataset_download("umerhaddii/us-collegiate-sports-dataset")

# Validate that the file was downloaded
if debug: print(f'Path to dataset files: {path}')

file_path = "sports.csv"

# Import the data into a dataframe
df_sports = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "umerhaddii/us-collegiate-sports-dataset",
  file_path,
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

# Print a few rows of the dataset to get a feel for the data
if debug: print(df_sports)

  df_sports = kagglehub.load_dataset(
  result = read_function(


# T-TEST

In order to use a t-test to help answer the question of "how many women played lacrosse at Division I schools in 2019", I chose to answer the question by comparing the number of women who played lacrosse at Division I schools in 2019 with the number of women who played lacrosse at Division II schools in 2019, to see if there was a statistically significant difference between the two groups.  Therefore I chose these two groups as my variables.

So the T-test Question is this: "Is there a statistically significant difference between women's participation in lacrosse at Division I and Division II schools in 2019?

The t-test will determine if there's a statistically significant difference between the means of the two groups, but it does not determine which group has more participants, so I calculated the mean and mode of each group to get an idea which group had more participants.

Procedure:
- Get the data for the number of women who participated in Division I lacrosse in 2019.

- Get the data for the number of women who participated in Division II lacrosse in 2019.

- Use a t-test to compare the two sets of data.

- Analyze the t-statistic and p-value.

- Calculate the mean and mode for the number of women who participated in Division I lacrosse in 2019.

- Calculate the mean and mode for the number of women who participated in Division II lacrosse in 2019.

- Analyze the results.

In [38]:
# Set the debug flag for this segment of code

debug = False

# Initialize parameters

div1_categories = ['NCAA Division I-FBS', 'NCAA Division I-FCS']
div2_categories = ['NCAA Division II without football', 'NCAA Division II with football']
sport_lacrosse = 'Lacrosse'
school_year = 2019

# Function to query the dataset and get the number of women who particiated in
# Lacrosse for the given NCAA Categories in 2019

def get_lacrosse_data(param_ncaa_categories):

    # Args:
    # param_ncaa_categories: A list of target NCAA categories.

    # Returns:
    # A list containing the number of women participating in lacrosse for each
    # school in the given NCAA Category in 2019.

    if debug: print(f'\nget_lacrosse_data: ncaa_categories = {param_ncaa_categories}')

    # Query the dataset
    condition = (df_sports['year'] == school_year) & (df_sports['sports'] == sport_lacrosse) & (df_sports['classification_name'].isin(param_ncaa_categories))
    df_results = df_sports.loc[condition, ['year', 'sports', 'institution_name', 'partic_women']].copy()

    if debug: print(f'\ndf_results\n{df_results}')

    # Remove rows where 'partic_women' has blanks and NaN values
    df_results = df_results.dropna(subset=['partic_women'])

    # Get the number of women participants
    partic_women = df_results['partic_women']

    return partic_women

# Function that calculates and outputs the mean, median, and mode of a given list
# of numbers. The mean is the average of all values, the median is the middle value
# when the data is ordered, and the mode is the value that appears most frequently.

def print_stats(param_title, param_numbers):

    # Args:
    # param_title: The name of the list.
    # param_numbers: A list of numbers.

    if debug: print(f"\nprint_stats for {param_title}\nmean: {mean(param_numbers)}\nmedian: {median(param_numbers)}\nmode: {mode(param_numbers)}")

    format_string = "{:>18} {:>12.2f} {:>12.2f} {:>12.2f}"
    print(format_string.format(param_title, mean(param_numbers), median(param_numbers), mode(param_numbers)))

# Get the data for the number of women who particiated in Division I Lacrosse in 2019
div1_women = get_lacrosse_data(div1_categories)

if debug: print(f'\ndiv1_women\n{div1_women}')

# Get the data for the number of women who particiated in Division II Lacrosse in 2019
div2_women = get_lacrosse_data(div2_categories)

if debug: print(f'\ndiv2_women\n{div2_women}')

# Perform the t-test
t_stat, p_value = stats.ttest_ind(div1_women, div2_women)

# Print the results of the t-test
print(f"\nT-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Check for statistical significance (e.g., at a 5% level)
if p_value < 0.05:
   print("Result: There is a statistically significant difference in the number of women participating in NCAA Division I and Division II lacrosse.")
else:
   print("Result: There is NOT a statistically significant difference in the number of women participating in NCAA Division I and Division II lacrosse.")

# Output the mean, median and mode in table format
print(f"\n{'Number of Participants':>38}")
print("=" * 58)
print(f"{'category':>14}{'mean':>16} {'median':>14} {'mode':>10}")
print("=" * 58)

# Output the mean, median and mode for div1_women
print_stats('Division I Women', div1_women)

# Output the mean, median and mode for div2_women
print_stats('Division II Women', div2_women)

print("-" * 58)


T-statistic: 11.816277319702285
P-value: 1.5389265868448235e-24
Result: There is a statistically significant difference in the number of women participating in NCAA Division I and Division II lacrosse.

                Number of Participants
      category            mean         median       mode
  Division I Women        33.71        33.00        30.00
 Division II Women        24.44        25.00        26.00
----------------------------------------------------------


T-TEST CONCLUSIONS

According to the course material, a high t-statistic value indicates a strong difference between the means of the two groups being compared. Essentially, it suggests that the observed difference is unlikely to have occurred by chance alone.

 Given the high t-statistic (~ 11.8), I conclude that in 2019, there is a strong difference between the number of women that played NCAA Division I lacrosse and the number of women that played NCAA Division II lacrosse.  The differences in mean values (33.7, 23.4) and mode values (30, 26) for these 2 groups supports this conclusion, and indicates that more women played Division I lacrosse than Division II lacrosse in 2019.

# ANOVA TEST

The following analysis helps me answer the question of "How much football programs contributed to the overall sports revenue at SEC schools in 2019" by looking at the amount of money a selected set of SEC Division I schools spent (invested) on football versus other big sports (basketball, soccer, lacrosse and volleyball).  Since I looked at financial revenue numbers in a previous assignment, I decided to look at financial expense numbers this time.  So my variables were financial expense for football, basketball, soccer, lacrosse and volleyball.

ANOVA Question:
Is there statistically significant difference in the total financial expense for football, basketball, soccer, lacrosse and volleyball programs at SEC Division I schools in 2019?

Procedure:

- Gather expense data for football, basketball, soccer, lacrosse and volleyball at selected SEC Division I schools in 2019.

- Run a one-way ANOVA to compare the expenses across the 5 sports.

- Analyze the ANOVA results to see if the f-statistic and p-value are significant.

- Because the ANOVA does not tell exactly which sports have the statistically significant difference, I calculated the mean and mode for each sport to help answer that question.

In [48]:
# Set the debug flag for this segment of code

debug = False

# Create the selected list of SEC Division I schools
school_names = ['The University of Alabama','University of Florida', 'University of Georgia',
                'University of South Carolina-Columbia','The University of Texas at Austin']

# Create the list of sports
sports_names = ['Football', 'Basketball', 'Soccer', 'Lacrosse', 'Volleyball']

# Specify the year
school_year = 2019

# Query the data set
condition = (df_sports['institution_name'].isin(school_names)) & (df_sports['year'] == school_year) & (df_sports['sports'].isin(sports_names))
df_results = df_sports.loc[condition, ['year', 'institution_name', 'sports', 'total_exp_menwomen']].copy()

# Remove rows where 'total_exp_menwomen' has blanks and NaN values
df_results = df_results.dropna(subset=['total_exp_menwomen'])

if debug: print(df_results)

# Prepare input to the ANOVA calculation
exp_football = df_results.loc[df_results['sports'] == sports_names[0], 'total_exp_menwomen']
exp_basketball = df_results.loc[df_results['sports'] == sports_names[1], 'total_exp_menwomen']
exp_soccer = df_results.loc[df_results['sports'] == sports_names[2], 'total_exp_menwomen']
exp_lacrosse = df_results.loc[df_results['sports'] == sports_names[3], 'total_exp_menwomen']
exp_volleyball = df_results.loc[df_results['sports'] == sports_names[4], 'total_exp_menwomen']

if debug:
  print(f'\nTEST-1 exp_football: Length = {len(exp_football)}, Values = {exp_football.values}')
  print(f'\nTEST-2 exp_baseketball: Length = {len(exp_basketball)}, Values = {exp_basketball.values}')
  print(f'\nTEST-3 exp_soccer: Length = {len(exp_soccer)}, Values = {exp_soccer.values}')
  print(f'\nTEST-4 exp_lacrosse: Length = {len(exp_lacrosse)}, Values = {exp_lacrosse.values}')
  print(f'\nTEST-5 exp_volleyball: Length = {len(exp_volleyball)}, Values = {exp_volleyball.values}')

# Calculate the ANOVA values for expenditures across the set of sports
df_anova = stats.f_oneway(exp_football, exp_basketball, exp_soccer, exp_lacrosse, exp_volleyball)

# Output the results
text_str = f'\nExpenditures for Football, Basketball, Soccer, Lacrosse and Volleyball\nF-statistic: {df_anova.statistic}\np-value: {df_anova.pvalue}'
print(text_str)

# Output the mean, median and mode in table format
print(f"\n{'Financial Expenditure':>38}")
print("=" * 58)
print(f"{'category':>15}{'mean':>12} {'median':>13} {'mode':>12}")
print("=" * 58)

# Output the mean, median and mode for each sport
print_stats('Football', exp_football)
print_stats('Basketball', exp_basketball)
print_stats('Soccer', exp_soccer)
print_stats('Lacrosse', exp_lacrosse)
print_stats('Volleyball', exp_volleyball)

print("-" * 58)



Expenditures for Football, Basketball, Soccer, Lacrosse and Volleyball
F-statistic: 58.15951156367277
p-value: 2.4950602357714594e-09

                 Financial Expenditure
       category        mean        median         mode
          Football  43659005.00  39503076.00  58508853.00
        Basketball  14438291.60  13543284.00  13235476.00
            Soccer   2130293.80   1883398.00   1883398.00
          Lacrosse   1648659.00   1648659.00   1648659.00
        Volleyball   2300973.80   1909222.00   1909222.00
----------------------------------------------------------


ANOVA TEST CONCLUSIONS

According to the course material, the high F-statistic (~ 58.16) indicates that there is a statistically significant difference in the expenses for at least two of the sports.  The extremely small p-value means that the evidence against the null hypothesis (that all group means are equal) is extremely strong.

The ANOVA results combined with a comparison of the mean and mode values for  the selected sports at the selected Division I SEC schools indicates that football probably had the highest sports expenditure in 2019, which is also an indicator that football might have made the highest contribution to sports revenue at Division I SEC schools in 2019.  More analyis is required to determine a more accurate answer.

- A key part of this analyis was performed in the previous assignment where I used Pearson Correlation Coefficient analysis to correlate the amount of revenue generated by football with the total amount of revenue generated by all sports at each selected SEC school in 2019.

- The results (0.97 coefficient and 0.00 p-value) indicated a strong correlation between the amount of revenue generated by football and the amount of revenue generated by the overall sports programs at those schools.

- Based on all the tests I performed on the football data, I have a strong opinion that football had the highest revenue, expenses, and contribution to overall sports contribution than basketball, soccer, lacrosse and volleyball.