# Instructor Feedback for Mid-Semester Dataset Update and Group Assignments to Date for Group 5

Please find below three sections of feedback regarding your mid-semester dataset update and group assignments. Overall, I'm glad to see progress on the project and it seems like you are working together, which is great. I also think your choice of topic remains a rich one and I'm excited to see how you're working together to realize your proposed project. However, I have some questions about the division of labor, as well as the overall scope and documentation of your project. As a reminder, your goal is to try and create something like the datasets we have been exploring on the *Responsible Datasets in Context Project*. While I do not expect what you to produce to be as polished as what is available on the website, considering there are four of you in the group, I am hoping you can clarify some of your current choices and provide more detail on how you are envisioning your final project and achieve the stated project goals and requirements.

In terms of returning feedback to the instructor, you have two options: create a new GitHub issue responding to my questions and tag me in the issue, or complete the Google form, available via Canvas, where you can update your project plan and respond to my questions or offer suggested corrections to the stated assessments. You are also welcome to use the form to provide feedback that will not be shared with the rest of your group if you think that would be helpful or want to share something with me privately.

## Load Libraries and Datasets

To run this notebook, you will need to have `pandas`, `altair`, and `rich` installed. You can find instructions on how to do so in our course website.

In [1]:
import pandas as pd
import altair as alt
from rich.console import Console
from rich.table import Table

console = Console()

# Load the data
contributors_df = pd.read_csv("./datasets/contributors_group_5.csv")
commits_df = pd.read_csv("./datasets/commits_group_5.csv")
issues_df = pd.read_csv("./datasets/issues_group_5.csv")

## Overall Group Feedback

Overall, your group seems to be working well together, though I notice some patterns below that I have questions about. I was glad to see that your GitHub repository has a `gitignore` file. However, I would recommend though that you update your `README.md` to describe the overall project and the structure of your repository. I also would recommend you update the repository description to be accurate for your project and add a license for the repository.

I will say that while I respect and understand your current approach to dividing labor, it does make it somewhat difficult for me to assess since it seems like primarily some group members have been doing the coding and issue management (at least based on the data you will see below), whereas others are doing more of the research and data entry intensive labor. If you could please just briefly confirm that the division of labor is relatively equitable on the Google form on Canvas, then I'm happy to share the grades between all members of the group. But I would like to get that confirmation since it is not clear to me from the data, I have available.

In the following graphs, you will see some of how your group has been working together from what I can tell via GitHub. I do not think this data represents all of your group's work or activities, so I would encourage you to both think about how to document work in more detail and also to contact the instructors if you feel this data is not representative of your group's work and would like other aspects to be included in your assessment.

### Contributions Per Group Member

In [2]:
# Create a table for the contributions
contributors_table = Table(title="Contributors")
# Add columns with the contributors and the number of contributions
contributors_table.add_column("Contributor", style="cyan")
contributors_table.add_column("Number of Contributions", style="magenta")
# Sort the contributors by the number of contributions, with the highest first
contributors_df = contributors_df.sort_values(by="contributions", ascending=False)
# Loop through the contributors and add them to the table
for index, row in contributors_df.iterrows():
	# Add the contributor to the table and set the contributions to be a string
	contributors_table.add_row(row["login"], str(row["contributions"]))
# Print the table
console.print(contributors_table)

Above is the total number of contributions (that is commits) per group member. I see that there is not many commits currently, which makes sense since most of your group activities so far have been more research intensive, and also that almost every group member has committed, which is great. Ideally, I would like to see all members of the group committing to the repository moving forward if it is possible, but happy to respect your division of labor if that is something you have all decided on.

### Commits Over Time

In [3]:
# Melt the commits dataframe from wide to long format so that we can have all commit type activities in one column called commit_metric
melted_commits_df = commits_df.melt(id_vars=['oid', 'message', 'committedDate', 'login', ], value_vars=['additions', 'deletions', 'changedFiles'], var_name='commit_metric', value_name='commit_metric_value')
# Convert the commit date to a datetime object
melted_commits_df['commit_date'] = pd.to_datetime(melted_commits_df['committedDate'], errors='coerce')
# Get the unique commit types to create charts for each type
commit_types = melted_commits_df['commit_metric'].unique().tolist()

# Create a list to store the charts
charts = []
# Loop through the commit types and create a chart for each type
for commit_type in commit_types:
	# Create an interactive selection for each chart where you can select the login to highlight each group member's contributions
	selection = alt.selection_point(fields=['login'], bind='legend')
	
	# Filter the DataFrame for the current commit type and subset to only the columns we need to keep the chart smaller in size
	filtered_df = melted_commits_df[melted_commits_df['commit_metric'] == commit_type][['commit_date', 'commit_metric_value', 'login', 'message']]
	
	# Create a bar chart for the current commit type
	chart = alt.Chart(filtered_df).mark_bar().encode(
		x='commit_date:T', # Use the commit date as the x-axis
		y='commit_metric_value:Q', # Use the commit metric value as the y-axis
		color='login:N', # Color the bars by the login
		opacity=alt.condition(selection, alt.value(1), alt.value(0.1)), # Set the opacity to 1 if the login is selected and 0.1 if not
		tooltip=['commit_date', 'commit_metric_value', 'login', 'message'] # Show the commit date, commit metric value, login, and message in the tooltip
	).add_params(selection).properties(
		title=f"Commit {commit_type} Over Time"
	)
	# Add the chart to the list of charts
	charts.append(chart)
# Combine all the charts into one chart and set the y-axis to be independent so that we can see all the changes even if the y axis scale is different for each commit type activity
alt.hconcat(*charts).resolve_scale(y='independent')

When we look at the distribution of commit activities (so additions, deletions, and number of files changed), it does look like that, even though there have been few commits, it has been Yosef making some very large changes, though Ethan, Nick, and Jason have contributed a bit as well. I am concerned that I do not see any contributions from Sam and Kohta. I would like to see more of a balance in the future, but I understand that this may be due to the division of labor you have all decided on (though again will need confirmation that is the case).

### Issues and Project Management Over Time

In [4]:
# Once again melt the issues dataframe from wide to long format so that we can have all the issue dates in one column called issue_date
melted_issues_df = issues_df.melt(id_vars=['user.login', 'title', 'state', 'body', 'html_url', 'assignee', 'assignees_logins', 'labels', 'milestone', 'draft', 'comments', 'state_reason', 'closed_by.login', 'reactions.total_count', 'issue_duration', 'issue_associated_with_pull_request', 'pull_request.url', 'issue_status_on_project_board'], value_vars=['created_at', 'updated_at', 'closed_at'], var_name='issue_date_type', value_name='issue_date')

# Rename columns for because Altair does not let us use '.' in the column names
melted_issues_df = melted_issues_df.rename(columns={
    'user.login': 'user_login', 
    'closed_by.login': 'closed_by_login'
})

# Sort the DataFrame by issue_date
melted_issues_df = melted_issues_df.sort_values(by='issue_date')

# Get the unique issue titles
issue_titles = melted_issues_df['title'].unique().tolist()

# Initialize an empty list to store the charts
charts = []
# Loop through each issue title to create a chart for the issue
for title in issue_titles:
    # Create a selection so that we can highlight the contributions of each group member
    selection = alt.selection_point(fields=['user_login'], bind='legend')
    
    # Filter the DataFrame for the current issue title
    subset_data = melted_issues_df[melted_issues_df['title'] == title]
    
    # Initialize a list to store subtitles for the chart
    subtitle = []
    
    # Check if the issue is associated with a pull request
    has_pr = subset_data[subset_data['issue_associated_with_pull_request'] == True]
    if not has_pr.empty:
        # If the issue is associated with a pull request, get the URL of the pull request and add it to the subtitle
        pr_url = subset_data['pull_request.url'].unique()[0]
        subtitle.append(f"Issue associated with a pull request ({pr_url}).")
    
    # Check if the issue is associated with a project board
    has_project_board = subset_data[subset_data['issue_status_on_project_board'].notna()]
    if not has_project_board.empty:
        # If the issue is associated with a project board, get the status of the issue on the project board and add it to the subtitle
        board_status = subset_data['issue_status_on_project_board'].unique()[0]
        subtitle.append(f"Issue is associated with a project board and has status {board_status}.")
    
    # If no subtitles were added, add a default message
    if not subtitle:
        subtitle = "Issue is not associated with a pull request or a project board."
    
    # Create a line chart for the current issue title
    chart = alt.Chart(subset_data).mark_line(point=True).encode(
        x='issue_date:T', # Use the issue date as the x-axis
        y=alt.Y('title', axis=alt.Axis(title=None, labels=False)), # Use the title as the y-axis and don't show the axis labels or title since we have the title in the chart title
        color='user_login:N', # Color the lines by the user login
        tooltip=[
            'user_login', 'closed_by_login', 'yearmonthdatehoursminutes(issue_date)', 'issue_date_type', 
            'title', 'body', 'state', 'assignee', 'assignees_logins', 'html_url'
        ], # Show the user login, closed by login, issue date, issue date type, title, body, state, assignee, assignees logins, and HTML URL in the tooltip
        opacity=alt.condition(selection, alt.value(1), alt.value(0.1)) # Set the opacity to 1 if the user login is selected and 0.1 if not
    ).add_params(selection).properties(
        # Set the title of the chart to be the issue title and the subtitle to be the subtitles we created
        title=alt.Title(f"Issue: {title}", subtitle=subtitle)
    )
    
    # Append the chart to the list of charts
    charts.append(chart)

# Concatenate all the charts vertically and resolve the x-scale to be shared
alt.vconcat(*charts).resolve_scale(x='shared')

Looking at your issues and project board, it looks like you initially used issues for project management, but then stopped for some reason? I also noticed that many of the issues are being closed immediately after being opened so there might be room to leverage issues more as you complete the project. 

I realize using a new platform and interface for project management can be difficult, but I would encourage you to think about how you can use these tools to not only help you manage your project and communicate with each other, but also document your work for the final project.

## Group Assignments to Date Feedback

The following feedback is for the three group assignments you have completed so far. Since there isn't clear documentation on who completed what activity, I am currently using the git history to assess contributions, but I am happy to adjust this assessment if you provide me with more information.

### Mass Digitization & Digital Libraries Assignment

- It seems like the `Group Document.md` is the correct file for this assignment, but please correct me if I am wrong.
- Overall, I think you did a great job on this assignment, and I particularly appreciated that you detailed how you planned to work (you're one of the few groups that included that requested information!). I also thought your assessment of how sports data is digitized and preserved, emphasizing the importance of metadata and the shift from limited historical access to modern, comprehensive digital archives was spot on!
- I appreciated that you included some names of which individuals completed which sections (maybe just Jason in this case?) and included links to the sources you found (though would recommend in future correctly formatting links for Markdown files). I particularly liked your inclusion of the International Olympic Committee's digital archive as an example of one of the oldest digital sports archives, as well as you detailing some of the defunct projects. Finally, I thought your deep dive into statscast and the Curry viral video was great as well!
- In the git history, I'm seeing Yosef and Nick listed, so will be giving full marks to them and Jason. But feel free to correct me if I am wrong.

Status: Complete & Full Marks for Yosef, Nick, and Jason (Though happy to update this if you can share more details about how you worked together on this assignment)

### Critical Cultural Data Explorations Assignment

- I believe that this assignment is the `critical_cultural_data_explorations.md` file, but again please correct me if I am wrong or missing something.
- Overall, I thought you did great on this assignment and really appreciated that you detailed who worked on which sections.
- For Part 1, I thought you did a great job with your detailed comparison of the 1955 and 2024 baseball datasets, and specifically how new technologies have transformed the way data is collected and represented in baseball. I appreciated how you noted that the 1955 dataset focused on traditional, team-centric statistics, offering a snapshot of performance that aligns with the era’s priorities. In contrast, your description of the 2024 dataset, enriched with Statcast-enabled advanced metrics like exit velocity and pitch spin rate, effectively showcased the shift to player-centric analysis and the growing focus on granular data. This distinction underscored the changing power dynamics and the evolution of data emphasis over time, which you communicated clearly. Well done on drawing attention to how these datasets reflect broader cultural trends within the sport!
- For Part 2, I thought you did a great job as well and was especially impressed with the diversity of tools used, such as ChatGPT, Meta, and Gemini, and how you evaluated their capabilities in analyzing the datasets. I thought your focus on the limitations was very strong, especially how though these AI tools effectively listed statistics, they lacked deeper contextual analysis or acknowledgment of data gaps in the 1955 dataset. 
- Well done on this assignment overall and would encourage you to consider incorporating some of this work into final project data essay. While the git history only shows Nick, I will be giving full marks to all group members, but please correct me if I am wrong.

Status: Complete & Full Marks for Everyone (Though happy to update this if you can share more details about how you worked together on this assignment)

### Proprietary & Perspectival Dataset Creation Assignment

- It seems like the assignment is in your `proprietary-perspectival-dataset-creation` folder, but please correct me if I am wrong.
- Overall, I thought your choice of Baseballsavant.com made sense given your final project. The `proprietary_platform_exploration.md` was a bit thin, but did appreciate that you did cover the main relevant points in terms of accessing data. It doesn't seem like you covered the historical shift in the terms of use but maybe that is because the Wayback Machine was down? If that's the case, it would be helpful to note that in the assignment.
- The `data-reflection-documentation.md` is much more detailed which is great! I also appreciate that you not only including the data but also a notebook exploring the data. I thought your documentation was very detailed and appreciated that you included information about the data types in the `Dataset Structure` section. I thought your decision to do a `Like Player` column for the subjective element of the dataset is an interesting approach, and I appreciated that you detailed your rationale, as well as some the limitations of your approach to encoding subjectivity into the dataset.
- I particularly appreciated your thoughtful discussion of your data collection process, the platform dynamics and how that shaped this data, as well as the legal and ethical considerations of both reusing this data and licensing it. Overall fantastic work on this one!
- In the git history, I'm only seeing activity from Yosef and a little bit from Jason, so they will be receiving full marks. But please correct me if I am wrong. 

Status: Complete & Full Marks for Yosef and Jason (Though happy to update this if you can share more details about how you worked together on this assignment)

## Mid-Semester Dataset Update Feedback

### Dataset Feedback

I appreciate that you include two versions of your dataset, as well as a notebook with some preliminary analysis.

Overall, it seems like the data is well formatted, which makes sense since it seems like you just downloaded it from Baseballsavant? I'm curious why you didn't just use a package like `pybaseball` [https://github.com/jldbc/pybaseball](https://github.com/jldbc/pybaseball) to see if there were other datasets you might want to capture. But regardless, I did appreciate that you detailed your narrowing of available columns, as well as what each column represents.

I'm not convinced that the current data comes anywhere close to your stated interest in exploring: 

> Has it changed how people consume the sport? Hyper-analyzing?  Sports gambling has become a huge focus.  What meaning is the data collected now? What impact does it have?  

So, I'm wondering if there's additional data you have yet to collect that would get at those phenomena? In particular, I'm wondering how these player statistics data get at your interest in how baseball has been consumed, hyper-analyzed, and how it has impacted sports gambling. I'm also concerned that you aren't really verifying this statscast data at all. I realize it might be difficult but otherwise all you've done is download a dataset and do some basic analyses on it, so not sure how that meets the requirements of the project. I do think there's several ways you could approach your stated interests, so happy to set up a meeting to discuss if you would like. You might also consider digging into some of the available scholarship, for example there's a Digital Humanities scholar, Katherine Walden, at Notre Dame who specializes in the history of baseball and digital humanities, so you might look at some of her work to see if it helps you clarify your project.

### Documentation Feedback

I thought your `README.md` was well organized, and I appreciated that it detailed a lot of the rationale for your choices. I do think some of your previous assignment work could be included in the `README.md` to help provide more context for your project. 

I do think you could also include a bit more detail on how you are using the Baseball savant search download interface, and you might include a screenshot or two to help a reader understand how you are collecting data. I did appreciate that you included some of your search filters so that a reader could replicate your search in the interface. 

I also appreciated that you included a list of who contributed to which aspects though it is concerning that not all group members are listed. So, it seems like at the very least that there's an imbalance in the division of labor, but I would like to hear from you all if that is the case.

As I mentioned earlier, while your dataset is documented and included in your repository, it does not fully align with the project’s core objectives. Specifically, it’s important to articulate what cultural phenomena you are trying to capture through your data and how this data helps illuminate that phenomenon. I encourage you to think more deeply about what you aim to reveal through your data and whether the data you’ve gathered is suitable for that purpose. Simply downloading data and using it without modification is not sufficient; you should have a strategy for verifying its accuracy and ensuring that it aligns with your project goals.

One of my main concerns is that your approach feels more like a data science exercise focused on statistical analysis. While analysis is important, the primary aim of this course is to create and curate data that helps us explore and understand cultural phenomena. Your focus should be on the data itself, its relevance, and its context, rather than just the analysis.

In your case, this might involve considering how to validate your Statcast data and how you plan to augment it, both of which are necessary when using existing data. Alternatively, it might be helpful to revisit your initial motivations. It’s clear that your group is interested in the significance of statistics in baseball, which is an excellent starting point. The question then becomes: how do these statistics wield such influence within baseball culture? Part of this exploration could include how these statistics shape the economics of the sport or affect our understanding, consumption, and participation in the game.

To capture these broader cultural dynamics, consider gathering additional data that reflects this impact. This could include social media data, economic data related to baseball (e.g., betting trends or ticket sales), or even fan engagement statistics. This shift in focus will help ensure that your project aligns with the goals of the course and fosters a deeper understanding of the intersection between data and cultural phenomena.

### Progress from Initial Proposal Feedback

In the initial proposal, I had asked you to address the following questions in your project (paraphrased):

- [ ]  What specific data sources will you use to collect the data on player performance, team statistics, and game outcomes? How do you plan to analyze this data to uncover trends and patterns? What patterns are meaningful to your topic?
- [ ] Given that you're working with existing compiled sports statistics, which can sometimes be prone to errors or inconsistencies, how will you ensure the quality of the data you're using? What steps will you take to verify the accuracy of the data?
- [ ] How will you get around any legal restrictions on data collection? Are there resources like APIs you could use?
- [ ] How will you equitably distribute tasks among the group? What happens if collecting and curating the data takes longer than anticipated? How will that impact the other roles? 

My current assessment is that you have completed some of these tasks (for example, accessing the baseball data) but there are some that remain to confront, as well as many new questions for you to address. In particular, I think you need to do some more work on refining what and how you collect this data. While I think you have a good foundation so far, I'm worried that you are not capturing the full scope of the project and that you might be missing some of the larger cultural dynamics that would shape this data.

I think rather than only trying to find correlations between players statistics, you might consider additional data sources that could help you to understand how these statistics are shaping the sport and its consumption. I realize we are already well along in the semester so starting new data collection might feel daunting, but I think if all members contribute that it is doable. Furthermore, I am happy to consult with you or help you brainstorm what might be doable in the remaining weeks.

Finally, I think you could incorporate more of your previous work into your dataset documentation, but also need to consider the division of labor. Increasingly, it seems like only a few members are contributing to the project, so I would like you to collectively consider how can you right this imbalance and make sure that all members are contributing to the project. That might look like some members doing a bit more in the next few weeks, but I leave it up to you all to decide how to proceed.

I do think you are on the right track with your project, but I think addressing some of the concerns I have outlined above will help ensure you are on target to meet the project goals and requirements. I also think bringing in some of our course readings, as well as additional scholarship, will also help flush out and situate your project. I would encourage you to reach out to me if you have any questions or concerns about my feedback or how to proceed with the project.

### Final Grade

Your grade for this assignment is currently a B+. If you can show additional data collection that is relevant to your project **or** verify the data you have collected, then I would be happy to bump you up to an A-. Since only Yosef, Nick, and Ethan are listed as contributing to the update, and I don't see anything in the git history, they will currently be receiving the grade. For Jason, Sam, and Kohta, you are welcome to articulate how you contributed if you would like to receive a grade for this assignment. Otherwise, the only other option is for the three of you to plan to contribute more to the project moving forward and get the consent of your fellow group members in which case I would be happy to share the grade among all group members. Again, though if the current documentation is inaccurate, please let me know and I will adjust the grades accordingly. 