# Instructor Feedback for Mid-Semester Dataset Update and Group Assignments to Date for Group 4

Please find below three sections of feedback regarding your mid-semester dataset update and group assignments. Overall, I'm glad to see progress on the project and it seems like you are working together, which is great. I also think your choice of topic remains a rich one and I'm excited to see how you're working together to realize your proposed project. However, I have some questions about the division of labor, as well as the overall scope and documentation of your project. As a reminder, your goal is to try and create something like the datasets we have been exploring on the *Responsible Datasets in Context Project*. While I do not expect what you to produce to be as polished as what is available on the website, considering there are four of you in the group, I am hoping you can clarify some of your current choices and provide more detail on how you are envisioning your final project and achieve the stated project goals and requirements.

In terms of returning feedback to the instructor, you have two options: create a new GitHub issue responding to my questions and tag me in the issue, or complete the Google form, available via Canvas, where you can update your project plan and respond to my questions or offer suggested corrections to the stated assessments. You are also welcome to use the form to provide feedback that will not be shared with the rest of your group if you think that would be helpful or want to share something with me privately.

## Load Libraries and Datasets

To run this notebook, you will need to have `pandas`, `altair`, and `rich` installed. You can find instructions on how to do so in our course website.

In [1]:
import pandas as pd
import altair as alt
from rich.console import Console
from rich.table import Table

console = Console()

# Load the data
contributors_df = pd.read_csv("./datasets/contributors_group_4.csv")
commits_df = pd.read_csv("./datasets/commits_group_4.csv")
issues_df = pd.read_csv("./datasets/issues_group_4.csv")

## Overall Group Feedback

Overall, your group seems to be working well together, though I notice some patterns below that I have questions about. Your GitHub repository overall is well organized, but it still needs a `gitignore` file. I would recommend though that you update your `README.md` to describe the overall project and the structure of your repository. I also would recommend you update the repository description to be accurate for your project. I did appreciate that you have a license in your `proprietary-perspectival-dataset-creation` folder, but that will not show up for the entire repository unless you move it to the root of the repository. 

I will say that while I respect and understand your current approach to dividing labor, it does make it somewhat difficult for me to assess since it seems like primarily some group members have been doing the coding and issue management (at least based on the data you will see below), whereas others are doing more of the research and data entry intensive labor. If you could please just briefly confirm that the division of labor is relatively equitable on the Google form on Canvas, then I'm happy to share the grades between all members of the group. But I would like to get that confirmation since it is not clear to me from the data, I have available.

In the following graphs, you will see some of how your group has been working together from what I can tell via GitHub. I do not think this data represents all of your group's work or activities, so I would encourage you to both think about how to document work in more detail and also to contact the instructors if you feel this data is not representative of your group's work and would like other aspects to be included in your assessment.

### Contributions Per Group Member

In [2]:
# Create a table for the contributions
contributors_table = Table(title="Contributors")
# Add columns with the contributors and the number of contributions
contributors_table.add_column("Contributor", style="cyan")
contributors_table.add_column("Number of Contributions", style="magenta")
# Sort the contributors by the number of contributions, with the highest first
contributors_df = contributors_df.sort_values(by="contributions", ascending=False)
# Loop through the contributors and add them to the table
for index, row in contributors_df.iterrows():
	# Add the contributor to the table and set the contributions to be a string
	contributors_table.add_row(row["login"], str(row["contributions"]))
# Print the table
console.print(contributors_table)

Above is the total number of contributions (that is commits) per group member. I see that there is not many commits currently, which makes sense since most of your group activities so far have been more research intensive, and that almost every group member has committed, which is great. Ideally, I would like to see all members of the group committing to the repository moving forward if it is possible, but happy to respect your division of labor if that is something you have all decided on.

### Commits Over Time

In [3]:
# Melt the commits dataframe from wide to long format so that we can have all commit type activities in one column called commit_metric
melted_commits_df = commits_df.melt(id_vars=['oid', 'message', 'committedDate', 'login', ], value_vars=['additions', 'deletions', 'changedFiles'], var_name='commit_metric', value_name='commit_metric_value')
# Convert the commit date to a datetime object
melted_commits_df['commit_date'] = pd.to_datetime(melted_commits_df['committedDate'], errors='coerce')
# Get the unique commit types to create charts for each type
commit_types = melted_commits_df['commit_metric'].unique().tolist()

# Create a list to store the charts
charts = []
# Loop through the commit types and create a chart for each type
for commit_type in commit_types:
	# Create an interactive selection for each chart where you can select the login to highlight each group member's contributions
	selection = alt.selection_point(fields=['login'], bind='legend')
	
	# Filter the DataFrame for the current commit type and subset to only the columns we need to keep the chart smaller in size
	filtered_df = melted_commits_df[melted_commits_df['commit_metric'] == commit_type][['commit_date', 'commit_metric_value', 'login', 'message']]
	
	# Create a bar chart for the current commit type
	chart = alt.Chart(filtered_df).mark_bar().encode(
		x='commit_date:T', # Use the commit date as the x-axis
		y='commit_metric_value:Q', # Use the commit metric value as the y-axis
		color='login:N', # Color the bars by the login
		opacity=alt.condition(selection, alt.value(1), alt.value(0.1)), # Set the opacity to 1 if the login is selected and 0.1 if not
		tooltip=['commit_date', 'commit_metric_value', 'login', 'message'] # Show the commit date, commit metric value, login, and message in the tooltip
	).add_params(selection).properties(
		title=f"Commit {commit_type} Over Time"
	)
	# Add the chart to the list of charts
	charts.append(chart)
# Combine all the charts into one chart and set the y-axis to be independent so that we can see all the changes even if the y axis scale is different for each commit type activity
alt.hconcat(*charts).resolve_scale(y='independent')

When we look at the distribution of commit activities (so additions, deletions, and number of files changed), it does look like that even though there has been few commits it has been Alan and Haydn primarily making changes to the files in the repository, though Alex and Ahmad have contributed a bit as well. I would like to see more of a balance in the future, but I understand that this may be due to the division of labor you have all decided on.

### Issues and Project Management Over Time

In [4]:
# Once again melt the issues dataframe from wide to long format so that we can have all the issue dates in one column called issue_date
melted_issues_df = issues_df.melt(id_vars=['user.login', 'title', 'state', 'body', 'html_url', 'assignee', 'assignees_logins', 'labels', 'milestone', 'draft', 'comments', 'state_reason', 'closed_by.login', 'reactions.total_count', 'issue_duration', 'issue_associated_with_pull_request', 'pull_request.url', 'issue_status_on_project_board'], value_vars=['created_at', 'updated_at', 'closed_at'], var_name='issue_date_type', value_name='issue_date')

# Rename columns for because Altair does not let us use '.' in the column names
melted_issues_df = melted_issues_df.rename(columns={
    'user.login': 'user_login', 
    'closed_by.login': 'closed_by_login'
})

# Sort the DataFrame by issue_date
melted_issues_df = melted_issues_df.sort_values(by='issue_date')

# Get the unique issue titles
issue_titles = melted_issues_df['title'].unique().tolist()

# Initialize an empty list to store the charts
charts = []
# Loop through each issue title to create a chart for the issue
for title in issue_titles:
    # Create a selection so that we can highlight the contributions of each group member
    selection = alt.selection_point(fields=['user_login'], bind='legend')
    
    # Filter the DataFrame for the current issue title
    subset_data = melted_issues_df[melted_issues_df['title'] == title]
    
    # Initialize a list to store subtitles for the chart
    subtitle = []
    
    # Check if the issue is associated with a pull request
    has_pr = subset_data[subset_data['issue_associated_with_pull_request'] == True]
    if not has_pr.empty:
        # If the issue is associated with a pull request, get the URL of the pull request and add it to the subtitle
        pr_url = subset_data['pull_request.url'].unique()[0]
        subtitle.append(f"Issue associated with a pull request ({pr_url}).")
    
    # Check if the issue is associated with a project board
    has_project_board = subset_data[subset_data['issue_status_on_project_board'].notna()]
    if not has_project_board.empty:
        # If the issue is associated with a project board, get the status of the issue on the project board and add it to the subtitle
        board_status = subset_data['issue_status_on_project_board'].unique()[0]
        subtitle.append(f"Issue is associated with a project board and has status {board_status}.")
    
    # If no subtitles were added, add a default message
    if not subtitle:
        subtitle = "Issue is not associated with a pull request or a project board."
    
    # Create a line chart for the current issue title
    chart = alt.Chart(subset_data).mark_line(point=True).encode(
        x='issue_date:T', # Use the issue date as the x-axis
        y=alt.Y('title', axis=alt.Axis(title=None, labels=False)), # Use the title as the y-axis and don't show the axis labels or title since we have the title in the chart title
        color='user_login:N', # Color the lines by the user login
        tooltip=[
            'user_login', 'closed_by_login', 'yearmonthdatehoursminutes(issue_date)', 'issue_date_type', 
            'title', 'body', 'state', 'assignee', 'assignees_logins', 'html_url'
        ], # Show the user login, closed by login, issue date, issue date type, title, body, state, assignee, assignees logins, and HTML URL in the tooltip
        opacity=alt.condition(selection, alt.value(1), alt.value(0.1)) # Set the opacity to 1 if the user login is selected and 0.1 if not
    ).add_params(selection).properties(
        # Set the title of the chart to be the issue title and the subtitle to be the subtitles we created
        title=alt.Title(f"Issue: {title}", subtitle=subtitle)
    )
    
    # Append the chart to the list of charts
    charts.append(chart)

# Concatenate all the charts vertically and resolve the x-scale to be shared
alt.vconcat(*charts).resolve_scale(x='shared')

Looking at your issues and project board, it looks like you have been consistently using issues for project management, which is great! I did notice that Alan doesn't have any issues though so wondering if that's something you all decided on or a sign that you could more thoughtfully divide your labor?

I also noticed that many of the issues are being closed immediately after being opened so there might be room to leverage issues more as you complete the project. I realize using a new platform and interface for project management can be difficult, but I would encourage you to think about how you can use these tools to not only help you manage your project and communicate with each other, but also document your work for the final project.

## Group Assignments to Date Feedback

The following feedback is for the three group assignments you have completed so far. Since there isn't clear documentation on who completed what activity, I am currently using the git history to assess contributions, but I am happy to adjust this assessment if you provide me with more information.

### Mass Digitization & Digital Libraries Assignment

- I could not identify any file that represented this assignment in your repository. If you add it to the repository, I will be happy to update your grade.

Status: Incomplete

### Critical Cultural Data Explorations Assignment

- I believe that this assignment is the `README.md` file in the `Critical_Cultural_Data_Explorations` folder, but again please correct me if I am wrong.
- Overall, I thought you did great on this assignment! I appreciate the depth of analysis your group provided when comparing the historical versus modern datasets. You did an excellent job showcasing how the availability and scope of statistics have evolved over time, particularly noting the absence of certain metrics like 3PT% in the older data due to technological and rule limitations. Your insights into data transparency and how modern data collection practices have improved were well-articulated. You might consider looking at the work of Katherine Walden who is a Digital Humanist at Notre Dame and specializes in the history of baseball.
- I also thought your reflection on the AI analysis was good! I think your assessment about AI’s bias toward modern datasets makes sense, and I appreciated your point about how it might reinforce dominant narratives about player capabilities and performance. Your observation of the AI’s limitations in understanding historical context and nuances, such as rule changes, was also spot on!
- Well done on this assignment overall and while you're focus has shifted for the final project, you might consider this historical angle and some of these insights for your final project submission. Currently the git history just shows Ahmad adding the file, and you did not include names of who contributed to the assignment in the file itself. So currently Ahmad will get full marks, but happy to share with all group members if you confirm the division of labor is equitable.

Status: Complete & Full Marks for Ahmad (Though happy to update this if you can share more details about how you worked together on this assignment)

### Proprietary & Perspectival Dataset Creation Assignment

- It seems like the assignment is in your `proprietary-perspectival-dataset-creation` folder, but please correct me if I am wrong.
- Overall, I thought your choice of Instagram made sense for the assignment. You did a good job investigating the terms of service and access limitations, especially considering that the Wayback Machine was down. I particularly appreciated your inclusion of the information about the API, as well as the comparison to TikTok’s requirements of logging in to access content. 
- I'm not sure I quite understood your ethics and legality section, though I did appreciate your choice and implementation of license. You write:

>There aren't really any legal or ethical considerations with the creation of this dataset. All these metrics that we chose to collect can be hidden by the creator which would make it so the users are unable to view it. The creator also has an option to make an account private which means that no one but people they approve can view the content they are creating. 

But I'm not quite sure what you mean by "can be hidden by the creator" or that the "creator also has an option to make an account private", as well as how that relates to the ethics and legality of the dataset. I think you might consider expanding on this section to make it a bit clearer what you mean and how it relates to the dataset you created.
- In terms of the dataset, I thought it was interesting you decided to store it in a Python script and specifically a rich table. On one hand I'm glad you find rich such a useful library, but on the other, this isn't really a helpful format for sharing data with others. I would recommend you consider storing the data in a CSV or JSON file in the future, as this will make it easier for others to access and use the data you have collected. As for the data itself, as we discussed in class, there's some issues with both how you determined `genre` and `rating` in the sense that both are subjective (which is great!) but there doesn't seem to be any effort to make them coherent or consistent. Just inputting random values for these fields isn't very compelling.
- Overall, I would say this is currently partial marks for this assignment. I would say that if you updated your Markdown file to include a bit more about how you determined genre and rating, then I'm happy to give full marks. Based on the git history, I'm only seeing Alan contributing to this assignment, so he will get partial marks for this assignment but if you can provide more information on how you worked together on this assignment, I'm happy to update the grade.

Status: Incomplete & Partial Marks for Alan (Though happy to update this if you can share more details as requested above)

## Mid-Semester Dataset Update Feedback

### Dataset Feedback

I appreciate that you included a Google Sheets link and an excel file, though I would highly recommend using a `.csv` or some other more interoperable format in the future. 

The formatting of the dataset is impressive, though I'm a bit concerned that it's more akin to what would do for a PowerPoint presentation than working with data. For example, if we try to read in your dataset with Pandas, what will happen? Well let's try it!

In [5]:
project_dataset_excel_df = pd.read_excel("./Semester Long Project Dataset.xlsx")
project_dataset_excel_df.head(10)

Unnamed: 0,Season,Team,Player,# of tweets,Unnamed: 4,Team.1,Total # of tweets,Season Ranking,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2021-2022,Boston,Jayson Tatum,41,,Suns,337.0,1.0,,Start Date,2021-10-19
1,,,Jaylen Brown,40,,Grizzlies,319.0,2.0,,End Date,2022-04-10
2,,,Al Horford,20,,Warriors,77.0,3.0,,,NaT
3,,,Marcus Smart,20,,Heat,53.0,4.0,,,NaT
4,,,Derrick White,1,,Boston,122.0,5.0,,,NaT
5,,,TOTAL:,122,,Pacers,94.0,26.0,,,NaT
6,,Warriors,Steph Curry,40,,Thunder,75.0,27.0,,,NaT
7,,,Draymond Green,20,,Pistons,20.0,28.0,,,NaT
8,,,Klay Thompson,10,,Rockets,52.0,29.0,,,NaT
9,,,Andrew Wiggins,5,,Magic,54.0,30.0,,,NaT


As you can start to see, what you've created in Excel, though visually impressive, is very difficult to work with as data. Indeed, we've lost all the color formatting, trying to use the `TOTAL` value is difficult since it is nested in the `Player` column, and you have a number of `Unnamed` columns that are not helpful for analysis. I understand that creating data from scratch is difficult and likely a new activity, but part of that work is thinking about how you'll use this data. At the moment I'm very concerned that what you have here is pretty unusable for any kind of analysis, whether by you or by others. Not to be repetitive but I would highly recommend returning to the *Responsible Datasets in Context Project* to see how they structure their data and consider how you might structure your data in a similar way. I would at the very least **strongly recommend** that you separate out aggregate analysis from individual player analysis (you could have multiple `csv` files for example) and that you consider dataset curation and creation as a separate activity from analysis (it seems you might be conflating the two here).

In terms of the data itself, I like the idea of exploring NBA players' social media habits during their regular season, and thought using X was a great choice though curious if you considered other platforms. For example, I would imagine many players would be more likely to post on Instagram or even TikTok, but I do think X is a great option. I thought your choice to focus on a narrow date range and only the top 5 most sucessful teams made sense, as well as your choice of using basketballreference.com to verify player's identifies and starting lineups. However, 
I have a number of questions in the next section over whether this is sufficient amount of data or why this data is even interesting to collect.

I am most concerned about, besides your choice of formatting, is that you didn't include links to the actual posts on X, so there's no way to verify your data. Such a choice seems strange considering our class discussions about being responsible dataset creation and preserving the context of the data. Without the links, there is no way to get additional metadata whether that's about the timing of posts or the content of the posts themselves. Essentially, you've created a dataset that is not only difficult to work with but also difficult to verify or really use to even meaningfully answer the questions you've posed. For example, you write:

>found the top 5 teams tweeted 1,169 compared to the bottom 5 tweeting 295 times. This is a massive 4x difference in social media frequency.

While I agree that this difference might be attributed to the higher profile of the teams, I think it would be very helpful to have a sense of frequency of tweets. For example, are high profile players actually mostly retweeting or are they creating original content? Are they tweeting during games or after games? Are they tweeting about basketball or other topics? Without the actual data, it's hard to know what you're actually measuring here. Also just because you see a correlation in the aggregate doesn't mean it's accurate, since it seems like some of the top teams have players that tweet very infrequently. 

Ultimately, I appreciate that you are creating data from scratch and I think you have an interesting direction here but need to really think about how you're collecting and structuring your data so that it is in line with the goals for the final project.

### Documentation Feedback

I appreciated that you created both a `README.md` with your documentation, and a `pdf` version of the document. You do have some `Zone.Identifier` files that I would recommend adding to your `.gitignore` file (you can do this by adding `*.Identifier` to the file). These files tend to be created by Windows Subsystem for Linux and can be safely ignored. I would also recommend adding some actual headers, instead of just spaces, to your `README.md` file to make it easier to read.

I thought you did a good job detailing your data collection process, though I would encourage you to include actual screenshots of the process in your final data essay to help a reader walk through both how X's advanced search worked and how you were extracting data from the tweets. You might consider a figure similar to this example from the *Shakespeare and Company Project* [https://shakespeareandco.princeton.edu/sources/cards/](https://shakespeareandco.princeton.edu/sources/cards/) to make it clear how you were collecting data. 

I also appreciated that you included a lot of your rationale for what data you collected in your `Content Description` section, as well as how you divided the labor in your `Responsibility and Contributions` section. I do think you could again be more detailed in that section, since it is hard to tell based on the git history how you all worked together (though you are welcome to share the full Google Sheets with me to see that version history if you think that would be helpful).

I thought your `Our Findings` section was interesting, but I'm very concerned that you're still in the mindset of thinking of this project as a data science or analytics one. Specifically, because you seem to be thinking I want you to find some correlation between variables, but I'm more interested in how you are using data to think critically about cultural phenomena. You seem to be very interested in the relationship between an NBA's team success and the social media activity of its players, but I'm curious what confounds you think might shape this dynamic and whether it's really prestige or simply an individual's social media predilections that would influence this. If you think it is prestige, then why is that some top performing players don't post very often? Also, why would NBA prestige matter on social media? I think what you are maybe trying to get at is how various NBA players parlay their success on the court into social media success, but I think you need to think more deeply about what can drive social media activity. For example, you might think about trying to capture overall followers and following, as well as how much engagement a player gets on their posts. You might also consider how the NBA itself uses social media to promote its players and how that might influence player's social media activity.

Alternatively, if you are just interested in social media activity without a relationship to prestige than I wonder if it would be more interesting to look at age of players and evolution of their social media habits. For example, you might consider how players who grew up with social media use it differently than players who didn't. You might also consider how players from different countries use social media differently. Finally, you might also think about how X has changed over time and how that might influence player's social media habits. For example, many users have left X since Elon Musk bought the platform, so how would that influence player's social media habits?

I think you have a lot of interesting directions you could take this project, but I would encourage you to think more deeply about what you are trying to capture with your data and how you can use it to answer those questions. While I appreciate you doing manual data collection, you might consider using some of the existing libraries for capturing twitter data, like `twarc` [https://twarc-project.readthedocs.io/en/latest/](https://twarc-project.readthedocs.io/en/latest/) or `tweepy` [https://github.com/tweepy/tweepy](https://github.com/tweepy/tweepy), to make your data collection more efficient and reproducible. Melanie Walsh has a helpful chapter on accessing Twitter Data in her *Introduction to Cultural Analytics* [https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/11-Twitter-API-Setup.html](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/11-Twitter-API-Setup.html) and a helpful chapter on using social media data [https://melaniewalsh.org/blog/2023/social-media-research/](https://melaniewalsh.org/blog/2023/social-media-research/). There's also a rich literature of scholars both in Digital Humanities and especially in Computational Social Science examining X data so I would be happy to provide links to some relevant materials if that would be helpful. 

Overall, I think your documentation is a good start, but you need to do more work thinking about what cultural phenomena you are trying to capture and how data can help us understand those dynamics, as well as how to ensure that the data you are collecting is in line with the goals of the project.

### Progress from Initial Proposal Feedback

In the initial proposal, I had asked you to address the following questions in your project (paraphrased):

- [ ] How are you planning to look at the original historical archival sources for basketball data, as well as how are you planning to access Synergy Sports Technology data?
- [ ] How will you get around the terms of service for sports-reference.com and how does your mention of virality relate to your project?
- [ ] How does your mention of wanting to focus on "average height of an NBA team to win rate of that team in one year" engage with the project requirements of either creating new data or augmenting existing data? 
- [ ] How will you equitably distribute tasks among the group? What happens if collecting and curating the data takes longer than anticipated? How will that impact the other roles? 

My current assessment is that some of these concerns are no longer relevant (you seem less focused on accessing statistics now) but there are many new questions for you to address. In particular, I think you need to do some more work on refining what and how you collect this data. While I think narrowing your focus was a good call, I think you actually need to expand to think more critically about what the data actually represents and why it might be interesting to scholars.

I think rather than only trying to find correlations between generic social media activity and NBA players, you might consider several possible directions, including reframing around either how the NBA and players use social media to promote themselves or how social media has changed over time and how that might influence player's social media habits.

You obviously can't capture everything in data (or in the remainder of the semester), but I encourage you to try and critically consider the larger cultural influences that would shape these dynamics, whether that's how fame and prestige have changed over time for NBA players or how social media platforms have presented both opportunities and challenges for NBA players. Remember, the goal is to create a dataset that is both responsible and contextual, so you want to make sure you are thinking about how to capture the context of the data you are collecting and how it might be used to answer larger questions about the NBA and social media.

Finally, I think you need to do more work on refining your documentation and how you might more accurately document your group's collaboration, though again happy to respect your listed responsibilities if those are confirmed to represent equitable division of labor by all group members.

I do think you are on the right track with your project, but I think addressing some of the concerns I have outlined above will help you to make sure you are on target to meet the project goals and requirements. I also think bringing in some of our course readings and your work on previous group assignments will also help flush out and situate your project. I would encourage you to reach out to me if you have any questions or concerns about my feedback or how to proceed with the project.

### Final Grade

Your grade for this assignment is currently a B. If you can restructure your data for it to be usable programmatically and provide more documentation on which group members contributed to what aspects of the project (currently I only see Haydn in the git history) then I would be happy to bump you up to a B+/A-. Overall, if you can also address some of the concerns I have outlined above, you will be in a good position to complete the project successfully. Also, I'm currently planning to share this grade between all group members, but please let me know if that is not accurate or appropriate.