# Instructor Feedback for Mid-Semester Dataset Update and Group Assignments to Date for Group 3

Please find below three sections of feedback regarding your mid-semester dataset update and group assignments. Overall, I'm glad to see progress on the project and it seems like you are working together, which is great. I also think your choice of topic remains a rich one and I'm excited to see how you're working together to realize your proposed project. However, I have some questions about the division of labor, as well as the overall scope and documentation of your project. As a reminder, your goal is to try and create something like the datasets we have been exploring on the *Responsible Datasets in Context Project*. While I do not expect what you to produce to be as polished as what is available on the website, considering there are four of you in the group, I am hoping you can clarify some of your current choices and provide more detail on how you are envisioning your final project and achieve the stated project goals and requirements.

In terms of returning feedback to the instructor, you have two options: create a new GitHub issue responding to my questions and tag me in the issue, or complete the Google form, available via Canvas, where you can update your project plan and respond to my questions or offer suggested corrections to the stated assessments. You are also welcome to use the form to provide feedback that will not be shared with the rest of your group if you think that would be helpful or want to share something with me privately.

## Load Libraries and Datasets

To run this notebook, you will need to have `pandas`, `altair`, and `rich` installed. You can find instructions on how to do so in our course website.

In [1]:
import pandas as pd
import altair as alt
from rich.console import Console
from rich.table import Table

console = Console()

# Load the data
contributors_df = pd.read_csv("./datasets/contributors_group_3.csv")
commits_df = pd.read_csv("./datasets/commits_group_3.csv")
issues_df = pd.read_csv("./datasets/issues_group_3.csv")

## Overall Group Feedback

Overall, your group seems to be working well together, though I notice some patterns below that I have questions about. Your GitHub repository still needs a `gitignore` file and I noticed there is a `.DS_Store` file as well, so it would be good to remove it. From previous feedback, I had suggested removing your project proposal from your `README.md` into a separate document and making the main `README.md` more focused on explaining the overall project and directory structure. You also have a bit of a confusing organization currently. For example, I couldn't tell what assignment the `data analysis` folder and notebook `part1.ipynb` was relevant for, so adding more clear documentation will help others navigate your repository. I did appreciate that you have a license and would recommend you also take a moment to update the description on the repository to be more reflective of your project. Finally, I noticed you hadn't merged either my previous feedback pull request or one of Charles' pull requests, so I would recommend you do that as well and happy to help since it can cause some git issues if you leave them open for too long.

I will say that while I respect and understand your current approach to dividing labor, it does make it somewhat difficult for me to assess since it seems like primarily some group members have been doing the coding and issue management (at least based on the data you will see below), whereas others are doing more of the research and data entry intensive labor. If you could please just briefly confirm that the division of labor is relatively equitable on the Google form on Canvas, then I'm happy to share the grades between all members of the group. But I would like to get that confirmation since it is not clear to me from the data that I have available.

In the following graphs, you will see some of how your group has been working together from what I can tell via GitHub. I do not think this data represents all of your group's work or activities, so I would encourage you to both think about how to document work in more detail and also to contact the instructors if you feel this data is not representative of your group's work and would like other aspects to be included in your assessment.

### Contributions Per Group Member

In [2]:
# Create a table for the contributions
contributors_table = Table(title="Contributors")
# Add columns with the contributors and the number of contributions
contributors_table.add_column("Contributor", style="cyan")
contributors_table.add_column("Number of Contributions", style="magenta")
# Sort the contributors by the number of contributions, with the highest first
contributors_df = contributors_df.sort_values(by="contributions", ascending=False)
# Loop through the contributors and add them to the table
for index, row in contributors_df.iterrows():
	# Add the contributor to the table and set the contributions to be a string
	contributors_table.add_row(row["login"], str(row["contributions"]))
# Print the table
console.print(contributors_table)

Above is the total number of contributions (that is commits) per group member. I see that there is not many commits currently, which makes sense since most of your group activities so far have been more research intensive, and also that almost every group member has committed, which is great. Ideally, I would like to see all members of the group committing to the repository moving forward if it is possible, but happy to respect your division of labor if that is something you have all decided on.

### Commits Over Time

In [3]:
# Melt the commits dataframe from wide to long format so that we can have all commit type activities in one column called commit_metric
melted_commits_df = commits_df.melt(id_vars=['oid', 'message', 'committedDate', 'login', ], value_vars=['additions', 'deletions', 'changedFiles'], var_name='commit_metric', value_name='commit_metric_value')
# Convert the commit date to a datetime object
melted_commits_df['commit_date'] = pd.to_datetime(melted_commits_df['committedDate'], errors='coerce')
# Get the unique commit types to create charts for each type
commit_types = melted_commits_df['commit_metric'].unique().tolist()

# Create a list to store the charts
charts = []
# Loop through the commit types and create a chart for each type
for commit_type in commit_types:
	# Create an interactive selection for each chart where you can select the login to highlight each group member's contributions
	selection = alt.selection_point(fields=['login'], bind='legend')
	
	# Filter the DataFrame for the current commit type and subset to only the columns we need to keep the chart smaller in size
	filtered_df = melted_commits_df[melted_commits_df['commit_metric'] == commit_type][['commit_date', 'commit_metric_value', 'login', 'message']]
	
	# Create a bar chart for the current commit type
	chart = alt.Chart(filtered_df).mark_bar().encode(
		x='commit_date:T', # Use the commit date as the x-axis
		y='commit_metric_value:Q', # Use the commit metric value as the y-axis
		color='login:N', # Color the bars by the login
		opacity=alt.condition(selection, alt.value(1), alt.value(0.1)), # Set the opacity to 1 if the login is selected and 0.1 if not
		tooltip=['commit_date', 'commit_metric_value', 'login', 'message'] # Show the commit date, commit metric value, login, and message in the tooltip
	).add_params(selection).properties(
		title=f"Commit {commit_type} Over Time"
	)
	# Add the chart to the list of charts
	charts.append(chart)
# Combine all the charts into one chart and set the y-axis to be independent so that we can see all the changes even if the y axis scale is different for each commit type activity
alt.hconcat(*charts).resolve_scale(y='independent')

When we look at the distribution of commit activities (so additions, deletions, and number of files changed), it does look like that even though there has been few commits it has been Weiting primarily making changes to the files in the repository. I would like to see more of a balance in the future, but I understand that this may be due to the division of labor you have all decided on.

### Issues and Project Management Over Time

In [4]:
# Once again melt the issues dataframe from wide to long format so that we can have all the issue dates in one column called issue_date
melted_issues_df = issues_df.melt(id_vars=['user.login', 'title', 'state', 'body', 'html_url', 'assignee', 'assignees_logins', 'labels', 'milestone', 'draft', 'comments', 'state_reason', 'closed_by.login', 'reactions.total_count', 'issue_duration', 'issue_associated_with_pull_request', 'pull_request.url', 'issue_status_on_project_board'], value_vars=['created_at', 'updated_at', 'closed_at'], var_name='issue_date_type', value_name='issue_date')

# Rename columns for because Altair does not let us use '.' in the column names
melted_issues_df = melted_issues_df.rename(columns={
    'user.login': 'user_login', 
    'closed_by.login': 'closed_by_login'
})

# Sort the DataFrame by issue_date
melted_issues_df = melted_issues_df.sort_values(by='issue_date')

# Get the unique issue titles
issue_titles = melted_issues_df['title'].unique().tolist()

# Initialize an empty list to store the charts
charts = []
# Loop through each issue title to create a chart for the issue
for title in issue_titles:
    # Create a selection so that we can highlight the contributions of each group member
    selection = alt.selection_point(fields=['user_login'], bind='legend')
    
    # Filter the DataFrame for the current issue title
    subset_data = melted_issues_df[melted_issues_df['title'] == title]
    
    # Initialize a list to store subtitles for the chart
    subtitle = []
    
    # Check if the issue is associated with a pull request
    has_pr = subset_data[subset_data['issue_associated_with_pull_request'] == True]
    if not has_pr.empty:
        # If the issue is associated with a pull request, get the URL of the pull request and add it to the subtitle
        pr_url = subset_data['pull_request.url'].unique()[0]
        subtitle.append(f"Issue associated with a pull request ({pr_url}).")
    
    # Check if the issue is associated with a project board
    has_project_board = subset_data[subset_data['issue_status_on_project_board'].notna()]
    if not has_project_board.empty:
        # If the issue is associated with a project board, get the status of the issue on the project board and add it to the subtitle
        board_status = subset_data['issue_status_on_project_board'].unique()[0]
        subtitle.append(f"Issue is associated with a project board and has status {board_status}.")
    
    # If no subtitles were added, add a default message
    if not subtitle:
        subtitle = "Issue is not associated with a pull request or a project board."
    
    # Create a line chart for the current issue title
    chart = alt.Chart(subset_data).mark_line(point=True).encode(
        x='issue_date:T', # Use the issue date as the x-axis
        y=alt.Y('title', axis=alt.Axis(title=None, labels=False)), # Use the title as the y-axis and don't show the axis labels or title since we have the title in the chart title
        color='user_login:N', # Color the lines by the user login
        tooltip=[
            'user_login', 'closed_by_login', 'yearmonthdatehoursminutes(issue_date)', 'issue_date_type', 
            'title', 'body', 'state', 'assignee', 'assignees_logins', 'html_url'
        ], # Show the user login, closed by login, issue date, issue date type, title, body, state, assignee, assignees logins, and HTML URL in the tooltip
        opacity=alt.condition(selection, alt.value(1), alt.value(0.1)) # Set the opacity to 1 if the user login is selected and 0.1 if not
    ).add_params(selection).properties(
        # Set the title of the chart to be the issue title and the subtitle to be the subtitles we created
        title=alt.Title(f"Issue: {title}", subtitle=subtitle)
    )
    
    # Append the chart to the list of charts
    charts.append(chart)

# Concatenate all the charts vertically and resolve the x-scale to be shared
alt.vconcat(*charts).resolve_scale(x='shared')

Looking at your issues and project board, it looks like Jeff, Weiting, and Shoi have been the ones primarily responsible for project management. Is that accurate or am I missing something? Again, the roles you describe do not easily map to this activity patterns so feel free to briefly share some details about how you've ended up with this distribution. 

Overall, it seems like you're using the issues and project board to help you with your project, but I do notice that many of the issues are being closed immediately after being opened so there might be room to leverage issues more as you complete the project. I realize using a new platform and interface for project management can be difficult, but I would encourage you to think about how you can use these tools to not only help you manage your project and communicate with each other, but also document your work for the final project.

## Group Assignments to Date Feedback

The following feedback is for the three group assignments you have completed so far. Since there isn't clear documentation on who completed what activity, I am currently using the git history to assess contributions, but I am happy to adjust this assessment if you provide me with more information.

### Mass Digitization & Digital Libraries Assignment

- I believe this assignment is your `mass-digitization-digital-libraries.md` file but please correct me if I am missing something. Overall, on this assignment, I thought you did a great job, though I was a bit confused by some of the formatting. I particularly appreciated that you listed which group member contributed which section. That's very helpful and in line with ethical and responsible labor practices so well done on that front! 
-  In terms of the assignment itself, you did an excellent job of providing a comprehensive analysis of different types of music data, including digital formats like MIDI files and streaming platforms such as Spotify and Apple Music. The detailed explanation of the digitization processes, especially the role of DACs in converting analog recordings to digital formats, demonstrated a strong technical understanding of music preservation and digitization. If you are interested, there is a number of Python libraries for working with audio and music [https://github.com/andreimatveyeu/awesome-python-audio](https://github.com/andreimatveyeu/awesome-python-audio) that you might find useful for your project. In particular, I have seen Digital Humanities scholars use the `music21` library for working with music data, so that might be a good place to start.
-  I also appreciated your thoughtful comparison of historical music storage methods, such as sheet music and vinyl records, with their digital counterparts. Additionally, your discussion on born-digital materials and the impact of viral content, exemplified by the ‘Baby Shark’ case study, showed a nuanced grasp of the cultural and commercial dynamics of digital music distribution.
-  Mainly I'm wondering why you didn’t include some of this research in your documentation for the Mid-Semester Dataset Update? Do you think this context for how you developed the project is not helpful or were you just not sure how to include it? I personally think it would be helpful to include some of this information in your documentation, so I would encourage you to include it in the final project submission and happy to answer questions if you are not sure how to do so.
- Looking at the git history, it looks like Jeff added the file, and then Weiting, Shoi, and Yuktha all are listed as contributors. So, these members will be getting full marks, but happy to share with all group members if you confirm the division of labor is equitable.

Status: Complete & Full Marks for Jeff, Weiting, Shoi, and Yuktha (Though happy to update this if you can share more details about how you worked together on this assignment)

### Critical Cultural Data Explorations Assignment

- I believe that this assignment is the `reflections.md` file in the `reflections` folder, but again please correct me if I am wrong.
- Overall, I thought you did great on this assignment as well! I appreciate the depth of analysis your group provided when discussing the power structures embedded within the datasets. You did an excellent job highlighting how certain artists and major record labels dominate the dataset, reflecting industry power dynamics. I thought your awareness of the limitations in metadata, especially genre, provided by the datasets and how that affects transparency was great as well! I would highly recommend you look at the work of Nick Seaver, who is an anthropologist who has written extensively on the politics of classification and metadata in music streaming services, particularly Spotify. His work might provide you with some additional context for your project.
- Your comparison between how AI describes cultural objects versus datasets was insightful. It was great to see that you noted the AI’s tendency to shift from narrative and contextual descriptions to technical summaries. This observation is crucial for understanding how AI can reinforce or challenge existing power structures in cultural data.
- Well done on this assignment overall and again would encourage you to include some of this information in your documentation for the final project submission. Currently the git history just shows Jeff adding the file, and you did not include names of who contributed to the assignment in the file itself. So currently Jeff will get full marks, but happy to share with all group members if you confirm the division of labor is equitable.

Status: Complete & Full Marks for Jeff (Though happy to update this if you can share more details about how you worked together on this assignment)

### Proprietary & Perspectival Dataset Creation Assignment

- It seems like the assignment is in your `proprietary-perspectival-dataset-creation` folder in the `proprietary_platform_exploration.md` file, but please correct me if I am wrong.
- Overall, I thought your choice of Spotify made sense given your focus for the final project. You did an excellent job thoroughly investigating Spotify’s terms of service and access limitations. The discussion of how these terms have evolved over time, from fewer restrictions to more stringent policies limited to just accessing individual users' data was spot on! I also thought your assessment of how API changes could impact research was great and I would recommend you look at the work of Amelia Acker, an LIS scholar, who has written about the challenges of working and relying on APIs for research data (also happy to provide more resources if you are interested).
- I thought your inclusion and explanation of your rationale for your license was great as well. I liked that you decided on `My Playlist` as your subjective column, and that you included the dataset in this folder. I think your choice of which columns to include makes sense, but as we discussed in class there were some issues with your subjective column. Namely that it was primarily Charles who was determining what to include or not, rather than each group member. Furthermore, you might have included which group member had which song in their playlist and even what playlists they were from to make the dataset more transparent and reproducible. Because the assignment explicitly said that all group members should contribute to the dataset, I will be marking this assignment as incomplete, and partial marks for Charles, as well as Jeff and Weiting from the git history. However, I am happy to update this if you can share more details about how you worked together on this assignment (for example, if this was a division of labor issue then happy to give Charles, Jeff, and Weiting full marks). Alternatively, if everyone in the group wants full marks, I would request that you quickly add a few more rows from each group member, as well as a column detailing which member added which song to the dataset.

Status: Incomplete & Partial Marks for Charles, Jeff, & Weiting (Though happy to update this if you can share more details as requested above)

## Mid-Semester Dataset Update Feedback

### Dataset Feedback

I am currently a bit confused by your dataset files. It seems you have two files `timeline_data_with_recessions.csv.csv` and `DataCleaning`, but neither of these are appropriately named. You should not have duplicate `.csv` and `DataCleaning` needs a specified file format. Overall, it seems like your project has shifted a bit to look at the correlation between historic events and music popularity. That's a very interesting topic, but also, I'm a bit concerned that this is such a broad phenomena that you will end up with correlation with causation problem. I detail more thoughts below in the Documentation Feedback section.

In terms of the two datasets, it seems like there's a bit more detail about the Kaggle Spotify & Billboard ones, than the timeline dataset. For the first one, I appreciate that you detailed in-depth your choices on what to include in the data, as well as how your project developed and how that influenced your choice of datasets. You write:

>We dropped rows with missing release dates as they would not be useful. We also dropped some columns such as Spotify URI, Song Image, Mode, and Time Signature. The Spotify URLs are not directly useful for our topic analysis; the song images do not contain any attributes that impact the music's characteristics or popularity trends. The Mode was deleted as we already have specific attributes like energy, valence, or danceability to capture the trends changes in music. From our preliminary analysis, we found out that most popular songs follow common time signatures like 4/4, so this column will not affect our later music trends prediction.

I'm a bit confused why you would drop these columns at this stage though? It seems like you're concerned about the columns not being useful for your topic analysis, but I'm not sure what that means. I would recommend you keep all the columns for now, or at the very least add back in the URI column since it is both a unique identifier and helpful to verify if the data is in fact accurate. I would also recommend potentially considering having the original Kaggle datasets in your group project to ensure they are archived fully, and also because it might be of interest which songs and therefore genres are not represented in the Billboard dataset.

For the timeline dataset, I'm particularly confused by how you generated this dataset. You just say web scraping and manually data entry in your documentation, but as we've discussed in class, both are interpretative acts. So, I would highly encourage you to include the code you used to generate that data, as well as a more detailed example of what data you entered and how you determined those choices. I also think it would be helpful particularly for this dataset (but honestly could do it for both) to have a table listing the columns, their data types, and a brief description of what they contain. This will help you and others understand the data better and also help you document your work more effectively.

You also write:

>For example, technological advancements are included, such as the launch of the first U.S. weather satellite (Tiros I) in 1960 and the invention of the ARPANET in 1969, the precursor to the internet. The dataset allows for exploration of how such technological milestones impacted music production, distribution, and consumption. For instance, the rise of television and radio broadcasting during the postwar years played a major role in the popularization of music, with more homes having access to these technologies by the 1960s, changing the way music reached audiences. By linking these cultural, economic, and political events to music trends, the dataset helps us understand not just what was popular, but why certain genres or themes gained prominence during specific periods.

So, I'm generally a bit confused what this timeline recession data represents, since what you describe here is more the history of technology rather than recessions. It seems like maybe you manually entered recessions. But that requires even determining what constitutes a recession. Such a label is often hotly debated so it would be helpful to know how you determined what a recession is exactly. Relatedly though, I'm a bit confused why you think that the history of American recessions is the most relevant historic data for assessing popularity? Specifically, I'm wondering how much of the music you have is actually made by American artists, and if you have considered how the global nature of music might impact your analysis. While Billboard is technically a US chart, it does include music created from beyond the US, so I would recommend you consider how this might impact any analysis of your data.

### Documentation Feedback

I appreciated that you created both a `README.md` with your documentation, and a `pdf` version of the document. 

I thought you did a good job detailing what you included in the Spotify and Billboard datasets, as well as your rationale for including them. As mentioned above, I'm still a bit confused by how the timeline dataset was created or what it represents, but still appreciate that you did document it somewhat. I thought your sections were helpful to seeing how your project evolved, especially the `Challenges` section, though as I mentioned above you might move some of the content in the `Dataset Content Description` into a table for readability.

I particularly appreciated that you included references, though I noticed you didn't really include any scholarship, whether course readings or external sources. I would recommend you include some of these to help contextualize your project for the final dataset essay. I also found your `Responsibility and Contributions` section helpful, though I will note it does not correspond with the git history so I will need some confirmation from you all that this accurately reflects how you worked together on this project.

Overall though I'm a bit concerned that you are still in the mindset of treating this like a data science or analytics exercise rather than a chance to do critical dataset curation and creation. For example, you seem to be removing rows and columns with a goal of prioritizing certain analyses, but that seems to focus on the end output of seeing correlations rather than thinking deeply about how this data was created or what it represents (and doesn't). That doesn't mean the work you've done here isn't useful, but more I think you could consider what is the phenomena you're actually studying here. In my mind, it is contextualizing music popularity both in terms of the music itself and the broader cultural and historical context. But to do this contextualizing requires more than just seeing if there's correlations, it requires thinking about how taste is mediated by the music industry and broader cultural forces. Such broad phenomena can be difficult to represent in a dataset, but you might consider one of the trends you've identified which is the consolidation of the music industry and how that has impacted the types of music that are popular. 

For example, your write:

> The inclusion of Billboard rankings in the dataset is very important to our project, as it provides a reliable measure of a song's commercial success and cultural impact over time. Billboard rankings are widely recognized as a benchmark for music popularity, reflecting the tastes and preferences of the broader public. By tracking a song’s position on the Billboard HOT 100, the dataset can capture shifts in music consumption, listener behavior, and trends, helping us discover what was popular in different eras. By incorporating Billboard rankings, the dataset can quantify shifts in music’s popularity, making it possible to analyze the correlation between musical characteristics and historical events.

But you seem to be treating Billboard's rankings as a neutral measure of popularity, rather than a reflection of the music industry's power and the broader cultural forces that shape what is popular. For example, you could critically think about how something like Billboard doesn't just reflect popularity but actually functions as a gatekeeper for what is popular. In particular, as we discussed in class there's a bit question over how Billboard has measured popularity over time and how that has changed. You don't really seem to engage with that though. Like how has streaming changed the way Billboard measures popularity? Or the invention of tape cassettes or CDs? I realize you've struggled to narrow your focus so hopefully this feedback helps you think about how to do that.

Finally, you still haven't included much detail on how accurate the Kaggle datasets are. Remember if you are using found data you need to verify that data is accurate and reliable. In this case, that might be trying to identify a particular Spotify song and seeing if the data matches what you find on Spotify. You might even see if you could use a Python library to replicate some of Spotify's analysis to see if the data matches. Similarly, for Billboard, you might try to select a particular year and see if you can determine if the Kaggle dataset matches what you can find on Billboard's website. You might also see if you can even verify Billboard's data by looking at how exactly they measure popularity and if that matches what you see in the dataset. This is a crucial part of dataset curation and creation, so I would recommend you include a section on this in your final project documentation.

### Progress from Initial Proposal Feedback

In the initial proposal, I had asked you to address the following questions in your project (paraphrased):

- [ ] How will you define and measure what people are listening to now? How does the economics of the music industry shape both Spotify and Billboard, and also influence what you might mean by listening? Finally, now seems obvious but is that a particular day that you are planning to select or a set of years?
- [ ] How will you verify that the Spotify datasets you are using compared to the actual cultural object it is representing? Have you tried to find any of the original songs on Spotify to see if the metadata matches up or even the physical albums? 
- [ ] How will you determine what computational methods you will experiment with the data? Have you done the ones you listed, like machine learning or statistical analyses before? You write "we use the best model to get the music trends accurately interpreted and contextualized within broader cultural or historic shifts in the music industry." How will you assess what is "best" or "accurate"?
- [ ] How will you equitably distribute tasks among the group? What happens if collecting and curating the data takes longer than anticipated? How will that impact the other roles? 

My current assessment is that some of these concerns are no longer relevant (you seem less focused on listening now, for example) but many are and that while you have completed some of them, many remain for you to address. In particular, I think you need to do some more work on explicating and refining how you are approaching the found datasets you are using, as well as how you are thinking about the broader cultural and historical context of your project. 

I think rather than trying to find correlations between historic events and taste, you might consider reframing your focus on how the music industry has shaped what is popular and how that has changed over time. This will require you to think more critically about what represents the music industry, why it has changed over time, and how some of the shifts in music technology you listed also impact these phenomena. But I do think your current data is a good start, and I think you have a good foundation to build on. You can continue with your current focus, but I do think it will be difficult to make a compelling rationale for why this is relevant without some more critical engagement with the data and the broader cultural and historical context.

I also think you need to do a bit more work on refining your documentation and how you might more accurately document your group's collaboration, though again happy to respect your listed responsibilities if those are confirmed to represent equitable division of labor by all group members. 

I do think you are on the right track with your project, but I think addressing some of the concerns I have outlined above will help you to make sure you are on target to meet the project goals and requirements. I also think bringing in some of our course readings and your work on previous group assignments will also help flush out and situate your project. I would encourage you to reach out to me if you have any questions or concerns about my feedback or how to proceed with the project.

### Final Grade

Your grade for this assignment is currently a B+. I think you are on the right track with your project, but I think you need to some more work. If you can provide a bit more detail on your process for creating the timeline dataset (specifically, what you web scraped and manually entered), I would be happy to bump you up to an A- for this assignment. And I think if you can address some of the concerns I have outlined above, you will be in a good position to complete the project successfully. Also, I'm currently planning to share this grade between all group members, but please let me know if that is not accurate or appropriate.