# Instructor Feedback for Mid-Semester Dataset Update and Group Assignments to Date for Group 6

Please find below three sections of feedback regarding your mid-semester dataset update and group assignments. Overall, I'm glad to see progress on the project and it seems like you are working together, which is great. I also think your choice of topic remains a rich one and I'm excited to see how you're working together to realize your proposed project. However, I have some questions about the division of labor, as well as the overall scope and documentation of your project. As a reminder, your goal is to try and create something like the datasets we have been exploring on the *Responsible Datasets in Context Project*. While I do not expect what you to produce to be as polished as what is available on the website, considering there are four of you in the group, I am hoping you can clarify some of your current choices and provide more detail on how you are envisioning your final project and achieve the stated project goals and requirements.

In terms of returning feedback to the instructor, you have two options: create a new GitHub issue responding to my questions and tag me in the issue, or complete the Google form, available via Canvas, where you can update your project plan and respond to my questions or offer suggested corrections to the stated assessments. You are also welcome to use the form to provide feedback that will not be shared with the rest of your group if you think that would be helpful or want to share something with me privately.

## Load Libraries and Datasets

To run this notebook, you will need to have `pandas`, `altair`, and `rich` installed. You can find instructions on how to do so in our course website.

In [2]:
import pandas as pd
import altair as alt
from rich.console import Console
from rich.table import Table

console = Console()

# Load the data
contributors_df = pd.read_csv("./datasets/contributors_group_6.csv")
commits_df = pd.read_csv("./datasets/commits_group_6.csv")
issues_df = pd.read_csv("./datasets/issues_group_6.csv")

## Overall Group Feedback

Overall, your group seems to be working well together, though I notice some patterns below that I have questions about. Your GitHub repository still needs a `gitignore` file and I would recommend though that you update your `README.md` to describe the overall project and the structure of your repository. I also would recommend you update the repository description to be accurate for your project. I was glad to see you had a license for the repository though, and I was especially impressed with the organization of your repository!

I will say that while I respect and understand your current approach to dividing labor, it does make it somewhat difficult for me to assess since it seems like primarily some group members have been doing the coding and issue management (at least based on the data you will see below), whereas others are doing more of the research and data entry intensive labor. If you could please just briefly confirm that the division of labor is relatively equitable on the Google form on Canvas, then I'm happy to share the grades between all members of the group. But I would like to get that confirmation since it is not clear to me from the data that I have available.

In the following graphs, you will see some of how your group has been working together from what I can tell via GitHub. I do not think this data represents all of your group's work or activities, so I would encourage you to both think about how to document work in more detail and also to contact the instructors if you feel this data is not representative of your group's work and would like other aspects to be included in your assessment.

### Contributions Per Group Member

In [6]:
# Create a table for the contributions
contributors_table = Table(title="Contributors")
# Add columns with the contributors and the number of contributions
contributors_table.add_column("Contributor", style="cyan")
contributors_table.add_column("Number of Contributions", style="magenta")
# Sort the contributors by the number of contributions, with the highest first
contributors_df = contributors_df.sort_values(by="contributions", ascending=False)
# Loop through the contributors and add them to the table
for index, row in contributors_df.iterrows():
	# Add the contributor to the table and set the contributions to be a string
	contributors_table.add_row(row["login"], str(row["contributions"]))
# Print the table
console.print(contributors_table)

Above is the total number of contributions (that is commits) per group member. I see that there are quite a few commits and also that almost every group member has committed, which is great. I do notice that Gloria is commit a lot more than everyone else. Ideally, I would like to see all members of the group committing to the repository somewhat equally moving forward if it is possible, but happy to respect your division of labor if that is something you have all decided on.

### Commits Over Time

In [4]:
# Melt the commits dataframe from wide to long format so that we can have all commit type activities in one column called commit_metric
melted_commits_df = commits_df.melt(id_vars=['oid', 'message', 'committedDate', 'login', ], value_vars=['additions', 'deletions', 'changedFiles'], var_name='commit_metric', value_name='commit_metric_value')
# Convert the commit date to a datetime object
melted_commits_df['commit_date'] = pd.to_datetime(melted_commits_df['committedDate'], errors='coerce')
# Get the unique commit types to create charts for each type
commit_types = melted_commits_df['commit_metric'].unique().tolist()

# Create a list to store the charts
charts = []
# Loop through the commit types and create a chart for each type
for commit_type in commit_types:
	# Create an interactive selection for each chart where you can select the login to highlight each group member's contributions
	selection = alt.selection_point(fields=['login'], bind='legend')
	
	# Filter the DataFrame for the current commit type and subset to only the columns we need to keep the chart smaller in size
	filtered_df = melted_commits_df[melted_commits_df['commit_metric'] == commit_type][['commit_date', 'commit_metric_value', 'login', 'message']]
	
	# Create a bar chart for the current commit type
	chart = alt.Chart(filtered_df).mark_bar().encode(
		x='commit_date:T', # Use the commit date as the x-axis
		y='commit_metric_value:Q', # Use the commit metric value as the y-axis
		color='login:N', # Color the bars by the login
		opacity=alt.condition(selection, alt.value(1), alt.value(0.1)), # Set the opacity to 1 if the login is selected and 0.1 if not
		tooltip=['commit_date', 'commit_metric_value', 'login', 'message'] # Show the commit date, commit metric value, login, and message in the tooltip
	).add_params(selection).properties(
		title=f"Commit {commit_type} Over Time"
	)
	# Add the chart to the list of charts
	charts.append(chart)
# Combine all the charts into one chart and set the y-axis to be independent so that we can see all the changes even if the y axis scale is different for each commit type activity
alt.hconcat(*charts).resolve_scale(y='independent')

When we look at the distribution of commit activities (so additions, deletions, and number of files changed), it does look like that, even though there have been more commits by Gloria, that other members, like Lucas and Cindy, have been contributing consistently as well. Finally, seems like Francis has recently started to add more commits, which is great to see. Overall, this looks like a good distribution of commits over time.

### Issues and Project Management Over Time

In [5]:
# Once again melt the issues dataframe from wide to long format so that we can have all the issue dates in one column called issue_date
melted_issues_df = issues_df.melt(id_vars=['user.login', 'title', 'state', 'body', 'html_url', 'assignee', 'assignees_logins', 'labels', 'milestone', 'draft', 'comments', 'state_reason', 'closed_by.login', 'reactions.total_count', 'issue_duration', 'issue_associated_with_pull_request', 'pull_request.url', 'issue_status_on_project_board'], value_vars=['created_at', 'updated_at', 'closed_at'], var_name='issue_date_type', value_name='issue_date')

# Rename columns for because Altair does not let us use '.' in the column names
melted_issues_df = melted_issues_df.rename(columns={
    'user.login': 'user_login', 
    'closed_by.login': 'closed_by_login'
})

# Sort the DataFrame by issue_date
melted_issues_df = melted_issues_df.sort_values(by='issue_date')

# Get the unique issue titles
issue_titles = melted_issues_df['title'].unique().tolist()

# Initialize an empty list to store the charts
charts = []
# Loop through each issue title to create a chart for the issue
for title in issue_titles:
    # Create a selection so that we can highlight the contributions of each group member
    selection = alt.selection_point(fields=['user_login'], bind='legend')
    
    # Filter the DataFrame for the current issue title
    subset_data = melted_issues_df[melted_issues_df['title'] == title]
    
    # Initialize a list to store subtitles for the chart
    subtitle = []
    
    # Check if the issue is associated with a pull request
    has_pr = subset_data[subset_data['issue_associated_with_pull_request'] == True]
    if not has_pr.empty:
        # If the issue is associated with a pull request, get the URL of the pull request and add it to the subtitle
        pr_url = subset_data['pull_request.url'].unique()[0]
        subtitle.append(f"Issue associated with a pull request ({pr_url}).")
    
    # Check if the issue is associated with a project board
    has_project_board = subset_data[subset_data['issue_status_on_project_board'].notna()]
    if not has_project_board.empty:
        # If the issue is associated with a project board, get the status of the issue on the project board and add it to the subtitle
        board_status = subset_data['issue_status_on_project_board'].unique()[0]
        subtitle.append(f"Issue is associated with a project board and has status {board_status}.")
    
    # If no subtitles were added, add a default message
    if not subtitle:
        subtitle = "Issue is not associated with a pull request or a project board."
    
    # Create a line chart for the current issue title
    chart = alt.Chart(subset_data).mark_line(point=True).encode(
        x='issue_date:T', # Use the issue date as the x-axis
        y=alt.Y('title', axis=alt.Axis(title=None, labels=False)), # Use the title as the y-axis and don't show the axis labels or title since we have the title in the chart title
        color='user_login:N', # Color the lines by the user login
        tooltip=[
            'user_login', 'closed_by_login', 'yearmonthdatehoursminutes(issue_date)', 'issue_date_type', 
            'title', 'body', 'state', 'assignee', 'assignees_logins', 'html_url'
        ], # Show the user login, closed by login, issue date, issue date type, title, body, state, assignee, assignees logins, and HTML URL in the tooltip
        opacity=alt.condition(selection, alt.value(1), alt.value(0.1)) # Set the opacity to 1 if the user login is selected and 0.1 if not
    ).add_params(selection).properties(
        # Set the title of the chart to be the issue title and the subtitle to be the subtitles we created
        title=alt.Title(f"Issue: {title}", subtitle=subtitle)
    )
    
    # Append the chart to the list of charts
    charts.append(chart)

# Concatenate all the charts vertically and resolve the x-scale to be shared
alt.vconcat(*charts).resolve_scale(x='shared')

Looking at your issues and project board, it looks like you initially used issues for project management, but then stopped for some reason? I also noticed that many of the issues are being closed immediately after being opened so there might be room to leverage issues more as you complete the project. I also notice that Gloria is doing the lion's share of the issue management, which is impressive, but I would encourage you to think about how you can balance this workload more effectively.

I realize using a new platform and interface for project management can be difficult, but I would encourage you to think about how you can use these tools to not only help you manage your project and communicate with each other, but also document your work for the final project.

## Group Assignments to Date Feedback

The following feedback is for the three group assignments you have completed so far. Since there isn't clear documentation on who completed what activity, I am currently using the git history to assess contributions, but I am happy to adjust this assessment if you provide me with more information.

### Mass Digitization & Digital Libraries Assignment

- It seems like the `READ.md` in your `mid-semester-dataset-&-documentation` folder is the correct file for this assignment, but please correct me if I am wrong.
- Overall, I think you did a great job on this assignment! Your detailed exploration of digital objects in children’s literature, including e-books, audiobooks, and interactive apps, was comprehensive and well thought out.
- I especially liked the comparison between older archives like Project Gutenberg and newer resources such as the International Children’s Digital Library was effective, highlighting the evolution in accessibility and user engagement.
- I also thought your example of “Charlie and the Chocolate Factory” as a viral digital object was a strong, showing how multimedia can enhance a literary work’s reach.
- I appreciate that you included clear attribution of contributions for each section and would just recommend formatting links properly for Markdown in future submissions would improve readability.
- Given the git history and listed contributors, I am giving full marks to everyone for this assignment.

Status: Complete & Full Marks for Everyone

### Critical Cultural Data Explorations Assignment

- I believe that this assignment is the files in the `critical-cultural-data-explorations` folder, but again please correct me if I am wrong or missing something.
- Overall, I thought you did fantastic job on this assignment and really appreciated that you once again not only detailed who worked on which sections, but also included some of your GitHub project management in the assignment. I also appreciated that you included relevant images, some sample data, as well as your notes for the presentation––truly impressive!
- You did a great job comparing the Historical Children’s Literature Collection and the Children Stories Text Corpus from Kaggle. Your analysis of how the historical dataset maintained rich cultural elements like illustrations and detailed metadata versus the contemporary dataset focusing solely on text, thus losing visual and contextual depth, was very compelling. I also thought you did a really good job giving an overview of what each dataset contain and considering the power dynamics and potential biases in the datasets.
- I found your AI summaries and reflections very interesting as well! I thought you were very thorough in your assessment, and I thought your point about AI provided more detailed and culturally sensitive descriptions for physical books while taking a more mechanical, structural approach for datasets was spot on!
- Since both the git history and the listed contributors show that everyone contributed to this assignment, I am giving full marks to everyone for this assignment.

Status: Complete & Full Marks for Everyone (Though happy to update this if you can share more details about how you worked together on this assignment)

### Proprietary & Perspectival Dataset Creation Assignment

- It seems like the assignment is in your `proprietary-perspectival-dataset-creation` folder, but please correct me if I am wrong.
- Overall, again truly excellent work! You did a great job delving into Spotify as a platform and outlining the process of data collection, the limitations of data access, and how the platform’s terms have evolved over time. Your descriptions of the available tools and methods for collecting data, such as the Spotify API, effectively highlighted both the benefits and restrictions associated with using this platform for research purposes. I thought your point about how there is more transparency now even if there are more restrictions was great. I also appreciated that you considered Spotify Wrapped as an example of how users can get a sense of their data in the aggregate.
- I also thought you did a great job both creating and documenting your perspectival dataset! I thought the choice of `Mood While Listening`, `Personal Connection`, and `Subjective Choice` as your subjective categories were very thoughtful and well documented. I do think it would have been helpful to have which member entered which data, but it even without that additional data this is impressive. I also thought your data collection process was also well documented and you made some excellent points about having to balance legal and ethical considerations when using Spotify's data. In particular, I thought your inclusion of copyright issues and protecting users was really great. The only additional suggestion would be having this data in a `csv` file rather than a Markdown table, but given the small size of the dataset, this is a minor issue.
- In this assignment, I am not seeing listed contributors, so going off the git history it looks like Lucas, Gloria, and Francis all contributed. But please correct me if I am wrong.

Status: Complete & Full Marks for Lucas, Gloria, and Francis (Though happy to update this if you can share more details about how you worked together on this assignment)

## Mid-Semester Dataset Update Feedback

### Dataset Feedback

Overall, your current dataset looks promising, however, there's some immediate issues you should address. First, having the dataset in a Markdown table is not helpful for analysis or visualization. I would recommend you convert this to a `csv` file and then load it into a DataFrame for analysis. As part of this process, I would recommend changing the column titles to be lowercased and underscored for ease of use. I would also recommend that you make sure that you strip some of the spaces from the Markdown values so that you don't end up with issues because your `Language` column has `English` and ` English` as two separate values. I also would recommend that you add in a link to the book in the actual dataset, rather than just in the documentation.

I thought you did a good job explaining the data origins from your two archives, and how these archives sometimes had uneven metadata. I also thought your `Content Description` section did a good job of summarizing the data, though I would recommend you somewhat ironically turn that summary into a Markdown table for readability.

I appreciated that you decided to collect this data manually and think that was a great initial choice but would **strongly recommend** you consider programmatically collecting this data moving forward. For example, you could web scrape the International Children's Digital Library relatively easily and then use the Internet Archive's python library available here [https://archive.org/developers/internetarchive/](https://archive.org/developers/internetarchive/) to find additional children's literature. But you might even look closer to campus. Specifically, the University of Illinois has a one of the largest collection of children's literature that you could consider using as well. You can find links to some of the options here [https://ccb.ischool.illinois.edu/collections-resources/](https://ccb.ischool.illinois.edu/collections-resources/) and I would particularly recommend the School (S)-Collection in the Social Sciences, Health, and Education Library (SSHEL) and HathiTrust Digital Library as potential sources.

Indeed, HathiTrust would give you full access to older Children's literature, so if you did want to explore illustrations and other metadata, that might be a good place to start. There's even a great *Programming Historian* lesson on doing image analysis with HathiTrust materials that might be relevant: Stephen Krewson, "Extracting Illustrated Pages from Digital Libraries with Python," *Programming Historian* 8 (2019), [https://doi.org/10.46430/phen0084](https://doi.org/10.46430/phen0084).

Overall, I think this dataset is a good start, but I would consider how you might scale up your data collection and also what metadata you might want to collect moving forward.

### Documentation Feedback

I thought your documentation was well structured, though you might consider using Markdown headers rather than bolding for your sections. I thought you did a great job detailing your data collection process, as well as some of the challenges you encountered. 

I also appreciated that you included information on how you selected which books to include or exclude. I was curious how you were determining "well-known" books? Did you mean books that appear in both archives, or did you have some other criteria? I would recommend potentially creating a figure where you annotate a screenshot of one of the archives to show how you were making these decisions and turning the materials into data. A great potential example of this is from the *Shakespeare and Company Project* [https://shakespeareandco.princeton.edu/sources/cards/](https://shakespeareandco.princeton.edu/sources/cards/). 

As I mentioned above, I do think you could turn your `Content Description` into a Markdown table for readability. I thought your `Responsibility and Contributions` section was particularly helpful and thoughtful, and I appreciated that you not only listed in-depth what each group member contributed but also noted when they worked together on a task. Your group is currently one of the few that has done this, and I think it is a great model for other groups to follow.

I do think going forward you could supplement your current documentation with some of the historical research you've done for previous assignments, and also more secondary literature to help situate your project. In particular, you might take a look at the *Journal of Cultural Analytics*, which has some potential relevant articles both specifically on children's literature and more broadly on computational analysis of literature [https://culturalanalytics.org/](https://culturalanalytics.org/).

Overall, I think you have a good start on your dataset and documentation, but I would again recommend you consider how you might scale up your data collection and also how that might require you to adjust or augment your documentation moving forward.

### Progress from Initial Proposal Feedback

In the initial proposal, Jess had asked you to address the following questions in your project (paraphrased):

- [ ] How will you access the Historical Children’s Literature Collection? Are there other digitized collections you might use?
- [ ] How will you supplement the lack of metadata in the Children Stories Text Corpus? Are there other datasets you could use to provide more context?
- [ ] How will you use ChatGPT since you mention it in your Materials section?
- [ ] How and what will you be web scraping?
- [ ] What are your citations and references for the datasets you are using, as well as the scholarship you are referencing?
- [ ] How will you get access to full-text children's books and how will that delimit your dataset? 

Also want to flag that you have yet to merge in Jess's pull request from the initial proposal feedback, so I would encourage you to do that as soon as possible. I am happy to help if you are having trouble with that.

My current assessment is that you have completed some of these tasks (for example, accessing the actual collections) but there are some that remain to confront, as well as some new questions for you to address. In particular, I think you need to expand the scale of your data and consider some more programmatic solutions. I also think you need to move your data into a more usable format both for the next stage of your project and future users.

As you expand the dataset, I think it will be helpful to pull in both readings we've discussed in class (particularly the ones related to literature) and also some additional scholarship on children's literature and digital humanities. In particular, you will need to start thinking about how you might have multiple datasets if you are working with multiple archives and how you might both keep separate copies and merge these datasets together. I also think you might want to consider including bibliographic metadata that would help make your dataset more useful for future researchers.

Finally, if you decide you do want to work with either full text or page illustrations, I would be happy to consult and help you generate some code to help you with that process. I do not think it is necessary for your project, but it might be a nice addition if you are interested in exploring that further. I would also be happy to help troubleshoot the web scraping or working with the Internet Archive or HathiTrust if you are interested in exploring those options.

Overall, you are on the right track with your project, but I think addressing some of the concerns I have outlined above will help ensure you are on target to meet the project goals and requirements. Please feel free to reach out to me if you have any questions or concerns about my feedback or how to proceed with the project.

### Final Grade

Your grade for this assignment is currently a B+. If you can show put your data into a csv, then I would be happy to bump you up to an A-. If you address my concerns above, I think you are on target to a very successful project and I'm very excited to see what you produce in the next few weeks. Also, this grade will be shared with all members of the group, so please let me know if you have any concerns about that.