# Instructor Feedback for Mid-Semester Dataset Update and Group Assignments for Group 2

Below are three sections of feedback regarding your mid-semester dataset update and group assignments. Overall, I'm pleased to see promising progress on the project and effective teamwork. Your current work provides a solid foundation for the project's goal of creating something like the datasets we've explored in the *Responsible Datasets in Context Project*. I was particularly impressed with your group's creativity and your ability to navigate challenging obstacles.

I do have some questions about the division of labor, which you'll find below. You have two options for returning feedback: create a new GitHub issue responding to my questions and tag me or complete the Google form available via Canvas. The form allows you to confirm or suggest corrections to my assessments. You can also use the form to provide private feedback that will not be shared with the rest of your group if you think it would be helpful.

## Load Libraries and Datasets

To run this notebook, you will need to have `pandas`, `altair`, and `rich` installed. You can find instructions on how to do so in our course website.

In [1]:
import pandas as pd
import altair as alt
from rich.console import Console
from rich.table import Table

console = Console()

# Load the data
contributors_df = pd.read_csv("./datasets/contributors_group_2.csv")
commits_df = pd.read_csv("./datasets/commits_group_2.csv")
issues_df = pd.read_csv("./datasets/issues_group_2.csv")

## Overall Group Feedback

Overall, your group seems to be working well together, and I think you've made excellent progress on your final project! In terms of your group's GitHub repository, I would highly encourage you to update the main `README.md` file to remove the default text I had added and replace it with more informative content about your project and how to navigate your repository. You should also add an updated description to the repository and a license file. Finally, I'm glad to see you have a `gitignore` file, but there was few empty files like `project_manager_3.md` or weirdly formatted ones like `Project manager assignment.ipynb` that you might want to clean up or delete.

I will say that while I respect and understand your current approach to dividing labor, it does make it somewhat difficult for me to assess since it seems like primarily Ryan has been doing the coding and issue management (at least based on the data you will see below). If you could please just briefly confirm that the division of labor is relatively equitable on the Google form on Canvas, then I'm happy to share the grades between all members of the group. But I would like to get that confirmation since it is not clear to me from the data that I have available.

In the following graphs, you will see some of how your group has been working together from what I can tell via GitHub. I do not think this data represents all your group's work or activities, so I would encourage you to consider if there's more you could do to document and be transparent with how you work together in more detail moving forward. Though again I am happy to trust your current division if I get confirmation from all group members.

### Contributions Per Group Member

In [2]:
# Create a table for the contributions
contributors_table = Table(title="Contributors")
# Add columns with the contributors and the number of contributions
contributors_table.add_column("Contributor", style="cyan")
contributors_table.add_column("Number of Contributions", style="magenta")
# Sort the contributors by the number of contributions, with the highest first
contributors_df = contributors_df.sort_values(by="contributions", ascending=False)
# Loop through the contributors and add them to the table
for index, row in contributors_df.iterrows():
	# Add the contributor to the table and set the contributions to be a string
	contributors_table.add_row(row["login"], str(row["contributions"]))
# Print the table
console.print(contributors_table)

Above is the total number of contributions (that is commits) per group member. I see that there is not many commits currently, which makes sense since most of your group activities so far have been more research intensive. However, I would like to see all members of the group committing to the repository moving forward if it is possible.

### Commits Over Time

In [3]:
# Melt the commits dataframe from wide to long format so that we can have all commit type activities in one column called commit_metric
melted_commits_df = commits_df.melt(id_vars=['oid', 'message', 'committedDate', 'login', ], value_vars=['additions', 'deletions', 'changedFiles'], var_name='commit_metric', value_name='commit_metric_value')
# Convert the commit date to a datetime object
melted_commits_df['commit_date'] = pd.to_datetime(melted_commits_df['committedDate'], errors='coerce')
# Get the unique commit types to create charts for each type
commit_types = melted_commits_df['commit_metric'].unique().tolist()

# Create a list to store the charts
charts = []
# Loop through the commit types and create a chart for each type
for commit_type in commit_types:
	# Create an interactive selection for each chart where you can select the login to highlight each group member's contributions
	selection = alt.selection_point(fields=['login'], bind='legend')
	
	# Filter the DataFrame for the current commit type and subset to only the columns we need to keep the chart smaller in size
	filtered_df = melted_commits_df[melted_commits_df['commit_metric'] == commit_type][['commit_date', 'commit_metric_value', 'login', 'message']]
	
	# Create a bar chart for the current commit type
	chart = alt.Chart(filtered_df).mark_bar().encode(
		x='commit_date:T', # Use the commit date as the x-axis
		y='commit_metric_value:Q', # Use the commit metric value as the y-axis
		color='login:N', # Color the bars by the login
		opacity=alt.condition(selection, alt.value(1), alt.value(0.1)), # Set the opacity to 1 if the login is selected and 0.1 if not
		tooltip=['commit_date', 'commit_metric_value', 'login', 'message'] # Show the commit date, commit metric value, login, and message in the tooltip
	).add_params(selection).properties(
		title=f"Commit {commit_type} Over Time",
		width=200
	)
	# Add the chart to the list of charts
	charts.append(chart)
# Combine all the charts into one chart and set the y-axis to be independent so that we can see all the changes even if the y axis scale is different for each commit type activity
alt.hconcat(*charts).resolve_scale(y='independent')

When we look at the distribution of commit activities (so additions, deletions, and number of files changed), it does look like you have been working on the group GitHub repository somewhat consistently, which is good. I do think you might consider committing more frequently, especially as the final project wraps up (you can think of commits as almost a form of saving your work, so it's good to do it often), but again leave that up to you all to decide.

### Issues and Project Management Over Time

In [4]:
# Once again melt the issues dataframe from wide to long format so that we can have all the issue dates in one column called issue_date
melted_issues_df = issues_df.melt(id_vars=['user.login', 'title', 'state', 'body', 'html_url', 'assignee', 'assignees_logins', 'labels', 'milestone', 'draft', 'comments', 'state_reason', 'closed_by.login', 'reactions.total_count', 'issue_duration', 'issue_associated_with_pull_request', 'pull_request.url', 'issue_status_on_project_board'], value_vars=['created_at', 'updated_at', 'closed_at'], var_name='issue_date_type', value_name='issue_date')

# Rename columns for because Altair does not let us use '.' in the column names
melted_issues_df = melted_issues_df.rename(columns={
    'user.login': 'user_login', 
    'closed_by.login': 'closed_by_login'
})

# Sort the DataFrame by issue_date
melted_issues_df = melted_issues_df.sort_values(by='issue_date')

# Get the unique issue titles
issue_titles = melted_issues_df['title'].unique().tolist()

# Initialize an empty list to store the charts
charts = []
# Loop through each issue title to create a chart for the issue
for title in issue_titles:
    # Create a selection so that we can highlight the contributions of each group member
    selection = alt.selection_point(fields=['user_login'], bind='legend')
    
    # Filter the DataFrame for the current issue title
    subset_data = melted_issues_df[melted_issues_df['title'] == title]
    
    # Initialize a list to store subtitles for the chart
    subtitle = []
    
    # Check if the issue is associated with a pull request
    has_pr = subset_data[subset_data['issue_associated_with_pull_request'] == True]
    if not has_pr.empty:
        # If the issue is associated with a pull request, get the URL of the pull request and add it to the subtitle
        pr_url = subset_data['pull_request.url'].unique()[0]
        subtitle.append(f"Issue associated with a pull request ({pr_url}).")
    
    # Check if the issue is associated with a project board
    has_project_board = subset_data[subset_data['issue_status_on_project_board'].notna()]
    if not has_project_board.empty:
        # If the issue is associated with a project board, get the status of the issue on the project board and add it to the subtitle
        board_status = subset_data['issue_status_on_project_board'].unique()[0]
        subtitle.append(f"Issue is associated with a project board and has status {board_status}.")
    
    # If no subtitles were added, add a default message
    if not subtitle:
        subtitle = "Issue is not associated with a pull request or a project board."
    
    # Create a line chart for the current issue title
    chart = alt.Chart(subset_data).mark_line(point=True).encode(
        x='issue_date:T', # Use the issue date as the x-axis
        y=alt.Y('title', axis=alt.Axis(title=None, labels=False)), # Use the title as the y-axis and don't show the axis labels or title since we have the title in the chart title
        color='user_login:N', # Color the lines by the user login
        tooltip=[
            'user_login', 'closed_by_login', 'yearmonthdatehoursminutes(issue_date)', 'issue_date_type', 
            'title', 'body', 'state', 'assignee', 'assignees_logins', 'html_url'
        ], # Show the user login, closed by login, issue date, issue date type, title, body, state, assignee, assignees logins, and HTML URL in the tooltip
        opacity=alt.condition(selection, alt.value(1), alt.value(0.1)) # Set the opacity to 1 if the user login is selected and 0.1 if not
    ).add_params(selection).properties(
        # Set the title of the chart to be the issue title and the subtitle to be the subtitles we created
        title=alt.Title(f"Issue: {title}", subtitle=subtitle)
    )
    
    # Append the chart to the list of charts
    charts.append(chart)

# Concatenate all the charts vertically and resolve the x-scale to be shared
alt.vconcat(*charts).resolve_scale(x='shared')

Looking at your issues and project board, it is a bit concerning that only Ryan seems to be doing any issues? Furthermore, I'm curious why you haven't been using them to prior to the end of October. I realize it might be difficult to learn a new interface, but having the issues and project board can be very helpful for keeping track of what needs to be done and what has been done. I would encourage you to try to use them more moving forward.

## Group Assignments to Date Feedback

The following feedback is for the three group assignments you have completed so far. Since there isn't clear documentation on who completed what activity, I am currently using the git history to assess contributions, but I am happy to adjust this assessment if you provide me with more information.

### Mass Digitization & Digital Libraries Assignment

- ~~I could not find a file that corresponds to this assignment in your repository. If you could please provide me with the file name or location, I would be happy to reassess this assignment.~~
- 

Status: Incomplete

### Critical Cultural Data Explorations Assignment

- I believe your `Critical Cultural Data Exploration.txt` is the correct file for this assignment though please feel free to correct me if I missed something. Overall, I thought you collectively did a great job with this assignment, so well done! 
- I appreciated that you included links to all your sources, but you might consider reformatting the file as a Markdown file though it’s not required. It's just a bit easier to format and have the links render as links on GitHub. 
- I did find it a bit confusing which dataset was which based on your referents, so if in future you could use exact names (in this case it would be the Censorship Attacks vs UPENN datasets). With regards to the origins of the Censorship Attacks dataset, I found this information [https://www.everylibraryinstitute.org/book_censorship_database_magnusson](https://www.everylibraryinstitute.org/book_censorship_database_magnusson) which might be helpful for your final project. It appears the dataset is not anonymous and instead the author appears to have some affiliation with PEN America [https://pen.org/profile/tasslyn-magnusson/](https://pen.org/profile/tasslyn-magnusson/).
- I think you did a good job of identifying the differences between the two datasets, especially considering that one of them was a library catalogue database which makes a bit more difficult to compare. I thought your assessment of the potential bias was spot on, though I would add that you could also include what books even end up catalogued in libraries in the first place, which is often a winnowing process since there are far more books published than preserved in libraries. Again, though given your focus on banned books, I doubt there's many banned books so obscure they wouldn't be in a library, so my point is more of a general one.
- I thought your assessment of the AI tools was very good, and I appreciated your critical lens on them. I would be curious to know why the AI tool couldn't read the dataset, since it seems like it should be able to process a csv file. I think you might also consider what it knows about the books themselves though? That might give you a sense of a) how famous a book is and b) also one of the unintended consequences of the downstream effects of book bans (it could make some books more notable or elide over others depending on how these bans are reported).
- Based on the git history, it looks like Rebecca is the only one who worked on this file so if you could confirm that you all divide the labor equitably then I'm happy to share the grade of full marks between all members of the group.

Status: Complete & Currently Full Marks for Rebecca (though all group members will receive the same grade if confirmed)

### Proprietary & Perspectival Dataset Creation Assignment

- It seems like you have two documents, in your `proprietary-perspectival-dataset-creation` folder and that both were pushed up by Lhaye. Please let me know if I'm missing anything, and you will see I made a small alteration to get the screenshots to render correctly.
- Overall, I thought you did a fantastic job with this assignment! I thought you had a clear rationale for choosing Instagram, and I was impressed at the level of detail you went into regarding Meta's TOS and access limitations, especially the various APIs and how they can be used. I thought your assessment of the platform and how lawsuits and business interests are shaping it was spot on. I also appreciated the link to the google sheets though I cannot see the version history. I thought your use of the new policy change around teenage accounts on Instagram was a great way of getting at the shifts in the TOS considering that the Wayback Machine was down. Finally, I thought your assessment of how lawsuits and business interests are shaping Instagram was spot on.
- In terms of the dataset and reflection, I appreciated the link to the Google Sheets dataset, though again I cannot see the version history to see who did what and you might consider simply adding that data to this GitHub repository. Overall, I thought you did a fantastic job with this document as well. I thought your `Documenting the Data` section was great and I really appreciated your consideration of user privacy in creating this data. Also appreciated the inclusion of data types with each column, and that you even researched the max limit of caption length on Instagram. I would highly encourage you to do a similar write up of relevant columns for your final project.
- I thought your choice of mood or feeling as the subjective element of the dataset made sense and I liked that you considered not only the completeness of this data, but also limitations of your current approach around empty captions or informational content, as well as the potential problems of not having a controlled vocabulary around mood. I thought your reflection section was excellent as well and I think choosing your own or friends account made sense. I particularly liked your consideration of how dynamics like sponsorship could influence this data. If you continued this dataset or restarted it for another class, you might consider tracing how sponsored content looks differently (if it does even) for you and your friends, though you are part of the same social circle. Regardless though, your assessment of the grey area of ethics around collecting user generated data and informed consent was great. The license looks great too, though it does need to be moved to the main part of your repo to show up as an overall license for the repository.

Status: Complete & Currently Full Marks for Lhaye (though all group members will receive the same grade if confirmed)

## Mid-Semester Dataset Update Feedback

### Dataset Feedback

Overall, the dataset is starting to look very promising! I appreciate that you cannot publicly release the data on GitHub, though that does limit how you version the dataset. I would encourage you to consider sharing the actual Google Sheets you're using since then I would be able to see the data entry process through the version history. We should also discuss how you might release aggregate data publicly in a way that makes some of your work publicly available while still protecting the authors of the books you are detailing.

As I said at the outset, I'm very impressed with how you are working around the challenges of not finding datasets about banned books, and the degree to which you all collectively worked to follow up on any potential leads on existing data. I realize this gap is somewhat surprising given their political and cultural importance of banned books, but unfortunately few organizations are incentivized or have the bandwidth to collect this data. Previously I did have students hand enter the top lists of banned books each year released by PEN America, but even determining how those lists were created can be tricky, so I mention it as a potential additional resource, though I do think you have quite a bit of data already. I was especially impressed that you conducted actual archival and digitization work (you might be the only group this semester to do so!) and I think that will be a very valuable contribution to the dataset.

I appreciate that you have worked very hard to combine two fairly different datasets and that you have consider thoughtfully what to include or exclude in terms of columns. I think your choice not to drop any columns is an excellent one, especially since most bibliographic data is quite sparse. However, I do think combining the two datasets is not necessarily required. Instead, I would encourage you to think about multiple datasets and merging them. That would allow you have to more relevant documentation for each dataset that reflected the original data source. It would also make it easier for others to work with your data since they could choose to use one or the other dataset if they wanted. Finally, it would likely make it easier to add additional bibliographic metadata (potentially as a third dataset) to your existing data. If you want a relevant example of how to organize the data, I would point you to the "Against Cleaning" chapter that we read in class. As a small point, you have an `Unnamed: 0` column that you might want to remove, but that's a minor point (remember the Pandas function `pd.to_csv` has an `index=False` parameter that can help with this).

I would also encourage you to have a bit more detail about what is in your dataset in the documentation. To be clear, I think your current documentation is excellent (more details below in the next section), but it would be helpful to have an overview of what you collectively consider the most useful columns and also any columns you have manually added. Finally, you might also consider adding a bit more detail to the data you are entering from digitized sources. Specifically, I would encourage you to include what source you are using, who is entering the data, and potentially a date of entry or update date to help document your process. Remember the process of creating data is worth documenting as well!

### Documentation Feedback

As I said above, your documentation is excellent, and particularly the thoroughness with which you've detailed your choices and process.

I thought all the sections were great but particularly the `Challenges` section. I want to assure you that these obstacles, though frustrating are relatively common so learning how to adapt and pivot in the face of them is an excellent skill to have throughout your career! While book banning has a long history and was/is certainly popular internationally, you’re absolutely right that the practice has become much more popular since the culture wars of the 1990s. In terms of getting genre, ISBN, and other metadata, I may be able to help you get those items from WorldCat though that will need to be through a collaboration with HathiTrust. However, you could also look directly to see if any of the banned books are in HathiTrust or the Internet Archive to gather that metadata (you will not be able to get full text but still the metadata could be helpful!).

I also think the `Criteria` should go right into your final data essay. It’s excellent and appreciate your thoughtfulness towards what might bias this data or confounds that shaped its production; a context that is absolutely crucial to understanding this data! I also found the `Content Description` section very helpful, I would suggest for the final data essay that you include a bit more details on your manually data creation process. You could for example include a page of the scanned record to help narrate your process similar to this example from the *Shakespeare and Company Project* [https://shakespeareandco.princeton.edu/sources/cards/](https://shakespeareandco.princeton.edu/sources/cards/). This project might also be of interest since it looks at books as well, though primarily the lending behaviors of members of the Shakespeare and Company library in Paris during the 1920s and 1930s. Indeed, there are a number of potentially relevant projects for looking at books and libraries that might help you situate your project. You've already seen the *Responsible Datasets in Context Project*, which uses *HathiTrust* and *Internet Archive* data, you might also take a look at the data available from Goodreads, via the *Goodreads Social Graph* project, which has a lot of data on books and reading habits (available at [https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home)). Similarly, you might take a look at the *What Middletown Read* project (available here [https://lib.bsu.edu/wmr/](https://lib.bsu.edu/wmr/)) that looks at the reading habits of Muncie, Indiana in the late 19th and early 20th centuries. Finally, you might also take a look at the *Seattle Public Library* project (available here [https://spl.org/](https://spl.org/)) that has a lot of data on the books in the Seattle Public Library system. I do not expect you to use this data, though there may be some useful metadata or ideas for your project. But more I think these projects will help you think about who might use this data and also see some examples of existing datasets to inform how you might structure your data to be useful to others.

Finally, I appreciate your thoughtful division of labor and think your choices make sense. I will need confirmation that this is accurate from all group members to share the grade between all of you since again I can only see Ryan's adding of the `README` file currently in the git history. 

If you decide you want to try something like working with HathiTrust APIs to get metadata, I would be happy to help you setup that pipeline so please don't hesitate to reach out. Similarly, you might for the final project consider mapping as a helpful visualization in the documentation. If you do decide to go that route, I would encourage you to consider two Python libraries: for geocoding, you might consider `geopy` [https://geopy.readthedocs.io/en/stable/](https://geopy.readthedocs.io/en/stable/) which lets you turn state names into geographic shapes, and for mapping you might consider `leafmap` [https://leafmap.org/](https://leafmap.org/) which is a wrapper around `folium` that makes it easier to create maps. This is just a suggestion though, and I think you have a lot of great ideas already!

### Progress from Initial Proposal Feedback

In the initial proposal, Jess had asked you to address the following questions as feedback to consider (paraphrased):

- [ ] How will you clarify and refine your research question? Are you planning to argue that the number of banned books tracks with “significant cultural and political turnover” or are you comparing the data of banned books in the past with the data from the 21st century? 
- [ ] How will you be collecting data? You mentioned web scraping a few times, but it is unclear why it is needed for this project. What data will you be scraping?
- [ ] You note that you will be taking data from the ALA and UIUC archives. Will your data focus only on American book banning or international?

My current assessment is that you have largely completed these tasks, though I would encourage you to consider the following as you move towards completing the final project:

First, I think you could experiment and consider a bit more how best to structure the data, as well as how to document the various data sources you are using. Second, I think you could include a bit more of your process in the final dataset and documentation, especially since you have done some very interesting archival work. Finally, I think you could consider how to make the data more accessible to others, especially since you are not able to share the raw data. One solution might be aggregate counts or simply visualizations, but it would be helpful to have some public aspect to the data so happy to help you brainstorm. 

I think you are on the right track with your project, but I think addressing some of the considerations I have outlined above will help you to make sure you are on target to meet the project goals and requirements. Please feel free to reach out to me if you have any questions or concerns about my feedback or how to proceed with the project.

### Final Grade

Your grade for this assignment is currently an A, and I am excited to see what you do in the remaining weeks! This grade is currently for everyone named in the `Responsibility and Contributions` (so Ryan, Rebecca, Avery, Lhaye, and Henry), once all group members have confirmed the division of labor listed is accurate and equitable.