**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Kaitlyn Chou
- Aria Gross Villa
- Alera Ojomoh-dermody
- Isaac Roberts
- Angelina Vo

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


Standardized tests can act as a lens through which educators and policymakers can understand various aspects of the educational system, identifying both strengths and areas in need of improvement. Analyzing standardized test results alongside teacher credentials, resource allocation, and instructional strategies provides a comprehensive view of the educational environment, fostering efforts to improve student achievement and success. The No Child Left Behind (NCLB) Act mandated that schools employ qualified educators who are certified in their state, hold a bachelor's degree, and demonstrate subject knowledge. <a name="cite_ref-3"></a>[<sup>1</sup>](#cite_note-3) With the replacement of NCLB by the Every Child Succeeds Act (ECSA) in 2015, schools, or Local Education Agencies, are now obligated to report on the qualifications of educators, including the percentage of 'inexperienced' teachers. <a name="cite_ref-3"></a>[<sup>2</sup>](#cite_note-3) 

Numerous research papers have explored the influence of libraries on individual students' educational attainment. Across the three initial research sources we examined, student success was measured and defined using metrics such as GPAs, retention rates, and qualitative interviews, shedding light on students' perspectives regarding the significance of public libraries in their academic journey. Just as GPA and retention rates provide insights, CAASPP test scores in ELA and mathematics offer pathways to explore the connection between public library utilization and measurable academic achievement. While standardized testing is not the sole dependable method for gauging student progress, it remains commonly utilized to assess college readiness and academic advancement. 

The literature asserts that public libraries serve as invaluable resources for students from diverse socioeconomic backgrounds, offering access to a wide range of written, verbal, and digital materials. For students lacking reliable technology, libraries provide a conducive environment for studying, asking questions, and engaging in group activities, all within a quiet and secure setting. Meanwhile, school librarians act as threads, linking students and teachers to reliable, consistently accurate instruction. Libraries not only supply tangible resources to all community members but can also serve as academic equalizers for low-income students and students of color. A study conducted on Australian public libraries' impact on students living in regional or remote locations found that these students benefited significantly from utilizing library resources.<a name="cite_ref-2"></a>[<sup>3</sup>](#cite_note-2) This was attributed to their access to qualified and informed assistance from librarians, as well as the diversity of materials available in libraries. A study conducted at Northern Colorado University from fall 2017 to spring 2018 revealed that a higher proportion of students of color and those eligible for Pell grants utilized library resources compared to their white or non-eligible counterparts.<a name="cite_ref-1"></a>[<sup>4</sup>](#cite_note-1) The Northern Colorado University study posed several empirical questions related to student success, which could be investigated using the metrics mentioned earlier. Questions include: “How does the use of specific library services correlate with persistence for undergraduate students?” and, “Is there a positive correlation between the number of uses of library services and academic achievement for undergraduate students?”. <a name="cite_ref-1"></a>[<sup>5</sup>](#cite_note-1) 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Mayer, Jennifer; Dineen, Rachel; Rockwell, Angela; and Blodgett, Jayne. 'Undergraduate Student Success and Library Use: A Multimethod Approach.' University Libraries Faculty Publications, 2020. Available at: https://digscholarship.unco.edu/libfacpub/96 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Power, Emma et al. ''Working Together': Public Libraries Supporting Rural, Regional, and Remote Low-Socioeconomic Student Success in Partnership with Universities.' Journal Name, vol. [volume number], no. [issue number], 2019, pp. 105-125. DOI: https://doi.org/10.1080/24750158.2019.1608497
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Tomasetti, Elena T. 'Factors That Contribute to a Score Increase from Eighth to Eleventh Grade on the CAASPP Mathematics Exam at Red Bluff High School.' Master's thesis, California State University, Chico, 2022.

# Hypothesis


We hypothesize that California counties with higher rates of activity and participation in public libraries correlate positively with higher CAASPP (California Assessment of Student Performance and Progress) test scores in both ELA and mathematics. Children who regularly visit libraries  have more exposure to reading, routine studying, tutoring, and community events that are often held by libraries. We hypothesize that this would result in those children having stronger reading comprehension and understanding of other skills that lead to them scoring higher on California standardized tests than those who don’t.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## California Public Libraries Statistics

In [1]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
import pandas as pd
import numpy as np

# read in csv 
library_17 = pd.read_csv('./data/All-DataFY16-17.csv',  encoding='latin-1')

# select only columns we need
variables = ['County', 'Population of The Legal Service Area', 
        'Registered Users as of June 30', 'Children Borrowers', 
        'Library Visits', 'Total Operating Income', 'Circulation of Childrens Materials',
        "# of Children's Programs", "Children's Program Attendance"]
library_17 = library_17[variables]

# convert quantitative variables to floats 
# THIS IS NOT WORKING HELP
library_17[variables[1:]] = library_17[variables[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.float64)
library_17['Year'] = 2017

# do the same with other years
library_18 = pd.read_csv('./data/All-DataFY17-18.csv',  encoding='latin-1')
library_18 = library_18[variables]
library_18[variables[1:]] = library_18[variables[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.float64)
library_18['Year'] = 2018

library_19 = pd.read_csv('./data/All-DataFY18-19.csv',  encoding='latin-1')
library_19 = library_19[variables]
library_19[variables[1:]] = library_19[variables[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.float64)
library_19['Year'] = 2019

library_21 = pd.read_csv('./data/All-DataFY20-21.csv',  encoding='latin-1')
library_21 = library_21[['1.35 County', '2.1 Population of The Legal Service Area', 
        '2.2 Registered Users as of June 30', '2.3 Children Borrowers', '7.2 Library Visits', 
        '3.5 Total Operating Income', '7.12 Circulation of Childrens Materials', 
        "# of Children's Programs (calculated)", "Children's Program Attendance"]]
# rename columns 
library_21.columns = variables
library_21[variables[1:]] = library_21[variables[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.float64)
library_21['Year'] = 2021

library_22 = pd.read_csv('./data/All-DataFY21-22.csv',  encoding='latin-1')
library_22 = library_22[['1.36 County', '2.1 Population of The Legal Service Area', 
        '2.2 Registered Users as of June 30', '2.3 Children Borrowers', '7.2 Library Visits', 
        '3.5 Total Operating Income', "7.11 Circulation of Children's Materials", 
        "# of Children's Programs (calculated)", "Children's Program Attendance"]]
# rename columns
library_22.columns = variables
library_22[variables[1:]] = library_22[variables[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.float64)
library_22['Year'] = 2022


libraries = pd.concat([library_17, library_18, library_19, library_21, library_22])

In [2]:
# sort rows by county and combine all rows in the same county
libraries = libraries.sort_values(by='County')
agg_functions = {'Population of The Legal Service Area': 'sum', 'Registered Users as of June 30': 'sum', 'Children Borrowers': 'sum', 
                 'Library Visits': 'sum', 'Total Operating Income': 'sum', 'Circulation of Childrens Materials': 'sum', 
                 "# of Children's Programs": 'sum', "Children's Program Attendance": 'sum'}
new_libraries = libraries.groupby(['County','Year']).aggregate(agg_functions)

In [3]:
new_libraries

Unnamed: 0_level_0,Unnamed: 1_level_0,Population of The Legal Service Area,Registered Users as of June 30,Children Borrowers,Library Visits,Total Operating Income,Circulation of Childrens Materials,# of Children's Programs,Children's Program Attendance
County,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alameda,2017,1645359.0,1144802.0,211353.0,7943160.0,100332971.0,6808663.0,14923.0,494780.0
Alameda,2018,1660202.0,1166499.0,212589.0,7701703.0,107204293.0,6769157.0,14790.0,504045.0
Alameda,2019,1669301.0,1207722.0,233130.0,7663199.0,120455980.0,7223460.0,13432.0,477303.0
Alameda,2021,1656591.0,1274668.0,193807.0,321287.0,129778074.0,3394211.0,1217.0,33399.0
Alameda,2022,1651979.0,1201528.0,183693.0,3379739.0,136050987.0,6686492.0,5678.0,106002.0
...,...,...,...,...,...,...,...,...,...
Yuba,2017,74577.0,37350.0,5573.0,54647.0,410172.0,38717.0,477.0,4032.0
Yuba,2018,74727.0,36651.0,2690.0,59033.0,506932.0,21688.0,489.0,4891.0
Yuba,2019,77916.0,37978.0,2578.0,58208.0,714484.0,19155.0,354.0,4589.0
Yuba,2021,79407.0,37795.0,1835.0,-1.0,843125.0,4354.0,145.0,7735.0


## California Assessment of Student Performance and Progress (CAASPP) Smarter Balance Assessments

In [6]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# read in score csv
caaspp_17 = pd.read_csv('sb_ca2017_1_csv_v2.csv')
# select variables we need
# drop rows with 11th grade and 13th grade 
# make separate columns for mean scale score math and reading
# combine rows with same county code; have 58 rows

# read in school code csv
codes_17 = pd.read_csv('sb_ca2017entities_csv.csv')
# select variables we need
# keep only first unique county code; have 58 rows

# add county name column to caaspp 

# Ethics & Privacy

The ideal data we have proposed is likely to produce some ethics and privacy concerns. Given our research question, one of the primary concerns we have regards the privacy of the children whose data we will be collecting. The data we wish to utilize will track participation at public libraries as well as student test scores. We understand that while parents or guardians may have consented to allow the collection of their children's data, it is unlikely that the children themselves can offer truly informed consent to have their data collected and used. For this reason, we propose to utilize only data that refrains from including all personal identifying information except for the age of children. Additionally, in our analysis, we aim to only use this data in aggregate — never singling out a single data point for commentary on that individual. Another privacy concern hinges on the fact that some of our data includes attendance rates for children's programs at public libraries. To protect the privacy and safety of the children who attend these programs, we will withhold information about the times and locations of the programs as well as consider only using the program data in aggregate by county. Lastly, considering that these libraries are public institutions, the data must be safeguarded through strong security measures to prevent hacking, data leaks, and other security breaches.

The ethics concerns we have for our data, however, extend beyond just privacy. One issue that might arise from the data is the stigmatization of certain communities or school systems. Since we include performance data on standardized tests as well as library attendance habits, viewers may interpret these as metrics to profile certain groups or communities. In addition, inherent biases can impact the validity and fairness of research findings, which may result in overrepresentation or underrepresentation of particular demographics and socioeconomic groups. Lastly, when collecting data within communities, it is imperative to implement responsible measures and practices that leave the community feeling safe and at ease. It is important in our analysis to remind readers that standardized test scores are not a holistic measure of an individual's capabilities nor is library attendance an accurate means to assess a community's habits.

# Team Expectations 


* Each team member will attend our weekly group meetings unless they indicate they cannot attend beforehand.
* Each team member will communicate any concerns, questions, or ideas with the other team members through our designated group chat or during our group meetings.
* Each team member will complete their designated tasks by the deadlines determined by the team unless they communicate that they need more time to complete them.
* Each team member will contribute equally across the entire project submission.
* Each team member will regularly check in with each other to ensure good progress is being made toward the final project.

# Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/08  |  7:00 PM | Read through COGS 108 Team Policies; look for data sets about school rankings, student academic performance  | Discuss **Project Proposal**. Complete Team Expectations and Project Timeline Proposal. Assign the rest of the sections of the Project Proposal to the team members. | 
| 2/11  |  **By 10:00 PM** |  **Project Proposal Submission** |  | 
| 2/13  | 6:00 PM  | Complete assigned Project Proposal section and review group work for submission. | Discuss **Data** and Project Proposal feedback.   |
| 2/20  | 6:00 PM  | Import and wrangle data. | Discuss Project Proposal feedback and finalize **Data**. Discuss **EDA** and project tasks.   |
| **2/25 DEADLINE**  | **Before 11:59 PM**  | **Checkpoint #1: DATA** |  |
| 2/27  | 6:00 PM  | Submit Data and brainstorm **EDA** approach.  | Finalize **EDA** plan and assign project **tasks**. |
| 3/05  | 6:00 PM  | Complete **EDA**, plan assigned tasks, outline of Analysis.  | Review **EDA**, discuss **analysis**, and finalize task plans. |
| **3/10 DEADLINE**  | **Before 11:59 PM**  | **Checkpoint #2: EDA** |  |
| 3/12  | 6:00 PM  | Complete Analysis and draft of results, discussion, and conclusion. | Review drafts and edit the full project draft. Plan the **video**. |
| 3/19  | 6:00 PM  | Prepare the video script and presentation.| Review presentation and film **final video**. Submit **Final Report**. |
| **3/20 DEADLINE**  | **Before 11:59 PM**  | **DUE Final Report, Final Video, Team Evaluation, & Post Course Survey (EC)** |  |