## Group Project for DATA 11900 - Winter Quarter 2024

### Deadlines: 

- project proposal: Friday, February 2 at 5:00pm
- group presentation: TBD Week 8-9
- final report and notebook: Friday, March 1 at 5:00pm

The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results.

This is a group project where group membership assignments are made by the instructor.  I recognize that individual schedules and other constraints might limit your ability to work in a team. If this the case, please let me know immediately. In general, I anticipate that all members of a group will receive the same score. However, I reserve the right to assign different scores to each group member based on peer assessments of effort and contribution.

### Deliverables:

- **Project proposal**: one paragraph discussing the project goals. This will not be used for the project score - it is a way to check what the instructors think about your ideas. Please indicate in the proposal the source(s) of the data used for the project. The proposal file should also have the group number and the names of the students in the group.
- **Presentation**: Each group will present to the instructor, TAs and the rest of the class during the last week(s) of class. The presentation will be a 5 min **lightning talk** using slides. Each team member must present at least one slide (not included the title slide)
- **Report**: The report should reflect comments and feedback received during presentation. **The reports should be at most five pages long.**
- **Notebook**: a high-quality and readable Python Jupyter notebook. You should strive for doing things the right way and think about aspects such as reusability etc. We also expect you to document your code.


### Report Objectives

The final report should cover these aspects:

-    Overview and Motivation: Provide an overview of the project goals and the motivation for it. 
-    Related Work: Anything that inspired you, such as a paper, a newspaper/magazine article etc.
-    Initial Questions: What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
-    Data: Source, scraping method, cleanup, etc.
-    Exploratory Analysis: What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made. How did you reach these conclusions?
-    Final Analysis: What did you learn about the data? How did you answer the questions? How can you justify your answers?
-     Group member contributions: Please state at the end the role each member of the team had in the project.

### Report Quality

Similarly to the project in DATA 11800, the quality of the writing, tables and figures is very important. Make sure that **the report does not exceed 5 pages** (including references). 

Figures should meet the following standards:
- Must be clearly labeled and referenced by those labels
- Axes should be labeled and informative
- Try to only include the most relevant plots (10 of the same kind of plot of a slight variation of the same information isn't very interesting)
- Do not include too much information in a single plot (for example, if you compute some covid or crime metric for each community area in Chicago, you should not display that metric in one giant barplot with all community areas listed. We will take points off for things like this). 
- Be creative! We love good vizualizations. 


###  Data Examples

The following are some of the sources we use for data. You can use data there or you can use them for inspiration for project ideas. Using **multiple datasets** could enhance the analysis.
 
Google Dataset search:
https://datasetsearch.research.google.com/

https://blog.google/products/search/discovering-millions-datasets-web/
 
CDC:
https://data.cdc.gov/browse

500 cities:
https://www.cdc.gov/500cities/index.htm

UN:
http://data.un.org/

Kaggle:
https://www.kaggle.com/datasets

AWS:
https://registry.opendata.aws/

FEC:
https://www.fec.gov/

FiveThirtyEight:
https://github.com/fivethirtyeight/data

In [2]:
# Don't change this cell; just run it. 
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
    
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

In [3]:
df = pd.read_csv('acgr23.txt', sep='\t')
df

Unnamed: 0,AcademicYear,AggregateLevel,CountyCode,DistrictCode,SchoolCode,CountyName,DistrictName,SchoolName,CharterSchool,DASS,...,SPED Certificate (Count),SPED Certificate (Rate),GED Completer (Count),GED Completer (Rate),Other Transfer (Count),Other Transfer (Rate),Dropout (Count),Dropout (Rate),Still Enrolled (Count),Still Enrolled (Rate)
0,2022-23,C,1,,,Alameda,,,All,All,...,60,0.7,1,0,31,0.4,511,6,245,2.9
1,2022-23,C,1,,,Alameda,,,All,All,...,112,1.2,1,0,40,0.4,788,8.6,326,3.6
2,2022-23,C,1,,,Alameda,,,All,All,...,0,0,0,0,0,0,3,7.9,2,5.3
3,2022-23,C,1,,,Alameda,,,All,All,...,40,0.9,1,0,2,0,100,2.2,45,1
4,2022-23,C,1,,,Alameda,,,All,All,...,28,1.8,0,0,12,0.8,147,9.4,113,7.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113966,2022-23,T,0,,,State,,,Yes,Yes,...,2,0.3,1,0.1,19,2.8,270,39.1,231,33.5
113967,2022-23,T,0,,,State,,,Yes,Yes,...,4,0.1,15,0.5,57,2,1202,41.8,931,32.4
113968,2022-23,T,0,,,State,,,Yes,Yes,...,0,0,0,0,1,0.7,37,25.9,58,40.6
113969,2022-23,T,0,,,State,,,Yes,Yes,...,20,0.1,129,0.6,341,1.6,8225,39.7,6405,30.9


In [4]:
df.fillna(-1, inplace=True)
ndf = df.loc[df['SchoolName'] != -1]
new_df = ndf.drop(columns=['Golden State Seal Merit Diploma (Count)',
       'Golden State Seal Merit Diploma (Rate', 'CHSPE Completer (Count)', 'Adult Ed. HS Diploma (Count)',
       'Adult Ed. HS Diploma (Rate)',])
new_df[['SchoolName','ReportingCategory','CohortStudents','Regular HS Diploma Graduates (Rate)']]

Unnamed: 0,SchoolName,ReportingCategory,CohortStudents,Regular HS Diploma Graduates (Rate)
8661,District Office,GF,192,65.6
8662,District Office,GM,181,53.6
8663,District Office,RA,16,93.8
8664,District Office,RB,116,58.6
8665,District Office,RD,*,*
...,...,...,...,...
113791,Wheatland Union High,SF,*,*
113792,Wheatland Union High,SH,*,*
113793,Wheatland Union High,SM,*,*
113794,Wheatland Union High,SS,184,96.7


In [None]:
Visualizaion ideas: 