In [4]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Midterm Project for DATA 11800 Section 1- Autumn 2023 (due November 10)

This is not a group project. You may discuss the project with other students but you should code and write the report independently. You should acknowledge any help in writing. The score will be based on:
-  Clarity and soundness of the arguments and conclusions; 
-  Use of data to back up arguments and analysis quality;  
-  Insightfulness of the results;
-  Quality of the data vizualizations, summaries used, and overall presentation. Make sure you use headings, captions for figures and tables etc. When you interpret a graph or data from a table, you should clearly specify which figure/table you refer to.

For this project, you are tasked to use the [National Crime Victimization Survey (NCVS)](https://www.icpsr.umich.edu/web/NACJD/studies/38090/summary). The NCVS gathers data about personal and household crimes since 1973. The primary goals of the survey are to collect information about the victims, to explore the consequences of crime, and to estimate the number and types of crimes that go unreported.

Information about the study and data can be found [here](https://drive.google.com/file/d/1cftlGhPkRdITPaKTTKgHH75f1GQYy87V/view?usp=sharing)

The goal of this project is to gain insight into crime victimization using the tools you have learned so far in this class.

### The Data 

You can (but are not required to) use any additional data you can find to get insight into this issue, but you need to specify the provenance of that data in your report. Some potential options include:

- NCVS Series and Supplements, https://www.icpsr.umich.edu/web/NACJD/series/95
- Annual Survey of Jails 2020, https://www.icpsr.umich.edu/web/NACJD/studies/38408
- National Prisoner Statistics, https://www.icpsr.umich.edu/web/NACJD/studies/38249
- American Community Survey, https://www.census.gov/programs-surveys/acs/

There is a plethoral of data collected in the NCVS, we have cleaned and selected a portion of the data for your use in this project. That data is provided on Canvas (the `NCVS_2020.csv` file). Also provided on Canvas is a codebook (an Excel file containing information about each variable) created by your instructors as well as a codebook from the study itself giving additional information about the data and data collection methods that will be useful to answer some of the questions below. The preprocessed data provided has 8044 rows (first row is the name of the columns, and there are 8043 rows of data) and 81 columns. 

In [5]:
# read the data - make sure you specify the proper path to the file
proj_df=pd.read_csv('NCVS_2020.csv')
proj_df.shape

(8043, 81)

In [6]:
# a sample of 10 rows
proj_df.sample(10)
proj_df.head(10)

Unnamed: 0,YEARQ,IDHH,ICPSR,PANEL_ROT_GROUP,URBANICITY,LIV_TYPE,UNITS,OUTSIDE,GATED,RESTRICTED,...,ACTIVE_DUTY,JOB_WEEK,JOB_6MO,JOB_2WEEK,JOB,EMP_TYPE,JOB_LOC,JOB_COLLEGE,ATT_COLLEGE,NUM_INCIDENTS
0,2020.1,1.60201e+24,9,37,2,1,6,1,2,2,...,1,1,9,9,11,3,1,2,4,1
1,2020.1,1.60207e+24,9,31,1,12,6,2,2,2,...,1,2,1,1,23,1,3,2,1,2
2,2020.1,1.60207e+24,9,31,1,12,6,2,2,2,...,1,2,1,1,23,1,3,2,1,2
3,2020.1,1.60207e+24,5,21,2,1,1,9,2,2,...,1,1,9,9,12,3,1,1,4,1
4,2020.1,1.60207e+24,5,37,2,1,1,9,2,2,...,4,1,9,9,26,1,4,2,4,1
5,2020.1,1.60207e+24,8,21,2,1,1,9,2,2,...,4,1,9,9,27,1,1,2,4,7
6,2020.1,1.60207e+24,8,21,2,1,1,9,2,2,...,4,1,9,9,27,1,1,2,4,7
7,2020.1,1.60207e+24,8,21,2,1,1,9,2,2,...,4,1,9,9,27,1,1,2,4,7
8,2020.1,1.60207e+24,8,21,2,1,1,9,2,2,...,4,1,9,9,27,1,1,2,4,7
9,2020.1,1.60207e+24,8,21,2,1,1,9,2,2,...,4,1,9,9,27,1,1,2,4,7


### The Assignment

#### Report on your findings about victimization. 
Imagine you are serving as a consultant who wants to recommend directions for future research, propose modifications in public policy, or suggest how one can reduce victimization and its consequences.

You must submit two files: 

1. The Jupyter Notebook that contains all the code you use for the analysis. You do not need to submit data you used, but just indicate how you obtained it in the Notebook.

2. A report of your findings **(in a .pdf file). This report should be at most 4 pages long including references.** Use data visualization and data summaries to justify your conclusions. Note that the page limitation means you will not show all analyses and plots you will make - select carefully what you think is most relevant.

The report should address the following points:

A.  **Introduce the dataset**.  Describe the data. Where does it come from? Why was it collected (what are the researchers interested in studying)? Was it an experiment? A retrospective observational study? A prospective observational study? Describe the sampling process. How many variables are there? List a few. How many observations (i.e., rows)? How many distinct households? Using what you have learned about data collection, is this a biased or unbiased sample? Why?

The National Crime Victimization Survey (NCVS), sponsored by the Bureau of Justice
Statistics (BJS), is used to estimate the frequency and characteristics of criminal victimization in
the United States. When calculating NCVS estimates, researchers must take into account the
complex stratified, multistage sample design and resulting analysis weights. Stratification,
clustering, and variation in analysis weights all affect the variances of survey parameters, and
inappropriately accounting for these factors during estimation can lead to invalid results
(Cochran, 1977).

NCVS data, which are available in SAS, SPSS, Stata, R, and ASCII plain text formats, are organized in four files: an address-level file,1 a household-level file, a person-level file, and an incident-level file.2 Furthermore, since the NCVS employs a 6-month retroactive reference period for reporting crimes, data are available in two file structures: collection-year files and data-year files. Collection-year files contain incidents based on when the interviews were conducted (as opposed to when the incidents actually occurred). Data-year files contain incidents based on when they occurred, regardless of when the interviews were conducted. Different sets of survey weights are provided for each type of file. Under the collection-year approach, only 12 months of interviews are needed for annual estimation. With the data-year approach, annual estimates cannot be made until 18 months of interviews have been conducted, making collection-year-based estimation more timely. For most outcomes the difference between the collection-year and data-year estimates is not statistically or substantively different. Therefore, collection-year data are more commonly used in BJS analyses and reports. Although this user’s guide will introduce both collection- and data-year weights, the examples will focus on collection-year estimation

The household-level file contains one record per reporting period for each sampled household in the NCVS. It contains data from the household screening interview, which assesses whether a household experienced any property crimes during the previous 6 months. Additionally, the household-level file contains characteristics of the household’s surrounding area, such as the census region and Metropolitan Statistical Area (MSA) status, and the characteristics of the principal and reference persons within the household.

Because households stay in the NCVS sample for 3 years—reporting seven times at 6-month intervals—most households appear on the annual household file more than once. For this reason, both the household identification number (IDHH) and the year and quarter indicator (YEARQ) must be used to uniquely identify households by reporting period when merging the household file with other NCVS data files. Household-level estimates for the collection  year use the collection-year household weight (WGTHHCY), whereas estimates for the data year are based on the data-year household weight (WGTHHDY). The household file is most commonly used in the calculation of property victimization rates.


principal vs ref?
what does everything refer to specifically

B.  **Characteristics of sample**. Describe the sample of people and households in the dataset.  Summarize the distributions of 3 or more of the characteristics (variables) of the people and households.  Some interesting variables you may consider include: marital status, employment, age, income etc. Choose at least 1 categorical and 1 numerical variable. You should include a graph or table for each distribution. You should create at least one graph and at least one table.

charts: n reports over time (line), n reports by income (income bracket histo otherwise scatter), n reports by job (whichever measure works best) (bar)
table: total reports by employment

C. **Relationships between variables.**  Now, shift focus from distributions of single variables to relationships between variables. You should investigate at least 2 of the individual or household characteristics (race, sex, education...) and at least two of the crime related variables.  For example, do you find evidence that those with levels of education are victims of less crime? Include two or more graphs or tables here. Describe any associations you find. 

severity of crime vs perp education level (scatter)
victim education level vs perp education level (scatter)
perp number of incidents vs perp education level (bar)
mean job level/length had vs education level (both perp and bar)
average perp income vs perp education level (bar)
 

D. **Provide context**  To the best of your knowledge, what do the relationships you discovered imply? Do you think the associations are causal? What are some potential confounders that may explain the relationships?  What are some questions that you would like to answer but are unable to with the current data set alone?  What data would you need to be able to answer them?

what would I want answered: 
perp hometown avg income vs perp education level
n reports by perp hometown avg income
would prove that higher income means nicer neighborhood (neighborhood with less crime) and therefore since higher education level correlates to higher income level increase education!
how to answer:
general census data

E. **Conclusion/Self-Assessment** What did you learn from exploring this dataset?
Overall consultant recommendation --> increase access to better education keeps perps of the streets and gives them a chance at a better life

Your submission is to be uploaded to Canvas.  
There are some minimum requirements for your submission:

1. Upload your report in PDF format to Canvas. 
2. Upload the Jupyter notebook containing your analysis code to Canvas.
3. The report should be 4 pages maximum, including bibliography, tables and figures.
4. If your report uses outside results and/or data, proper citations must be provided.
    
