# Homework 3: Working with data using Pandas and Basic Statistics

### <p style="text-align: right;"> &#9989; Put your name here.


## Learning Goals

### Content Goals
- Working with Pandas to read/write data
- Working with Pandas to clean and analyze datasets
- Visualize data using Matplotlib.

### Practice Goals
- Using data visualizations to draw conclusions
- Analyze real-world datasets to answer research questions
- Use the internet as a resource to learn how to best use imported packages
- Troubleshooting errors and debugging code 
___

## Assignment instructions

Work through the following assignment, making sure to follow all the directions and answer all the questions.

**This assignment is due at 11:59 pm on Friday, March 8.** It should be uploaded into the "Homework Assignments" submission folder for Homework #3.  Submission instructions can be found at the end of the notebook.

___

## Grading

* Academic integrity statement (**2 points**)
* Exploring Michigan policing data (**44 points**)
    - 1.1 Loading and inspecting the data (11 points)
    - 1.2 Cleaning and organizing the data (16 points)
    - 1.3 Basic statistics (9 points)
    - 1.4 Visualizing the trend in officer employment rates (8 points)

* The perils of summary statistics (**36 points**)
    - 2.1 Loading and inspecting the data (11 points)
    - 2.2 Getting summary statistics (11 points)
    - 2.3 Making scatter plots (14 points)

**Total**: 82 points

---
<a id="toc"></a>

## Table of Contents

[Introduction](#intro)
    
[Part 0. Integrity statement](#part_0) (2 points)

[Part 1. Exploring Michigan policing data](#part_1)  (44 points)

[Part 2. Datasaurus, or the perils of summary statistics](#part_2) (34 points)

---
<a id="intro"></a>

## Introduction and motivation

[Back to Top](#toc)


As you know, you are expected to complete a fully-ﬂedged data analysis or computational modeling project by the end of the semester. This is your opportunity to choose a topic that sounds interesting to you and dig into it. More details on the project will be provided on the class website after spring break. This homework is aimed at giving you some experience with part of your semester project, so read the entire notebook carefully because coding is only a small part of your project. The big part is designing it. Let's start!

Most, if not all, data science project have a workflow similar to the one depicted in the image below

![image.png](attachment:7298a6f9-fc43-4736-869c-e61bebe77abd.png)

The image was taken from this [article](https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492). As you can see there are several steps in a data science project. Let's break them down.

1. **Obtain (Data Collection)**: This is the foundational step where you're sourcing data relevant to your project. It involves identifying and gathering data from various sources which could be databases, online repositories, direct data entry, or data streams. It's critical to ensure that the data you collect is as accurate and comprehensive as possible because it sets the stage for the entire project.

2. **Scrub (Data Cleaning)**: Once you have your dataset, the next step is to prepare it for analysis. This involves cleaning and preprocessing the data. You'll be dealing with inconsistencies, missing values, and possibly irrelevant information that needs to be filtered out. This step is about transforming raw data into a structured format that's ready for exploration and analysis.

3. **Explore (Data Exploration and Visualization)**: At this stage, you delve into the dataset to uncover underlying structures, patterns, and insights. This is typically done through exploratory data analysis (EDA), where visualizations and statistical techniques are employed to understand the data's characteristics. It's an investigative process where you're asking questions of the data and seeking answers through visual and quantitative exploration. 

4. **Model (Data Modeling)**: Building on the insights from the exploration phase, you now develop predictive or descriptive models. This involves selecting algorithms that are appropriate for the data and the problem at hand, training these models on the data, and then validating their performance. The goal is to create a model that can generalize from your current dataset to unseen data to make predictions or to uncover more complex patterns.

5. **Interpret (Result Interpretation)**: The final stage is about making sense of the model outcomes. It's about translating the statistical findings and model predictions into actionable insights. This often requires critical thinking to understand the implications of the results within the context of the problem domain. The interpretation must be clear and comprehensible to stakeholders who may not have a technical background.

Throughout these steps, communication and iteration are key. You may find yourself revisiting earlier steps based on new findings or feedback, which is a natural part of the data science process. Each step builds upon the previous one, leading to informed decision-making based on data-driven evidence.

**This homework is designed to give you practical experience in data cleaning and EDA, which are often the most time-consuming steps in a data science project**. It's not uncommon for data scientists to spend a significant portion of their time on these phases, ensuring data quality and uncovering initial insights that will guide further analysis. Cleaning data effectively requires a meticulous eye for detail and a robust understanding of the tools at your disposal, while EDA demands a blend of curiosity and statistical knowledge to ask the right questions and interpret the data correctly. Mastery of these stages is crucial, as they form the bedrock upon which reliable models are built and insightful conclusions are drawn.

Before we move on, I haven't mentioned the interesting part of a data science project, that thing that drives you to embark on the data science journey; the question! After all every project is just a way to find the answer to a question that has been bugging us. While in this homework the questions are provided, for the final project you will have to formulate the question yourself. This is not an easy task! Sometimes you don't have a specific question and in these cases it might be useful to start exploring the web for available data. In the Pre-Class Assignment of Day 13 we provide some websites where you can find data, here are few more websites:

[UCI Repository](https://archive.ics.uci.edu/datasets)

[Fivethirtyeight](https://github.com/fivethirtyeight/data)

[Data.gov - Federal](http://www.data.gov/)

[Data.gov - State of Michigan](http://data.michigan.gov/)

Note that some of the websites have fancy words like _machine learning_. Don't be worry! You are not required to do a machine learning project. Those types of projects are for CMSE 202. The only difference between a machine learning project and CMSE 201 project is the tools used in the project, but they both follow the same 5 steps above and require datasets.

&#9989; **Task (0 points)** Your first task is to start thinking about your project and trying to formulate a question.

<!-- This section was written with the help of chatGPT 4.0 -->

---
<a id="part_0"></a>

## Part 0: Integrity statement  (2 Points)

[Back to Top](#toc)


### 0.1 Academic integrity statement

In the markdown cell below, paste your personal academic integrity statement. By including this statement, you are confirming that you are submitting this as your own work and not that of someone else.


**I, Mark Endicott, commit to bring a serious, honest, and effortful intention to the tools of data science and the responsibilities associated with my education. I will maintain academic integrity because it allows me to thrive later. I value the impact of the skills I acquire and I will to use them to help people to the best of my ability. I am aware of MSU's ethical standards for integrity**

---
<a id="part_1"></a>

## Part 1: Exploring Michigan policing data (44 points total)

[Back to Top](#toc)

In your web search you land on this [article](https://www.themarshallproject.org/2014/11/24/10-not-entirely-crazy-theories-explaining-the-great-crime-decline) about the "Great Crime Decline". You are intrigued and start searching for more data and you land on the [FBI datasets website](https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/explorer/crime/crime-trend). You notice that they have a very nice interface that makes plots of data the FBI collects. There are two main sets of data; **Crime Data** and **Law Enforcement Collections**. You click on **Crime Data** and notice that the plot **Trend of Violent Crime** after adjusting the year range, shows a decline in violent crime in the 90's. Then you notice the other sets of data **Law Enforcement Collections** which contains data on law enforcement employees. One of the theories mentioned in the article was that the increase in law enforcement could be an explanation for the decrease in crime and this makes you wonder:

**Does the level of policing, measured as the number of law enforcement officer, correlate with the decrease in crime in INGHAM county?**


### 1.1: Loading and inspecting the data (11 points)

The instructors have downloaded the data and modified it a little for the purpose of this homework. 

**Using the provided dataset, `pe_MI_1960_2022.csv`, you're going to explore how policing changes across years and regions in Michigan.**

&#9989; **Task 1.1.1 (3 points)** Import `pandas` module, and then read in the policing data file as a pandas dataframe using the first column as the index and display the first 10 rows. 

In [1]:
import pandas as pd

data = pd.read_csv("pe_MI_1960_2022.txt")

data.iloc[0:11]

Unnamed: 0.1,Unnamed: 0,data_year,ori,pub_agency_name,pub_agency_unit,state_abbr,division_name,region_name,county_name,agency_type_name,...,male_officer_ct,male_civilian_ct,male_total_ct,female_officer_ct,female_civilian_ct,female_total_ct,officer_ct,civilian_ct,total_pe_ct,pe_ct_per_1000
0,4514,2021,MI5079200,Utica,,MI,East North Central,Midwest,MACOMB,City,...,15,1,16,0,5,5,15,6,21,4.12
1,4515,2021,MI5080600,Warren,,MI,East North Central,Midwest,MACOMB,City,...,197,10,207,12,29,41,209,39,248,1.87
2,4516,2021,MI5084900,Clinton Township,,MI,East North Central,Midwest,MACOMB,City,...,87,2,89,5,6,11,92,8,100,1.0
3,4517,2021,MI5090200,Chesterfield Township,,MI,East North Central,Midwest,MACOMB,City,...,43,1,44,5,13,18,48,14,62,1.32
4,4518,2021,MI5091200,Macomb Community College,,MI,East North Central,Midwest,MACOMB,University or College,...,23,2,25,3,2,5,26,4,30,
5,4519,2021,MI5091800,Huron-Clinton Metropolitan Authority:,Stony Creek Metropark,MI,East North Central,Midwest,MACOMB,Other,...,8,0,8,1,0,1,9,0,9,
6,4520,2021,MI5115100,Manistee,,MI,East North Central,Midwest,MANISTEE,County,...,14,11,25,1,3,4,15,14,29,1.54
7,4521,2021,MI5155000,Manistee,,MI,East North Central,Midwest,MANISTEE,City,...,11,0,11,1,0,1,12,0,12,1.95
8,4522,2021,MI5215200,Marquette,,MI,East North Central,Midwest,MARQUETTE,County,...,21,29,50,2,13,15,23,42,65,3.32
9,4523,2021,MI5231200,Chocolay Township,,MI,East North Central,Midwest,MARQUETTE,City,...,5,0,5,0,1,1,5,1,6,1.02


&#9989; **Task 1.1.2 (3 point)** As you can see there is too much data to be displayed and you need to do some exploring. Although we haven't used it in class a very useful function of `pandas.DataFrame` is `.info()`. [Here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) is the documentation for it. To take a closer look at the dataframe, in the cell below, write some code to:
- display the output of `.info()`
- print the `unique` county names

In [2]:
data.info()
unique = data['county_name'].unique() #or .value_counts()
print("\n", unique, "\n", len(unique))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31557 entries, 0 to 31556
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             31557 non-null  int64  
 1   data_year              31557 non-null  int64  
 2   ori                    31557 non-null  object 
 3   pub_agency_name        31557 non-null  object 
 4   pub_agency_unit        1808 non-null   object 
 5   state_abbr             31557 non-null  object 
 6   division_name          31557 non-null  object 
 7   region_name            31557 non-null  object 
 8   county_name            31557 non-null  object 
 9   agency_type_name       31557 non-null  object 
 10  population_group_desc  31557 non-null  object 
 11  population             31557 non-null  int64  
 12  male_officer_ct        31557 non-null  int64  
 13  male_civilian_ct       31557 non-null  int64  
 14  male_total_ct          31557 non-null  int64  
 15  fe

&#9989; **Task 1.1.3 (5 points)** 

Inspect the data displayed above and answer the following questions:

- looking at the output of `.info()`, what does the line `pub_agency_unit        1808 non-null   object` mean? In particular, the number `1808` and `object`
- what do `male_officer_ct` and `female_officer_ct` mean?
- what do `total_pe_ct` mean?
- how much memory does the dataset use?
- notice anything strange in the unique values of `county_names`? How many unique values are there? Note that there are 83 counties in the state of Michigan. 


**pub_agency_unit 1808 non-null object:** pub_agency_unit is the name of the column and likely represents a "public agency unit."1808 non-null" indicates the number of non-missing values in the public agency unit column. "obect" represents the type of values in the column. In pandas, "object" typically means string.

**male_officer_ct and female_officer_ct:** This likely stands for the count(ct) of the male and female officers within the data.

**total_pe_ct:** After observing that this column is the result of the sum of officer and civilian counts(ct), it can be inferred that "pe" means person or people. So overall, total person count.

**Memory:** Using .info(), the memory usage is shown to be 5.3+ mb

**Uniques:** After finding the length of the list of unique counties, a count of 108 was observed despite there being only 83 counties in Michigan.




### 1.2: Cleaning and organizing the data (16 points)

Next, you want clean the dataset a bit and simplify it so it only contains the columns that you are interested in.

&#9989; **Task 1.2.1: (7 points)** 
There are some missing values in the data frame that you need to take care of. In the cell below, you will write some code using `drop` and/or `dropna` to:
- first, remove column `pub_agency_unit` where the value is mostly missing
- second, delete any rows that contains missing values, or NaN values

Name your new cleaned data frame as `data_clean` and do the following sanity checks:
- print the length of the original data (* length = number of rows)
- print the length of the cleaned data
- explain the difference between the two
- what would happen if you were to remove the rows first and then the `pub_agency_unit` column? How would the lengths change?

In [3]:
#data.drop("pub_agency_unit", axis=1)
data_clean = data.dropna()
data_clean.info()

print("\n", len(data),"\n", len(data_clean))

data_clean.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             0 non-null      int64  
 1   data_year              0 non-null      int64  
 2   ori                    0 non-null      object 
 3   pub_agency_name        0 non-null      object 
 4   pub_agency_unit        0 non-null      object 
 5   state_abbr             0 non-null      object 
 6   division_name          0 non-null      object 
 7   region_name            0 non-null      object 
 8   county_name            0 non-null      object 
 9   agency_type_name       0 non-null      object 
 10  population_group_desc  0 non-null      object 
 11  population             0 non-null      int64  
 12  male_officer_ct        0 non-null      int64  
 13  male_civilian_ct       0 non-null      int64  
 14  male_total_ct          0 non-null      int64  
 15  female_officer_ct 

Unnamed: 0.1,Unnamed: 0,data_year,ori,pub_agency_name,pub_agency_unit,state_abbr,division_name,region_name,county_name,agency_type_name,...,male_officer_ct,male_civilian_ct,male_total_ct,female_officer_ct,female_civilian_ct,female_total_ct,officer_ct,civilian_ct,total_pe_ct,pe_ct_per_1000


**data vs. data_clean:** The cleaned data has a lower number of rows and less overall memory usage or unecessary "clutter."

**Order of removal:** If we were to remove the missing rows before removing the pub_agency_unit column, we would be dropping a lot more potentially relevant rows (since the column is mostly empty). Dropping the column first preserves the robustness of the data.

&#9989; **Task 1.2.2 (3 points)** 
There are too many columns with irrelevant infomation. You want to write some code to create a reduced dataframe called `data_allcounty` to include only the following columns:
- data_year
- county_name
- population
- male_officer_ct
- female_officer_ct


In [4]:
selected_columns = ['data_year', 'county_name', 'population', 'male_officer_ct', 'female_officer_ct']
data_allcounty = data_clean[selected_columns]
data_allcounty.describe()

Unnamed: 0,data_year,population,male_officer_ct,female_officer_ct
count,0.0,0.0,0.0,0.0
mean,,,,
std,,,,
min,,,,
25%,,,,
50%,,,,
75%,,,,
max,,,,


In [5]:
# Find the index of the row with the highest officer count
idx_max_officer_count = data_allcounty['male_officer_ct'].idxmax()

# Use the index to get the county_name with the highest officer count
county_name_highest_officer_count = data_allcounty.loc[idx_max_officer_count, 'county_name']

# Print the county_name
print("County with the highest officer count:", county_name_highest_officer_count)

ValueError: attempt to get argmax of an empty sequence

&#9989; **Task 1.2.3 (4 points)** You want to create an even smaller dataset that contains only total population and officer counts in the Ingham county for each year. Write some code to do the following:

- create the Ingham dataset: use masking to extract all the rows with `"INGHAM"` as `county_name` from `data_allcounty` dataframe (ignore entries with two or more county affiliations) and store them in a new variable `data_ingham`
- sum over different locations in the county for each year: use `groupby` method to group the dataframe by `data_year`, then calculate the sum `population`, `male_officer_ct`, `female_officer_ct` for each year. Assign the resulting dataframe to a new variable `data_ingham_byyear`. (This can be accomplished with one line of code. You will need to Google the documentation of the `groupby` method for `pandas`.)

In [None]:
data_ingham = data_allcounty[data_allcounty['county_name'] == 'INGHAM'].copy()

data_ingham_byyear = data_ingham.groupby('data_year').agg(
    total_population=('population', 'sum'),
    total_male_officer_ct=('male_officer_ct', 'sum'),
    total_female_officer_ct=('female_officer_ct', 'sum')
).reset_index()

print(data_ingham_byyear)

In [None]:
data_wayne = data_allcounty[data_allcounty['county_name'] == 'WAYNE'].copy()

data_ingham_byyear = data_ingham.groupby('data_year').agg(
    total_population=('population', 'sum'),
    total_male_officer_ct=('male_officer_ct', 'sum'),
    total_female_officer_ct=('female_officer_ct', 'sum')
).reset_index()

print(data_ingham_byyear)

&#9989; **Question 1.2.4 (2 points)** In your own words describe what `groupby` does. 

'groupby' allows the user to perform data analysis by specified columns (one or more). Within these columns the user is able to apply functions such as "mean" or "sum" to obtain useful information.

### 1.3: Basic statistics (9 points)
Now having a couple of smaller datasets organized, you are ready for some statistics!

&#9989; **Task 1.3.1 (6 points)**
In the cell below, write some code to obtain the summary statistics (like the count, min, max, and mean) for each column in `data_allcounty` and `data_ingham`, and answer the following questions:
- On average, does the Ingham county have more officers per agency than that of the state? What statistics support your conclusion?
- Does the Ingham county has the largest law enforcement agency (highest # officers) in the whole state? What statistics support your conclusion?

In [None]:
print(data_allcounty.describe())
print('\n', "Ingam County",'\n')
print(data_ingham.describe())
print('\n', "WAYNE County",'\n')
print(data_wayne.describe())

**does the Ingham county have more officers per agency than that of the state?** From the years 1960-2021, Ingam county has a mean population of 26529.345 with a mean of 41 officers altogether. So 41 / 26529 * 100% = .155% Officers in the population. For the state, this calculation is about 32 / 19340 = .165%. So the state has more officers per agency than ingam county alone.

**Does the Ingham county has the largest law enforcement agency (highest # officers) in the whole state?** Comparing the means of state vs. ingham, ingham does in fact have more officers on average than the state's average.However, this does not mean that Ingham has the highest total amount of officers. In fact, the max count of officers that ingham has had is around 300 while wayne county peaks out at 6500 officers altogether.

&#9989; **Task 1.3.2 (3 points)**
You found it a little misleading to just look at the officer counts without considering the population in the respective jurisdiction. Thus, you set out to calculate the employment _rate_ of officers per 1000 people in the Ingham county. 

In the cell below, write some code to create two new columns `male_rate` and `female_rate` in the dataframe `data_ingham_byyear` that contains the employment rates per 1000 people for male and female officers in each year. 

In [None]:
data_ingham_byyear['male_rate'] = (data_ingham_byyear['total_male_officer_ct'] / data_ingham_byyear['total_population']) * 1000
data_ingham_byyear['female_rate'] = (data_ingham_byyear['total_female_officer_ct'] / data_ingham_byyear['total_population']) * 1000

# Print the modified DataFrame
print(data_ingham_byyear)

### 1.4 Visualizing the trend in officer employment rates (8 points)
Now it's time to look at how the employment rate of police officers in the Ingham county has changed over time. You will be using the `data_ingham_byyear` dataset you created.

&#9989; **Task 1.4.1 (5 points)**
In the cell below, write some code to plot male and female officer employment rates you just calculated in 1.3.2 as a function of year in the same figure. Don't forget to add axis labels and legends. 

In [None]:
import matplotlib.pyplot as plt

plt.plot(data_ingham_byyear['data_year'], data_ingham_byyear['male_rate'], label='Male Rate')
plt.plot(data_ingham_byyear['data_year'], data_ingham_byyear['female_rate'], label='Female Rate')
plt.xlabel('Year')
plt.ylabel('Employment Rate per 1000 People')
plt.title('Ingham County: Male and Female Officer Employment Rates')
plt.legend()
plt.grid(True)
plt.show()

&#9989; **Task 1.4.2 (3 points)** We are finally ready to answer some questions.
Observe the plot and answer the following questions:
- If there is a great crime decline in the 90s in the Ingham county, do you think it's due to an increase in policing? Why?
- Around which year did female officers start to be employed in Ingham county?

In the early 90s there was a signifigant rise in employment rate for both male and female officers, followed by more gradual employment of female officers. This could indicate a correlation between the total increase in police force, the greater percentage of female officers (gradually increasing from 1970s-mid2000s) and a lower crime rate. Although, it is worth mentioning that prior to the 90s, there was a sharp decrease in police force following a long period of signigantly higher percentage male police employment. Overall, I believe this data alone can only show correlations between policing and male/female rates of employment and cannot state a causal effect between policing and crime rate.

## &#128721; STOP

Wait a minute! We still haven't answered the original question! Does crime and level of policing correlate in Ingham county? Well this homework is getting very long so I will leave the rest of the work to you. This work now involves **obtaining** a dataset with yearly crime statistics for Ingham county. Note that for your project if you want to show that two things are correlated you not only need to make plots but also to calculate the correlation coefficient! 


---
<a id="part_2"></a>

## Part 2: Datasaurus, or the perils of summary statistics (34 total points)

[Back to Top](#toc)

In the 1970s, the statistician F. J. Anscombe published the article ["Graphs in Statistical Analysis"](https://www.jstor.org/stable/2682899), which demonstrated the importance of actually graphing your data, instead of just looking at the summary statistics: mean, median, inter-quartile ranges, etc.  

To hammer this point home, he created "Anscombe's quartet", a group of four graphs that are completely different, yet have the same means, ranges, and lines of linear regression.

<img src="https://upload.wikimedia.org/wikipedia/commons/3/30/Anscombe%27s_quartet_3_cropped.jpg"
     alt="Image of four very different graphs with the same summary statistics"
     align="center" 
     height="100" 
     width="600" />
     
Image credit: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

More recently, a Twitter user named Albert Cairo developed a [different dataset](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html) to illuminate the same point.  Here you will examine this data, calculate its properties and finally make some plots!

### 2.1: Loading and inspecting the data (11 points)

&#9989; **Task 2.1.1: (2 points)** Read in data from `datasaurus_data.csv` file as a pandas dataframe called `data` and display the first 5 rows and the last 5 rows.

In [None]:
data = pd.read_csv("datasaurus.txt")
print(data.iloc[0:5])
print(data.iloc[5:11])


&#9989; **Task 2.1.2: (3 points)** Make a list (called `datasets`) of all of the *unique* values in the `dataset` column.  Then print the list.

**Hint: This can be done using a built-in function.**

In [None]:
datasets = data['dataset'].unique()
print(datasets)

If the prior steps were done correctly, the following code will take the information in the dataframe and convert it to another format for easier access:

In [None]:
data_dict = {}
for dataset_name in datasets:
    data_dict[dataset_name] = {}
    data_dict[dataset_name]['x'] = data['x'][data['dataset'] == dataset_name].values  
    data_dict[dataset_name]['y'] = data['y'][data['dataset'] == dataset_name].values  

&#9989; **Task 2.1.3: (4 points)** Describe what `data_dict` is, how it is structured, and how you can access the data that it contains.

data_dict is a dictionary used to store data from the original dataframe in a nested format. Specifically, using the sub-dictionary keys 'x' and 'y'. You can access data in it by using the dataset name with it's dictionary key (such a x).

&#9989; **Task: (2 points)** Print out all of the x-coordinates in the `'star'` dataset using the `data_dict` object. 

In [None]:
print(data_dict["star"]['x'])

### 2.2: Getting summary statistics (11 points)

Now let's compute the following statistics for each dataset:

- mean of x coordinates
- mean of y coordinates
- standard deviation of x coordinates
- standard deviation of y coordinates
- correlation between x and y  (**Hint: check out `np.corrcoef`**)

&#9989; **Task 2.2.1: (9 points)** Make a loop that prints this information for all datasets in the following format:

```
Dataset: dino       x_mean = 54.26 | x_std = 16.71 | y_mean = 47.83 | y_std = 26.84 | corr = -0.06
Dataset: away       x_mean = 54.27 | x_std = 16.71 | y_mean = 47.83 | y_std = 26.84 | corr = -0.06
...
```

In [None]:
import numpy as np

data.describe()

for data_s in datasets:
    L1 = data_dict[data_s]["x"]
    L2 = data_dict[data_s]["y"]
    
    correlation_m = np.corrcoef(L1,L2)
    correlation = round(correlation_m[0,1],4)
    
    meanx = round(np.mean(L1),4)
    meany = round(np.mean(L2),4)
    
    stdx = round(np.std(L1),4)
    stdy = round(np.std(L1),4)
    
    print('Dataset: ', data_s,' | ',
          "x_mean =", meanx, ' | ',
          "y_mean =", meany, ' | ',
          "x_std =", stdx, ' | ',
          "y_std =", stdy, ' | ', 
          "corr =", correlation)
    

&#9989; **Task 2.2.2: (2 points)** What do you observe about these statistics?

All the statistics are very consistent across the datasets. Most changes for x and y are very margianl and fall within the 2-3 decimal places.

### 2.3: Making scatter plots (14 points)

&#9989; **Task 2.3.1: (14 points)** Plot the x and y data of each dataset using the following guidelines for formatting:

- Make a set of subplots with five rows and three columns
- Use a `for` loop to make the plots. **Note:** this is not easy and you will have to think a little. A suggestion is try to make the first few plots without the loop and then look for those numbers that can easily be incremented via a loop. Make sure to divide your code into steps _e.g._ write a pseudocode.
- Title each plot with the dataset name, and the correlation coefficient
- Plot each set of points with a different color, using circles for each (no lines)
- Label the axes
- Be sure to size your figure so it is big enough and nothing looks cramped! _Hint:_ You know you have the right size when you can clearly see the dinosaur.

In [None]:
color_names = ['indigo','magenta','peachpuff','slateblue','lightseagreen','goldenrod','teal','midnightblue','mediumorchid','gainsboro','thistle','bisque','salmon']

In [None]:
num_rows = 5
num_cols = 3

# Create a figure and axes for subplots
fig, axs = plt.subplots(num_rows, num_cols, figsize=(20, 25))

# Iterate over datasets and plot each on a subplot
for i, data_s in enumerate(datasets):
    L1 = data_dict[data_s]["x"]
    L2 = data_dict[data_s]["y"]

    row = i // num_cols
    col = i % num_cols

    axs[row, col].scatter(L1, L2, color = color_names[i])
    axs[row, col].set_title(data_s)
    axs[row, col].set_xlabel('x')
    axs[row, col].set_ylabel('y')

# Remove any empty subplots
for i in range(len(datasets), num_rows*num_cols):
    row = i // num_cols
    col = i % num_cols
    fig.delaxes(axs[row, col])

plt.tight_layout()
plt.show()

---
## Wait! Before you submit your notebook, read the following important recommendations

When your TA opens your notebook to grade the assignment, it will be really useful if your notebook is saved in a fully executed state so that they can see the output of all your code. Additionally, it may be necessary from time to time for the TA to actually run your notebook, but when they run the notebook, it is important that you are certain that all the code will actually run for them!

You should get into the following habit: **before you save and submit your final notebook for your assignments, go to the "Kernel" tab at the top of the notebook and select "Restart and Run all".** This will restart your notebook and try to run the code cell-by-cell from the top to the bottom. Once it finished, review you notebook and make sure there weren't any errors that popped up. Sometimes, when working with notebooks, we accidentally change code in one cell that break our code in another cell. If your TA stumbles across code cells that don't work, **they will likely have to give you a zero for those portions of the notebook that don't work**. Testing your notebook one last time is a good way to make sure this doesn't happen.

**Once you've tested your whole notebook again, make sure you save it one last time before you upload it to D2L.**

---

### Congratulations, you're done! ###

Submit this assignment by uploading it to the course Desire2Learn web page. Go to the "Homework Assignments" section, find the submission folder link for Homework #3, and upload it there.

&#169; 2024 Copyright the Department of Computational Mathematics, Science and Engineering.