# Week 3 - Principles of data visualization

## Part 1: Fundamentals of data visualization

**Exercise:**
- Explain in your own words how the Pearson correlation works and write down it's mathematical formulation. Can you think of an example where it fails (and visualization works)?
- What is the difference between a bar-chart and a histogram?
- How do you choose the right bin-size in histograms? Do a Google search to find a criterion you like and explain it.

**Answer:**
- f

## Part 2: Reading about the theory of visualization

**Exercise**: Questions for DAOST 
- Explain in your own words the point of the jitter plot.
- Explain in your own words the point of figure 2-3. 
- When can KDEs be misleading? 
- Janert writes "CDFs have less intuitive appeal than histograms of KDEs". What does he mean by that?
- What is a *Quantile plot*? What is it good for. 
- How is a *Probablity plot* defined? What is it useful for? Have you ever seen one before?
- One of the reasons we like DAOST is that Janert is so suspicious of mean, median, and related summary statistics. Explain why one has to be careful when using those - and why visualization of the full data is always better. 
- When are box plots most useful?
- Are violin plots better or worse than box plots? Why?
- Explain in your own words how this video illustrates potential issues even with box-plots? Do violin-plots help with that issue?

## Part 3: Visualizations based on the book

In [1]:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import os
import math

# Load the DataFrame
data_path = os.path.abspath(os.path.join(os.pardir, "data"))
cleaned_data_path = os.path.join(data_path, "Police_Department_Incident_Reports_Complete.csv")
df = pd.read_csv(cleaned_data_path)

# Define focus crimes
focuscrimes = {'WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 
               'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 
               'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'}

# Define the order of the days of the week
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Filter data for focus crimes
df_focus = df[df['Category'].isin(focuscrimes)]

In [None]:
# Making a jitter-plot based on SF Police data of the arrest times during a single hour (18-19) between Nov 2017 and April 2018

# Filter data for weapon laws and the desired time interval:
weapon_df = df_focus[df_focus['Category'] == 'WEAPON LAWS']
weapon_df = weapon_df[((weapon_df['Year'] == 2017) & (weapon_df['Month'].isin([11, 12]))) | 
                      ((weapon_df['Year'] == 2018) & (weapon_df['Month'].isin([1, 2, 3, 4])))]

# Further filter to 18:00 - 18:59
weapon_hour = weapon_df[weapon_df['Hour'] == 18]

# Since we don't have minutes in the data, simulate a minute value (0 to 60) for each incident
weapon_hour = weapon_hour.copy()  # to avoid SettingWithCopyWarning
weapon_hour['SimulatedMinute'] = np.random.uniform(0, 60, size=len(weapon_hour))

# Create a vertical jitter value (using a small random offset) so points don't overlap vertically
weapon_hour['Jitter'] = np.random.normal(0, 0.1, size=len(weapon_hour))

# Now create the jitter plot
plt.figure(figsize=(10,6))
plt.scatter(weapon_hour['SimulatedMinute'], weapon_hour['Jitter'], 
            alpha=0.7, color='purple', edgecolor='black')
plt.xlabel('Minute within hour (18:00 - 18:59)')
plt.ylabel('Jitter (arbitrary offset)')
plt.title('Jitter Plot of Weapon Laws Arrests\n(Nov 2017 - Apr 2018, Hour 18)')
plt.xticks(np.arange(0, 61, 10))
plt.yticks([])  # remove y-ticks since they only represent random jitter
plt.show()


**Exercise Part 1**: Connecting the dots and recreating plots from DAOST

- Now grab 25 random timepoints from the dataset (of 1000-10000 original data) you've just plotted and create a version of Figure 2-4 based on the 25 data points. Does this shed light on why I think KDEs can be misleading? 

**Exercise Part 2**:

-  What does this plot reveal that you can't see in the plots from last time?

**Exercise**: Let's plot a map with some random values in it.

In [2]:
randomdata = {
    'CENTRAL': 0.4821,
    'SOUTHERN': 0.9153,
    'BAYVIEW': 0.3674,
    'MISSION': 0.7542,
    'PARK': 0.6285,
    'RICHMOND': 0.2147,
    'INGLESIDE': 0.05391,
    'TARAVAL': 0.007846,
    'NORTHERN': 0.4938,
    'TENDERLOIN': 0.08127
}

**Exercise:**

Main goal: *determine the districts where you should (and should not) leave your car on Sundays*. (Or stated differently, count up the number of thefts.)

- Based on your map and analysis, where should you park the car for it to be safest on a Sunday? And where's the worst place?
- Try to change the range of data-values in the plot above. Is there a way to make the difference between district less evident? 
- Why do you think perceptual errors are a problem? Try to think of a few examples. 
