# Part 1: More lecturing on dataviz

Excercise: Some questions about the video. 

* Mention 10 examples of ways we can encode data.
    * Position
    * Length
    * Area
    * Shape
    * Color
    * Angle
    * Line weight
    * Line ending
    * Texture
    * Pattern
    
    * Length of a rectangle (barchart)
    * Nodes and links
    * Geodata with maps
    * Piecharts
    * Time series graphs
    * Area encoding, such as size of some circles
    * Color intensity encoding


* Are all encodings created equally? Why not? Can you think of an example from the previous lectures?
    * No. Encodings can reveal different aspects and serve different purposes. An example from the previous lecture is the size of an area on a map vs the number of vehicle thefts. It is misleading to just look at the colors on the map and make a decision based on the color map.



* Mention 3 encodings that are difficult for the human eye to parse. Can you find an example of a visualization online that uses one of those three?
    * Piecharts angle differences
    * Color intensity can be difficult to conclude from
    * areas of bubles/circles
    * In general: Position and length are good for representing numbers, but angle, area/size and color intensity are not good at representing numbers


* Explain in your own words: What is the problem with pie-charts?
    * It is very difficult to see whether one size in the chart is bigger than the other, if it is not an obvious big difference. And you cant really tell how much they differ either. Not easy for the eye.

# Part 2: Visualizing geo-data


Exercise: A different take on geospatial data. It's OK to use your LLM for all of the Folium exercises.

A couple of weeks ago (Part 3 of Week 2), we worked with spacial data by using color-intensity of shapefiles to show the counts of certain crimes within those individual areas. Today, we look at studying geospatial data by plotting raw data points as well as heatmaps on top of actual maps.


* First start by plotting a map of San Francisco with a nice tight zoom. Simply use the command folium.Map([lat, lon], zoom_start=13), where you'll have to look up San Francisco's longitude and latitude.

In [1]:
import folium
SF = [37.774, -122.431297]
SF_MAP = folium.Map(SF, zoom_start=13)
SF_MAP

* Next, use the the coordinates for SF City Hall 37.77919, -122.41914 to indicate its location on the map with a nice, pop-up enabled maker. (In the screenshot below, I used the black & white Stamen tiles, because they look cool. UPDATE 2024: Note that the Stamen tiles are no longer avialible, but there are many other tile-options. Link for more options on Stamen here).

In [23]:
SF_CH = [37.77919, -122.41914]
folium.Marker(
    location=[SF_CH[0], SF_CH[1]],
    popup='San Francisco City Hall',
    icon=folium.Icon(color='red')
).add_to(SF_MAP)

# Display the map
SF_MAP

* Now, let's plot some more data (no need for pop-ups this time). Select a couple of months of data for 'DRUG/NARCOTIC' and draw a little dot for each arrest for those two months. You could, for example, choose June-July 2016, but you can choose anything you like - the main concern is to not have too many points as this uses a lot of memory and makes Folium behave non-optimally. We can call this kind of visualization a point scatter plot.


In [3]:
import pandas as pd
data = pd.read_csv("Police_Department_Incident_Reports__Historical_2003_to_May_2018_20240204.csv") 

In [4]:
# Filter only drug/narcotic data and make a copy to avoid SettingWithCopyWarning
DRUG_data = data[data['Category'] == 'DRUG/NARCOTIC'].copy()

# Convert 'Date' column to datetime format
DRUG_data['Date'] = pd.to_datetime(DRUG_data['Date'])

# Filter data for June and July
DRUG_data_jj = DRUG_data[DRUG_data['Date'].dt.month.isin([6, 7])]

# Filter data for the year 2016
data_filtered = DRUG_data_jj[DRUG_data_jj['Date'].dt.year == 2016]

# Extract latitude and longitude from 'location' column using regex
data_points = data_filtered['location'].str.extract(r'POINT \(([-\d.]+) ([-\d.]+)\)', expand=True)

# CHATGBT:
# Convert extracted values to float and create a DataFrame
df_points = pd.DataFrame({
    'Latitude': pd.to_numeric(data_points[0], errors='coerce'),
    'Longitude': pd.to_numeric(data_points[1], errors='coerce')
})



In [213]:
SF_MAP = folium.Map(location=[37.7749, -122.4194], zoom_start=12)

# Plot markers for each point in the DataFrame
for index, row in df_points.iterrows():
    folium.CircleMarker(
        location=[row['Longitude'], row['Latitude']],
        radius=2,
        color='red',
        fill=True,
        fill_color='red'
    ).add_to(SF_MAP)
    #print([row['Latitude'], row['Longitude']])
    
# Østerbro punkt:
folium.CircleMarker(
    location=[55.694294,12.586799],
    radius=2,
    color='red',
    fill=True,
    fill_color='red'
).add_to(SF_MAP)
# Display the map
SF_MAP


Exercise: Heatmaps.

* Now, let's play with heatmaps. You can figure out the appropriate commands by grabbing code from the main tutorial) and modifying to suit your needs.
* To create your first heatmap, grab all arrests for the category 'SEX OFFENSES, NON FORCIBLE' across all time. Play with parameters to get plots you like.


In [18]:
from folium import plugins
from folium.plugins import HeatMap

SF_MAP = folium.Map(location=[37.7749, -122.4194], zoom_start=12)


# List comprehension to make out list of lists
heat_data = [[row['Longitude'], row['Latitude']] for index, row in df_points.iterrows()]

heatmap = HeatMap(heat_data,
                  radius=15,  # Adjust the radius of influence
                  blur=30,    # Adjust the blur
                  gradient={0.3: 'blue', 0.45: 'lime', 1: 'red'}  # Customize the gradient
                 )

SF_MAP.add_child(heatmap)
#HeatMap(heat_data).add_to(SF_MAP)

# Display the map
SF_MAP

* Now, comment on the differences between scatter plots and heatmaps. . - What can you see using the scatter-plots that you can't see using the heatmaps? . - And vice versa: what does the heatmaps help you see that's difficult to distinguish in the scatter-plots?
    * The scatterplot shows HOW MANY in each area very clearly, where as the heat map can concieve you to think that there are much more in an area due to the 'clouds' of colors. On the other hand, the heatmap quickly gives an overview of where the crimes are the most, where the scatterplot is not that quickly analysed by the eye.

* Play around with the various parameters for heatmaps. You can find a list here: https://python-visualization.github.io/folium/plugins.html
* Comment on the effect on the various parameters for the heatmaps. How do they change the picture? (at least talk about the radius and blur).
    * The radius really changes how you interpret about how many crimes there are.
    * With too much blur you loose detail, but with too little

Exercise: Heatmap movies. This exercise is a bit more independent than above - you get to make all the choices.


* For the final element of working with heatmaps, let's now use the cool Folium functionality HeatMapWithTime to create a visualization of how the patterns of your favorite crime-type changes over time.



* Start by choosing your favorite crimetype. Prefereably one with spatial patterns that change over time (use your data-exploration from the previous lectures to choose a good one).
* Now, choose a time-resolution. You could plot daily, weekly, monthly datasets to plot in your movie. Again the goal is to find interesting temporal patterns to display. We want at least 20 frames though.
* Create the movie using HeatMapWithTime.

In [35]:
import folium
from folium.plugins import HeatMapWithTime
import pandas as pd

# Step 1: Choose your favorite crime type and filter the dataset accordingly
favorite_crime_type = 'BURGLARY'  # Example: 'BURGLARY'
favorite_crime_data = data[data['Category'] == favorite_crime_type].copy()

# Step 2: Ensure 'Date' column is in datetime format
favorite_crime_data['Date'] = pd.to_datetime(favorite_crime_data['Date'])

# Step 3: Filter data for January 2016
january_2016_data = favorite_crime_data[(favorite_crime_data['Date'].dt.year == 2016) & 
                                        (favorite_crime_data['Date'].dt.month == 1)]

# Step 4: Group the data by Date and Location
grouped_data = january_2016_data.groupby([pd.Grouper(key='Date', freq='D'), 'location']).size().reset_index(name='Count')

# Step 5: Prepare data for HeatMapWithTime
heat_data = []
for date, frame_data in grouped_data.groupby('Date'):
    frame_heat_data = []
    for index, row in frame_data.iterrows():
        # Extract latitude and longitude from 'location' column
        lon, lat = map(float, row['location'].split('(')[1].split(')')[0].split())
        # Append data to frame_heat_data with correct count value
        frame_heat_data.append([lat, lon, row['Count']])
    heat_data.append(frame_heat_data)

# Step 6: Create base map
SF_MAP = folium.Map(location=[37.7749, -122.4194], zoom_start=12)

# Step 7: Create HeatMapWithTime
HeatMapWithTime(heat_data, radius=15).add_to(SF_MAP)  # Adjust radius as needed

# Step 8: Display the map
SF_MAP


# Part 3: Errors in the data. The importance of looking at raw (or close to raw) data.
We started the course by plotting simple histogram and bar plots that showed a lot of cool patterns. But sometimes the binning can hide imprecision, irregularity, and simple errors in the data that could be misleading. In the work we've done so far, we've already come across at least three examples of this in the SF data. It's 100% OK to use your LLM for this one.

This last exercise for today has two parts:

* In each of the examples above, describe in your own words how the data-errors I call attention to above can bias the binned versions of the data. Also, briefly mention how not noticing these errors can result in misconceptions about the underlying patterns of what's going on in San Francisco (and our modeling).
    * The way that data is collected is very important. It can mess with the analysis in the end and create bias or misunderstandings. If police often are in spexific areas for example, then more crimes will be noted here. That doesnt mean that these places are the ones with most crimes. And if they often are lazy with writing the exact time down (and they start rounding up or down), then it would be really important how you define your bins (if it is in whole hour, and if it starts at every half hour or whole), this can change the interpretations. 
* Find your own example of human noise in the data and visualize it.
* Were you able to use your LLM for anything in this exercise?
    * No