# EDA Image Aquisation for Grid-Model
In this notebook we train a model with satellite images to classify the probability of an traffic accident 
Therefore we do the following steps:
- Choose City: We have to choose a city that we want to train our model with 
- Raster über Stadt legen 
- Für jedes Raster Anzahl der Unfälle herausfinden 
- Anzahl der Kategorien + Zuordnoungskriterien für Kategorien erarbeiten 
- Raster Kategorien zuordnen



## Choose City

In [1]:
import pandas as pd
import numpy as np
from rtree import index
import geopandas as gpd
from shapely.geometry import Point


In [4]:
data = pd.read_csv('C:/Projekte/TDS/TDS2324-TrafficAccidents/Data/TrafficAccidentData/all_16_22.csv',dtype={'ags': str})
data.head()

Unnamed: 0.1,Unnamed: 0,land,regbez,kreis,gemeinde,jahr,monat,stunde,wochentag,kategorie,...,ist_pkw,ist_fuss,ist_krad,ist_gkfz,ist_sonstige,linrefx,linrefy,xgcswgs84,ygcswgs84,ags
0,0,1,0,53,120,2016,1,9,5,2,...,1,0,0,0.0,0,606982.394,5954660.0,10.621659,53.729615,1053120
1,1,1,0,57,10,2016,1,17,3,3,...,1,0,0,0.0,0,574882.533,6011441.0,10.149176,54.245453,1057010
2,2,1,0,62,8,2016,1,0,5,3,...,1,0,0,0.0,0,599934.6875,5964609.0,10.518094,53.820403,1062008
3,3,1,0,3,0,2016,1,15,5,3,...,0,0,0,0.0,1,610709.3487,5968284.0,10.683021,53.851243,1003000
4,4,1,0,55,28,2016,1,14,1,3,...,1,0,0,0.0,0,605690.7904,6009152.0,10.620986,54.219459,1055028


In [5]:
# check wich german city has most accidents
most_accidents_ags = data['ags'].value_counts()
most_accidents_ags.head()


ags
09162000    33806
06412000    17220
05315000    16671
04011000    15672
14612000    14513
Name: count, dtype: int64

The AGS stands for:
1. Munich
2. Frankfurt
3. Cologne
4. Bremen
5. Dresden

Since we have very good satellite images and a lot of accident data for Munich, we will train our model with the city of Munich

## Data Introduction Munich

In [6]:
# saving the data for the city of munich
munich = data[data['ags'] == '09162000']
munich.head()

Unnamed: 0.1,Unnamed: 0,land,regbez,kreis,gemeinde,jahr,monat,stunde,wochentag,kategorie,...,ist_pkw,ist_fuss,ist_krad,ist_gkfz,ist_sonstige,linrefx,linrefy,xgcswgs84,ygcswgs84,ags
60594,60599,9,1,62,0,2016,1,16,1,3,...,1,0,0,0.0,0,694645.8088,5332073.0,11.615121,48.112151,9162000
60680,60685,9,1,62,0,2016,1,14,1,3,...,1,0,0,0.0,0,687820.0791,5333578.0,11.524172,48.127725,9162000
60853,60858,9,1,62,0,2016,1,15,1,2,...,0,0,0,0.0,0,688689.197,5335742.0,11.536799,48.146921,9162000
60953,60958,9,1,62,0,2016,1,3,6,3,...,1,1,0,0.0,0,700587.1593,5332651.0,11.695126,48.115499,9162000
61071,61076,9,1,62,0,2016,1,14,3,3,...,1,0,0,0.0,0,694222.71,5340374.0,11.61323,48.186879,9162000


In [7]:
munich.shape

(33806, 25)

## Building the Grid
- to get a rectangular form of munich we are looking for the most min and maximum longitude and latitude so we can build our grid.


In [8]:
# Finding the boundaries of the grid 
min_longitude = munich['xgcswgs84'].min()
max_longitude = munich['xgcswgs84'].max()
min_latitude = munich['ygcswgs84'].min()
max_latitude = munich['ygcswgs84'].max()


- Afterwards we estimate the degree of coordinates in meters, to build our grid in a 40m x 40m grid.

In [9]:
# Constants for conversion
METERS_PER_DEGREE_LATITUDE = 111000  # approximately 111 kilometers
METERS_PER_DEGREE_LONGITUDE = 71000  # approximately 71 kilometers at Munich's latitude

# Calculating degree increments for a 40 meter cell
degree_increment_latitude = 40 / METERS_PER_DEGREE_LATITUDE
degree_increment_longitude = 40 / METERS_PER_DEGREE_LONGITUDE

Now we can generate the grid

In [10]:
# Generating the grid
latitude_range = np.arange(min_latitude, max_latitude, degree_increment_latitude)
longitude_range = np.arange(min_longitude, max_longitude, degree_increment_longitude)

# Creating a DataFrame to represent the grid
grid_cells = pd.DataFrame([(lat, lon) for lat in latitude_range for lon in longitude_range], 
                          columns=['Latitude', 'Longitude'])

grid_cells.head()

Unnamed: 0,Latitude,Longitude
0,48.065872,11.381973
1,48.065872,11.382536
2,48.065872,11.3831
3,48.065872,11.383663
4,48.065872,11.384226


The coordinates show the upper left corner of a grid cell. 

In [11]:
grid_cells.shape

(290772, 2)

we have around TODO grid cells. We export them as CSV and check if the grid cells are as expected using QGIS. The QGIS project can be found Data\QGIS\Munich grid.

In [12]:
# export munich grid as csv
grid_cells.to_csv('C:/Projekte/TDS/TDS2324-TrafficAccidents/Data/Munich/munich_grid.csv', index=False)

## Categorize grid cells

Now we going to look deeper into the grid cells and analyse the dristribution of accdients in the grid cells to create some categories in which we can sort them. to make the runtime for the code faster we use a R-Tree. 

In [13]:
from rtree import index

# Create an R-tree index
idx = index.Index()

# Populate the index with the bounding boxes of the grid cells
for i, row in grid_cells.iterrows():
    # Assume that 'Longitude' and 'Latitude' are the lower bounds of the cell
    # and 'Longitude_max' and 'Latitude_max' are the upper bounds
    idx.insert(i, (row['Longitude'], row['Latitude'], row['Longitude']+degree_increment_longitude, row['Latitude']+degree_increment_latitude))

# Add a 'grid_cell' column to the munich DataFrame
munich['grid_cell'] = np.nan

# Assign a grid cell to each accident
for i, row in munich.iterrows():
    # Find the first grid cell that contains the accident
    for j in idx.intersection((row['xgcswgs84'], row['ygcswgs84'])):
        munich.at[i, 'grid_cell'] = j
        break

munich.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  munich['grid_cell'] = np.nan


Unnamed: 0.1,Unnamed: 0,land,regbez,kreis,gemeinde,jahr,monat,stunde,wochentag,kategorie,...,ist_fuss,ist_krad,ist_gkfz,ist_sonstige,linrefx,linrefy,xgcswgs84,ygcswgs84,ags,grid_cell
60594,60599,9,1,62,0,2016,1,16,1,3,...,0,0,0.0,0,694645.8088,5332073.0,11.615121,48.112151,9162000,76061.0
60680,60685,9,1,62,0,2016,1,14,1,3,...,0,0,0.0,0,687820.0791,5333578.0,11.524172,48.127725,9162000,101313.0
60853,60858,9,1,62,0,2016,1,15,1,2,...,0,0,0.0,0,688689.197,5335742.0,11.536799,48.146921,9162000,132658.0
60953,60958,9,1,62,0,2016,1,3,6,3,...,1,0,0.0,0,700587.1593,5332651.0,11.695126,48.115499,9162000,81522.0
61071,61076,9,1,62,0,2016,1,14,3,3,...,0,0,0.0,0,694222.71,5340374.0,11.61323,48.186879,9162000,198395.0


In [14]:
# save munich with grid cell as csv
munich.to_csv('C:/Projekte/TDS/TDS2324-TrafficAccidents/Data/Munich/munich.csv', index=False)

Now we find out how many accidents happen in each gridcell

In [15]:
# add grid_cells a feature named count
grid_cells['count'] = 0

# Count the number of accidents in each grid cell
for i, row in munich.iterrows():
    grid_cells.at[row['grid_cell'], 'count'] += 1

grid_cells.head()

Unnamed: 0,Latitude,Longitude,count
0,48.065872,11.381973,0
1,48.065872,11.382536,0
2,48.065872,11.3831,0
3,48.065872,11.383663,0
4,48.065872,11.384226,0


Now we find out how many cells have a certain accident_count

In [22]:
# make new empty df
count_cells = pd.DataFrame(columns=['accident_count', 'amount_cells'])

# fill accident_count with uniquevalues from count from grid_cells
count_cells['accident_count'] = grid_cells['count'].unique()

# fill amount_cells with the amount of cells with the same accident_count
for i, row in count_cells.iterrows():
    count_cells.at[i, 'amount_cells'] = grid_cells[grid_cells['count'] == row['accident_count']].shape[0]

count_cells.sort_values(by=['accident_count'], inplace=True)
count_cells.head(40)





Unnamed: 0,accident_count,amount_cells
0,0,277152
1,1,7422
2,2,2627
3,3,1203
5,4,703
6,5,440
7,6,304
4,7,226
9,8,125
11,9,121


Since the Distribution is very diverse we try to build our categories in a way we that we have around 1000 Pictures foreach category 

In [23]:
# categorize each grid cell by the number of accidents it contains
#category 0 = 0 accidents
#category 1 = 1 accidents
#category 2 = 2 accidents
#category 3 = 3-4 accidents
#category 4 = 5+ accidents

grid_cells['category'] = 0

for i, row in grid_cells.iterrows():
    if row['count'] == 0:
        grid_cells.at[i, 'category'] = 0
    elif row['count'] == 1:
        grid_cells.at[i, 'category'] = 1
    elif row['count'] == 2:
        grid_cells.at[i, 'category'] = 2
    elif row['count'] == 3 or row['count'] == 4:
        grid_cells.at[i, 'category'] = 3
    else:
        grid_cells.at[i, 'category'] = 4

grid_cells.head()


Unnamed: 0,Latitude,Longitude,count,category
0,48.065872,11.381973,0,0
1,48.065872,11.382536,0,0
2,48.065872,11.3831,0,0
3,48.065872,11.383663,0,0
4,48.065872,11.384226,0,0


In [24]:
# how often does each category occur?
grid_cells['category'].value_counts()


category
0    277152
1      7422
2      2627
3      1906
4      1665
Name: count, dtype: int64

In [19]:
# Filter the munich DataFrame to include only the WGS coordinates
munich_wgs = munich[['xgcswgs84', 'ygcswgs84']]

# Save the filtered DataFrame as a CSV file
munich_wgs.to_csv('C:/Projekte/TDS/TDS2324-TrafficAccidents/Data/Munich/munich_coord.csv', index=False)



## Selecting cells
Here we select the cells we download the pictures from.

In [25]:
# Create a DataFrame with 1000 cells of each category
category_0 = grid_cells[grid_cells['category'] == 0].sample(n=1500, replace=True)
category_1 = grid_cells[grid_cells['category'] == 1].sample(n=1500, replace=True)
category_2 = grid_cells[grid_cells['category'] == 2].sample(n=1500, replace=True)
category_3 = grid_cells[grid_cells['category'] == 3].sample(n=1500, replace=True)
category_4 = grid_cells[grid_cells['category'] == 4].sample(n=1500, replace=True)


# Concatenate the DataFrames
selected_img = pd.concat([category_0, category_1, category_2, category_3, category_4])

# Reset the index of the DataFrame
selected_img.reset_index(drop=True, inplace=True)

# Print the resulting DataFrame
print(selected_img)

# print how often each category occurs
selected_img['category'].value_counts()

# save selected_img as csv
selected_img.to_csv('C:/Projekte/TDS/TDS2324-TrafficAccidents/Data/Images/selected_img_grid.csv', index=False)



       Latitude  Longitude  count  category
0     48.101908  11.605635      0         0
1     48.229836  11.509297      0         0
2     48.165691  11.498029      0         0
3     48.209295  11.442255      0         0
4     48.238845  11.496902      0         0
...         ...        ...    ...       ...
7495  48.145151  11.478311      6         4
7496  48.117043  11.656339      6         4
7497  48.202088  11.605071      5         4
7498  48.124971  11.510424      7         4
7499  48.141547  11.556057      7         4

[7500 rows x 4 columns]
