<a href="https://colab.research.google.com/github/Fatis092/repo52/blob/main/Clustering_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering and Dimensionality Reduction Exam
Welcome to the weekly project on clustering and dimensionality reduction. You will be working with a dataset of traffic accidents.

## Dataset
The dataset that will be used in this task is `Traffic_Accidents.csv`

## Instructions
- Follow the steps outlined below.
- Write your code in the empty code cells.
- Comment on your code to explain your reasoning.

## Dataset Overview
The dataset contains information about traffic accidents, including location, weather conditions, road conditions, and more. Below are sample of these columns:

* `Location_Easting_OSGR`: Easting coordinate of the accident location.
* `Location_Northing_OSGR`: Northing coordinate of the accident location.
* `Longitude`: Longitude of the accident site.
* `Latitude`: Latitude of the accident site.
* `Police_Force`: Identifier for the police force involved.
* `Accident_Severity`: Severity of the accident.
* `Number_of_Vehicles`: Number of vehicles involved in the accident.
* `Number_of_Casualties`: Number of casualties in the accident.
* `Date`: Date of the accident.
* `Day_of_Week`: Day of the week when the accident occurred.
* `Speed_limit`: Speed limit in the area where the accident occurred.
* `Weather_Conditions`: Weather conditions at the time of the accident.
* `Road_Surface_Conditions`: Condition of the road surface during the accident.
* `Urban_or_Rural_Area`: Whether the accident occurred in an urban or rural area.
* `Year`: Year when the accident was recorded.
* Additional attributes related to road type, pedestrian crossing, light conditions, etc.

## Goal
The primary goal is to analyze the accidents based on their geographical location.


## Import Libraries

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## Load the Data

In [3]:

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Exam week 1/Traffic_Accidents.csv')


## Exploratory Data Analysis (EDA)
Perform EDA to understand the data better. This involves several steps to summarize the main characteristics, uncover patterns, and establish relationships:
* Find the dataset information and observe the datatypes.
* Check the shape of the data to understand its structure.
* View the the data with various functions to get an initial sense of the data.
* Perform summary statistics on the dataset to grasp central tendencies and variability.
* Check for duplicated data.
* Check for null values.

And apply more if needed!


In [4]:

df.describe()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Day_of_Week,Local_Authority_(District),1st_Road_Class,1st_Road_Number,Speed_limit,2nd_Road_Class,2nd_Road_Number,Urban_or_Rural_Area,Year
count,52000.0,52000.0,52000.0,52000.0,52000.0,51678.0,52000.0,50959.0,52000.0,52000.0,52000.0,52000.0,52000.0,52000.0,52000.0,51912.0,52000.0
mean,440284.256846,299861.7,-1.427193,52.586684,30.401712,2.837145,1.834327,1.354756,4.130712,349.542558,4.080519,997.078077,39.148558,2.672673,384.503058,1.359397,2009.401788
std,95109.751221,161362.4,1.398249,1.453049,25.545581,0.402582,0.727856,0.85522,1.926217,259.504721,1.428056,1806.405065,14.212826,3.20508,1304.989395,0.479868,3.006997
min,98480.0,19030.0,-6.895268,50.026153,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,15.0,-1.0,-1.0,1.0,2005.0
25%,375540.0,178010.0,-2.36619,51.487676,7.0,3.0,1.0,1.0,2.0,112.0,3.0,0.0,30.0,-1.0,0.0,1.0,2006.0
50%,440950.0,267180.0,-1.391202,52.295042,30.0,3.0,2.0,1.0,4.0,323.0,4.0,128.5,30.0,3.0,0.0,1.0,2010.0
75%,523500.0,398149.2,-0.214666,53.478016,46.0,3.0,2.0,1.0,6.0,530.0,6.0,716.0,50.0,6.0,0.0,2.0,2012.0
max,654960.0,1203900.0,1.753632,60.714774,98.0,3.0,34.0,51.0,7.0,941.0,6.0,9999.0,70.0,6.0,9999.0,3.0,2014.0


In [5]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52000 entries, 0 to 51999
Data columns (total 26 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Location_Easting_OSGR                        52000 non-null  float64
 1   Location_Northing_OSGR                       52000 non-null  float64
 2   Longitude                                    52000 non-null  float64
 3   Latitude                                     52000 non-null  float64
 4   Police_Force                                 52000 non-null  int64  
 5   Accident_Severity                            51678 non-null  float64
 6   Number_of_Vehicles                           52000 non-null  int64  
 7   Number_of_Casualties                         50959 non-null  float64
 8   Date                                         52000 non-null  object 
 9   Day_of_Week                                  52000 non-null  int64  
 10

In [6]:

df.head()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
0,560530.0,103950.0,0.277298,50.812789,47,3.0,1,1.0,27/11/2009,6,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Raining with high winds,Flood (Over 3cm of water),2.0,Yes,2009
1,508860.0,187170.0,-0.430574,51.572846,1,3.0,2,1.0,10/10/2010,1,...,6,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,Yes,2010
2,314460.0,169130.0,-3.231459,51.414661,62,3.0,2,1.0,14/09/2005,4,...,3,4055,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2005
3,341700.0,408330.0,-2.8818,53.568318,4,3.0,1,2.0,18/08/2007,7,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2007
4,386488.0,350090.0,-2.20302,53.047882,21,3.0,2,2.0,06/08/2013,3,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2013


In [7]:

df.tail()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
51995,475125.0,319380.0,-0.888006,52.766777,33,3.0,2,1.0,31/08/2012,6,...,6,6485,None within 50 metres,Pedestrian phase at traffic signal junction,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2012
51996,456682.0,127058.0,-1.192915,51.04003,44,3.0,1,1.0,08/05/2013,4,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Fine without high winds,Dry,2.0,Yes,2013
51997,540510.0,152250.0,0.012032,51.252055,45,3.0,3,1.0,01/11/2011,3,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2011
51998,434720.0,334000.0,-1.485264,52.902301,30,3.0,2,2.0,22/07/2011,6,...,5,81,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2011
51999,454710.0,185430.0,-1.212104,51.56505,43,3.0,3,1.0,24/05/2010,2,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,2.0,Yes,2010


In [8]:

df.sample(2)

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
25265,441843.0,400728.0,-1.370646,53.501581,14,3.0,2,2.0,14/09/2013,7,...,4,6097,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2013
17954,530110.0,181410.0,-0.12623,51.516583,1,3.0,1,1.0,07/04/2011,5,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2011


In [9]:

df.dtypes

Unnamed: 0,0
Location_Easting_OSGR,float64
Location_Northing_OSGR,float64
Longitude,float64
Latitude,float64
Police_Force,int64
Accident_Severity,float64
Number_of_Vehicles,int64
Number_of_Casualties,float64
Date,object
Day_of_Week,int64


In [11]:

df['Date'] = pd.to_datetime(df['Date'])

  df['Date'] = pd.to_datetime(df['Date'])


In [12]:

numrical_df = df.select_dtypes(include=['int','float'])
categorical_df = df.select_dtypes(include=['object'])

encoded_categorical_df = pd.get_dummies(categorical_df, prefix=categorical_df.columns)

final_data = pd.concat([encoded_categorical_df , numrical_df], axis=1)
final_data

Unnamed: 0,Local_Authority_(Highway)_E06000001,Local_Authority_(Highway)_E06000002,Local_Authority_(Highway)_E06000003,Local_Authority_(Highway)_E06000004,Local_Authority_(Highway)_E06000005,Local_Authority_(Highway)_E06000006,Local_Authority_(Highway)_E06000007,Local_Authority_(Highway)_E06000008,Local_Authority_(Highway)_E06000009,Local_Authority_(Highway)_E06000010,...,Number_of_Casualties,Day_of_Week,Local_Authority_(District),1st_Road_Class,1st_Road_Number,Speed_limit,2nd_Road_Class,2nd_Road_Number,Urban_or_Rural_Area,Year
0,False,False,False,False,False,False,False,False,False,False,...,1.0,6,556,3,22,70,-1,0,2.0,2009
1,False,False,False,False,False,False,False,False,False,False,...,1.0,1,26,4,466,30,6,0,1.0,2010
2,False,False,False,False,False,False,False,False,False,False,...,1.0,4,746,6,0,30,3,4055,1.0,2005
3,False,False,False,False,False,False,False,False,False,False,...,2.0,7,84,6,0,30,6,0,1.0,2007
4,False,False,False,False,False,False,False,False,False,False,...,2.0,3,257,6,0,30,-1,0,1.0,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51995,False,False,False,False,False,False,False,False,False,False,...,1.0,6,365,3,607,30,6,6485,1.0,2012
51996,False,False,False,False,False,False,False,False,False,False,...,1.0,4,502,3,272,60,-1,0,2.0,2013
51997,False,False,False,False,False,False,False,False,False,False,...,1.0,3,516,5,85,40,6,0,1.0,2011
51998,False,False,False,False,False,False,False,False,False,False,...,2.0,6,323,5,81,30,5,81,1.0,2011


In [13]:

co_matrix = numrical_df.corr()
co_matrix

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Day_of_Week,Local_Authority_(District),1st_Road_Class,1st_Road_Number,Speed_limit,2nd_Road_Class,2nd_Road_Number,Urban_or_Rural_Area,Year
Location_Easting_OSGR,1.0,-0.4261,0.999358,-0.4281,-0.355791,0.011899,0.013283,-0.036557,-0.006266,-0.378165,-0.066523,-0.084217,-0.056254,0.045012,-0.003933,-0.087152,0.0326
Location_Northing_OSGR,-0.4261,1.0,-0.436512,0.999974,0.17644,-0.030637,-0.041982,0.028132,0.001804,0.128939,0.036908,0.044823,0.045996,-0.034241,0.019856,0.050624,-0.011934
Longitude,0.999358,-0.436512,1.0,-0.438409,-0.369331,0.012545,0.014655,-0.035299,-0.006195,-0.388518,-0.064988,-0.081901,-0.055575,0.04561,-0.002466,-0.08549,0.032355
Latitude,-0.4281,0.999974,-0.438409,1.0,0.174347,-0.030641,-0.041782,0.028327,0.001804,0.127175,0.036912,0.045675,0.045712,-0.033955,0.020501,0.050162,-0.012018
Police_Force,-0.355791,0.17644,-0.369331,0.174347,1.0,-0.033097,-0.015102,0.002525,0.004726,0.981991,0.047616,-0.006392,0.199189,-0.117069,-0.019011,0.237511,-0.031013
Accident_Severity,0.011899,-0.030637,0.012545,-0.030641,-0.033097,1.0,0.079257,-0.100917,0.015712,-0.03304,0.019119,-0.010151,-0.074867,0.06285,0.024364,-0.080621,-0.007567
Number_of_Vehicles,0.013283,-0.041982,0.014655,-0.041782,-0.015102,0.079257,1.0,0.267989,-0.005239,-0.009094,-0.139214,0.00046,0.080439,0.067722,0.027258,0.038217,-0.008036
Number_of_Casualties,-0.036557,0.028132,-0.035299,0.028327,0.002525,-0.100917,0.267989,1.0,0.003392,0.010768,-0.082221,0.0082,0.14063,-0.032779,0.00264,0.118086,-0.014588
Day_of_Week,-0.006266,0.001804,-0.006195,0.001804,0.004726,0.015712,-0.005239,0.003392,1.0,0.004856,0.008415,0.002395,-0.01501,0.003579,0.002457,-0.015958,-0.000973
Local_Authority_(District),-0.378165,0.128939,-0.388518,0.127175,0.981991,-0.03304,-0.009094,0.010768,0.004856,1.0,0.05851,0.005139,0.205801,-0.116678,-0.01762,0.25119,-0.035152


## Data Preprocessing
Do what you think you need such as:
* Remove the outliers
* Impute missing data
* Scale the data
* Reduce dimentions using PCA
* Implement One-Hot Encoding for nominal categorical variables.

In [14]:

df.isnull().sum()

Unnamed: 0,0
Location_Easting_OSGR,0
Location_Northing_OSGR,0
Longitude,0
Latitude,0
Police_Force,0
Accident_Severity,322
Number_of_Vehicles,0
Number_of_Casualties,1041
Date,0
Day_of_Week,0


In [15]:

df.duplicated().sum()

43

## Feature Selection
Select relevant features for clustering. Explain your choice of features.


## Data Visualization
Visualize the data using appropriate plots to gain insights into the dataset. Using the following:
- Scatter plot of accidents based on Longitude and Latitude.

## Clustering
Apply K-Means clustering. Determine the optimal number of clusters and justify your choice.
* Find the `n_clusters` parameter using the elbow method.
* Train the model.

## Evaluation
Evaluate the clustering result using appropriate metrics.


## Plot the data points with their predicted cluster center

## Exam Questions
* **Justify Your Feature Selection:**
   - Which features did you choose for clustering and why?
* **Number of Clusters Choices:**
   - How did you determine the optimal number?
* **Evaluation:**
   - Which metrics did you use to evaluate the clustering results, and why?
   - How do these metrics help in understanding the effectiveness of your clustering approach?
* **Improvements and Recommendations:**
   - Suggest any improvements or future work that could be done with this dataset. What other methods or algorithms would you consider applying?