# Clustering and Dimensionality Reduction Exam
Welcome to the weekly project on clustering and dimensionality reduction. You will be working with a dataset of traffic accidents.

## Dataset
The dataset that will be used in this task is `Traffic_Accidents.csv`

## Instructions
- Follow the steps outlined below.
- Write your code in the empty code cells.
- Comment on your code to explain your reasoning.

## Dataset Overview
The dataset contains information about traffic accidents, including location, weather conditions, road conditions, and more. Below are sample of these columns:

* `Location_Easting_OSGR`: Easting coordinate of the accident location.
* `Location_Northing_OSGR`: Northing coordinate of the accident location.
* `Longitude`: Longitude of the accident site.
* `Latitude`: Latitude of the accident site.
* `Police_Force`: Identifier for the police force involved.
* `Accident_Severity`: Severity of the accident.
* `Number_of_Vehicles`: Number of vehicles involved in the accident.
* `Number_of_Casualties`: Number of casualties in the accident.
* `Date`: Date of the accident.
* `Day_of_Week`: Day of the week when the accident occurred.
* `Speed_limit`: Speed limit in the area where the accident occurred.
* `Weather_Conditions`: Weather conditions at the time of the accident.
* `Road_Surface_Conditions`: Condition of the road surface during the accident.
* `Urban_or_Rural_Area`: Whether the accident occurred in an urban or rural area.
* `Year`: Year when the accident was recorded.
* Additional attributes related to road type, pedestrian crossing, light conditions, etc.

## Goal
The primary goal is to analyze the accidents based on their geographical location.


## Import Libraries

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Load the Data

In [6]:
df1=pd.read_csv('/content/Traffic_Accidents.csv')
df1.head()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
0,560530.0,103950.0,0.277298,50.812789,47,3.0,1,1.0,27/11/2009,6,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Raining with high winds,Flood (Over 3cm of water),2.0,Yes,2009.0
1,508860.0,187170.0,-0.430574,51.572846,1,3.0,2,1.0,10/10/2010,1,...,6,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,Yes,2010.0
2,314460.0,169130.0,-3.231459,51.414661,62,3.0,2,1.0,14/09/2005,4,...,3,4055,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2005.0
3,341700.0,408330.0,-2.8818,53.568318,4,3.0,1,2.0,18/08/2007,7,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2007.0
4,386488.0,350090.0,-2.20302,53.047882,21,3.0,2,2.0,06/08/2013,3,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2013.0


## Exploratory Data Analysis (EDA)
Perform EDA to understand the data better. This involves several steps to summarize the main characteristics, uncover patterns, and establish relationships:
* Find the dataset information and observe the datatypes.
* Check the shape of the data to understand its structure.
* View the the data with various functions to get an initial sense of the data.
* Perform summary statistics on the dataset to grasp central tendencies and variability.
* Check for duplicated data.
* Check for null values.

And apply more if needed!


In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17275 entries, 0 to 17274
Data columns (total 26 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Location_Easting_OSGR                        17275 non-null  float64
 1   Location_Northing_OSGR                       17275 non-null  float64
 2   Longitude                                    17275 non-null  float64
 3   Latitude                                     17275 non-null  float64
 4   Police_Force                                 17275 non-null  int64  
 5   Accident_Severity                            17146 non-null  float64
 6   Number_of_Vehicles                           17275 non-null  int64  
 7   Number_of_Casualties                         16894 non-null  float64
 8   Date                                         17275 non-null  object 
 9   Day_of_Week                                  17275 non-null  int64  
 10

In [8]:
df1.shape

(17275, 26)

In [9]:
df1.sample()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
6546,518940.0,244580.0,-0.265477,52.086759,40,2.0,1,1.0,28/09/2011,4,...,5,160,None within 50 metres,non-junction pedestrian crossing,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2011.0


In [10]:
df1.tail()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
17270,320040.0,470220.0,-3.224923,54.121606,3,3.0,2,1.0,09/08/2012,5,...,6,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,Yes,2012.0
17271,445300.0,240320.0,-1.34069,52.059375,43,3.0,2,1.0,01/08/2007,4,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,No,2007.0
17272,392880.0,340920.0,-2.107464,52.965574,21,3.0,2,1.0,15/12/2010,4,...,6,116,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Raining without high winds,Wet/Damp,2.0,No,2010.0
17273,326130.0,673100.0,-3.184341,55.945285,95,3.0,1,1.0,04/12/2007,3,...,6,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,Yes,2007.0
17274,522990.0,428500.0,-0.136782,53.738379,16,3.0,2,2.0,25/04/2011,2,...,6,0,None within 50 metres,,,,,,,


In [11]:
df1.head()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
0,560530.0,103950.0,0.277298,50.812789,47,3.0,1,1.0,27/11/2009,6,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Raining with high winds,Flood (Over 3cm of water),2.0,Yes,2009.0
1,508860.0,187170.0,-0.430574,51.572846,1,3.0,2,1.0,10/10/2010,1,...,6,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,Yes,2010.0
2,314460.0,169130.0,-3.231459,51.414661,62,3.0,2,1.0,14/09/2005,4,...,3,4055,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2005.0
3,341700.0,408330.0,-2.8818,53.568318,4,3.0,1,2.0,18/08/2007,7,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2007.0
4,386488.0,350090.0,-2.20302,53.047882,21,3.0,2,2.0,06/08/2013,3,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2013.0


In [12]:
df1.describe()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Day_of_Week,Local_Authority_(District),1st_Road_Class,1st_Road_Number,Speed_limit,2nd_Road_Class,2nd_Road_Number,Urban_or_Rural_Area,Year
count,17275.0,17275.0,17275.0,17275.0,17275.0,17146.0,17275.0,16894.0,17275.0,17275.0,17275.0,17275.0,17275.0,17275.0,17275.0,17244.0,17274.0
mean,440163.371867,299881.2,-1.428948,52.586804,30.465355,2.837163,1.835137,1.361134,4.130941,349.920347,4.078379,1014.402142,39.224891,2.657192,370.892446,1.362213,2009.378314
std,95512.594682,161505.5,1.40349,1.454458,25.662876,0.403785,0.714821,0.850088,1.926305,260.459377,1.432473,1827.955771,14.248821,3.21109,1283.091684,0.480654,2.997507
min,127210.0,19030.0,-6.352237,50.026153,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,20.0,-1.0,-1.0,1.0,2005.0
25%,374709.5,177930.0,-2.378323,51.486666,7.0,3.0,1.0,1.0,2.0,112.0,3.0,0.0,30.0,-1.0,0.0,1.0,2006.0
50%,441610.0,266440.0,-1.383216,52.287226,30.0,3.0,2.0,1.0,4.0,323.0,4.0,130.0,30.0,3.0,0.0,1.0,2010.0
75%,523480.0,398533.0,-0.213639,53.481946,46.0,3.0,2.0,1.0,6.0,531.0,6.0,728.5,50.0,6.0,0.0,2.0,2012.0
max,654950.0,1183525.0,1.753632,60.53288,98.0,3.0,10.0,42.0,7.0,940.0,6.0,9999.0,70.0,6.0,9999.0,2.0,2014.0


In [13]:
df1.duplicated().sum()

2

In [14]:
df1.isnull().sum()

Unnamed: 0,0
Location_Easting_OSGR,0
Location_Northing_OSGR,0
Longitude,0
Latitude,0
Police_Force,0
Accident_Severity,129
Number_of_Vehicles,0
Number_of_Casualties,381
Date,0
Day_of_Week,0


## Data Preprocessing
Do what you think you need such as:
* Remove the outliers
* Impute missing data
* Scale the data
* Reduce dimentions using PCA
* Implement One-Hot Encoding for nominal categorical variables.

In [15]:
def remove_outlier():
  Q1=
  Q3=
  IQR=Q3-Q1
  #************************* if Q1> number then it is outlier || if number > Q3 then it is outlier

SyntaxError: invalid syntax (<ipython-input-15-075c4327cb32>, line 2)

In [16]:
#Light_Conditions - Urban_or_Rural_Area - Accident_Severity - Number_of_Casualties - Weather_Conditions - Road_Surface_Conditions - Did_Police_Officer_Attend_Scene_of_Accident

In [31]:
df1.fillna(method='ffill',inplace=True)
df1.isnull().sum()


  df1.fillna(method='ffill',inplace=True)


Unnamed: 0,0
Location_Easting_OSGR,0
Location_Northing_OSGR,0
Longitude,0
Latitude,0
Police_Force,0
Accident_Severity,0
Number_of_Vehicles,0
Number_of_Casualties,0
Date,0
Day_of_Week,0


## Feature Selection
Select relevant features for clustering. Explain your choice of features.


In [None]:
FS=df1['']

## Data Visualization
Visualize the data using appropriate plots to gain insights into the dataset. Using the following:
- Scatter plot of accidents based on Longitude and Latitude.

In [40]:
import matplotlib.pyplot as plt


plt.scatter(df1['longitude'], df1['latitude'], color='blue')

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Scatter Plot of Longitude and Latitude')
plt.show()

KeyError: 'longitude'

## Clustering
Apply K-Means clustering. Determine the optimal number of clusters and justify your choice.
* Find the `n_clusters` parameter using the elbow method.
* Train the model.

In [37]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17275 entries, 0 to 17274
Data columns (total 25 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Location_Easting_OSGR                        17275 non-null  float64
 1   Location_Northing_OSGR                       17275 non-null  float64
 2   Longitude                                    17275 non-null  float64
 3   Latitude                                     17275 non-null  float64
 4   Police_Force                                 17275 non-null  int64  
 5   Accident_Severity                            17275 non-null  float64
 6   Number_of_Vehicles                           17275 non-null  int64  
 7   Number_of_Casualties                         17275 non-null  float64
 8   Day_of_Week                                  17275 non-null  int64  
 9   Local_Authority_(District)                   17275 non-null  int64  
 10

In [38]:
df1.corr()

ValueError: could not convert string to float: 'E10000011'

In [36]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto")
df1=df1.drop('Date', axis=1)
kmeans.fit(df1)

ValueError: could not convert string to float: 'E10000011'

## Evaluation
Evaluate the clustering result using appropriate metrics.


## Plot the data points with their predicted cluster center

## Exam Questions
* **Justify Your Feature Selection:**
   - Which features did you choose for clustering and why?
* **Number of Clusters Choices:**
   - How did you determine the optimal number?
* **Evaluation:**
   - Which metrics did you use to evaluate the clustering results, and why?
   - How do these metrics help in understanding the effectiveness of your clustering approach?
* **Improvements and Recommendations:**
   - Suggest any improvements or future work that could be done with this dataset. What other methods or algorithms would you consider applying?