## <Center> Detailed Report of the Excercise

# Table of Content

* ### KMeans_Clustering_UsingGEE Steps
* ### Assisgnment Discussion


### <Center> KMeans_Clustering_UsingGEE Steps

In this excercise, our task is to perform K-means clustering algorithm for Landuse clustering in two different countries using the Sentinel 2 imagery and functionalities of Google earth Engine. Several steps has been performed to carry out the task. The steps are discussed below along with arough flowchart for better understanding the steps.

<Center> 
  <img src="./images/Flowchart.jpg" alt="Image 1" style="width: 49%; margin: 0 10px;">
</div>

## 1. Import Google Earth Engine API and Initializing the authentication process

This step has been carried out to import all the functionalities of Google Earth Engine in Python and the authentication step lets one to log in Google Earth Engine with one's user name and password. 


In [None]:
import ee
import geemap

In [None]:
try:
    ee.Initialize()
except Exception as e:
    ee.Authenticate()
    ee.Initialize()

## 2. Dask

As the sample dataset contains 10 CSV files, dask is used to read all the dataset at once and to minimize the processing time.

In [None]:
import dask.dataframe as dd

## 3. Reading Sample Dataset and Reindexing

As all the sample dataset CSV file has its own indexing, a reindexing method was required to create and compile it into one CSV file.


In [None]:
df = dd.read_csv('Data/SamplesSet*.csv')
df = df.compute()
df = df.reset_index(drop=True)

## 4. Exploratory Data Analysis

In this step, several exploratory data analyzing processes have been carried out.
 * **Dropping Duplicates** - To remove duplicate cells
 * **Distinct_Value_Count** - To find out the numbe rof distinct landcover classes. 5 landcover class have been found.
 * **Minimum and Maximum Latitude and Longitude Value** - To check the range of countries/geographies. In this excercise, we found from the minimum and maxium latitude and longitude that the dataset given was collected from Europe. 
 * **Plotting the top 20 countries with most location from dataset** - This steps has been perfomed to check the where/which country these sample datasets mostly belong to. It helped us to choose our study area for this excercise. As in this study, we were asked to choose two countries, one of which has larger correspondence to the dataset and one has lower correspondence. We chose, **France** for higher similarity and **Poland** for Lower similarity.

In [None]:
distinct_values = df['landcover'].drop_duplicates()
distinct_value_counts = df['landcover'].value_counts()
min_value_lat = df['lat'].min()
max_value_lat = df['lat'].max()
min_value_lon = df['lon'].min()
max_value_lon= df['lon'].max()

## 5. Importing Bounding Box List 

A bounding box list of the countries has been accessed and it leads to the task of finding the bounding box where majority of the sample reference points belong to.



In [None]:
import json

with open('bounding_boxes.json', 'r') as file:
    bounding_boxes = json.load(file)

def is_in_bounding_box(lat, lon, bounding_box):
    sw = bounding_box['sw']
    ne = bounding_box['ne']
    return sw['lat'] <= lat <= ne['lat'] and sw['lon'] <= lon <= ne['lon']


dfs = []

# count locations in each bounding box
for code, box in bounding_boxes.items():
    country_name = code
    bounding_box = box
    min_lat, min_lon = bounding_box['sw']['lat'], bounding_box['sw']['lon']
    max_lat, max_lon = bounding_box['ne']['lat'], bounding_box['ne']['lon']
    filtered_df = df[df.apply(lambda row: is_in_bounding_box(row['lat'], row['lon'], bounding_box), axis=1)]
    count = len(filtered_df)
    bounding_box_df = dd.from_pandas(pd.DataFrame({'Country': [country_name], 'Code': [code], 'Count': [count]}), npartitions=1)
    dfs.append(bounding_box_df)

counts_df = dd.concat(dfs)

## 6. Filtering the point within Bounding box of study area

With this operation, the reference points has bben cropped within the bounding box for France and Poland and it is used for the accuracy assessment of the task.

In [None]:
min_latitude = 43.303
max_latitude = 49.124
min_longitude = -3.142
max_longitude = 6.561

# Filter points within the bounding box
france_points_small = df[
    (df['lat'] >= min_latitude) & 
    (df['lat'] <= max_latitude) & 
    (df['lon'] >= min_longitude) & 
    (df['lon'] <= max_longitude)
]

france_points_small = france_points_small.reset_index(drop=True)

## 7. Sentinel_2 Image Access

The Sentinel-2 images for Summer,2023 (2023-06-01 to 2023-08-30) with cloud cover below 30% have been accessed for this excercise. Here, the cloud coverage below 30% is considered enough as it will be compensated by the median calculation.  The GEE Pyhton API has been used as it has massive archive of dataset and it can provide direct access to these dataset instead of downloading it on the local computer which increases the processing time and speed and also reduce the task of downloading the image manually from server. 


In [None]:
map_france = ee.ImageCollection("COPERNICUS/S2_SR_HARMONIZED")\
  .filterBounds(bounding_box)\
  .filterDate('2023-06-01', '2023-08-30')\
  .filterMetadata('CLOUDY_PIXEL_PERCENTAGE', 'less_than', 30) \
  .sort('CLOUDY_PIXEL_PERCENTAGE', False)

## 8. Data Interrogotaion

Here several image data interrogation steps has been perfomed to get the information about the images that have been accessed and to reduce the size of the images.
 * Number of Images
 * Band Names
 * Median Image Calculation
 * Band Visualization


In [None]:
listOfImages = map_france.aggregate_array('system:index').getInfo()
bandNames = median_image.bandNames() 
median_image = map_france.median()

## 9. Create Training datasets and display

In this steps, a training datasets has been created for the selected study area for about 10000 number of pixels and this clustering will occur based on the spectral characteristics of the pixel in the feature space and their distance from the randomly initialized cluster center, and for later experiments, the number of pixels has been also increased to 100000 to check the result. 



In [None]:
training = median_image.sample(**{
    'region':bounding_box,
    'scale': 30,
    'numPixels': 10000,
    'seed': 0,
    'tileScale' : 2,
    'geometries': True # the geometries are not included
})

## 10. Clustering and Displaying the cluster

In this step, the clustering has been carried out for a several number of clusters based on the training data previously created randomly from the stallite imageries, and result will provide an image with predefined cluster for the whole images where each pixel belong to that predefined clusters. 

In [None]:
n_clusters = 5
clusterer = ee.Clusterer.wekaKMeans(n_clusters).train(training)

# Cluster the input using the trained clusterer.
result = median_image.cluster(clusterer)

## 11. Performance Assessment

To check the accuracy of the cluster based on the ground truth data from the sample dataset, this step has been performed. The result shows inaccuracy of the result. 

In [None]:
def get_cluster_for_coordinates(lat, lon):
    point = ee.Geometry.Point(lon, lat)
    cluster = result.reduceRegion(ee.Reducer.first(), point, 80)  # Adjust the scale as needed
    return cluster.get('cluster').getInfo()

france_points['cluster'] = france_points.apply(lambda row: get_cluster_for_coordinates(row['lat'], row['lon']), axis=1)
france_points.head()

## 12. Histogram Operation 

Histogram Operation has been performed to check the clusters and to check how many land cover classes belongs to each classes. 

In [None]:
grouped_data = france_points.groupby(['cluster', 'landcover']).size().reset_index(name='count')

# Pivot the data to create a table suitable for plotting
pivot_data = grouped_data.pivot(index='cluster', columns='landcover', values='count').fillna(0)

colors = {
    'Water': 'blue',
    'ForestNaturalAreas': 'green',
    'ArtificialSurfaces': 'grey',
    'Wetlands': 'purple',
    'AgriculturalArea': '#FFD700'
}

# Plotting with custom colors
pivot_data.plot(kind='bar', stacked=True, color=[colors[col] for col in pivot_data.columns])
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.title('Distribution of Points per Cluster for Each Landcover')
plt.show()

### <Center> Assignment Discussion

## 1. Why do we need to train an Unsupervised Classifier in GEE? 

Unlike supervised casuetring algorithm which requires labeled training data, any machine learning clustering algorithm in Google earth engine requires unlabeled training data simply beacuse it has large dataset from all over the world and it would be difficult and challenging due to its scale of the datsets to perform unsupervised clustering on it. Generally in any unsupervised task, we ask machine to find the pattern or cluster in dataset given which is quite difficult in Earth Engine. rather, it requires some smaller, manageable training data and then based on the trainng data, it performs the clustering for whole broader dataset for predefined cluster numbers. 

## 2. Impact of Number of Cluster in the result

<div style="display: flex; justify-content: center;">
  <img src="images/Pol2.png" alt="Image 1" style="width: 49%; margin: 0 10px;">
  <img src="images/Pol3.png" alt="Image 2" style="width: 49%; margin: 0 10px;">
</div>

<div style="display: flex; justify-content: center;">
  <img src="images/Pol4.png" alt="Image 1" style="width: 49%; margin: 0 10px;">
  <img src="images/Pol5.png" alt="Image 1" style="width: 49%; margin: 0 10px;">
</div>

<div style="display: flex; justify-content: center;">
  <img src="images/Pol10.png" alt="Image 1" style="width: 49%; margin: 0 10px;">
  <img src="images/Pol12.png" alt="Image 1" style="width: 49%; margin: 0 10px;">
</div>

We have tried multiple number of clusters to check the impact in the result. Even with tryingmultiple number of clusters, we found each cluster is a combination of several number of land cover classes. There are several potential reason behind it. 
* **Cluster center Initialization** - As in K-means the cluster center initialization occurs randomly and it has influence in rest of the clustering phase by calculating the distance from that center, we believe, it creates this problem of clustering different land cover class into one cluster.
* **Use of All Bands** - Generally, Red, Green, Blue, and NIR bands is required to distinguish different land cover as mostly they are sensitive to these bands and their reflectance value high;y dependent on it, due to the use of all the bands here, and each band is responsible for creating one dimension (and create a multi-dimension evetually) where these pixels are situated, it creates a problem while calculating the distance for K-means clustering. Thus, each clusters takes all the nearby pixel even if they belong to different class and make on cluster. 

## 3. Impact of performing Cluster in different Region

## 4. Computational Time