# Students
- GHAITH Sarahnour (M2QF & ENSIIE)
- ROISEUX Thomas (M2QF & ENSIIE)

# Introduction
## Context

The goal of this project is to study 
the temporal evolution of temperature and wind in France, across one year.

## Required packages
- `pandas` : to manipulate dataframes.
- `numpy` : to manipulate arrays.
- `matplotlib` : to plot graphs.
- `cartopy` : to plot maps.
- `IPython` : to display dataframes in Jupyter Notebook.
- `scikit-learn` : to use machine learning algorithms.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from datetime import datetime, timedelta
from typing import Dict

from IPython.display import display

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA

## Data importation
### Preparing GPS dataframe

In [None]:
gps_df = pd.read_csv("dataGPS.csv", header=None, sep=";")
gps_df.columns = ["ID", "Lattitude", "Longitude"]
gps_df["ID"] = gps_df["ID"].str.replace("TEMP", "")
gps_df.set_index("ID", inplace=True)

display(gps_df.head())

### Preparing temperature dataframe and wind dataframe

In [None]:
year = 2019
hours = [datetime(year, 1, 1, 0, 0, 0) + timedelta(hours=i) for i in range(8760)]

In [None]:
temp_df = pd.read_csv("dataTemp.csv", header=None, sep=";", index_col=0)
temp_df.index.name = "ID"
for key in temp_df.index:
    temp_df.rename(index={key: key.replace("TEMP", "")}, inplace=True)
temp_df.columns = hours


display(temp_df.head())


wind_df = pd.read_csv("dataWind.csv", header=None, sep=";", index_col=0)
wind_df.index.name = "ID"
for key in wind_df.index:
    wind_df.rename(index={key: key.replace("VVENT", "")}, inplace=True)
wind_df.columns = hours


display(wind_df.head())

## Example: weather in Paris
We are goinng to study the weather in Paris, the capital of France, as an example.
It is located at 48.51° N, 2.21° E.

Let's firstplace in on a map.

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

x, y = 2.217999, 48.512381
ax.plot(x, y, "r*", markersize=15)
ax.text(x, y, "Paris")
plt.show()

Now, we are going to plot the evolution of temperature and wind in Paris, across one year.

In [None]:
plt.figure(figsize=(20, 10))
plt.title("Temperature in Paris")
plt.plot(temp_df.columns, temp_df.iloc[33, :], color="blue")
plt.show()

In [None]:
plt.figure(figsize=(20, 10))
plt.title("Wind in Paris")
plt.plot(wind_df.columns, wind_df.iloc[33, :], color="blue")
plt.show()

# Preliminaries
## Cities selection
We are going to select 3 more cities in France, to study the weather in different regions.

We chose:
- Strasbourg (48.58° N, 7.75° E);
- Nice (43.70° N, 7.26° E);
- Brest (48.39° N, 4.48° W);

In [None]:
def find_closest_point(x: float, y: float, df: pd.DataFrame) -> Dict[str, float | str]:
    """Get the closest point to the given coordinates in the given dataframe.

    Args:
        x (float): longitude
        y (float): lattitude
        df (pd.DataFrame): dataframe with columns Longitude and Lattitude

    Returns:
        dict[str, float | str]: closest point
    """
    distances = np.sqrt((df["Longitude"] - x) ** 2 + (df["Lattitude"] - y) ** 2)
    dic = df.iloc[np.argmin(distances)].to_dict()
    dic["ID"] = df.index[np.argmin(distances)]
    return dic


strasbourg = find_closest_point(48.5734053, 7.7521113, gps_df)
print("Strasbourg:", (strasbourg["Longitude"], strasbourg["Lattitude"]))
nice = find_closest_point(43.7009358, 7.2683912, gps_df)
print("Nice:", (nice["Longitude"], nice["Lattitude"]))
brest = find_closest_point(48.390528, -4.486008, gps_df)
print("Brest:", (brest["Longitude"], brest["Lattitude"]))

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

x, y = 2.217999, 48.512381
ax.plot(x, y, "r*", markersize=15)
ax.text(x, y, "Paris")
ax.plot(strasbourg["Lattitude"], strasbourg["Longitude"], "r*", markersize=15)
ax.text(strasbourg["Lattitude"], strasbourg["Longitude"], "Strasbourg")
ax.plot(nice["Lattitude"], nice["Longitude"], "r*", markersize=15)
ax.text(nice["Lattitude"], nice["Longitude"], "Nice")
ax.plot(brest["Lattitude"], brest["Longitude"], "r*", markersize=15)
ax.text(brest["Lattitude"], brest["Longitude"], "Brest")
plt.show()

We are now going to plot the evolution of temperature and wind in these cities, across one year.

In [None]:
plt.figure(figsize=(20, 10))
plt.title(f"Temperature")
plt.xlabel("Time")
plt.ylabel("Temperature")
for dict, names in zip((strasbourg, nice, brest), ("Strasbourg", "Nice", "Brest")):
    wind_id = dict["ID"]
    temp_id = dict["ID"]
    plt.plot(temp_df.columns, temp_df.loc[temp_id, :], label=names)

plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(20, 10))
plt.title(f"Wind")
plt.xlabel("Time")
plt.ylabel("Wind speed")
for dict, names in zip((strasbourg, nice, brest), ("Strasbourg", "Nice", "Brest")):
    wind_id = dict["ID"]
    temp_id = dict["ID"]
    plt.plot(wind_df.columns, wind_df.loc[wind_id, :], label=names)

plt.legend()
plt.show()

## Clustering
We are going to cluster the cities in France, using the temperature and wind data.
We will use 2 different clustering algorithms:
- $K$-means;
- hierarchical clustering.
### $K$-means
We will use $K=4$ for the number of clusters.

In [None]:
k_means_wind, k_means_temp = KMeans(n_clusters=4), KMeans(n_clusters=4)
k_means_wind.fit(wind_df)
k_means_temp.fit(temp_df)

print("Wind clusters:", k_means_wind.cluster_centers_)
print("Temperature clusters:", k_means_temp.cluster_centers_)

classifications = pd.DataFrame(
    k_means_wind.predict(wind_df), index=wind_df.index, columns=["K Wind"]
)
classifications["K Temperature"] = k_means_temp.predict(temp_df)

display(classifications.head())

We are going to plot the clusters on a map, first for the wind.

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

plt.scatter(
    gps_df["Lattitude"],
    gps_df["Longitude"],
    c=classifications["K Wind"],
    cmap="viridis",
    transform=ccrs.PlateCarree(),
)
plt.show()

Now we are going to do the same for the temperature.

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

plt.scatter(
    gps_df["Lattitude"],
    gps_df["Longitude"],
    c=classifications["K Temperature"],
    cmap="viridis",
    transform=ccrs.PlateCarree(),
)
plt.show()

### Hierarchical clustering
After using the $K$-means algorithm, we are going to use hierarchical clustering, with the same number of clusters.

In [None]:
agg_wind, agg_temp = AgglomerativeClustering(n_clusters=4), AgglomerativeClustering(
    n_clusters=4
)
agg_wind = agg_wind.fit(wind_df)
agg_temp = agg_temp.fit(temp_df)

classifications["A Wind"] = agg_wind.labels_
classifications["A Temperature"] = agg_temp.labels_

display(classifications.head())

We are going to plot the clusters on a map, first for the wind.

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

plt.scatter(
    gps_df["Lattitude"],
    gps_df["Longitude"],
    c=classifications["A Wind"],
    cmap="viridis",
    transform=ccrs.PlateCarree(),
)
plt.show()

And now for the temperature.

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

plt.scatter(
    gps_df["Lattitude"],
    gps_df["Longitude"],
    c=classifications["K Temperature"],
    cmap="viridis",
    transform=ccrs.PlateCarree(),
)
plt.show()

## Average clustering
We are going to compute the average of the temperature and wind data, for each city, then classify the cities using the average data.
### $K$-means

In [None]:
averages = (wind_df + temp_df) / 2

In [None]:
k_means = KMeans(n_clusters=4)
k_means.fit(wind_df)

print("Clusters:", k_means.cluster_centers_)

classifications["K Average"] = k_means.predict(averages)

display(classifications.head())

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

plt.scatter(
    gps_df["Lattitude"],
    gps_df["Longitude"],
    c=classifications["K Average"],
    cmap="viridis",
    transform=ccrs.PlateCarree(),
)
plt.show()

### Hierarchical clustering

In [None]:
k_means = AgglomerativeClustering(n_clusters=4)
k_means.fit(wind_df)

classifications["A Average"] = k_means.labels_

display(classifications.head())

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

plt.scatter(
    gps_df["Lattitude"],
    gps_df["Longitude"],
    c=classifications["A Average"],
    cmap="viridis",
    transform=ccrs.PlateCarree(),
)
plt.show()

### Conclusion
Fortunately, this algorithm is considered as "naive", as computing the average between wind and temperature is not relevant, and so is the resuling classification.
# Wind clustering
We are now going to cluster only wind data, but we will use new algorithms.
## Raw data
Using the raw time series, this was done in the previous part, under the title "Clustering".
Here is the resulting plot as a reminder.

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 2, 1, projection=ccrs.PlateCarree())
ax.set_extent([-5, 9, 42, 52])
ax.set_title("France")
ax.stock_img()

plt.scatter(
    gps_df["Lattitude"],
    gps_df["Longitude"],
    c=classifications["K Wind"],
    cmap="viridis",
    transform=ccrs.PlateCarree(),
)
plt.show()

We therefore notice that the wind categories are not very relevant, and seems related to the different climates in France.
## Principal component analysis
We will now perform a principal component analysis on the wind data, to reduce the dimension of the data.
We have $n=259$ cities and $p=8460$ time stamps.
Our goal is to reduce the dimension of the data, while keeping the most relevant information.
Each time stamp is separated by only one hour, so we can assume that the data is highly correlated.
Keeping all the time stamps would be redundant, so we are going to reduce the dimension of the data: we will keep only the first $k=5$ principal components.

In [None]:
corr_matrix = wind_df.corr()
corr_matrix_diff = corr_matrix.diff(axis=1).dropna(axis=1)

As we have features that describes the wind hour per hour, we will evaluate the differences of the correlation between the features, as it will give a good idea about the correlatoin between two consecutive time stamps.

In [None]:
print("Correlation matrix differences mean:", corr_matrix_diff.mean().mean())
print("Correlation matrix differences std:", corr_matrix_diff.std().std())
print("Correlation matrix differences min:", corr_matrix_diff.abs().min().min())
print("Correlation matrix differences max:", corr_matrix_diff.abs().max().max())
print("Correlation matrix differences median:", corr_matrix_diff.median().median())

The statistical information of the correlation matrix is given just above: this shows that the features are highly correlated, as the mean of the differences along the columns is close to $0$.
Before doing the PCA, we will do another PCA with all the components, to determine how many ones we must keep in order to keep $95\,\%$ of the variance.

## Computing the best number of components

In [None]:
pca = PCA(n_components=200)

pca.fit(wind_df)

variance_ratio = pca.explained_variance_ratio_

cumulative_variance_ratio = np.cumsum(variance_ratio)

plt.figure(figsize=(20, 10))
plt.title("Variance ratio")
plt.plot(cumulative_variance_ratio)
plt.xlabel("Principal components number")
plt.ylabel("Amount of explained variance")
plt.show()

print(
    "Minimum number of components to explain 90% of the variance:",
    np.argmax(cumulative_variance_ratio > 0.90) + 1,
)
print(
    "Minimum number of components to explain 95% of the variance:",
    np.argmax(cumulative_variance_ratio > 0.95) + 1,
)
print(
    "Minimum number of components to explain 99% of the variance:",
    np.argmax(cumulative_variance_ratio > 0.99) + 1,
)

As we want to keep at least $95\,\%$ of the variance, we will keep the first $k=38$ principal components.
As reminder, we had $p=8460$ features, so we reduced the dimension of the data by a factor of $223$.

## PCA and clustering
We are now going to do the PCA with $k=10$ components, and then cluster the data using $K$-means and hierarchical clustering.
### PCA

In [None]:
pca = PCA(n_components=10)
pca.fit(wind_df)

reduced_wind_df = pd.DataFrame(pca.transform(wind_df), index=wind_df.index)
display(reduced_wind_df.head())