### **Introduction**

This notebook aims to identify countries or regions with similar weather conditions by analyzing monthly recorded data from April to September. By using Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and visualization, we can uncover patterns and clusters within the data.

The visualizations produced by PCA and UMAP help reduce the complexity of the dataset, making it easier to identify regions with similar weather conditions based on multiple metrics. These insights can be valuable in various business applications, such as targeting markets with similar characteristics, optimizing supply chains, and tailoring marketing strategies to specific regional weather patterns.

In [None]:
pip install umap-learn

In [2]:
# Importing necessary Libraries

# Computation libraries
import numpy as np
import pandas as pd

# Vizualization libraries
import plotly.graph_objs as go
import plotly.figure_factory as ff

# Preprocessing Libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import umap

In [4]:
# Loading the dataset
dataset = pd.read_csv("/content/Avg_Temperatures.csv")
dataset.head()

Unnamed: 0,Regions,April,August,December,February,January,July,June,March,May,November,October,September
0,Ain,12.29,19.03,5.95,2.93,5.35,20.84,18.54,6.87,13.82,7.47,11.92,15.93
1,Aisne,13.35,21.57,11.12,9.29,8.57,21.5,20.76,10.78,15.26,12.62,16.51,19.75
2,Allier,17.99,25.19,8.72,6.64,9.09,26.38,26.19,12.22,20.92,11.48,19.51,22.47
3,Alpes-de-Haute-Provence,11.99,20.2,6.69,3.9,6.18,21.09,18.29,7.61,13.31,8.32,11.9,16.58
4,Alpes-Maritimes,12.08,23.63,7.48,7.42,5.65,23.73,21.63,8.61,16.06,10.44,14.59,21.68


### **Exploratory Data Analysis(EDA)**
For the purpose of this project, we would not go into detail on exploring the dataset. However, as a good practise, we will perform the basic EDA on the dataset.

In [5]:
# Getting the full information of the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Regions    93 non-null     object 
 1   April      93 non-null     float64
 2   August     93 non-null     float64
 3   December   93 non-null     float64
 4   February   93 non-null     float64
 5   January    93 non-null     float64
 6   July       93 non-null     float64
 7   June       93 non-null     float64
 8   March      93 non-null     float64
 9   May        93 non-null     float64
 10  November   93 non-null     float64
 11  October    93 non-null     float64
 12  September  93 non-null     float64
dtypes: float64(12), object(1)
memory usage: 9.6+ KB


In [9]:
# Checking for null values
dataset.isna().sum()

Regions      0
April        0
August       0
December     0
February     0
January      0
July         0
June         0
March        0
May          0
November     0
October      0
September    0
dtype: int64

In [16]:
# Confirming the total number of regions in the dataset
dataset['Regions'].nunique()
dataset['Regions'].unique()

array(['Ain', 'Aisne', 'Allier', 'Alpes-de-Haute-Provence',
       'Alpes-Maritimes', 'Ardeche', 'Ardennes', 'Ariege', 'Aube', 'Aude',
       'Aveyron', 'Bas-Rhin', 'Bouches-du-Rhone', 'Calvados', 'Cantal',
       'Charente', 'Charente-Maritime', 'Cher', 'Correze', "Cote-d'Or",
       "Cotes-d'Armor", 'Creuse', 'Deux-Sevres', 'Dordogne', 'Doubs',
       'Drome', 'Essonne', 'Eure', 'Eure-et-Loir', 'Finistere', 'Gard',
       'Gers', 'Gironde', 'Haut-Rhin', 'Haute-Garonne', 'Haute-Loire',
       'Haute-Marne', 'Haute-Saone', 'Haute-Savoie', 'Haute-Vienne',
       'Hautes-Alpes', 'Hautes-Pyrenees', 'Hauts-de-Seine', 'Herault',
       'Ille-et-Vilaine', 'Indre', 'Indre-et-Loire', 'Isere', 'Landes',
       'Loir-et-Cher', 'Loire', 'Loire-Atlantique', 'Loiret', 'Lot',
       'Lot-et-Garonne', 'Lozere', 'Maine-et-Loire', 'Manche', 'Marne',
       'Mayenne', 'Meurthe-et-Moselle', 'Meuse', 'Morbihan', 'Moselle',
       'Nievre', 'Nord', 'Oise', 'Orne', 'Paris', 'Pas-de-Calais',
       'Puy-de-D

In [17]:
# Getting the overall statistics of the dataset
dataset.describe()

Unnamed: 0,April,August,December,February,January,July,June,March,May,November,October,September
count,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0
mean,13.188495,21.734409,9.54828,7.17043,8.178495,22.735269,20.04043,10.023333,15.64043,10.685591,15.682258,19.120645
std,1.727715,2.734134,1.991274,2.044392,1.957351,3.000885,2.544448,1.883158,2.064988,2.105559,1.91695,2.344849
min,10.81,17.54,5.07,2.67,4.13,17.94,15.31,6.42,12.36,6.0,11.48,14.73
25%,11.82,19.55,8.24,5.96,6.61,20.34,18.54,8.64,14.13,9.85,14.58,17.52
50%,12.91,21.16,10.07,7.71,8.62,22.28,19.88,10.24,15.34,11.3,15.92,18.98
75%,14.41,23.63,10.83,8.41,9.52,24.7,21.53,11.33,16.93,12.08,17.01,20.9
max,17.99,28.67,13.55,11.92,12.45,29.53,26.33,14.32,21.25,14.52,19.51,24.37


#### ***Observation:***
The dataset contains non-null values, 93 entries which implies 93 regions, and 12 columns, of which 11 of the columns represent the monthly weatherr conditions from April to December, given in float datatype.

Moreso, from the overall statistics, we can tell the average weather condition for each months, as well as the max and minimum weather condition, amongst other statistics.

### **Preprocessing**

In [18]:
# Dividing dataset into label and feature sets
X = dataset.drop(['Regions'], axis = 1) # Features
Y = dataset['Regions'] # Labels

In [19]:
# Getting the set of each sets
print(X.shape)
print(Y.shape)

(93, 12)
(93,)


In [20]:
# Normalizing numerical features so that each feature has mean 0 and variance 1
feature_scaler = StandardScaler()
X_scaled = feature_scaler.fit_transform(X)

### **Principal Component Analysis(PCA)**
PCA is a statistical technique used for dimensionality reduction, data visualization, and feature extraction. It transforms a large set of variables into a smaller one that still contains most of the information in the original set, allowing analyst easily visualize multidimensional datasets with large variables. It is also helpful in noise redcution,identifying patterns and correlations in the data, etc.

Read more about it [here](https://builtin.com/data-science/step-step-explanation-principal-component-analysis)

In [21]:
# Applying PCA for initial dimensionality reduction
pca = PCA(n_components = 2)
pca.fit(X_scaled)
x_pca = pca.transform(X_scaled)
print("Variance explained by each of the n_components: ",pca.explained_variance_ratio_)
print("Total variance explained by the n_components: ",sum(pca.explained_variance_ratio_))

Variance explained by each of the n_components:  [0.63551772 0.23651075]
Total variance explained by the n_components:  0.8720284679265535


#### ***Observation***
The variance explained by each component indicates how much of the original data's variability is captured by that component. From the above we see that the first component captures 63% of the variance in the dataset, while the the second component captures 23% of the variance in the dataset, putting the overall variance captured in this dataset at 87.2%, implying that PCA is able to retain most of the important information in the dataset.


In [32]:
# Creating a DataFrame for PCA results
pca_df = pd.DataFrame(data=x_pca, columns=['PC1', 'PC2'])
pca_df['Region'] = Y.values

pca_df.head()

Unnamed: 0,PC1,PC2,Region
0,-4.475762,-1.634579,Ain
1,1.066846,1.188511,Aisne
2,4.520382,-2.599654,Allier
3,-3.831168,-1.085755,Alpes-de-Haute-Provence
4,-0.356467,-1.707702,Alpes-Maritimes


##### ***Plotting UMAP Results***

In [34]:
# Visualizing PCA results
digits=list(dataset['Regions'])
data = [go.Scatter(x=x_pca[:,0], y=x_pca[:,1], mode='markers',
                    marker = dict(color=None, colorscale='Rainbow', opacity=0.5),
                                text=[f'digit: {a}' for a in digits],
                                hoverinfo='text')]

layout = go.Layout(title = 'PCA Dimensionality Reduction', width = 700, height = 700,
                    xaxis = dict(title='First Principal Component'),
                    yaxis = dict(title='Second Principal Component'))
fig = go.Figure(data=data, layout=layout)
fig.show()

#### ***Observations***
From the image above we can see clusters of regions with quite similar weather conditions.





### **Uniform Manifold Approximation and Projection (UMAP)**
UMAP is a nonlinear dimensionality reduction technique used for visualization and clustering of high-dimensional data. It is particularly effective at preserving both the local and global structure of the data, making it a popular choice for exploring datasets with even higher dimensions than what PCA can handle.

 Also, unlike PCA which is a linear technique and excels at capturing global variance and is good at preserving the large-scale structure of the data, UMAP is a nonlinear techinque, and is designed to preserve both local and global structures in the data.

 Read more about it [here](https://pair-code.github.io/understanding-umap/).

In [24]:
# Applying UMAP for initial dimensionality reduction
u = umap.UMAP(n_components = 2, n_neighbors=15, min_dist=0.1)
x_umap = u.fit_transform(X_scaled)

In [35]:
# Creating a DataFrame for UMAP results
umap_df = pd.DataFrame(data=x_umap, columns=['UMAP1', 'UMAP2'])
umap_df['Region'] = Y.values

umap_df.head()

Unnamed: 0,UMAP1,UMAP2,Region
0,-7.104587,17.855692,Ain
1,5.410969,16.717253,Aisne
2,16.099657,8.25625,Allier
3,-6.40457,17.489391,Alpes-de-Haute-Provence
4,-11.35603,-2.795272,Alpes-Maritimes


##### ***Plotting UMAP Results***

In [25]:
# Visualizing PCA results
data = [go.Scatter(x=x_umap[:,0], y=x_umap[:,1], mode='markers',
                    marker = dict(color=None, colorscale='Rainbow', opacity=0.5),
                                text=[f'digit: {a}' for a in digits],
                                hoverinfo='text')]

layout = go.Layout(title = 'UMAP Dimensionality Reduction', width = 700, height = 700,
                    xaxis = dict(title='First Dimension'),
                    yaxis = dict(title='Second Dimension'))
fig = go.Figure(data=data, layout=layout)
fig.show()


#### ***Observation***
From the chart above you can see even better clusters, showing regions with similar weather conditions.