<a href="https://colab.research.google.com/github/GuillaumeArp/Wild_Notebooks/blob/main/PCA_Guillaume_Arp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [None]:
# Run this to update Plotly to the latest version, required to display trendlines

! pip install plotly --upgrade

# Then restart the kernel if needed from the Menu bar or the button below the red text
# Comment the previous code after restart

Collecting plotly
  Downloading plotly-5.4.0-py2.py3-none-any.whl (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 67 kB/s 
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-5.4.0 tenacity-8.0.1


In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import plotly.express as px
import plotly.graph_objects as go

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/murpi/wilddata/master/quests/weather2020.csv')

## EDA

In [None]:
# Quick look at the dataset first

df.head(10)

Unnamed: 0,DATE,MAX_TEMPERATURE_C,MIN_TEMPERATURE_C,WINDSPEED_MAX_KMH,TEMPERATURE_MORNING_C,TEMPERATURE_NOON_C,TEMPERATURE_EVENING_C,PRECIP_TOTAL_DAY_MM,HUMIDITY_MAX_PERCENT,VISIBILITY_AVG_KM,PRESSURE_MAX_MB,CLOUDCOVER_AVG_PERCENT,HEATINDEX_MAX_C,DEWPOINT_MAX_C,WINDTEMP_MAX_C,WEATHER_CODE_MORNING,WEATHER_CODE_NOON,WEATHER_CODE_EVENING,TOTAL_SNOW_MM,UV_INDEX,SUNHOUR,OPINION,MONTH,DAY
0,2020-01-01,11,10,12,10,11,10,3.9,97,7.875,1029,85.75,11,10,8,353,248,353,0,1,3.3,bad,1,1
1,2020-01-02,12,9,21,9,11,10,0.1,91,8.625,1029,95.125,12,9,8,122,122,122,0,1,3.3,bad,1,2
2,2020-01-03,12,10,24,11,12,10,0.6,94,9.375,1032,77.0,12,10,8,176,116,176,0,1,5.1,bad,1,3
3,2020-01-04,9,5,7,5,8,7,0.0,90,10.0,1038,12.375,9,6,3,113,116,116,0,1,8.7,very bad,1,4
4,2020-01-05,9,4,10,4,7,7,0.0,88,10.0,1038,18.625,9,5,3,116,116,116,0,1,8.7,very bad,1,5
5,2020-01-06,10,2,17,2,7,10,0.0,91,10.0,1031,32.0,10,8,-1,116,116,119,0,1,8.7,very bad,1,6
6,2020-01-07,12,11,18,11,12,12,0.4,99,3.0,1029,99.875,12,12,10,248,248,266,0,1,3.3,bad,1,7
7,2020-01-08,13,12,17,11,12,12,1.7,95,5.0,1028,98.25,13,12,10,266,266,122,0,1,3.3,bad,1,8
8,2020-01-09,13,10,34,12,13,10,7.2,91,9.375,1024,91.375,13,12,6,122,296,353,0,1,3.3,bad,1,9
9,2020-01-10,9,7,18,8,8,9,0.0,84,10.0,1032,57.5,9,6,4,116,116,122,0,1,5.1,bad,1,10


In [None]:
# And the column types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   DATE                    366 non-null    object 
 1   MAX_TEMPERATURE_C       366 non-null    int64  
 2   MIN_TEMPERATURE_C       366 non-null    int64  
 3   WINDSPEED_MAX_KMH       366 non-null    int64  
 4   TEMPERATURE_MORNING_C   366 non-null    int64  
 5   TEMPERATURE_NOON_C      366 non-null    int64  
 6   TEMPERATURE_EVENING_C   366 non-null    int64  
 7   PRECIP_TOTAL_DAY_MM     366 non-null    float64
 8   HUMIDITY_MAX_PERCENT    366 non-null    int64  
 9   VISIBILITY_AVG_KM       366 non-null    float64
 10  PRESSURE_MAX_MB         366 non-null    int64  
 11  CLOUDCOVER_AVG_PERCENT  366 non-null    float64
 12  HEATINDEX_MAX_C         366 non-null    int64  
 13  DEWPOINT_MAX_C          366 non-null    int64  
 14  WINDTEMP_MAX_C          366 non-null    in

There are 22 numerical columns, and two object ones. One for the date (that we can parse as datetime later if necessary), and one for the opinion.

In [None]:
# Now a pairplot

fig = px.scatter_matrix(df.iloc[:,1:22])

fig.update_layout(width=2200, height=2200, title='Dataset Pairplot')
fig.show()

As usual, a pairplot with that many variables is pretty hard to read but we can already see some significant correlations, that will probably be clearer with a heatmap.

In [None]:
# Heatmap

corr = df.corr()

fig = go.Figure()

fig.add_trace(go.Heatmap(
    z = corr,
    x = corr.columns.values,
    y = corr.columns.values,
    colorscale=px.colors.diverging.RdBu,
    zmid=0
))

fig.update_layout(width=1200, height=900, title='Correlation Heatmap')
fig.show()

This is much clearer. There was no snow recorded this year so that variable is empty. The temperature variables are all very heavily positively correlated between each other, and with the Heat Index as well.

The Wind Temperature and Dewpoint Temperature are, as their names suggest, temperature measures that are correlated to the Morning, Noon and Evening temperature recordings.

On the negative correlation side, the most notable one is the cloud coverage and the sun hours, which is also expected, as more cloud coverage tends to mean less sun hours.


In [None]:
# A look at the summary statistics

df.iloc[:,1:22].describe()

Unnamed: 0,MAX_TEMPERATURE_C,MIN_TEMPERATURE_C,WINDSPEED_MAX_KMH,TEMPERATURE_MORNING_C,TEMPERATURE_NOON_C,TEMPERATURE_EVENING_C,PRECIP_TOTAL_DAY_MM,HUMIDITY_MAX_PERCENT,VISIBILITY_AVG_KM,PRESSURE_MAX_MB,CLOUDCOVER_AVG_PERCENT,HEATINDEX_MAX_C,DEWPOINT_MAX_C,WINDTEMP_MAX_C,WEATHER_CODE_MORNING,WEATHER_CODE_NOON,WEATHER_CODE_EVENING,TOTAL_SNOW_MM,UV_INDEX,SUNHOUR
count,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0
mean,16.997268,11.259563,20.991803,10.631148,15.770492,15.275956,3.169126,86.603825,9.237022,1019.562842,50.01776,17.311475,11.256831,9.70765,160.92623,166.172131,181.278689,0.0,3.639344,9.269126
std,6.369232,4.51067,8.546565,4.601133,6.018857,6.450046,6.408436,7.501151,1.059369,8.309106,28.935057,6.688467,4.225123,5.728018,74.537276,81.604763,93.009523,0.0,1.681807,3.399074
min,5.0,0.0,3.0,-1.0,3.0,3.0,0.0,47.0,3.0,982.0,0.0,5.0,0.0,-5.0,113.0,113.0,113.0,0.0,1.0,3.3
25%,12.0,8.0,14.0,8.0,11.0,10.0,0.0,83.0,8.75,1016.0,24.75,12.0,8.0,6.0,116.0,116.0,116.0,0.0,3.0,6.7
50%,16.0,11.0,20.0,11.0,15.0,14.0,0.4,88.5,9.875,1020.0,53.9375,16.0,12.0,10.0,119.0,116.0,122.0,0.0,4.0,9.1
75%,21.0,15.0,26.0,14.0,20.0,20.0,2.9,92.0,10.0,1024.0,74.0,23.0,14.0,14.0,176.0,176.0,176.0,0.0,5.0,11.6
max,38.0,23.0,50.0,23.0,35.0,37.0,53.3,99.0,10.0,1044.0,100.0,38.0,22.0,23.0,353.0,389.0,389.0,0.0,8.0,14.5


Oddly enough, the `MIN_TEMPERATURE_C` variable doesn't actually hold the lowest temperature of the dataset, that can be found in the `TEMPERATURE_MORNING_C` column.

We can get the confirmation here that there was no snow at all that year.

Dispersion seems to be relatively small on most variables, and there are no obvious outliers.

## Preprocessing

In [None]:
# Define X and y

X = df.select_dtypes(include='number')
y = df['OPINION']

X.shape

(366, 22)

In [None]:
# Standardize the data

scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)

X_scaled.shape

(366, 22)

In [None]:
# PCA initialization

pca = PCA()
pca.fit(X_scaled)

PCA()

In [None]:
# Variance explanation

variance = pca.explained_variance_ratio_
np.around(variance, 3)

array([0.41 , 0.209, 0.064, 0.06 , 0.046, 0.039, 0.03 , 0.028, 0.025,
       0.022, 0.021, 0.017, 0.011, 0.006, 0.005, 0.004, 0.002, 0.001,
       0.001, 0.   , 0.   , 0.   ])

In [None]:
# Getting 70% of the variance

np.sum(variance[:4]) * 100

74.28512609184794

In [None]:
# Other method

pca = PCA(n_components=0.7)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
X_pca.shape

(366, 4)

We can get more than 70% of the variance with 4 principal components.

In [None]:
# Getting 70% of the variance

np.sum(variance[:6]) * 100

82.77173684351605

In [None]:
# Other method

pca = PCA(n_components=0.8)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
X_pca.shape

(366, 6)

We can get more than 80% of the variance with 6 principal components.

## Transforming the Data

In [None]:
X_pca = pca.transform(X_scaled)

X_pca.shape

(366, 6)

In [None]:
# Plotting the first two transformed columns

fig = px.scatter(
    x=X_pca[:,0],
    y=X_pca[:,1],
    color=y,
    labels={'color': 'OPINION'}
)

fig.update_layout(width=900, height=650, template='plotly_dark')
fig.show()

## Classification with KNN

In [None]:
# Fit the KNN model on the scaled data

modelKNN = KNeighborsClassifier().fit(X_scaled, y)
modelKNN


KNeighborsClassifier()

In [None]:
# Score of the KNN on the scaled dataset

print("\nScore for the scaled dataset :", modelKNN.score(X_scaled, y))


Score for the scaled dataset : 0.8661202185792349


In [None]:
# PCA reinitialisation with 2 components

pca = PCA(n_components=2)
pca.fit(X_scaled)

X_pca = pca.transform(X_scaled)

In [None]:
# Fit the KNN model on the PCA data with 2 columns

modelKNN_pca = KNeighborsClassifier().fit(X_pca, y)
modelKNN_pca

KNeighborsClassifier()

In [None]:
# Score of the KNN on the PCA dataset

print("\nScore for the PCA dataset :", modelKNN_pca.score(X_pca, y))


Score for the PCA dataset : 0.8469945355191257


The scores are almost similar, a little bit lower with the PCA dataset, which only uses 2 columns. We could try using 2 to 6 columns to see which gives the best results as well, and see if one of them is close enough to the scaled data.

In [None]:
# With 2 to 6 components

for i in range(2,7):
  pca = PCA(n_components=i)
  pca.fit(X_scaled)
  X_pca = pca.transform(X_scaled)
  modelKNN_pca = KNeighborsClassifier().fit(X_pca, y)

  print(f"Score for the PCA dataset with {i} components: {modelKNN_pca.score(X_pca, y)}")

Score for the PCA dataset with 2 components: 0.8469945355191257
Score for the PCA dataset with 3 components: 0.8415300546448088
Score for the PCA dataset with 4 components: 0.7950819672131147
Score for the PCA dataset with 5 components: 0.8005464480874317
Score for the PCA dataset with 6 components: 0.8360655737704918


Looks like 2 components indeed give the best result here.