<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>

# Clustering with scikit-learn

<br><br></p>

In this notebook, we will learn how to use K-Means with scikit-learn in Python.

We will use cluster analysis to generate a climate model using minute-definition data. This data set has millions of records. How do we create 12 groups?

NOTE: The data set we will use is in a large CSV file called minute.csv.

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
from itertools import cycle, islice
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
import datetime

In [2]:
data = pd.read_csv('./meteo/minuto.csv')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">Weather data per minute</p>
<br>
The **minute** weather data set comes from the same source as the daily weather data set we use in the decision tree-based classifier notebook. The main difference between these two data sets is that the minute weather data set contains raw sensor measurements captured at one-minute intervals. The daily weather data set instead contained processed (averaged) data.

As with daily weather data, this data comes from a weather station. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. The data was collected over a three-year period, from September 2011 to September 2014, to ensure that sufficient data is captured for different seasons and weather conditions.

Each row in **minute.csv** contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:

* **rowID:** Unique key
* **hpwren_timestamp:** measurement timestamp (*Unit: year-month-day hour:minute:second*)
* **air_pressure:** atmospheric pressure (*Hectopascals*)
* **air_temp:** Air temperature (*Degrees Fahrenheit*)
* **avg_wind_direction:** average wind direction during the minute before the timestamp
* **avg_wind_speed:** average wind speed during the minute before the timestamp (meters per second)
* **max_wind_direction:** Maximum wind direction
* **max_wind_speed:** Maximum wind speed
* **min_wind_direction:** Minimum wind direction
* **min_wind_speed:** Minimum wind speed
* **rain_accumulation:** Rain accumulation at timestamp
* **rain_duration:** rain duration
* **relative_humidity:** relative humidity measured in the timestamp

In [3]:
data.shape

(1587257, 13)

In [4]:
data.head()

Unnamed: 0,rowID,hpwren_timestamp,air_pressure,air_temp,avg_wind_direction,avg_wind_speed,max_wind_direction,max_wind_speed,min_wind_direction,min_wind_speed,rain_accumulation,rain_duration,relative_humidity
0,0,2011-09-10 00:00:49,912.3,64.76,97.0,1.2,106.0,1.6,85.0,1.0,,,60.5
1,1,2011-09-10 00:01:49,912.3,63.86,161.0,0.8,215.0,1.5,43.0,0.2,0.0,0.0,39.9
2,2,2011-09-10 00:02:49,912.3,64.22,77.0,0.7,143.0,1.2,324.0,0.3,0.0,0.0,43.0
3,3,2011-09-10 00:03:49,912.3,64.4,89.0,1.2,112.0,1.6,12.0,0.7,0.0,0.0,49.5
4,4,2011-09-10 00:04:49,912.3,64.4,185.0,0.4,260.0,1.0,100.0,0.1,0.0,0.0,58.8


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Data Sampling<br></p>

We have too many rows in the dataset, we will reduce the amount to 10% <br>

In [6]:
# We will choose 1 of each 10 rows with this module operation.
sampled_df = data[data['rowID'] % 10 == 0]
sampled_df.shape

(158726, 13)

Statistics

In [9]:
#Describing the sample and changing columns to rows and rows to columns so that the description looks better
sampled_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rowID,158726.0,793625.0,458203.937509,0.0,396812.5,793625.0,1190437.5,1587250.0
air_pressure,158726.0,916.830161,3.051717,905.0,914.8,916.7,918.7,929.5
air_temp,158726.0,61.851589,11.833569,31.64,52.7,62.24,70.88,99.5
avg_wind_direction,158680.0,162.1561,95.278201,0.0,62.0,182.0,217.0,359.0
avg_wind_speed,158680.0,2.775215,2.057624,0.0,1.3,2.2,3.8,31.9
max_wind_direction,158680.0,163.462144,92.452139,0.0,68.0,187.0,223.0,359.0
max_wind_speed,158680.0,3.400558,2.418802,0.1,1.6,2.7,4.6,36.0
min_wind_direction,158680.0,166.774017,97.441109,0.0,76.0,180.0,212.0,359.0
min_wind_speed,158680.0,2.134664,1.742113,0.0,0.8,1.6,3.0,31.6
rain_accumulation,158725.0,0.000318,0.011236,0.0,0.0,0.0,0.0,3.12


In [10]:
#we detect that the rain accumulation and rain duration columns have values of zero
sampled_df[sampled_df['rain_accumulation'] == 0].shape

(157812, 13)

In [11]:
sampled_df[sampled_df['rain_duration'] == 0].shape

(157237, 13)

Cleaning this data

In [12]:
# Since both columns have a lot of zeros, we'll delete them entirely.
del sampled_df['rain_accumulation']
del sampled_df['rain_duration']

In [13]:
rows_before = sampled_df.shape[0]
# Deleting rows with null data
sampled_df = sampled_df.dropna()
rows_after = sampled_df.shape[0]

In [14]:
rows_before - rows_after

46

In [15]:
sampled_df.columns

Index(['rowID', 'hpwren_timestamp', 'air_pressure', 'air_temp',
       'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction',
       'max_wind_speed', 'min_wind_direction', 'min_wind_speed',
       'relative_humidity'],
      dtype='object')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

We select the qualities of interest to address clustering
<br><br></p>

In [16]:
#These are the columns that we select as features
features = ['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction', 
        'max_wind_speed','relative_humidity']

In [17]:
select_df = sampled_df[features]

In [18]:
select_df.columns

Index(['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed',
       'max_wind_direction', 'max_wind_speed', 'relative_humidity'],
      dtype='object')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Scale the Features using StandardScaler
<br><br></p>


In [19]:
X = StandardScaler().fit_transform(select_df)
X

array([[-1.48456281,  0.24544455, -0.68385323, ..., -0.62153592,
        -0.74440309,  0.49233835],
       [-1.48456281,  0.03247142, -0.19055941, ...,  0.03826701,
        -0.66171726, -0.34710804],
       [-1.51733167,  0.12374562, -0.65236639, ..., -0.44847286,
        -0.37231683,  0.40839371],
       ...,
       [-0.30488381,  1.15818654,  1.90856325, ...,  2.0393087 ,
        -0.70306017,  0.01538018],
       [-0.30488381,  1.12776181,  2.06599745, ..., -1.67073075,
        -0.74440309, -0.04948614],
       [-0.30488381,  1.09733708, -1.63895404, ..., -1.55174989,
        -0.62037434, -0.05711747]])

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Implementating k-Means Clustering
<br><br></p>


In [24]:
#Setting the quantity of clusters we want to have.
kmeans = KMeans(n_clusters=12, n_init = 8)
# Adjusting the model.
model = kmeans.fit(X)
print("model\n", model)

model
 KMeans(n_clusters=12, n_init=8)


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>


¿Cuáles son los centros de los 12 grupos que formamos?
<br><br></p>