<p style="font-family: Arial; font-size:3.75vw;color:purple; font-style:bold"><br>
Clustering Exercise Notebook
</p><br>

# Exercise Notebook Instructions

### 1. Important: Only modify the cells which instruct you to modify them - leave "do not modify" cells alone.  

The code which tests your responses assumes you have run the startup/read-only code exactly.

### 2. Work through the notebook in order.

Some of the steps depend on previous, so you'll want to move through the notebook in order.

### 3. It is okay to use numpy libraries.

You may find some of these questions are fairly straightforward to answer using built-in numpy functions.  That's totally okay - part of the point of these exercises is to familiarize you with the commonly used numpy functions.

### 4. Seek help if stuck

If you get stuck, don't worry!  You can either review the videos/notebooks from this week, ask in the course forums, or look to the solutions for the correct answer.  BUT, be careful about looking to the solutions too quickly.  Struggling to get the right answer is an important part of the learning process.

In [2]:
# DO NOT MODIFY

# import appropriate libraries

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np

In [3]:
# DO NOT MODIFY

# We will use the minute weather dataset for this exercise.

def get_data():
    return pd.read_csv('../weather/minute_weather.csv')

df = get_data()

<p style="font-family: Arial; font-size:2.75vw;color:purple; font-style:bold"><br>

Exercise 1: Sampling Down a Time Series Dataset<br><br></p>


In the cell below, modify the function to RETURN a new dataframe that is 
sampled down by taking every k row only. For example, if k = 5, the 
function will take every 5th row and skip rows in between.

The inputs to function are a dataframe and an integer k.

In [6]:
# modify this cell

def down_sample(df, k):
    ### BEGIN SOLUTION
    new_df = df[(df['rowID'] % k) == 0]
    return new_df
    ### END SOLUTION

In [7]:
# DO NOT MODIFY
ans = 15873

try:
    sampled_df = down_sample(df, 100)
    assert np.alltrue(sampled_df.shape[0] == ans)
except AssertionError as e: print("Try again, your output did not match the expected answer above")

<p style="font-family: Arial; font-size:2.75vw;color:purple; font-style:bold"><br>

Exercise 2: Data Cleaning<br><br></p>


In the cell below, modify the function to do following tasks and RETURN a new data frame: (do not modify the input dataframe) <br><br>

- delete column 'rain_accumulation'
- delete column 'rain_duration'
- delete rows that contain atleast one NULL

In [15]:
# modify this cell

def clean_data(df):
    ### BEGIN SOLUTION
    df_copy = df.copy()
    del df_copy['rain_accumulation']
    del df_copy['rain_duration']
    return df_copy.dropna()
    ### END SOLUTION

In [16]:
# DO NOT MODIFY

try: 
    cleaned_df = clean_data(sampled_df)
    assert np.alltrue(sampled_df.shape == (15873, 13))
    assert np.alltrue(cleaned_df.shape == (15870, 11))
except AssertionError as e: print("Try again ! Check both  clean_data and down_sample functions")

In [17]:
cleaned_df.describe()

Unnamed: 0,rowID,air_pressure,air_temp,avg_wind_direction,avg_wind_speed,max_wind_direction,max_wind_speed,min_wind_direction,min_wind_speed,relative_humidity
count,15870.0,15870.0,15870.0,15870.0,15870.0,15870.0,15870.0,15870.0,15870.0,15870.0
mean,793624.0,916.829464,61.857384,161.287524,2.792804,162.700945,3.414625,166.644297,2.152268,47.591084
std,458219.1,3.051662,11.834858,95.313161,2.070506,92.269601,2.428906,97.824836,1.758114,26.18968
min,0.0,905.1,32.36,0.0,0.1,0.0,0.1,0.0,0.0,1.5
25%,396825.0,914.8,52.7,61.0,1.3,68.0,1.6,75.0,0.8,24.7
50%,793650.0,916.7,62.42,182.0,2.2,186.0,2.8,180.0,1.7,44.6
75%,1190375.0,918.7,70.88,217.0,3.8,223.0,4.6,212.0,3.0,67.9
max,1587200.0,929.4,96.44,359.0,20.1,359.0,20.9,359.0,19.5,92.9


<p style="font-family: Arial; font-size:2.75vw;color:purple; font-style:bold"><br>

Exercise 3: Perform Scaling of input data<br><br></p>

In the cell below, modify the function that takes three inputs: a dataframe D, a list of features F
<br>
The function should do the following:
<br>
- Pick out columns corresponding to features F from dataframe D
- Use StandardScaler to perform scaling and return the scaled data frame

In [18]:
# modify this cell

def scale_task(D, F):
    ### BEGIN SOLUTION
    df_copy = D.copy()
    df_copy = df_copy[F]
    X = StandardScaler().fit_transform(df_copy)  
    return pd.DataFrame(X, columns=df_copy.columns)
    ### END SOLUTION

In [19]:
# DO NOT MODIFY

try: 
    features = ['air_pressure', 'air_temp', 'relative_humidity']
    scaled_data = scale_task(cleaned_df, features)
    assert np.allclose(scaled_data.mean(axis=0), 0)
    assert np.allclose(scaled_data.std(axis=0),1, atol=1e-04)
except AssertionError as e: print("Try again, your solution did not produce the expected output above")

<p style="font-family: Arial; font-size:2.75vw;color:purple; font-style:bold"><br>

Exercise 4: Perform Clustering<br><br></p>

In the cell below, modify the function that takes two inputs: a dataframe D (already scaled using your previous function), and an integer for # of clusters to create N.
<br>
The function should do the following:
<br>
- Perform clustering using your algorithm of choice
- Create N clusters
- Return a **dataframe** of N rows, where each row is a cluster center.

In [20]:
# modify this cell

def clustering_task(D, N):
    ### BEGIN SOLUTION
    kmeans = KMeans(n_clusters=N)
    model  = kmeans.fit(D)
    return pd.DataFrame(model.cluster_centers_, columns=D.columns)
    ### END SOLUTION

In [21]:
# DO NOT MODIFY

try: 
    ct = clustering_task(scaled_data, 3)
    assert np.alltrue(ct.shape == (3, 3))
except AssertionError as e: print("Keep trying - ensure all previous functions also work correctly ")

try:
    print(40*'-')
    print(ct.describe())
except AssertionError as e: print("Ensure clustering_task returns a Pandas dataframe")

----------------------------------------
       air_pressure  air_temp  relative_humidity
count      3.000000  3.000000           3.000000
mean       0.138806 -0.071897          -0.052474
std        0.891352  0.905230           1.043905
min       -0.449104 -0.970887          -0.853360
25%       -0.373985 -0.527568          -0.642783
50%       -0.298866 -0.084250          -0.432207
75%        0.432762  0.377598           0.347968
max        1.164389  0.839446           1.128144
