# Lab-Assignment

Nigel Sjölin Grech MA661E - VT2021

## Wind Power Forecasting 


### TODO:
- [X] 1.1 Data Preparation
    - [X] 1.1.1 Reading Data 
    - [] 1.1.2 Explaining Column Names
- [] 1.2. Manipulating Data
    - [] 1.2.1 Finding and handling Missing Values
        - [] 1.2.1.1 Visualizing Missing Data 
        - [] 1.2.1.2 Experimenting with different fill strategies  
        - [] 1.2.1.3 Removing rows and cols to deal with nulls
    - [] 1.2.2 --Converting Categorical Data-- No Categorical data in the data set
    - [] 1.2.3 Aggregation of Data 
        - [] 1.2.3.1 Yearly grouping 
        - [] 1.2.3.2 Monthly grouping
        - [] 1.2.3.3 Lagging 
- [] 2 Exploring Data
    - [] 2.1 Analyzing the feasibility of values
    - [] 2.2 Univariate data analysis 
        - [] 2.2.1 Box plots and handling of outliers
            - [] 2.2.1.1 Box plots on groupings 
    - [] 2.3 Bivariate data analysis (some of these can be replaces with time series techniques)
        - [] 2.3.1 Heat maps
        - [] 2.3.2 Scatter plots
        - [] 2.3.3 Joint distribution plots with regression fit 
        - [] 2.3.4 Plotting Category dependencies  
    - [] 2.4 Hypothesis testing (t-test)
- [] 3 Clustering
    - [] 3.1 Identify number of clusters 
    - [] 3.2 Clustering with k-means
- [] 4 Dimensionality reduction 
    - [] 4.1 Reduce dims with PCA
    - [] 4.2 Scatter plot of reduced dims
    - [] 4.3 Cluster labeling 

## Imports

- os - operating system function, used to make platform independent paths
- Pandas - for data manipulation 
- missingno - specific library for visualizing missing data


In [1]:
import pandas as pd
import missingno as msno
import os 

pd.set_option('display.max_columns', 50)

## 1. Data Preparation

### 1.1 Reading data 

Here data is read using pandas' read_csv method. One note is that the os.path.join function is used for platform independence. The head and tail of the data is displayed and we can make an observation: that the top of the file is missing data, while the bottom not so much this may indicate inconsistent data collection at the beginning of the process. 

In [2]:
raw_turbine_path = os.path.join(os.pardir, 'data', 'Turbine_Data.csv')
raw_turbine_data = pd.read_csv(raw_turbine_path)

In [3]:
raw_turbine_data.head()

Unnamed: 0.1,Unnamed: 0,ActivePower,AmbientTemperatue,BearingShaftTemperature,Blade1PitchAngle,Blade2PitchAngle,Blade3PitchAngle,ControlBoxTemperature,GearboxBearingTemperature,GearboxOilTemperature,GeneratorRPM,GeneratorWinding1Temperature,GeneratorWinding2Temperature,HubTemperature,MainBoxTemperature,NacellePosition,ReactivePower,RotorRPM,TurbineStatus,WTG,WindDirection,WindSpeed
0,2017-12-31 00:00:00+00:00,,,,,,,,,,,,,,,,,,,G01,,
1,2017-12-31 00:10:00+00:00,,,,,,,,,,,,,,,,,,,G01,,
2,2017-12-31 00:20:00+00:00,,,,,,,,,,,,,,,,,,,G01,,
3,2017-12-31 00:30:00+00:00,,,,,,,,,,,,,,,,,,,G01,,
4,2017-12-31 00:40:00+00:00,,,,,,,,,,,,,,,,,,,G01,,


In [4]:
raw_turbine_data.tail()

Unnamed: 0.1,Unnamed: 0,ActivePower,AmbientTemperatue,BearingShaftTemperature,Blade1PitchAngle,Blade2PitchAngle,Blade3PitchAngle,ControlBoxTemperature,GearboxBearingTemperature,GearboxOilTemperature,GeneratorRPM,GeneratorWinding1Temperature,GeneratorWinding2Temperature,HubTemperature,MainBoxTemperature,NacellePosition,ReactivePower,RotorRPM,TurbineStatus,WTG,WindDirection,WindSpeed
118219,2020-03-30 23:10:00+00:00,70.044465,27.523741,45.711129,1.515669,1.950088,1.950088,0.0,59.821165,55.193793,1029.870744,59.060367,58.148777,39.008931,36.476562,178.0,13.775785,9.234004,2.0,G01,178.0,3.533445
118220,2020-03-30 23:20:00+00:00,40.833474,27.602882,45.598573,1.702809,2.136732,2.136732,0.0,59.142038,54.798545,1030.160478,58.452003,57.550367,39.006759,36.328125,178.0,8.088928,9.22937,2.0,G01,178.0,3.261231
118221,2020-03-30 23:30:00+00:00,20.77779,27.560925,45.462045,1.706214,2.139664,2.139664,0.0,58.439439,54.380456,1030.137822,58.034071,57.099335,39.003815,36.131944,178.0,4.355978,9.236802,2.0,G01,178.0,3.331839
118222,2020-03-30 23:40:00+00:00,62.091039,27.810472,45.343827,1.575352,2.009781,2.009781,0.0,58.205413,54.079014,1030.178178,57.795387,56.847239,39.003815,36.007805,190.0,12.018077,9.237374,2.0,G01,190.0,3.284468
118223,2020-03-30 23:50:00+00:00,68.664425,27.915828,45.23161,1.499323,1.933124,1.933124,0.0,58.581716,54.080505,1029.834789,57.694813,56.74104,39.003815,35.914062,203.0,14.439669,9.235532,2.0,G01,203.0,3.475205


#### Explaining the column names 

| Column Name                  	| Description                                                  	|
|------------------------------	|--------------------------------------------------------------	|
| Time Stamp (Unnamed 0)       	| Time stamp of the data recording, from Jan 2018 - March 2020 	|
| ActivePower                  	| The power generated by the turbine                           	|
| Ambient temperature          	| The ambient temperature around the turbine                   	|
| BearingShaftTemperature      	| The temperature of the turbine's bearing shaft               	|
| Blade1PitchAngle             	| The pitch angle for the turbine's blade 1                    	|
| Blade2PitchAngle             	| The pitch angle for the turbine's blade 2                    	|
| Blade3PitchAngle             	| The pitch angle for the turbine's blade 3                    	|
| ControlBoxTemperature        	| The temperature of the turbine's control box                 	|
| GearboxBearingTemperature    	| The temperature of the turbine's gearbox bearing             	|
| GearboxOilTemperature        	| The temperature of the turbine's gearbox oil                 	|
| GeneratorRPM                 	|                                                              	|
| GeneratorWinding1Temperature 	|                                                              	|
| GeneratorWinding2Temperature 	|                                                              	|
| HubTemperature               	|                                                              	|

In [5]:
# we can alternatively use the df.info here but this has a prettier output

info_df = pd.concat([raw_turbine_data.dtypes, raw_turbine_data.count(), raw_turbine_data.isna().sum()], axis=1).reset_index()\
            .rename(columns={'index':'feature', 0:'dtype', 1:'# values', 2:'# na'})
info_df['% missing'] = np.ceil((info_df['# na']*100)/len(raw_turbine_data))
info_df

Unnamed: 0,feature,dtype,# values,# na,% missing
0,Unnamed: 0,object,118224,0,0.0
1,ActivePower,float64,94750,23474,20.0
2,AmbientTemperatue,float64,93817,24407,21.0
3,BearingShaftTemperature,float64,62518,55706,48.0
4,Blade1PitchAngle,float64,41996,76228,65.0
5,Blade2PitchAngle,float64,41891,76333,65.0
6,Blade3PitchAngle,float64,41891,76333,65.0
7,ControlBoxTemperature,float64,62160,56064,48.0
8,GearboxBearingTemperature,float64,62540,55684,48.0
9,GearboxOilTemperature,float64,62438,55786,48.0


In [9]:
msno.matrix(raw_turbine_data)