<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Univariate Feature Selection

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression,f_classif
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Run this before any other code cell
# This downloads the csv data files into the same directory where you have saved this notebook

import urllib.request
from pathlib import Path
import os
path = Path()

# Dictionary of file names and download links
files = {'CCPP_data.csv':'https://storage.googleapis.com/aipi_datasets/CCPP_data.csv'}

# Download each file
for key,value in files.items():
    filename = path/key
    url = value
    # If the file does not already exist in the directory, download it
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url,filename)

## Regression Feature Selection
Data available at https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

The dataset contains data collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is collected from and has effect on the Steam Turbine, the other three ambient variables affect the GT performance.

Features consist of hourly average ambient variables 
- AT: Ambient Temperature in °C,
- AP: Ambient Pressure in milibar,
- RH: Relative Humidity in %
- V: Exhaust Vacuum in cm Hg
- PE (target): Net hourly electrical energy output in MW  

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.

In [None]:
# Read in the data
ccpp_data = pd.read_csv('CCPP_data.csv')
ccpp_data.head()

In [None]:
# Create feature matrix X and target array y
X = ccpp_data.drop('PE',axis=1)
y = ccpp_data['PE']

# Split data into training and test sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
print("Shape of X_train, y_train:",X_train.shape,y_train.shape)
print("Shape of X_test, y_test:",X_test.shape,y_test.shape)

In [None]:
# Display a pairplot to look at relationships between variables
plt.figure(figsize=(10,10))
sns.pairplot(data=pd.concat([X_train,y_train],axis=1),diag_kind='kde')
plt.show()

### Univariate feature importance

In [None]:
# Evaluate continuous features using pearson's correlation coefficient (score_func = f_regression)
ftest = SelectKBest(score_func=f_regression, k='all')
ftest.fit(X_train,y_train)
f_scores = pd.DataFrame(ftest.scores_)
dfcolumns = pd.DataFrame(X_train.columns)
f_scores = pd.concat([dfcolumns,f_scores],axis=1)
f_scores.columns = ['Feature','F-Score']  
f_scores = f_scores.sort_values(by='F-Score',ascending=False)
f_scores

In [None]:
# Plot scores
plt.figure(figsize=(15,5))
plt.bar(x=f_scores['Feature'],height=f_scores['F-Score'])
plt.xticks(rotation=90)
plt.title('F-score of each feature')
plt.show()

### Correlation

In [None]:
# Examine correlations between variables using correlation matrix
plt.figure(figsize=(10,8))
train_data = pd.concat([X_train,y_train],axis=1)
cm = train_data.corr(method='pearson')
sns.heatmap(cm, annot=True, cmap = 'RdBu_r',linewidth=0.5,square=True)
plt.show()

## Univariate feature selection - classification

In [None]:
# Load the iris dataset using a helper function in Seaborn
iris = sns.load_dataset('iris')
iris.head()

### To-do
Now it's your turn.  In the cell below, do the following:
- Split the data into the input features X and target array y. 
- Now, split the X and y into training and test sets.  Use a test size of 20% and 80% for training.  Be sure to set `random_state=0`.  
- Perform a univariate feature selection analysis on the input features in X_train.  Use an ANOVA test (`score_func=f_classif`).  Calculate the F-Score for each feature, and then plot a bar chart of the scores.

In [None]:
### BEGIN SOLUTION ###



### END SOLUTION ###