<h1> <center> ENSF 519.01 Applied Data Scince </center></h1>
<h2> <center> Sample questions on feature engineering and model evaluation</center></h2>



<h1>Part A. Feature Selection </h1>
<br>

In this section, we are going to select the most informative features from the given NASA dataset.


<h2>Part A.1. Feature selection using ANOVA </h2>

Steps:
    
- Read data from the NASA.csv
- Using ANOVA select top K features, whith K=[1..10]
- Build LogisticRegression models, one with the original data, and one for each K (using a subset of feature)
- Compare the accuracies and find the best K (based on the median of 30 runs with random_state=[0 to 30])


In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest

## Part A.1 ANOVA 
nasa_data = pd.read_csv("NasaData.csv")
features = nasa_data.drop(columns="label")
labels = nasa_data["label"]

scores = {"log_reg_{}".format(i):[] for i in range(0, 11)}
# Test regular logistic regression
for split_seed in range(0, 30):
    X_train, X_test, y_train, y_test = train_test_split(
        features,
        labels,
        random_state=split_seed
    )
    
    lr = LogisticRegression(random_state=0).fit(X_train, y_train)
    scores["log_reg_0"].append(lr.score(X_test, y_test))

for i in range(1, 11):
    for split_seed in range(0, 30):
        X_train, X_test, y_train, y_test = train_test_split(
            features,
            labels,
            test_size=0.2,
            random_state=split_seed
        )
        
        # Select K best features using ANOVA
        k_best = SelectKBest(k=i)
        k_best.fit(X_train, y_train)
        
        X_train = k_best.transform(X_train)
        X_test = k_best.transform(X_test)
    
        lr = LogisticRegression(random_state=0).fit(X_train, y_train)
        scores["log_reg_{}".format(i)].append(lr.score(X_test, y_test))

print(scores)

In [13]:
import numpy as np

for i in scores.keys():
    print("The median accuracy of {} is {}".format(i, np.median(scores[i])))

The median accuracy of log_reg_0 is 0.7869989722507709
The median accuracy of log_reg_1 is 0.7861271676300579
The median accuracy of log_reg_2 is 0.7874116891457932
The median accuracy of log_reg_3 is 0.7861271676300579
The median accuracy of log_reg_4 is 0.7864482980089917
The median accuracy of log_reg_5 is 0.7864482980089917
The median accuracy of log_reg_6 is 0.7861271676300579
The median accuracy of log_reg_7 is 0.7867694283879255
The median accuracy of log_reg_8 is 0.7861271676300579
The median accuracy of log_reg_9 is 0.7854849068721901
The median accuracy of log_reg_10 is 0.7854849068721901


<h2>Part A.2. Compare feature selection models </h2>

Now apply SelectFromModel and RFE and compare them with SelectKBest, as follows:

- Apply the three techniques so that you reduce the features to only 6 features (note that 6 is not necessarily the best K from Part A.1)
- Report the prediction scores of a LogisticRegression model on the selected features of each model.
- Print the name of features selected by each model.


In [32]:
##  Part A.2 Compare feature selection models 
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(
    features,
    labels,
    random_state=0
)

# Create select from model feature selector 
select = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=0),
    threshold="1.25*mean")
select.fit(X_train, y_train)
select_X_train = select.transform(X_train)
select_X_test = select.transform(X_test)
print(LogisticRegression(random_state=0).fit(select_X_train, y_train).score(select_X_test, y_test))

# Create RFE feature selector
rfe_select = RFE(
    RandomForestClassifier(n_estimators=100, random_state=0),
    n_features_to_select=6
)
rfe_select.fit(X_train, y_train)
rfe_select_X_train = rfe_select.transform(X_train)
rfe_select_X_test = rfe_select.transform(X_test)
print(LogisticRegression(random_state=0).fit(rfe_select_X_train, y_train).score(rfe_select_X_test, y_test))

# Select 6 best features using ANOVA
k_best = SelectKBest(k=6)
k_best.fit(X_train, y_train)
k_best_X_train = k_best.transform(X_train)
k_best_X_test = k_best.transform(X_test)
print(LogisticRegression(random_state=0).fit(k_best_X_train, y_train).score(k_best_X_test, y_test))

0.7929085303186023
0.7934224049331963
0.7959917780061665


In [31]:
mask = rfe_select.get_support()
print("Features selected: **{}**".format([list(features.columns.values)[i] for i in np.where(mask == True)[0]]))

Features selected: **['LOC_BLANK', 'BRANCH_COUNT', 'LOC_CODE_AND_COMMENT', 'LOC_COMMENTS', 'CYCLOMATIC_COMPLEXITY', 'DESIGN_COMPLEXITY', 'ESSENTIAL_COMPLEXITY', 'LOC_EXECUTABLE', 'HALSTEAD_CONTENT', 'HALSTEAD_DIFFICULTY', 'HALSTEAD_EFFORT', 'HALSTEAD_ERROR_EST', 'HALSTEAD_LENGTH', 'HALSTEAD_LEVEL', 'HALSTEAD_PROG_TIME', 'HALSTEAD_VOLUME', 'NUM_OPERANDS', 'NUM_OPERATORS', 'NUM_UNIQUE_OPERANDS', 'NUM_UNIQUE_OPERATORS', 'LOC_TOTAL']**


<h1>Part B. Data Tranformation </h1>

In this part, you are going to work with a new data set which contains some the features of a house collected over time. 
The objective of this part is to help improve linear model's predicitons using data transformation.

<h2>Part B.1 Binning </h2>

Our first try is using binning, as follows:

- Read from MyHouse.csv (take 'Light' as the data target and the rest of the columns as data features ) 
- First apply a LinearRegression on the original data to predict the target and report the score of the model on the test set. 
- Now apply binning on all three columns 
 (for Temperature make 5 Bins -- for Humidity make 10 bins -- and for CO2Bins make 11 bins)
- Print your data shape before and after binning.
- Now again apply LinearRegression on the new data and report the score again. 



In [41]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import KBinsDiscretizer

house_data = pd.read_csv("MyHouse.csv")
features = house_data.drop(columns="Light")
labels = house_data["Light"]

house_data
X_train, X_test, y_train, y_test = train_test_split(
    features,
    labels,
    random_state=0
)

lin_reg = LinearRegression().fit(X_train, y_train)
print("Score of original lin reg = {}".format(lin_reg.score(X_test, y_test)))

# Apply binning to each column
temp_binner = KBinsDiscretizer(n_bins=5).fit(X_train["Temperature"])
humidity_binner = KBinsDiscretizer(n_bins=10).fit(X_train["Humidity"])
co2_binner = KBinsDiscretizer(n_bins=11).fit(X_train["CO2"])

Score of original lin reg = 0.6920532277961199


ValueError: Expected 2D array, got 1D array instead:
array=[21.025      21.5        20.7        ... 20.39       22.6
 22.82333333].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

<h2>Part B.2 Polynomials</h2>

To compare polynomials and binning, apply polynomials on all three features. 
- Use degree=6.
- Print your data shape before and after transformation
- Apply LinearRegression on the new data and report the score again. 


In [None]:
## Part B.2 Compare feature selection models 
