# Objective: Feature Subset Selection to Improve Software Cost Estimation

## Dataset
This is a PROMISE Software Engineering Repository data set made publicly available to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering. The main objective is to estimate the software cost estimation using feature subset selection techniques.

## Attributes
1.	RELY {Nominal,Very_High,High,Low} 
2.	DATA {High,Low,Nominal,Very_High} 
3.	CPLX {Very_High,High,Nominal,Extra_High,Low} 
4.	TIME {Nominal,Very_High,High,Extra_High} 
5.	STOR {Nominal,Very_High,High,Extra_High} 
6.	VIRT {Low,Nominal,High}
7.	TURN {Nominal,High,Low}
8.	ACAP {High,Very_High,Nominal} 
9.	AEXP {Nominal,Very_High,High} 
10.	PCAP {Very_High,High,Nominal}
11.	VEXP {Low,Nominal,High}
12.	LEXP {Nominal,High,Very_Low,Low} 
13.	MODP {High,Nominal,Very_High,Low}
14.	TOOL {Nominal,High,Very_High,Very_Low,Low} 
15.	SCED {Low,Nominal,High}
16.	LOC numeric 

## Target Class
ACT_EFFORT numeric %17

### Source: http://promise.site.uottawa.ca/SERepository/datasets/cocomonasa_v1.arff

Tasks:
1.	Obtain the software cost estimation dataset
2.	Apply pre-processing techniques (if any)
3.	Apply feature subset selection techniques such as correlation analysis, forward selection, backward elimination, recursive feature elimination etc. Find best possible subset of features from each method.
4.	Divide dataset into training and testing set, respectively.
5.	Implement support vector regression (SVR), Linear regression, and Decision tree.
6.	Ensemble SVR, Linear regression and Decision tree. 
7.	Evaluate Coefficient of determination and Root mean square error for all the models including the ensemble one.
8.	Conclude the results

Helpful links: https://scikit-learn.org/stable/modules/ensemble.html
https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/
https://medium.com/pursuitnotes/support-vector-regression-in-6-steps-with-python-c4569acd062d
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html


## Task 1: Implementation of regression models 

In [101]:
# Load the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler,LabelEncoder,OrdinalEncoder
from scipy.io import arff
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score,recall_score,mean_squared_error
from sklearn.feature_selection import SelectKBest,f_classif
from sklearn.ensemble import RandomForestRegressor

In [102]:
# Load the dataset 

data=arff.loadarff('cocomonasa_v1.arff')
data=data[0]
df=pd.DataFrame(data)
for i in range(df.shape[0]):
    for j in range(df.shape[1]-2):
        df.iloc[i,j]=df.iloc[i,j].decode('utf-8')
df.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,Nominal,High,Very_High,Nominal,Nominal,Low,Nominal,High,Nominal,Very_High,Low,Nominal,High,Nominal,Low,70.0,278.0
1,Very_High,High,High,Very_High,Very_High,Nominal,Nominal,Very_High,Very_High,Very_High,Nominal,High,High,High,Low,227.0,1181.0
2,Nominal,High,High,Very_High,High,Low,High,High,Nominal,High,Low,High,High,Nominal,Low,177.9,1248.0
3,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,115.8,480.0
4,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,29.5,120.0


In [103]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)
categories = [list(df[i].unique()) for i in df.columns[:-2]]
categories
enc=OrdinalEncoder(categories=categories)
df.iloc[:,:-2]=enc.fit_transform(df.iloc[:,:-2])
df.head()

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,70.0,278.0
1,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,227.0,1181.0
2,0.0,0.0,1.0,1.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,177.9,1248.0
3,2.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,115.8,480.0
4,2.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,29.5,120.0


In [104]:
scaler=StandardScaler()
df['LOC']=scaler.fit_transform(np.array(df['LOC']).reshape(-1,1))

In [105]:
X=df.drop('ACT_EFFORT',axis=1)
y=df['ACT_EFFORT']

In [106]:
# Apply feature subset selection techniques 
estimator=LinearRegression()
sel=SelectKBest(f_classif,k=15)
X=sel.fit_transform(X,y)


divide by zero encountered in true_divide



In [107]:
# Divide the dataset to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [108]:
# Build regression models 
model=LinearRegression()
model1=DecisionTreeRegressor()
model2=LinearSVR()
model.fit(X_train,y_train)
model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
print('Training Accuracy (1) - ',model.score(X_train,y_train))
print('Training Accuracy (2) - ',model1.score(X_train,y_train))
print('Training Accuracy (3) - ',model2.score(X_train,y_train))

Training Accuracy (1) -  0.9658112787040599
Training Accuracy (2) -  1.0
Training Accuracy (3) -  -0.12339691972310973


In [109]:
# Evaluate the build model on test dataset
print('Testing Accuracy (1) - ',model.score(X_test,y_test))
print('Testing Accuracy (2) - ',model1.score(X_test,y_test))
print('Testing Accuracy (3) - ',model2.score(X_test,y_test))

Testing Accuracy (1) -  0.8890984128477836
Testing Accuracy (2) -  0.6305570560629776
Testing Accuracy (3) -  -0.1931437346637499


In [129]:
# Evaluate training and testing coefficient of determination and root mean squre error
print('Training RMSE (1) - ',mean_squared_error(y_train,model.predict(X_train))**0.5)
print('Training RMSE (2) - ',mean_squared_error(y_train,model1.predict(X_train))**0.5)
print('Training RMSE (3) - ',mean_squared_error(y_train,model2.predict(X_train))**0.5)
print()
print('Testing RMSE (1) - ',mean_squared_error(y_test,model.predict(X_test))**0.5)
print('Testing RMSE (2) - ',mean_squared_error(y_test,model1.predict(X_test))**0.5)
print('Testing RMSE (3) - ',mean_squared_error(y_test,model2.predict(X_test))**0.5)


Training RMSE (1) -  100.53450948463174
Training RMSE (2) -  0.0
Training RMSE (3) -  576.2890248476601

Testing RMSE (1) -  279.2110177739933
Testing RMSE (2) -  509.60931659022435
Testing RMSE (3) -  915.8198521843149


## Task 2: Ensemble regression models


In [130]:
# Ensemble the regression models
ensembletrain=np.vstack((model.predict(X_train),model1.predict(X_train),model2.predict(X_train)))
ensembletrain=np.mean(ensembletrain,axis=0)
ensembletest=np.vstack((model.predict(X_test),model1.predict(X_test),model2.predict(X_test)))
ensembletest=np.mean(ensembletest,axis=0)

In [131]:
# Evaluate Coefficient of determination and Root mean square error 
print('Training RMSE (4) - ',mean_squared_error(y_train,ensembletrain)**0.5)
print('Testing RMSE (4) - ',mean_squared_error(y_test,ensembletest)**0.5)

Training RMSE (4) -  200.67403168981772
Testing RMSE (4) -  496.98888567523176



## Task 3: Conclude the results
