# Sensors Data which is Artificially Generated


Task description
----------------
The file task_data.csv contains an example data set that has been artificially
generated. The set consists of 400 samples where for each sample there are 10
different sensor readings available. The samples have been divided into two
classes where the class label is either `1 or -1`. The class labels define to what
particular class a particular sample belongs.

Your task is to rank the sensors according to their importance/predictive power
with respect to the class labels of the samples. Your solution should be a
Python script or a Jupyter notebook file that generates a ranking of the sensors
from the provided CSV file. The ranking should be in decreasing order where the
first sensor is the most important one.

Additionally, please include an analysis of your method and results, with
possible topics including:

* your process of thought, i.e., how did you come to your solution?
* properties of the artificially generated data set
* strengths of your method: why does it produce a reasonable result?
* weaknesses of your method: when would the method produce inaccurate results?
* scalability of your method with respect to number of features and/or samples
* alternative methods and their respective strengths, weaknesses, scalability

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv('../input/artificially-generatedsensorsdata/task_data.csv')
df.head()

There are 3 things that take my attention 
1) There is an Sample index that cannot be used for classificaiton 

2) `class_label` is our class label

3) we have 9 sensors to do our classification.

Therefore, drop sample_index feature.

In [None]:
df.info()

In [None]:
print("*_*"*20,"\nData Set")
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns') # fstring 
#display(missing_df.sort_values(by='Missing', ascending=False))
print(f'There are {df.isnull().sum().mean()} missing values')
#check shape and missing values  

Our data has 0 NANs.

* Next let's check the Categorical/ Numerical features.

In [None]:
# rename it to get ride of space
df.rename(columns={"sample index":"sample_index"},inplace=True)

In [None]:
# Categorical features
cat_col=df.select_dtypes(include='object').columns.to_list()
cat_col


In [None]:
# Numerical features
num_col=df.select_dtypes(include='number').columns.to_list()
num_col

In [None]:
ax = sns.countplot(df.class_label,label="Count")       # M = 212, B = 357
One, NegOne = df.class_label.value_counts()
print('Number of Ones: ',One)
print('Number of -Ones : ',NegOne)

* Our data is balances. ( each class contains 200).

Okey, now we have 9 features but what does they mean (not just Sensors meaning) or actually how much do we need to know about these features.

The answer is that we do not need to know meaning of these features however in order to imagine in our mind we should know something like variance, standart deviation, number of sample (count) or max min values. 

These type of information helps to understand about what is going on data.

In [None]:
df.describe()

In [None]:
df.describe().T.style.bar(subset=['mean'], color='#FF595E')\
                           .background_gradient(subset=['50%'], cmap='PiYG') # highlight median

# Visualization

In [None]:
cols = 4
rows = len(num_col) // cols+1
fig, axs = plt.subplots(ncols=cols, nrows=rows, figsize=(19,30), sharex=False) #subplot with all rows
plt.subplots_adjust(hspace = 0.4)
i=0

for r in np.arange(0, rows, 1):
    for c in np.arange(0, cols, 1):
        if i >= len(num_col):
            axs[r, c].set_visible(False)
        else:
            axs[r,c].hist(df[num_col[i]].values,
                                   color="#59c8ff",
                                   edgecolor="black",
                                   alpha=0.7,
                                   label="Train Dataset",bins=40)
            axs[r, c].set_title(num_col[i], fontsize=17, pad=4)
            axs[r, c].tick_params(axis="y", labelsize=11)
            axs[r, c].tick_params(axis="x", labelsize=11)
            axs[r,c].spines['right'].set_visible(False)
            axs[r,c].spines['top'].set_visible(False)

        i+=1

plt.show();

In [None]:
df['sample_index'].value_counts()

# each sample count 1

In [None]:
df.head()

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor1'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor2'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor3'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor4'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor5'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor6'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor7'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor8'], kind="reg", color="#ce1414")

In [None]:
sns.jointplot(x=df.loc[:,'sensor0'], y=df.loc[:,'sensor9'], kind="reg", color="#ce1414")

__Observation:__ Features are not correlated.

In [None]:
sns.set(style="white")
#df = x.loc[:,['radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)

In [None]:
list = ['sample_index','class_label']
x = df.drop(list,axis = 1 )
x.head()

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

## Feature Selection and Random Forest Classification

Today our purpuse is to try new `coffee`. For example, we are finaly in the coffe shop and we want to drink different tastes. Therefore, we need to compare ingredients of drinks. If one of them includes Milk, after drinking it we need to eliminate other drinks which includes Milk so as to experience very different tastes.

In this part we will select feature with different methods that are:

* feature selection with correlation, 
* univariate feature selection, 
* recursive feature elimination (RFE), 
* recursive feature elimination with cross validation (RFECV) and 
* tree based feature selection. 

We will use random forest classification in order to train our model and predict

### 1) Feature selection with correlation and random forest classification

In [None]:
y = df.class_label

In [None]:
x.head()

## here we will try all the 9 sensors.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score

# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

#random forest classifier with n_estimators=10 (default)
clf_rf = RandomForestClassifier(random_state=43)      
clr_rf = clf_rf.fit(x_train,y_train)

ac = accuracy_score(y_test,clf_rf.predict(x_test))
print('Accuracy is: ',ac)
cm = confusion_matrix(y_test,clf_rf.predict(x_test))
sns.heatmap(cm,annot=True,fmt="d")

In [None]:
# feature importances based on analysis using random forest

featureImp = pd.DataFrame({  
                'feature': x_train.columns,
                'Score': clf_rf.feature_importances_
              })
    
sortedFeatureImp = featureImp.sort_values('Score', axis=0, ascending=False, inplace=False, kind='quicksort', na_position='last')
# Feature importance
sortedFeatureImp.style.highlight_max(axis=0)

__Observation__ : Accuracy is almost 98% and as it can be seen in confusion matrix, we make 2 wrong predictions. Now lets see other feature selection methods to find better results.

### 2) Univariate feature selection and random forest classification

In univariate feature selection, we will use SelectKBest that removes all but the k highest scoring features. http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest



__Let's try 9 features like the previous method then afterward we will stick with 5 features only.__

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# find best scored 5 features
select_feature_9ft = SelectKBest(chi2, k=9).fit(x_train, y_train)

In [None]:
#feature importances based on analysis using random forest
feature_select_9ft = pd.DataFrame({'feature': x_train.columns,
                'Score': select_feature_9ft.scores_
            })
sortedFeatureImp = feature_select_9ft.sort_values('Score', axis=0, ascending=False, inplace=False, kind='quicksort', na_position='last')
sortedFeatureImp.style.highlight_max(axis=0)

In [None]:
x_train_2 = select_feature_9ft.transform(x_train)
x_test_2 = select_feature_9ft.transform(x_test)
#random forest classifier with n_estimators=10 (default)
clf_rf_2 = RandomForestClassifier()      
clr_rf_2 = clf_rf_2.fit(x_train_2,y_train)
ac_2 = accuracy_score(y_test,clf_rf_2.predict(x_test_2))
print('Accuracy is: ',ac_2)
cm_2 = confusion_matrix(y_test,clf_rf_2.predict(x_test_2))
sns.heatmap(cm_2,annot=True,fmt="d")

__Observation__ : the methodes varies and the Acc is the same with 9features

__Remark:__ if we reduce the K to 5, the Accuracy will drop meaningfully.


__So lets se what happens if we use only these best scored 5 feature.__
Now, I will not try all combinations but I will only choose k = 5 and then  find best 5 features.

In [None]:
# find best scored 5 features
select_feature_5ft = SelectKBest(chi2, k=5).fit(x_train, y_train)

In [None]:
#feature importances based on analysis using random forest
feature_select_5ft = pd.DataFrame({'feature': x_train.columns,
                'Score': select_feature_5ft.scores_
            })
sortedFeatureImp = feature_select_5ft.sort_values('Score', axis=0, ascending=False, inplace=False, kind='quicksort', na_position='last')
sortedFeatureImp.style.highlight_max(axis=0)

In [None]:
x_train_2 = select_feature_5ft.transform(x_train)
x_test_2 = select_feature_5ft.transform(x_test)
#random forest classifier with n_estimators=10 (default)
clf_rf_2 = RandomForestClassifier()      
clr_rf_2 = clf_rf_2.fit(x_train_2,y_train)
ac_2 = accuracy_score(y_test,clf_rf_2.predict(x_test_2))
print('Accuracy is: ',ac_2)
cm_2 = confusion_matrix(y_test,clf_rf_2.predict(x_test_2))
sns.heatmap(cm_2,annot=True,fmt="d")

In [None]:
# feature importances based on analysis using random forest

featureImp = pd.DataFrame({  
                'feature': ['Sensor8','Sensor4','Sensor0','Sensor3','Sensor1'],
                'Score': clf_rf_2.feature_importances_
              })
    
sortedFeatureImp = featureImp.sort_values('Score', axis=0, ascending=False, inplace=False, kind='quicksort', na_position='last')
# Feature importance
sortedFeatureImp.style.highlight_max(axis=0)

__Observation__: the accuracy droped to 92.5% so these are not the best features.

### 3) Recursive feature elimination (RFE) with random forest

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html Basically, it uses one of the classification methods (random forest in our example), assign weights to each of features. 
Whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features

Like previous method, we will use 5 features. However, which 5 features will we use ? We will choose them with RFE method.

In [None]:
from sklearn.feature_selection import RFE
# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestClassifier()      
rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)
rfe = rfe.fit(x_train, y_train)

In [None]:
print('Chosen best 5 feature by rfe:',x_train.columns[rfe.support_])


Chosen 5 best features by rfe are :
`'sensor0', 'sensor3', 'sensor4', 'sensor6', 'sensor8'`. 
They are exactly similar with previous (selectkBest) method with a slight change 1 by 6 , & 3 by 2
Therefore we do not need to calculate accuracy again. Shortly, we can say that we make good feature selection with rfe and selectkBest methods. However as you can see there is a problem, okey I except we find best 5 feature with two different method and these features are same but why it is 5.  Therefore lets see how many feature we need to use with __rfecv method.__

### 4) Recursive feature elimination with cross validation and random forest classification

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html Now we will not only find best features but we also find how many features do we need for best accuracy.

In [None]:
from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_4 = RandomForestClassifier() 
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_])

Finally, we find best 2 features that are `'sensor6', 'sensor8'` for best classification. Lets look at best accuracy with plot.

In [None]:
# Plot number of features VS. cross-validation scores
import matplotlib.pyplot as plt
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score of number of selected features")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

__Observation:__ Lets look at what we did up to this point. 
Lets accept that guys this data is very easy to classification. However, our first purpose is actually not finding good accuracy. Our purpose is learning how to make feature selection and understanding data. Then last make our last feature selection method.

### 5) Tree based feature selection and random forest classification

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html In random forest classification method there is a featureimportances attributes that is the feature importances (the higher, the more important the feature). 

__!!! To use feature_importance method, in training data there should not be correlated features. Random forest choose randomly at each iteration, therefore sequence of feature importance list can change.__

In [None]:
clf_rf_5 = RandomForestClassifier()      
clr_rf_5 = clf_rf_5.fit(x_train,y_train)
importances = clr_rf_5.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf_rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest

plt.figure(1, figsize=(14, 13))
plt.title("Feature importances")
plt.bar(range(x_train.shape[1]), importances[indices],
       color="g", yerr=std[indices], align="center")
plt.xticks(range(x_train.shape[1]), x_train.columns[indices],rotation=90)
plt.xlim([-1, x_train.shape[1]])
plt.show()

__Observation:__ As you can seen in plot above, after 5 best features importance of features decrease. Therefore we can focus these 5 features. As I sad before, I give importance to understand features and find best of them.

### 6) Feature Extraction with PCA


http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html We will use principle component analysis (PCA) for feature extraction. Before PCA, we need to normalize data for better performance of PCA.

In [None]:
# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
#normalization
x_train_N = (x_train-x_train.mean())/(x_train.max()-x_train.min())
x_test_N = (x_test-x_test.mean())/(x_test.max()-x_test.min())

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(x_train_N)

plt.figure(1, figsize=(14, 13))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_ratio_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_ratio_')

__Observation:__ 

* According to variance ration, **2 component can be chosen.**

__Observation:__ 

* Like __RFECV Method__ & __Feature Extraction with PCA__ mentioned that `Sensor6 & Sensor8` are the best so let's use them and see if the Accuracy will rise.

In [None]:
yy = df.class_label
list = ['sample_index','class_label','sensor0', 'sensor1', 'sensor2',
       'sensor3', 'sensor4', 'sensor5', 'sensor7', 'sensor9']
XX_f = df.drop(list,axis = 1 )
XX_f.head()


In [None]:
xx_train, xx_test, yy_train, yy_test = train_test_split(XX_f, yy, test_size=0.3, random_state=42)
clf_rf_final = RandomForestClassifier()      
clf_rf_final = clf_rf_final.fit(xx_train,yy_train)
importances = clf_rf_final.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf_rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(xx_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest

plt.figure(1, figsize=(14, 13))
plt.title("Feature importances")
plt.bar(range(xx_train.shape[1]), importances[indices],
       color="g", yerr=std[indices], align="center")
plt.xticks(range(xx_train.shape[1]), xx_train.columns[indices],rotation=90)
plt.xlim([-1, xx_train.shape[1]])
plt.show()

In [None]:
ac_f = accuracy_score(yy_test,clf_rf_final.predict(xx_test))
print('Accuracy is: ',ac_f)
cm_f = confusion_matrix(yy_test,clf_rf_final.predict(xx_test))
sns.heatmap(cm_f,annot=True,fmt="d")

# Conclusion

Shortly, I tried to show importance of feature selection and data visualization. 

Default data includes 9 feature but after feature selection we drop this number from 9 to 2 with accuracy __99%__. 

In this kernel we just tried basic things, I am sure with these data visualization and feature selection methods, you can easily ecxeed the % 99% accuracy. 

Or Maybe you can use other classification methods.

# You remarks and suggestions are much appreciated.  💡💡💡

## ⬆️⬆️Upvote this Notebook if you find it usefull ⬆️⬆️
