<h1><center>Exploratory Data Analysis (EDA)</center></h1>

<center><h2><i>Tabular data</i></h2></center>

This notebook is made to permit users to explore quickly their data. The goal is to automatize this part of the lifecycle to provide insights and intuitions about the data. So the notebook uses pandas-profiling, and dataprep.eda to facilitate the exploration. Next, the different parts explore the feature importance and feature extraction with different methods and algorithms.

<h2>Content</h2>

- [Sand box to load and clean data](#part_0)
- [I - EDA](#part_1)
    - [I-1 pandas-profiling](#part_1_1)
    - [I-2 Dataprep EDA](#part_1_2)
    - [I-3 Target identification](#part_1_3)
- [II - Feature selection](#part_2)
    - [II-1 Removing features with low variance](#part_2_1)
    - [II-2 Univariate Selection](#part_2_2)
    - [II-3 Recursive Feature Elimination (RFE)](#part_2_3)
- [III - Feature Extraction](#part_3)
    - [III-1 Principal Component Analysis (PCA)](#part_3_1)
    - [III-2 Independent Component Analysis (ICA)](#part_3_2)
    - [III-3 Linear Discriminant Analysis (LDA)](#part_3_3)
    - [III-4 Locally Linear Embedding (LLE)](#part_3_4)
    - [III-5 t-distributed Stochastic Neighbor Embedding (t-SNE)](#part_3_5)
- [VI - Feature Importance](#part_4)
    - [VI-1 Tree method](#part_4_1)
    - [VI-2 Permutation Method](#part_4_2)

---

<h2><a id="part_0">Sand box to load and clean data</a></h2>

---

In [2]:
#!pip install rfpimp 
#!pip install eli5
#!pip install pandas-profiling
#!pip install dataprep

Defaulting to user installation because normal site-packages is not writeable
Collecting dataprep
[31mERROR: Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/d0/3a/0e5c9c6645f7d470b96c82f86a9067919bc71abbc331e45c2ec0bb53133f/dataprep-0.2.7-py3-none-any.whl (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
[0m


In [4]:
%matplotlib inline
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
#importing all the libraries
from dataprep.eda import plot, plot_correlation, plot_missing
import sweetviz

In [5]:
# take the boston housing prices to test the pipe
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.boston_housing.load_data()

df = pd.DataFrame(X_train)
df.loc[:, "target"] = y_train
df_= pd.DataFrame(X_test)
df_.loc[:, "target"] = y_test
df = df.append(df_)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz


In [6]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,target
0,1.23247,0.0,8.14,0.0,0.538,6.142,91.7,3.9769,4.0,307.0,21.0,396.90,18.72,15.2
1,0.02177,82.5,2.03,0.0,0.415,7.610,15.7,6.2700,2.0,348.0,14.7,395.38,3.11,42.3
2,4.89822,0.0,18.10,0.0,0.631,4.970,100.0,1.3325,24.0,666.0,20.2,375.52,3.26,50.0
3,0.03961,0.0,5.19,0.0,0.515,6.037,34.5,5.9853,5.0,224.0,20.2,396.90,8.01,21.1
4,3.69311,0.0,18.10,0.0,0.713,6.376,88.4,2.5671,24.0,666.0,20.2,391.43,14.65,17.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,3.47428,0.0,18.10,1.0,0.718,8.780,82.9,1.9047,24.0,666.0,20.2,354.55,5.29,21.9
98,0.07896,0.0,12.83,0.0,0.437,6.273,6.0,4.2515,5.0,398.0,18.7,394.92,6.78,24.1
99,1.83377,0.0,19.58,1.0,0.605,7.802,98.2,2.0407,5.0,403.0,14.7,389.61,1.92,50.0
100,0.35809,0.0,6.20,1.0,0.507,6.951,88.5,2.8617,8.0,307.0,17.4,391.70,9.71,26.7


---

<h3><a id="part_1">I - EDA</a></h3>

<h3><a id="part_1_1">I-1 Pandas-profiling</a></h3>

---

In [None]:
profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True) 

In [None]:
# show the results in widget
profile.to_widgets()

In [None]:
# show the html inside the notebook
profile.to_notebook_iframe()

---

<h3><a id="part_1_2">I-2 Dataprep EDA</a></h3>

---

In [None]:
#API Plot
plot(df) 

In [None]:
#API Correlation
plot_correlation(df)

In [None]:
#API Missing Value
plot_missing(df) 

---

<h3><a id="part_1_4">I-3 Comparison train test with sweetviz</a></h3>

---

In [7]:
# Split the data in 80 / 20 
train = df[:int(len(df)*0.8)]
test = df[int(len(df)*0.8):]

In [8]:
my_report = sweetviz.compare([train, "Train"], [test, "Test"], "target")

:FEATURES DONE:                    |█████████████████████| [100%]   00:13  -> (00:00 left)
:PAIRWISE DONE:                    |█████████████████████| [100%]   00:00  -> (00:00 left)


Creating Associations graph... DONE!


In [9]:
my_report.show_html("Report.html") # Not providing a filename will default to SWEETVIZ_REPORT.html

---

<h3><a id="part_1_3">I-4 Target identification</a></h3>

---

In [None]:
df.columns

In [None]:
TARGET = "target"

In [None]:
y = np.array(df[TARGET])

In [None]:
X = df.loc[:, ~df.columns.isin([TARGET])]

In [None]:
X.columns

In [None]:
continuous = False
if y.dtype==float:
    print("The data is continuous")
    continuous=True

---

<h2><a id="part_2">II - Feature Selection</a></h2>

<h3><u><a id="part_2_1">Removing features with low variance</a></u></h3>

---

Remove feature containing more than 80% of missing data.

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_variance = sel.fit_transform(X)

In [None]:
print(f"The new number of features is: {X_variance.shape[1]}")

---

<h3><u><a id="part_2_2">Univariate Selection</a></u></h3>


The univariate Selection is done with the chi square approach. The goal of this approach is to seelect features with the strongest relationships with the output variable.

In the scikit-learn package, the function to do that is <a href=https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html>SelectKBest</a>

---

In [None]:
if not continuous:
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import chi2

    #apply SelectKBest class to extract top 10 best features
    bestfeatures = SelectKBest(score_func=chi2, k=10)
    fit = bestfeatures.fit(X,y)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(X.columns)
    #concat two dataframes for better visualization 
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Specs','Score']  #naming the dataframe columns
    print(featureScores.nlargest(10,'Score'))  #print 10 best features

---

<h3><u><a id="part_2_3">Recursive Feature Elimination (RFE)</a></u></h3>

---

<i><b>SVM</b></i>

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

In [None]:
if not continuous:
    svc = SVC(kernel="linear")
    # The "accuracy" scoring is proportional to the number of correct
    # classifications
    rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
                  scoring='accuracy')
    rfecv.fit(X, y)

    print("Optimal number of features : %d" % rfecv.n_features_)

    # Plot number of features VS. cross-validation scores
    plt.figure()
    plt.xlabel("Number of features selected")
    plt.ylabel("Cross validation score (nb of correct classifications)")
    plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
    plt.show()

<i><b>Logistic Regression</b></i>

In [None]:
# Feature Extraction with RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [None]:
if not continuous:
    # feature extraction
    model = LogisticRegression(solver='lbfgs', max_iter=5000)
    rfe = RFE(model, 3)
    fit = rfe.fit(X, Y)
    print("Num Features: %d" % fit.n_features_)
    print("Selected Features: %s" % fit.support_)
    print("Feature Ranking: %s" % fit.ranking_)

---

<h2><a id="part_3">III - Feature Extraction</a></h2>

---

<h3><u><a id="part_3_1">Principle Component Analysis (PCA)</a></u></h3>

---

In [1]:
from sklearn.decomposition import PCA

In [None]:
N_var = 2

In [None]:
pca = PCA(n_components=N_var)
X_pca = pca.fit_transform(X)
df_pca = pd.DataFrame(data = X_pca, columns = ['PC1', 'PC2'])
#df_pca.loc[:, TARGET]=df.loc[:, TARGET]
#df_pca[TARGET] = LabelEncoder().fit_transform(df_pca[TARGET])
#df_pca.head()

In [None]:
df_pca

In [None]:
plt.figure(figsize=(10,8))
plt.plot(df_pca["PC1"],df_pca["PC2"], ".")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)

---

<h3><u><a id="part_3_2">Independent Component Analysis (ICA)</a></u></h3>

---

In [None]:
from sklearn.decomposition import FastICA

In [None]:
ica = FastICA(n_components=N_var)
X_ica = ica.fit_transform(X)

In [None]:
X_ica

In [None]:
plt.figure(figsize=(10,8))
plt.plot(X_ica[:, 0],X_ica[:,1], ".")
plt.xlabel("ICA0")
plt.ylabel("ICA1")
plt.grid(True)

---

<h3><u><a id="part_3_3">Linear Discriminant Analysis (LDA)</a></u></h3>

---

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [None]:
if not continuous:
    lda = LinearDiscriminantAnalysis(n_components=N_var)

    # run an LDA and use it to transform the features
    X_lda = lda.fit(X, y).transform(X)
    print('Original number of features:', X.shape[1])
    print('Reduced number of features:', X_lda.shape[1])

---

<h3><u><a id="part_3_4">Locally Linear Embedding (LLE)</a></u></h3>

---

In [None]:
from sklearn.manifold import locally_linear_embedding

In [None]:
lle, error = locally_linear_embedding(X, n_neighbors=5, n_components=N_var, random_state=42, n_jobs=-1)

In [None]:
print(f"The squarred error is: {error}")

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(lle[:, 0], lle[:, 1],  cmap=plt.cm.Spectral)
plt.xlabel("lle0")
plt.ylabel("lle1")
plt.grid(True)

---

<h3><u><a id="part_3_5">t-distributed Stochastic Neighbor Embedding (t-SNE)</a></u></h3>

---

In [None]:
from sklearn.manifold import TSNE

In [None]:
X_embedded = TSNE(n_components=N_var).fit_transform(X)

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(X_embedded[:, 0], X_embedded[:, 1],  cmap=plt.cm.Spectral)
plt.xlabel("t-SNE0")
plt.ylabel("t-SNE1")
plt.grid(True)

---

<h2><a id="part_4">VI - Feature Importance</a></h2>

---

<h3><u><a id="part_4_1">Tree method</a></u></h3>

---

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
if not continuous:
    # Build a forest and compute the impurity-based feature importances
    forest = ExtraTreesClassifier(n_estimators=250,
                                  random_state=0)

    forest.fit(X, y)
    importances = forest.feature_importances_
    std = np.std([tree.feature_importances_ for tree in forest.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]

    # Print the feature ranking
    print("Feature ranking:")

    for f in range(X.shape[1]):
        print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

    # Plot the impurity-based feature importances of the forest
    plt.figure()
    plt.title("Feature importances")
    plt.bar(range(X.shape[1]), importances[indices],
            color="b", yerr=std[indices], align="center")
    plt.xticks(range(X.shape[1]), indices)
    plt.xlim([-1, X.shape[1]])
    plt.show()

In [None]:
if continuous:
    from sklearn.ensemble import RandomForestRegressor

    rf = RandomForestRegressor(n_estimators = 100,
                               n_jobs = -1,
                               oob_score = True,
                               bootstrap = True,
                               random_state = 42)
    rf.fit(X, y)

    print('R^2 Training Score: {:.2f} \nOOB Score: {:.2f} '.format(rf.score(X, y), rf.oob_score_,))

    results = pd.DataFrame(data=rf.feature_importances_, index=X.columns)
    results.columns = ["Importance"]
    results.sort_values(by=["Importance"], ascending=False)
    importances = rf.feature_importances_
    std = np.std([tree.feature_importances_ for tree in rf.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]
    plt.figure(figsize=(10,8))
    plt.title("Feature importances")
    plt.bar(results.index, results.Importance,
            color="b", yerr=std, align="center")
    plt.xticks(range(X.shape[1]), indices)
    plt.xlim([-1, X.shape[1]])
    plt.grid(True)
    plt.show()

---

<h3><u><a id="part_4_2">Permutation Method</a></u></h3>

---

In [None]:
if continuous:
    from sklearn.metrics import r2_score
    from rfpimp import permutation_importances

    def r2(rf, X_train, y_train):
        return r2_score(y_train, rf.predict(X_train))

    perm_imp_rfpimp = permutation_importances(rf, X, y, r2)
    importances = perm_imp_rfpimp.Importance
    
    indices = np.argsort(importances)[::-1]
    plt.figure(figsize=(10,8))
    plt.title("Feature importances")
    plt.bar(results.index, results.Importance,
                color="b",  align="center")
    plt.xticks(range(X.shape[1]), indices)
    plt.xlim([-1, X.shape[1]])
    plt.grid(True)
    plt.show()

In [None]:
if continuous:
    import eli5
    from eli5.sklearn import PermutationImportance

    perm = PermutationImportance(rf, cv = None, refit = False, n_iter = 50).fit(X, y)
    results = pd.DataFrame(data= perm.feature_importances_, index=X.columns)
    results.columns = ["Importance"]
    results.sort_values(by=["Importance"], ascending=False)
    importances = perm.feature_importances_
    #std = np.std([tree.feature_importances_ for tree in perm.estimators_],
    #             axis=0)
    indices = np.argsort(importances)[::-1]
    plt.figure(figsize=(10,8))
    plt.title("Feature importances")
    plt.bar(results.index, results.Importance,
                color="b",  align="center")
    plt.xticks(range(X.shape[1]), indices)
    plt.xlim([-1, X.shape[1]])
    plt.grid(True)
    plt.show()