 ## Exploratory Data Analysis

<b>Goal</b>: produce a statistical summary of the Iris dataset<br>
You can download the Iris data files from [this link](https://datahub.io/machine-learning/iris#data)

### 1) Loading data

Task: use <em>pandas</em> and <em>scipy</em> to load Iris from csv, arff and json formats and convert them onto dataframes

In [None]:
import pandas as pd
  
# Reading the CSV file
df = pd.read_csv("../example_datasets/iris.csv")
  
# Printing top 5 rows
df.head()

In [None]:
# Reading the JSON file
df = pd.read_json("../example_datasets/iris.json")
df.head()

In [None]:
from scipy.io.arff import loadarff

# Reading the ARFF file
data = loadarff('../example_datasets/iris_arff.arff')
df = pd.DataFrame(data[0])
df['class'] = df['class'].str.decode('utf-8')
df.head()

### 2) Summaries

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

How many observation do we have for each species?

In [None]:
df.value_counts("class")

Let us remove two values from *sepalwidth*

In [None]:
import numpy as np
df.loc[0:1,'sepalwidth'] = np.nan

Are there missings per variable?

In [None]:
df.isnull().sum()

We can choose to remove the rows or impute missings (e.g., *median* for numeric vars or *mode* for categoric)

In [None]:
df.dropna().head()

In [None]:
from sklearn.impute import SimpleImputer

# imputation strategies: mean, median, most_frequent
imp = SimpleImputer(strategy='mean', missing_values=np.nan, copy=True) 
df[['sepalwidth']] = imp.fit_transform(df[['sepalwidth']])
df.head()

Are there duplicates?<br>
Let us check how many observations have different sepal length and width.

In [None]:
unique_sepal = df.drop_duplicates(subset=["sepallength","sepalwidth"])
unique_sepal.shape

### 3) Simple visualization

Let us use <em>matplotlib</em>, <em>seaborn</em> and <em>plotly</em> to visualize info

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='class', data=df)
plt.show()

Let us explore basic relationships between variables:<br>
- <em>setosa</em> has smaller sepal lengths but larger sepal widths, while <em>virginica</em> has larger sepal lengths but smaller sepal widths (first image)
- petal lengths and widths vary from smaller to larger for <em>setosa</em>, <em>versicolor</em> and <em>virginica</em> species (second image)

In [None]:
sns.scatterplot(x='sepallength', y='sepalwidth', hue='class', data=df)
  
# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=2)  
plt.show()

In [None]:
sns.scatterplot(x='petallength', y='petalwidth', hue='class', data=df)
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.show()

What about histogram views per variable?

In [None]:
plot = sns.FacetGrid(df, hue="class")
plot.map(sns.histplot, "sepallength").add_legend()

plot = sns.FacetGrid(df, hue="class")
plot.map(sns.histplot, "sepalwidth").add_legend()
  
plt.show()

And boxplots?

In [None]:
sns.boxplot(x="class", y='sepallength', data=df)      
plt.show()
sns.boxplot(x="class", y='sepalwidth', data=df)      
plt.show()

How to comprehensively display pairwise variable relationships?

In [None]:
sns.pairplot(df, hue='class', height=2)

Are input variables correlated?

In [None]:
df.corr(method='pearson', numeric_only=True)

In [None]:
sns.heatmap(df.corr(method='pearson', numeric_only=True), annot=True)
plt.show()

### 4) Keep going...

Some of the <em>sklearn</em> facilities require input data to be separated from output data

In [None]:
X = df.drop('class', axis=1)
y = df['class']

Let us for instance check the discriminative power of each feature in accordance with <em>f_classif</em> criterion

In [None]:
from sklearn.feature_selection import f_classif

fimportance = f_classif(X, y)

print('features', X.columns.values)
print('scores', fimportance[0])
print('pvalues', fimportance[1])

Now it your turn to unlock the world of <em>sklearn</em> facilities. Good journey!