<a href="https://colab.research.google.com/github/Hanane98/Myfiles/blob/master/Copy_of_Copy_of_Step_4_3_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis

EDA is all about understanding your data by employing summarizing and visualizing techniques. At a high level, the EDA can be performed in two folds i.e., univariate analysis and multivariate analysis.
Let's learn to consider an example data set to learn practically. 


### Iris Data

Iris dataset is one of a well-known dataset used extensively in pattern recognition literature. It is hosted at the UC Irvine Machine Learning Repository. The data set contains petal length, petal width, sepal length and sepal width measurement for 3 types of Iris flowers i.e., Setosa, Versicolor, and Virginica.

In [0]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
!pip install PyDrive


from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import os

class download_data_from_folder(object):
    def __init__(self,path):
        path_id = path[path.find('id=')+3:]
        self.file_list = self.get_files_in_location(path_id)
        self.unwrap_data(self.file_list)
    def get_files_in_location(self,folder_id):
        file_list = drive.ListFile({'q': "'{}' in parents and trashed=false".format(folder_id)}).GetList()
        return file_list
    def unwrap_data(self,file_list,directory='.'):
        for i, file in enumerate(file_list):
            print(str((i + 1) / len(file_list) * 100) + '% done copying')
            if file['mimeType'].find('folder') != -1:
                if not os.path.exists(os.path.join(directory, file['title'])):
                    os.makedirs(os.path.join(directory, file['title']))
                print('Copying folder ' + os.path.join(directory, file['title']))
                self.unwrap_data(self.get_files_in_location(file['id']), os.path.join(directory, file['title']))
            else:
                if not os.path.exists(os.path.join(directory, file['title'])):
                    downloaded = drive.CreateFile({'id': file['id']})
                    downloaded.GetContentFile(os.path.join(directory, file['title']))
        return None

In [0]:
data_path = 'https://drive.google.com/open?id=13hFQ09ptYr-Ud5xOJ0Xx4cV0akc1RnZw'
download_data_from_folder(data_path)

In [0]:
from IPython.display import Image
Image(filename='Iris_Flower.png', width=500)

### Load Data

The Iris dataset comes as part of the scikit-learn dataset package which contains some of the populare datasets of machine learning literature.

In [0]:
import warnings
warnings.filterwarnings('ignore')

from sklearn import datasets
import numpy as np

iris = datasets.load_iris()

### Exploratory Data Analysis

EDA is all about understanding your data by employing summarizing and visualizing techniques.

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Let's convert to dataframe
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['species'])

# let's remove spaces from column name
iris.columns = iris.columns.str.replace(' ','')

# replace the values with class labels
iris.species = np.where(iris.species == 0.0, 'setosa', np.where(iris.species==1.0,'versicolor', 'virginica'))

# data dimension 
print(iris.shape)

# Peek at the 1st few records
iris.head()

### Univariate Analysis

Individual variables are analyzed in isolation to get a better understanding of them. 

Pandas provide describe function to create summary statistics in tabular format for all variables. These statistics are very useful for the numerical type of variables to understand any quality issues such as missing value and presence of outliers.

In [0]:
iris.describe()

The columns 'species' is categorical, so lets check the frequency distribution for each category.

In [0]:
print(iris['species'].value_counts())

Pandas supports plotting functions to quick visualization on attributes. We can see from the plot that 'species' has 3 category with 50 records each.

In [0]:
# Set the size of the plot
plt.figure(figsize=(15,8))

iris.hist()        # plot histogram
plt.suptitle("Histogram", fontsize=12) # use suptitle to add title to all sublots
plt.tight_layout(pad=1)
plt.show()

iris.boxplot()     # plot boxplot  
plt.title("Bar Plot", fontsize=16)
plt.tight_layout(pad=1)
plt.show()

### Multivariate Analysis

In multivariate analysis you try to estabilish a sense of relationship of all variables with one other.

Let’s determine the mean of each feature by species type

In [0]:
# We can quickly make a boxplot with Pandas on each feature split out by species
iris.boxplot(by="species", figsize=(12, 6))

In [0]:
# print the mean for each column by species
iris.groupby(by = "species").mean()

In [0]:
# plot for mean of each feature for each label class
iris.groupby(by = "species").mean().plot(kind="bar")

plt.title('Class vs Measurements')
plt.ylabel('mean measurement(cm)')
plt.xticks(rotation=0)  # manage the xticks rotation
plt.grid(True)

# Use bbox_to_anchor option to place the legend outside plot area to be tidy 
plt.legend(loc="upper left", bbox_to_anchor=(1,1))

### Correlation matrix
Correlation function uses Pearson correlation coefficient which results in a number between -1 to 1. Strong negative relationship is indicated by a coefficient is toward -1 and a strong positive correlation is indicated by a coefficient towards 1.

In [0]:
# create correlation matrix
corr = iris.corr()
print(corr)

In [0]:
import statsmodels.api as sm
sm.graphics.plot_corr(corr, xnames=list(corr.columns))
plt.show()

### Pair plot

You can understnad the relationship attributes by looking at the distribution of the interactions of each pair of attributes

This uses a built function to create a matrix of scatter plots of all attributes versus all attributes.

In [0]:
from pandas.plotting import scatter_matrix
scatter_matrix(iris, figsize=(10, 10))
plt.suptitle("Pair Plot", fontsize=20) # use suptitle to add title to all sublots

### Findings of EDA

* There are no missing values
* Sepal is longer than petal. Sepal length ranges between 4.3 to 7.9 with average lenth of 5.8, whereas petal length ranges between 1 to 6.9 with average length of 3.7
* Sepal is also wider than petal. Sepal width ranges between 2 to 4.4 with a average width of 3.05, whereas petal width ranges between 0.1 to 2.5 with average width of 1.19

* Average petal length of setosa is much smaller than versicolor and virginica, however the average sepal width of setosa is higher than versicolor and virginica

* Petal length and width are strongly correlated i.e., 96% of the time width increases with increase in length

* Petal length has negative correlation with sepal widht i.e., 42% of the time increase in sepal width will decrease petal length 

#### initial conclusion from data:
Based on length and width of sepal/petal alone, you can conclude that versicolor/virginica might resemble in size, however setosa characteristics seems to be noticeably different from other two.

### Let's look at the actual flowers to see if the finding makes sense!

In [0]:
Image(filename='iris_photo.png', width=900)