# Introduction to Exploratory Data Analysis
#### Agenda


*   **Loading Libraries**
*   **Loading Data** 
*   **Data Analysis**
*   **Data Visualization**



In this tutorial, you will learn one of the important step of a Data Science pipeline i.e. Exploratory Data Analysis.

**Exploratory Data Analysis (EDA):** In this step, we try to understand the data and the underlying interactions between different variables.

The dataset we will be using here is the wine quality data. Kaggle has provided only red wine quality dataset. The dataset I have used is taken from UCI Machine Learning Repository. Both Red Wine and White Wine Data is used in this notebook.

**Data Set Information:**
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Quality is based on sensory scores (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

## Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use. 

In data science, numpy and pandas are most commonly used libraries. Numpy is required for calculations like means, medians, square roots, etc. Pandas is used for data processin and data frames. We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd).

In [None]:
import numpy as np        # Fundamental package for linear algebra and multidimensional arrays
import pandas as pd       # Data analysis and manipultion tool

## Loading Data
Pandas module is used for reading files. We have our data in '.csv' format. We will use 'read_csv()' function for loading the data.

In [None]:
# In read_csv() function, we have passed the location to where the files are located in the UCI website. The data is separated by ';'
# so we used separator as ';' (sep = ";")
red_wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
white_wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";")

Let's check how our data looks. This can be done using head() method.

In [None]:
# Red Wine
red_wine_data.head()    # we can also pass the number of records we want in the brackets (). By default it displays first 5 records.

In [None]:
# White Wine
white_wine_data.head(6)      # we will get first 6 records from white wine data

Le's explore the attributes / columns of the datasets. Both red and white wines have the same columns.



In [None]:
# Columns / Attribute
red_wine_data.columns

### Target Variable:
The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. It is the variable that is, or should be the output.

Here **quality** is the target variable as we're trying to know which of the two types of wines have a better quality.




### Input Variables:
One or more variables that are used to determine (or predict) the 'Target Variable' are known as Input Variables. They are sometimes called Predictor Variable as well.

In our example, the input variables are: 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', and 'alcohol'. 

All of these will help us predict the quality of the wine.


### Variables/Features:
Variables and features both are the same, they are often used interchangeably. All the column names in a dataset are variables.

## Exploratory Data Analysis (EDA)
After loading the data, it is important to examine the data. It is usually not recommended to directly throw all the data into the model without understanding the data. This step always helps in improving our model.

Concatenate both the data.

In [None]:
# Add a column to separate whether the wine is red or white.
red_wine_data['color'] = 'r'
white_wine_data['color'] = 'w'


In [None]:
wine_data = pd.concat([red_wine_data, white_wine_data])

Let's rename the columns which contain spaces in their names.

In [None]:
# rename() function is used to rename the columns

# wine data
# red_wine_data
wine_data.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
                          'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
                 inplace = True)

# red_wine_data
red_wine_data.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
                          'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
                 inplace = True)    # inplace = True makes changes in the dataframe itself

# white_wine_data
white_wine_data.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
                          'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
                 inplace = True)

In [None]:
red_wine_data.head(2)

In [None]:
# concise summary about dataset
red_wine_data.info()

In [None]:
white_wine_data.info()

It is always interesting to know the basic statistical characteristics of each numerical variables.

In [None]:
# Basic Statistical details 
red_wine_data.describe()

In [None]:
white_wine_data.describe()

Let's explore different statistical measures that we have got from desribe().


*   **count:** total count of non-null values in the column
*   **mean**: the average of all the values in that column
*   **min:** the minimum value in the column
*   **max:** the maximum value in the column
*   **25%:** first quartile in the column after we arrange those values in ascending order
*   **50%:** this is the median or the second quartile
*   **75%:** the third quartile 
*   **std:** this is the standard deviation (i.e. measure of depreciation, you must have read in the basics of statistics study material)

**Note:** 25%, 50%, and 75% are nothing but corresponding percentile values


Our brains are good at spotting patterns in pictures. Let's play around different types of data visualizations.

In [None]:
# first import data visualizations libraries
import matplotlib.pyplot as plt
import seaborn as sns

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# red wines      
red_wine_data.hist(bins=10, figsize=(16,12))
plt.show()

The distribution of the attribute seems to be positively skewed. The attributes '**density**' and '**pH**' are quite normally distributed (you must have read about normal distribution in basic statistics study material). Now looking at the attribute **quality**, we can observe that the wines with average quality (i.e. quality rating 5 to 7) are more than bad and good quality of wines.

In [None]:
# white wine
white_wine_data.hist(bins=10, figsize=(16,12))
plt.show()

For white wines, the attribute '**pH**' is quite normally distributed. The average **quality** of wines are more than good and bad qualities of wines for white wine data too. Most of the wines seem to be containing **alcohol** percentage in the range 8.5% to nearly 13%.

We can use pivot tables to observe the values of different features for each quality of wines.

In [None]:
# Creating pivot table for red wine
columns = list(red_wine_data.columns).remove('quality')
red_wine_data.pivot_table(columns, ['quality'], aggfunc=np.median)    # By default the aggfunc is mean

For each quality of red wines, we can observe the median values of different features.

In [None]:
# Creating pivot table for red wine
columns = list(white_wine_data.columns).remove('quality')
white_wine_data.pivot_table(columns, ['quality'], aggfunc=np.median)

We can check how each features are reated with others using corr() function.

The correlation value ranges between -1 to 1.When it is close to 1, it means that there is a strong positive correlation. When the coefficient is close to –1, it means that there is a strong negative correlation. Finally, coefficients close to zero mean that there is no linear correlation. We can observe the detail information using correlation matrix

In [None]:
# red wines
red_wine_data.corr()

From the above correlation matrix, we can observe that there is a relatively high positive correlation between **fixed_acidity** and **citric_acid**, **fixed_acidity** and **density**. Similarly we can observe there is a relatively high negative correlation between **fixed_acidity** and **pH**. There is relatively high positive correlation between **alcohol** presence and quality of the wines.

In [None]:
# white wines
white_wine_data.corr()

We can plot the above correlation matrix using heatmaps too. The visualization using heatmap is a pictorial visualization.

In [None]:
# red wines
plt.figure(figsize=(16, 12))
sns.heatmap(red_wine_data.corr(), cmap='bwr', annot=True)     # annot = True: to display the correlation value in the graph

In [None]:
# white wines
plt.figure(figsize=(16, 12))
sns.heatmap(white_wine_data.corr(), cmap='bwr', annot=True)

#### Discrete Categorical Attributes
The attribute quality is categorical in nature and we can visualize this type of attributes using barplot or countplot.

In [None]:
# Countplot for quality of wines present in different category of wines (red and white)
plt.figure(figsize=(12,8))
sns.countplot(wine_data.quality, hue=wine_data.color)

From above countplot, we can observe that the average quality of wines are more than good and bad quality of wines in both variants of wines. 

We can visualize scatterplot matrix for the better understanding relationship between a pair of variables. It plots every numerical attribute against every other. 'pairplot' of seaborn helps to achieve this

In [None]:
# red wine
sns.pairplot(red_wine_data)

The correlation between **fixed_acidity** and **citric_acid** is 0.67 (you could find this value in the correlation matrix of red wines). Looking for scatterplot for this pair of variables, we can see the positive linear correlation between these two variables. We can observe the upward trend, and also the points are not too dispersed.

Similarly we can plot scatter plot between variables for white wines.

In [None]:
# Pairplot
sns.pairplot(white_wine_data)

#### **References:**
1. [Step-by-step guide for predicting Wine Preferences using Scikit-Learn by Nataliia Rastoropova](https://medium.com/analytics-vidhya/step-by-step-guide-for-predicting-wine-quality-using-scikit-learn-de5869f8f91a)
2. [A tutorial for COMPLETE BEGINNERS](https://www.kaggle.com/drgilermo/a-tutorial-for-complete-beginners)
3. [Introduction to Descriptive Statistics and Exploratory Data Analysis by Joinal Ahmed](https://www.youtube.com/watch?v=5CoETeAdi9A)
4. Also, I would like to thank Julian Miranda and Gunnika Batra for their contribution to improve this notebook.

## Thanks for reading this notebook. Hope you enjoyed it.