# Exploratory Data Analysis (EDA) 

## EDA is the crucial first step in any data analysis project. It involves exploring and visualizing data to:

### Understand the data's structure and characteristics.
### Identify patterns, trends, and relationships.
### Detect outliers and anomalies.
### Formulate hypotheses for further analysis.
### Prepare the data for modeling.

## 1.  Importing neccesary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA 
from sklearn.preprocessing import MinMaxScaler, StandardScaler

## 2. Loading and Initial Exploration
### Load the dataset: Use Pandas (pd.read_csv() ) to load your data.

data = pd.read_csv('winequality-red.csv', sep=';')

In [None]:
# write your code here


First look: Use df.head(), df.tail(), df.shape, and df.info() to get an initial overview of the data.

In [None]:
# write your code here


In [None]:
# write your code here


In [None]:
# write your code here


In [None]:
# write your code here


In [None]:
# Discuss the outcome 

Summary statistics: Use df.describe() to get descriptive statistics (mean, median, quartiles, etc.) for numerical columns.

In [None]:
# write your code here


## 3. Univariate Analysis

Histograms: Visualize the distribution of numerical data using plt.hist().

plt.hist(data['fixed acidity'])

plt.xlabel('Fixed Acidity')

plt.ylabel('Frequency')

plt.title('Distribution of Fixed Acidity')

plt.show()

In [None]:
# write your code here



Box plots: Identify the median, quartiles, and potential outliers using plt.boxplot() or sns.boxplot().


In [None]:
plt.figure(figsize=(8, 6)) 
plt.boxplot(data['alcohol'])


plt.xlabel('Alcohol Content')
plt.ylabel('Value')
plt.title('Box Plot of Alcohol Content in Red Wine')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x='quality', y='pH', data=data, notch=True) 


plt.xlabel('Wine Quality')
plt.ylabel('pH Value')
plt.title('Box Plot of pH Value by Wine Quality')
plt.show()

Bar charts: Explore the frequency of categorical data using plt.bar() or sns.countplot().

# 4. Bivariate Analysis

Scatter plots: Visualize the relationship between two numerical variables using plt.scatter().

plt.figure(figsize=(8, 6))

plt.scatter(data[''], data[''])

plt.xlabel('Fixed Acidity')

plt.ylabel('Citric Acid')

plt.title('Scatter Plot of Fixed Acidity vs. Citric Acid')

plt.show()

In [None]:
# write your code here



Correlation matrices: Calculate and visualize correlations between multiple numerical variables using df.corr() and sns.heatmap().

correlations = data.corr()

plt.figure(figsize=(10, 8))

sns.heatmap(correlations, annot=True, cmap='coolwarm', fmt=".2f")

plt.title('Correlation Matrix of Wine Quality Features')

plt.show()

In [None]:
# write your code here



Grouped analyses: Use df.groupby() to analyze data within different categories.

average_alcohol_by_quality = data.groupby('quality')['alcohol'].mean()

print(average_alcohol_by_quality)


In [None]:
# write your code here



plt.figure(figsize=(8, 6))

average_alcohol_by_quality.plot(kind='bar')

plt.xlabel('Wine Quality')

plt.ylabel('Average Alcohol Content')

plt.title('Average Alcohol Content by Wine Quality')

plt.show()

In [None]:
# write your code here



# 5. Multivariate Analysis

Pair plots: Visualize pairwise relationships between multiple variables using sns.pairplot().
===================
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'alcohol', 'quality']

sns.pairplot(data[features], hue='quality')

plt.show()

In [None]:
# write your code here



3D plots: Explore relationships in three dimensions.

In [None]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

# Select three features for the 3D plot
x = data['alcohol']
y = data['sulphates']
z = data['total sulfur dioxide']
c = data['quality']  # Color points by quality

ax.scatter(x, y, z, c=c, cmap='viridis')
ax.set_xlabel('Alcohol')
ax.set_ylabel('Sulphates')
ax.set_zlabel('Total Sulfur Dioxide')
plt.title('3D Scatter Plot of Wine Features')
plt.show()

Dimensionality reduction: Use techniques like Principal Component Analysis (PCA) to reduce the number of variables while preserving important information.

In [None]:
features_for_pca = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 
                    'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 
                    'density', 'pH', 'sulphates', 'alcohol'] 

X = data[features_for_pca]


# Apply PCA with 2 components
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X)
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Add the 'quality' column back to the PCA DataFrame for visualization
pca_df['quality'] = data['quality']

# Visualize the PCA results
plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='quality', data=pca_df, palette='viridis')
plt.title('PCA: Wine Quality Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

# Explained variance ratio
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

# 6. Handling Outliers

Identification: Use box plots, scatter plots, or statistical methods (e.g., IQR) to identify outliers.


numerical_data = data.select_dtypes(include=[np.number])

plt.figure(figsize=(15, 10))  

sns.boxplot(data=numerical_data, orient="h", palette="Set2") 

plt.title('Box Plots of All Numerical Variables in Wine Quality Dataset')

plt.show()

In [None]:
# write your code here



Identification using IQR

In [None]:
def find_outliers_iqr(data, column):
  """
  Identifies outliers in a DataFrame column using the IQR method.

  Args:
      data: Pandas DataFrame.
      column: Name of the column to check for outliers.

  Returns:
      A Series of boolean values indicating outliers.
  """
  Q1 = data[column].quantile(0.25)
  Q3 = data[column].quantile(0.75)
  IQR = Q3 - Q1
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR
  return (data[column] < lower_bound) | (data[column] > upper_bound)

# --- Identify outliers in all numerical columns ---

# Select only the numerical columns
numerical_data = data.select_dtypes(include=[np.number])

for column in numerical_data.columns:
    outliers = find_outliers_iqr(data, column)
    print(f"Number of outliers in '{column}': {outliers.sum()}")

Treatment: Decide whether to remove, transform, or keep outliers based on their nature and the analysis goals.

Removing outliers 

In [None]:
# (Create a copy of the data to avoid modifying the original)
wine_data_no_outliers = data.copy()
wine_data_no_outliers = wine_data_no_outliers[~outliers]

Transforming outliers (e.g., capping)

In [None]:
# (Create another copy for this method)
wine_data_capped = data.copy()
upper_cap = data['total sulfur dioxide'].quantile(0.95)  # Cap at 95th percentile
wine_data_capped['total sulfur dioxide'] = np.where(
    wine_data_capped['total sulfur dioxide'] > upper_cap,
    upper_cap,
    wine_data_capped['total sulfur dioxide']
)

Visualize the results

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 3, 1)
sns.histplot(data['total sulfur dioxide'], kde=True)
plt.title('Original Data')

plt.subplot(1, 3, 2)
sns.histplot(wine_data_no_outliers['total sulfur dioxide'], kde=True)
plt.title('Outliers Removed')

plt.subplot(1, 3, 3)
sns.histplot(wine_data_capped['total sulfur dioxide'], kde=True)
plt.title('Outliers Capped')

plt.tight_layout()
plt.show()

# 6. Normalise the Data

Min-Max Scaling

In [None]:
features_to_scale = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 
                    'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 
                    'density', 'pH', 'sulphates', 'alcohol']

X = data[features_to_scale]


# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data
X_scaled_minmax = scaler.fit_transform(X)

# Create a new DataFrame with the scaled data
wine_data_minmax = pd.DataFrame(X_scaled_minmax, columns=features_to_scale)

Standard Scaling

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled_standard = scaler.fit_transform(X)

# Create a new DataFrame with the scaled data
wine_data_standard = pd.DataFrame(X_scaled_standard, columns=features_to_scale)

Print the first few rows of the scaled DataFrames

In [None]:
print("Min-Max Scaled Data:")
print(wine_data_minmax.head())

print("\nStandard Scaled Data:")
print(wine_data_standard.head())

Plot box plots

In [None]:
plt.figure(figsize=(15, 10))

# Original Data
plt.subplot(1, 3, 1)
sns.boxplot(data=data[features_to_scale], orient="h", palette="Set2")
plt.title('Original Data')

# Min-Max Scaled Data
plt.subplot(1, 3, 2)
sns.boxplot(data=wine_data_minmax, orient="h", palette="Set2")
plt.title('Min-Max Scaled Data')

# Standard Scaled Data
plt.subplot(1, 3, 3)
sns.boxplot(data=wine_data_standard, orient="h", palette="Set2")
plt.title('Standard Scaled Data')

plt.tight_layout()
plt.show()