# Module 6: Exploratory Data Analysis (EDA)


## 1. EDA Fundamentals


Exploratory Data Analysis (EDA) is a crucial step in the data analysis process. It involves the initial examination and exploration of a dataset to understand its key characteristics and gain insights into the data. EDA helps data scientists and analysts make informed decisions about data cleaning, feature engineering, and the choice of modeling techniques.

**Major Components of EDA**

EDA typically consists of the following major components:

1. **Data Collection**: Gathering the dataset from various sources, such as databases, APIs, or public datasets like the Titanic dataset loaded in the code above.

2. **Data Cleaning**: Identifying and handling missing values, outliers, and any inconsistencies in the data. This step ensures that the dataset is suitable for analysis.

3. **Data Visualization**: Creating visualizations, such as histograms, scatter plots, and box plots, to understand the distribution of data, relationships between variables, and potential patterns.

4. **Summary Statistics**: Calculating summary statistics, including measures like mean, median, standard deviation, and percentiles, to describe the central tendency and variability of the data.

5. **Feature Engineering**: Creating new features or transforming existing ones to enhance the dataset's information content and improve model performance.

6. **Hypothesis Testing**: Formulating hypotheses about the data and conducting statistical tests to validate or reject these hypotheses.

7. **Correlation Analysis**: Examining the relationships between variables, often using correlation coefficients, to identify dependencies.

8. **Dimensionality Reduction**: Reducing the number of variables or features through techniques like Principal Component Analysis (PCA) or feature selection methods.

9. **Data Reporting**: Communicating the findings and insights from the EDA process through reports, visualizations, and presentations.

10. **Iterative Process**: EDA is often an iterative process, where data analysts revisit previous steps as they gain more insights and refine their understanding of the dataset.

Overall, EDA plays a crucial role in setting the stage for subsequent data analysis tasks, such as machine learning model building, and helps in making data-driven decisions. It allows data professionals to uncover patterns, detect anomalies, and generate hypotheses about the data using real-world datasets like the Titanic dataset.



In [None]:
# First, we will need to load a public dataset to conduct EDA
import pandas as pd

# Load the Titanic dataset for EDA
url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
df = pd.read_csv(url)

# Display the first few rows of the dataset to get an overview
print(df.head())

## 2. Summary Statistics and Descriptive Analysis


**Summary Statistics and Descriptive Analysis in the Titanic Dataset**

Summary statistics and descriptive analysis are fundamental techniques used to explore datasets and understand their key characteristics. In this code example, we'll apply these concepts to the Titanic dataset, which contains information about passengers on the Titanic.

**When and why is it used?**

- **When**: Summary statistics and descriptive analysis are typically the first steps in the exploratory data analysis (EDA) process. They are used at the beginning of data analysis to gain initial insights into the dataset.

- **Why**: Here are the primary reasons for using summary statistics and descriptive analysis in the Titanic dataset:

  1. **Data Understanding**: To get a sense of the data's overall structure, such as the number of rows and columns, as well as the initial values.

  2. **Basic Statistics**: To compute key statistics for numeric columns, such as measures of central tendency (mean, median), dispersion (standard deviation), and quartiles.

  3. **Categorical Data**: To count unique values and understand the distribution of categorical columns.

  4. **Correlation Analysis**: To calculate the correlation matrix for numeric columns, helping identify potential relationships between variables.

**Necessary Information for Summary Statistics and Descriptive Analysis in the Titanic Dataset**:

1. **Summary Statistics**: We use the `describe()` function to obtain key statistics for numeric columns:
   - **Count**: Number of non-null observations.
   - **Mean**: Average value.
   - **Std**: Standard deviation.
   - **Min**: Minimum value.
   - **25%**: First quartile (25th percentile).
   - **50%**: Median (50th percentile).
   - **75%**: Third quartile (75th percentile).
   - **Max**: Maximum value.

2. **Categorical Data**: We count unique values in categorical columns using a loop. This helps us understand the variety within these columns.

3. **Correlation Analysis**: We compute the correlation matrix to identify relationships between numeric columns. A correlation matrix shows how pairs of variables are related.

By performing summary statistics and descriptive analysis, we can quickly grasp important characteristics of the Titanic dataset, such as basic statistics and potential correlations. These initial insights provide a foundation for further analysis and modeling.

Summary statistics are often followed by data visualization to gain deeper insights into the data's patterns and relationships.

In [None]:
# Import necessary libraries
import pandas as pd

# Load the Titanic dataset for EDA
url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
df = pd.read_csv(url)

# Display the first few rows of the dataset to get an overview
df.head()

In [None]:
# Generating summary statistics for the Titanic dataset
summary_stats = df.describe()

In [None]:
# Displaying the summary statistics
print("\nSummary Statistics:")
summary_stats

In [None]:
# Count unique values in categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    unique_values = df[col].nunique()
    print(f"Unique {col}: {unique_values}")

In [None]:
# Calculate the correlation matrix for numeric columns
numeric_df = df.select_dtypes(include=['number'])
correlation_matrix = numeric_df.corr()
numeric_df

In [None]:
# Display the correlation matrix
print("\nCorrelation Matrix:")
correlation_matrix

**Inverse Relationship: Survived vs. Pclass**

In the Titanic dataset, "Pclass" represents the passenger class, where 1st class is the highest and 3rd class is the lowest. On the other hand, "Survived" indicates whether a passenger survived (1) or did not survive (0).

The inverse relationship is due to the fact that lower passenger classes (e.g., 3rd class) were associated with a lower chance of survival, while higher passenger classes (e.g., 1st class) had a higher chance of survival. This inverse relationship means that as "Pclass" increases (from 3rd class to 1st class), the likelihood of survival also increases. In other words, passengers in 1st class had a better chance of surviving compared to those in 3rd class.

This can be summarized as follows:

- 1st Class (higher "Pclass") → Higher Chance of Survival
- 3rd Class (lower "Pclass") → Lower Chance of Survival

The correlation coefficient quantifies this relationship and is negative, indicating the inverse relationship between "Survived" and "Pclass."


## 3. Distribution Analysis


Distribution analysis is a crucial component of exploratory data analysis (EDA). It involves examining how data points are spread or distributed within a dataset. This analysis helps us understand the central tendency (mean, median, mode) and the spread or variability of the data. Histograms are a common visualization tool for distribution analysis.

**In the provided Python code example:**

- We start by loading the Titanic dataset, which contains various passenger information, using the Pandas library.

- Next, we extract the "Age" column from the dataset, which represents the ages of Titanic passengers. This column is stored in the variable `ages`.

- We create a histogram using the Matplotlib library with `plt.hist()`. Here are the key parameters used:
  - `ages`: The data to be plotted, which is the passenger ages.
  - `bins=10`: Specifies that we want the data divided into 10 bins or intervals.
  - `edgecolor='k'`: Adds black borders to the histogram bars for clarity.
  - `alpha=0.7`: Sets the transparency of the bars to 0.7 for a smoother appearance.
  - `color='skyblue'`: Defines the color of the bars as sky blue.

- We include a title on the plot using `plt.title()` to indicate that we are visualizing the distribution of passenger ages.

- The x-axis is labeled as 'Age' with `plt.xlabel()`, and the y-axis is labeled as 'Frequency' with `plt.ylabel()`.

- We add a grid along the y-axis using `plt.grid()` to improve readability and facilitate interpreting the frequencies.

- To provide a summary statistic reference, a vertical dashed red line is drawn at the mean age using `plt.axvline()`. This line helps us see where the mean age falls within the distribution.

- A legend is included with `plt.legend()` to label the mean age line as 'Mean Age' and position it in the upper right corner.

Running this code generates a histogram that visually represents the distribution of passenger ages in the Titanic dataset. The histogram allows us to observe the central tendency (mean age) and the spread of ages among passengers.


In [None]:
import matplotlib.pyplot as plt

# Extracting the 'Age' column from the Titanic dataset
ages = df['Age']

# Plotting a histogram for Age
plt.hist(ages, bins=10, edgecolor='k', alpha=0.7, color='skyblue')  # Create a histogram with 10 bins, edgecolor for bar borders, and alpha for transparency
plt.title('Distribution of Passenger Ages')  # Set the title of the plot
plt.xlabel('Age')  # Label the x-axis
plt.ylabel('Frequency')  # Label the y-axis
plt.grid(axis='y', linestyle='--', alpha=0.5)  # Add a grid for better readability
plt.axvline(ages.mean(), color='r', linestyle='dashed', linewidth=1, label='Mean Age')  # Add a vertical line for the mean age
plt.legend(loc='upper right')  # Display the legend
plt.show()  # Show the histogram plot

## 4. Correlation and Heatmaps


Correlation provides an understanding of how one variable changes concerning another. It is pivotal in understanding relationships between multiple variables in a dataset. A heatmap is an excellent tool to visually represent correlations.


In [None]:
import seaborn as sns

# Compute the correlation matrix
corr = numeric_df.corr()

# Draw a heatmap with the correlation values
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title('Correlation Heatmap')
plt.show()

## 5. Outlier Detection and Handling


Outliers are extreme values that may affect your analysis and statistical tests. Identifying and handling outliers is an essential step in EDA. A boxplot is a standardized way of displaying the distribution of data and can be used to detect outliers visually.


In [None]:
# Plotting a boxplot for Age to detect outliers
sns.boxplot(df['Age'])
plt.title('Boxplot of Age')
plt.xlabel('Age')
plt.show()

To identify and handle outliers in the `Age` feature:
1. Identify outliers as values outside 1.5 times the IQR above Q3 or below Q1.
2. Remove the identified outliers from the dataset.
3. Visualize the updated box and whisker plot to confirm the changes.

Here's how to do it:


In [None]:
# Step 1: Identify outliers
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Step 2: Remove outliers
df_cleaned = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]

# Step 3: Visualize the updated box and whisker plot
plt.figure(figsize=(8, 6))
sns.boxplot(df_cleaned['Age'])
plt.title('Boxplot of Age (Outliers Removed)')
plt.xlabel('Age')
plt.show()

## 6. Handling Missing Data


Real-world data often has missing values. Handling missing data is essential to ensure your analyses are valid and conclusions are accurate. Data can be missing for various reasons, and the method to handle it depends on the nature of your data and the missing data.


In [100]:
# Creating a dataframe with missing data for demonstration
data_with_missing = {
    'Age': [25, 27, 30, None, 24, 31, 34, 28, 26, 23],
    'Salary': [50000, 54000, None, 58000, 51000, 62000, None, 57000, 59000, 52000]
}
df_missing = pd.DataFrame(data_with_missing)

# Displaying the dataframe
df_missing

Unnamed: 0,Age,Salary
0,25.0,50000.0
1,27.0,54000.0
2,30.0,
3,,58000.0
4,24.0,51000.0
5,31.0,62000.0
6,34.0,
7,28.0,57000.0
8,26.0,59000.0
9,23.0,52000.0


Notice the `None` values in our dataset. These represent missing data. Let's check how many missing data points we have.


In [101]:
# Checking for missing data
missing_data = df_missing.isnull().sum()

# Displaying missing data count
missing_data

Age       1
Salary    2
dtype: int64

Having missing data can cause a range of problems, from skewing statistical measures to rendering some algorithms inoperable. For instance, most machine learning algorithms require complete datasets to function correctly.

Let's see two common strategies for handling missing data:

1. Removing missing data entries.
2. Imputing missing data based on other entries.


In [102]:
# Strategy 1: Removing missing data entries
df_dropped = df_missing.dropna()

# Displaying the dataframe after dropping missing values
df_dropped

Unnamed: 0,Age,Salary
0,25.0,50000.0
1,27.0,54000.0
4,24.0,51000.0
5,31.0,62000.0
7,28.0,57000.0
8,26.0,59000.0
9,23.0,52000.0


Dropping missing values is straightforward, but it has its pitfalls. It can lead to loss of valuable data, especially if many rows contain at least one missing value. This can significantly reduce the sample size and potentially bias the analysis.

Let's move on to the second strategy.

In [103]:
# Strategy 2: Imputing missing data based on other entries
# Here, we will use mean imputation as an example

df_imputed = df_missing.copy()
df_imputed['Age'].fillna(df_imputed['Age'].mean(), inplace=True)
df_imputed['Salary'].fillna(df_imputed['Salary'].mean(), inplace=True)

# Displaying the dataframe after imputation
df_imputed

Unnamed: 0,Age,Salary
0,25.0,50000.0
1,27.0,54000.0
2,30.0,55375.0
3,27.555556,58000.0
4,24.0,51000.0
5,31.0,62000.0
6,34.0,55375.0
7,28.0,57000.0
8,26.0,59000.0
9,23.0,52000.0


Imputation replaces missing values with substituted values. The method of substitution can vary. We used the mean value of each column for our demonstration. Other methods include using the median, mode, or even sophisticated model-based methods.

By handling the missing data, we've ensured that our dataset remains suitable for analysis and various algorithms. However, it's crucial to remember that the method used to handle missing data should align with the nature and reason for the missingness. Improper handling can introduce bias or inaccuracies.