# >> Versatile Exploratory Data Analysis Code

This code is a set of functions designed to facilitate data exploration and analysis in Python, primarily using the Pandas, Matplotlib, and Seaborn libraries. It offers a comprehensive toolkit for you to gain insights into your data quickly and efficiently.

The code starts by importing essential data analysis and visualization libraries, such as Pandas for data manipulation, Matplotlib for plotting, Seaborn for enhanced data visualization, and NumPy for numerical operations.

It then defines a series of functions to perform common data analysis tasks:

- load_data: This function loads data from a CSV file and returns it as a Pandas DataFrame.

- display_head: It displays the first few rows of the DataFrame to provide a quick overview of the dataset's structure.

- get_shape: This function prints the dimensions (number of rows and columns) of the DataFrame.

- check_missing_values: It identifies and prints the number of missing values in each column of the DataFrame.

- drop_missing_values: This function removes rows with missing values from the DataFrame, aiding data cleaning.

- get_data_types: It displays the data types of each column in the DataFrame, helping users understand their data's structure.

- summary_stats_numeric: This function calculates and displays summary statistics for numerical columns, including mean, standard deviation, and more.

- unique_values_categorical: It lists unique values and their frequencies for categorical columns, supporting categorical data exploration.

- plot_histograms, plot_boxplots, plot_correlation_matrix, and plot_scatter_plots: These functions generate various plots, including histograms, box plots, correlation matrices, and scatter plots, allowing users to visualize and analyze their data's distribution, relationships, and patterns.

Together, these functions provide a robust toolkit for initial data exploration, enabling you to understand your dataset's structure and characteristics.

## 1. Importing modules

In [None]:
# Import the pandas library and alias it as 'pd' for convenience.
import pandas as pd

# Import the matplotlib.pyplot library and alias it as 'plt' for convenience.
import matplotlib.pyplot as plt

# Import the seaborn library and alias it as 'sns' for convenience.
import seaborn as sns

# Import the numpy library and alias it as 'np' for convenience.
import numpy as np

## 2. Loading Data

In [None]:
# Define a function called 'load_data' that takes a 'file_path' as its parameter.
# This function is used to load data from a CSV file and return it as a DataFrame.

def load_data(file_path):
    # Use the 'pd.read_csv' function from the pandas library to read the data from the CSV file
    # located at the given 'file_path' and store it in the variable 'data'.
    data = pd.read_csv(file_path)
    
    # Return the 'data' as a DataFrame to the caller.
    return data

## 3. Preview first five rows

In [None]:
# Define a function called 'display_head' that takes 'data' and an optional 'n' parameter.
# This function is used to display the first 'n' rows of the provided data.

def display_head(data, n=5):
    # Print a message indicating that we are displaying the first 'n' rows of the data.
    print(f"First {n} rows of the data:")
    
    # Use the 'head' method of the DataFrame 'data' to display the first 'n' rows.
    print(data.head(n))

## 4. Get Data Shape

In [None]:
# Define a function called 'get_shape' that takes 'data' as its parameter.
# This function is used to display the shape (number of rows and columns) of the provided data.

def get_shape(data):
    # Print a message indicating that we are displaying the shape of the data.
    print("Shape of the data:")
    
    # Use the 'shape' attribute of the DataFrame 'data' to get and print the number of rows and columns.
    print(data.shape)

## 5. Check Missing Values

In [None]:
# Define a function called 'check_missing_values' that takes 'data' as its parameter.
# This function is used to count and display the number of missing values in each column of the provided data.

def check_missing_values(data):
    # Print a message indicating that we are displaying the number of missing values.
    print("Number of missing values:")
    
    # Use the 'isna()' method of the DataFrame 'data' to create a boolean mask
    # indicating whether each element in the DataFrame is missing (True) or not (False).
    # Then, use the 'sum()' method to count the True values (missing values) in each column and print the result.
    print(data.isna().sum())

## 6. Drop Missing Values

In [None]:
# Define a function called 'drop_missing_values' that takes 'data' as its parameter.
# This function is used to remove rows with missing values from the provided data.

def drop_missing_values(data):
    # Use the 'dropna()' method of the DataFrame 'data' to remove rows with missing values,
    # and store the resulting DataFrame back in the variable 'data'.
    data = data.dropna()
    
    # Return the modified 'data' DataFrame without missing values to the caller.
    return data

## 7. Get Data Types

In [None]:
# Define a function called 'get_data_types' that takes 'data' as its parameter.
# This function is used to display the data types of each column in the provided data.

def get_data_types(data):
    # Print a message indicating that we are displaying the data types of the columns.
    print("Data types of the columns:")
    
    # Use the 'dtypes' attribute of the DataFrame 'data' to get and print the data types of each column.
    print(data.dtypes)

## 8. Summary Statistics

In [None]:
# Define a function called 'summary_stats_numeric' that takes 'data' as its parameter.
# This function is used to display summary statistics for the numerical columns in the provided data.

def summary_stats_numeric(data):
    # Print a message indicating that we are displaying summary statistics of numerical columns.
    print("Summary statistics of the numerical columns:")
    
    # Use the 'select_dtypes' method with 'include=[np.number]' to select only the numerical columns
    # in the DataFrame 'data', and then use 'describe()' to calculate and print summary statistics for those columns.
    print(data.select_dtypes(include=[np.number]).describe())

## 9. Unique Value Count

In [None]:
# Define a function called 'unique_values_categorical' that takes 'data' as its parameter.
# This function is used to display unique values and their frequencies for categorical columns in the provided data.

def unique_values_categorical(data):
    # Print a message indicating that we are displaying unique values and their frequencies for categorical columns.
    print("Unique values and their frequencies for categorical columns:")
    
    # Iterate through columns that have data type 'object' (categorical columns).
    for column in data.select_dtypes(include=[object]):
        # Print the name of the current column.
        print(column, ":")
        
        # Use the 'value_counts()' method to count unique values and their frequencies in the current column,
        # and then print the result.
        print(data[column].value_counts())
        
        # Print an empty line for separation between columns.
        print()

## 10. Plot Histogram

In [None]:
# Define a function called 'plot_histograms' that takes 'data' as its parameter.
# This function is used to plot histograms for the numerical columns in the provided data.

def plot_histograms(data):
    # Print a message indicating that we are plotting histograms for numerical columns.
    print("Histograms of the numerical columns:")
    
    # Select the numerical columns using 'select_dtypes' and 'include=[np.number]'.
    numerical_columns = data.select_dtypes(include=[np.number]).columns
    
    # Iterate through the numerical columns and create a histogram for each one.
    for column in numerical_columns:
        # Create a new figure and set its size.
        plt.figure(figsize=(4, 3))
        
        # Use 'sns.histplot' to create a histogram with specified options (e.g., number of bins, KDE, color).
        sns.histplot(data[column], bins=20, kde=True, color="brown")
        
        # Set labels and a title for the plot.
        plt.xlabel("Value")
        plt.ylabel("Count")
        plt.title(f"Histogram of {column}")
        
        # Display the plot.
        plt.show()

## 11. Box Plot

In [None]:
# Define a function called 'plot_boxplots' that takes 'data' as its parameter.
# This function is used to plot box plots for the numerical columns in the provided data.

def plot_boxplots(data):
    # Print a message indicating that we are plotting box plots for numerical columns.
    print("Box plots of the numerical columns:")
    
    # Select the numerical columns using 'select_dtypes' and 'include=[np.number]'.
    numerical_columns = data.select_dtypes(include=[np.number]).columns
    
    # Iterate through the numerical columns and create a box plot for each one.
    for column in numerical_columns:
        # Create a new figure and set its size.
        plt.figure(figsize=(4, 3))
        
        # Use 'sns.boxplot' to create a box plot for the current numerical column with specified options (e.g., color).
        sns.boxplot(x=data[column], color="lightgreen")
        
        # Customize the plot by setting labels and a title.
        plt.xlabel("")
        plt.ylabel("Value")
        plt.title(f"Box Plot of {column}")
        
        # Display the plot.
        plt.show()

## 12. Correlation Analysis

In [None]:
# Define a function called 'plot_correlation_matrix' that takes 'data' as its parameter.
# This function is used to plot a correlation matrix for the numerical columns in the provided data.

def plot_correlation_matrix(data):
    # Print a message indicating that we are plotting a correlation matrix for numerical columns.
    print("Correlation matrix of the numerical columns:")
    
    # Calculate the correlation matrix for the numerical columns using 'corr()'.
    corr_matrix = data.select_dtypes(include=[np.number]).corr()
    
    # Create a new figure and set its size.
    plt.figure(figsize=(6, 4))
    
    # Use 'sns.heatmap' to create a heatmap of the correlation matrix with specified options (e.g., colormap, annotations).
    sns.heatmap(corr_matrix, cmap="coolwarm", annot=True, fmt=".2f", cbar=True)
    
    # Set a title for the plot.
    plt.title("Correlation Matrix")
    
    # Display the plot.
    plt.show()

## 13. Scatter Plot

In [None]:
# Define a function called 'plot_scatter_plots' that takes 'data' and 'target_column' as its parameters.
# This function is used to plot scatter plots of the numerical columns against the target variable.

def plot_scatter_plots(data, target_column):
    # Print a message indicating that we are plotting scatter plots against the target variable.
    print("Scatter plots of the numerical columns against the target variable:")
    
    # Select the numerical columns using 'select_dtypes' and 'include=[np.number]'.
    numerical_columns = data.select_dtypes(include=[np.number]).columns
    
    # Exclude the 'target_column' from the list of numerical columns.
    numerical_columns = numerical_columns[numerical_columns != target_column]
    
    # Iterate through the numerical columns and create a scatter plot for each one against the target variable.
    for column in numerical_columns:
        # Create a new figure and set its size.
        plt.figure(figsize=(4, 3))
        
        # Use 'sns.scatterplot' to create a scatter plot with specified options (e.g., color palette).
        sns.scatterplot(x=column, y=target_column, data=data, hue=column, palette="viridis")
        
        # Set labels and a title for the plot.
        plt.xlabel(column)
        plt.ylabel(target_column)
        plt.title(f"Scatter Plot of {column} vs {target_column}")
        
        # Adjust the layout and display the plot.
        plt.tight_layout()
        plt.show()

In [None]:
# Example usage:
# Load the data
data = load_data("healthcare-dataset-stroke-data.csv")

In [None]:
# Display the head of the data
display_head(data)

In [None]:
# Get the shape of the data
get_shape(data)

In [None]:
# Check for missing values
check_missing_values(data)

In [None]:
# Drop missing values
data = drop_missing_values(data)

In [None]:
# Get the data types of the columns
get_data_types(data)

In [None]:
# Get the summary statistics of the numerical columns
summary_stats_numeric(data)

In [None]:
# Get the unique values and their frequencies for categorical columns
unique_values_categorical(data)

In [None]:
# Plot histograms of the numerical columns
plot_histograms(data)

In [None]:
# Plot box plots of the numerical columns
plot_boxplots(data)

In [None]:
# Plot correlation matrix of the numerical columns
plot_correlation_matrix(data)

In [None]:
# Plot scatter plots of the numerical columns against the target variable
plot_scatter_plots(data, "age")