
# Comprehensive EDA Template for Categorical Target

This Jupyter notebook provides a template for performing Exploratory Data Analysis (EDA) when the target variable is categorical. It covers data loading, cleaning, visualization, and basic statistical analysis. This template is meant to be a starting point and should be adapted based on the specifics of your dataset and analysis needs.

## Import Necessary Libraries

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
```

## Load the Dataset

```python
# Replace 'your_data.csv' with the path to your dataset
df = pd.read_csv('your_data.csv')
```

## Basic Data Overview

```python
# Display the first few rows of the dataset
print(df.head())

# Display dataset information
print(df.info())

# Summary statistics for numerical features
print(df.describe())

# Summary statistics for categorical features
print(df.describe(include='object'))
```

## Data Cleaning

```python
# Check for missing values
print(df.isnull().sum())

# Drop or impute missing values
# df.dropna(inplace=True) # To drop rows with missing values
# df.fillna(0, inplace=True) # Example: Fill missing values with 0

# Convert data types if necessary
# df['your_column'] = df['your_column'].astype('int')
```

## Univariate Analysis

### Target Variable Distribution

```python
sns.countplot(x='target_column', data=df)
plt.title('Distribution of Target Categories')
plt.show()
```

### Numerical Features Distribution

```python
# Histograms for numerical features
num_features = df.select_dtypes(include=[np.number]).columns.tolist()
df[num_features].hist(bins=15, figsize=(15, 6), layout=(2, -1))
plt.show()
```

### Categorical Features Distribution

```python
# Bar plots for categorical features
cat_features = df.select_dtypes(include=['object']).columns.tolist()
for col in cat_features:
    if col != 'target_column':  # Exclude the target variable
        sns.countplot(x=col, data=df)
        plt.title(f'Distribution of {col}')
        plt.xticks(rotation=45)
        plt.show()
```

## Bivariate Analysis

### Numerical Features vs Target

```python
# Box plots for numerical features vs target
for col in num_features:
    sns.boxplot(x='target_column', y=col, data=df)
    plt.title(f'{col} Distribution by Target Category')
    plt.show()
```

### Categorical Features vs Target

```python
# Chi-Square Test of Independence for categorical features vs target
for col in cat_features:
    if col != 'target_column':
        contingency_table = pd.crosstab(df['target_column'], df[col])
        _, p, _, _ = chi2_contingency(contingency_table)
        if p < 0.05:
            print(f'{col} is significantly associated with the target variable. (p-value = {p})')
        else:
            print(f'{col} is not significantly associated with the target variable. (p-value = {p})')
```

## Correlation Analysis

```python
# Correlation matrix for numerical features
corr_matrix = df[num_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()
```

## Conclusion

This template provides a starting point for EDA with a categorical target variable. Depending on your data and the specific questions you're trying to answer, you may need to add additional analysis or visualization steps.

