# Data Exploration and Visualization Notebook

This notebook is designed for exploring and visualizing datasets to gain insights before training AI models. It includes steps for data loading, cleaning, statistical analysis, and visualization using various Python libraries. The goal is to understand the structure, patterns, and potential issues in the data.

**Note**: Replace the dataset path with your own data file (e.g., CSV, JSON) or use a sample dataset as shown below.

## 1. Import Required Libraries

Let's start by importing the necessary libraries for data manipulation and visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set visualization styles
plt.style.use('seaborn')
%matplotlib inline

## 2. Load the Dataset

Load the dataset into a Pandas DataFrame. For this example, we'll use a sample dataset. Replace the path or dataset as needed.

In [None]:
# Load a sample dataset (replace with your own dataset path)
try:
    # Example: Using a publicly available dataset or local file
    df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    print("Dataset loaded successfully!")
except Exception as e:
    print(f"Error loading dataset: {e}")
    # Fallback to a smaller built-in dataset if needed
    from sklearn.datasets import load_iris
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target
    print("Fallback dataset loaded successfully!")

# Display the first few rows
df.head()

## 3. Basic Data Inspection

Let's inspect the dataset to understand its structure, data types, and basic statistics.

In [None]:
# Display basic information about the dataset
print("Dataset Info:")
df.info()

# Display basic statistics
print("\nDataset Description:")
df.describe(include='all')

## 4. Check for Missing Values

Identify and visualize missing values in the dataset.

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())

# Visualize missing values using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

## 5. Data Distribution Analysis

Visualize the distribution of numerical columns to understand their spread and potential skewness.

In [None]:
# Select numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Plot histograms for numerical columns
for col in numerical_cols:
    plt.figure(figsize=(8, 5))
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.show()

# Check skewness and kurtosis
for col in numerical_cols:
    skewness = stats.skew(df[col])
    kurtosis = stats.kurtosis(df[col])
    print(f"{col} - Skewness: {skewness:.2f}, Kurtosis: {kurtosis:.2f}")

## 6. Categorical Data Analysis

Analyze and visualize categorical columns to understand their distribution.

In [None]:
# Select categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Plot count plots for categorical columns
for col in categorical_cols:
    plt.figure(figsize=(8, 5))
    sns.countplot(data=df, x=col)
    plt.title(f'Count Plot of {col}')
    plt.xticks(rotation=45)
    plt.show()

# Display value counts
for col in categorical_cols:
    print(f"\nValue Counts for {col}:")
    print(df[col].value_counts())

## 7. Correlation Analysis

Explore relationships between numerical variables using correlation matrices and heatmaps.

In [None]:
# Compute correlation matrix
correlation_matrix = df[numerical_cols].corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix Heatmap')
plt.show()

## 8. Pairwise Relationships

Visualize pairwise relationships between variables using scatter plots and pair plots.

In [None]:
# Pair plot for numerical columns (limit to first 5 for performance)
sns.pairplot(df[numerical_cols[:5]])
plt.suptitle('Pairwise Relationships', y=1.05)
plt.show()

## 9. Outlier Detection

Identify potential outliers in numerical columns using box plots and IQR method.

In [None]:
# Plot box plots for numerical columns
for col in numerical_cols:
    plt.figure(figsize=(8, 5))
    sns.boxplot(data=df, y=col)
    plt.title(f'Box Plot of {col} for Outlier Detection')
    plt.show()

# Calculate IQR and detect outliers
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    print(f"\nOutliers in {col}:")
    print(f"Number of outliers: {len(outliers)}")
    if len(outliers) > 0:
        print(outliers.head())

## 10. Interactive Visualizations with Plotly

Create interactive plots for deeper exploration using Plotly.

In [None]:
# Interactive scatter plot (adjust columns as needed)
if len(numerical_cols) >= 2:
    fig = px.scatter(df, x=numerical_cols[0], y=numerical_cols[1], color=df.columns[-1] if 'target' in df.columns or len(categorical_cols) > 0 else None,
                     title='Interactive Scatter Plot')
    fig.show()

# Interactive box plot
if len(numerical_cols) > 0:
    fig = px.box(df, y=numerical_cols[0], title=f'Interactive Box Plot of {numerical_cols[0]}')
    fig.show()

## 11. Summary and Insights

Summarize key findings from the exploration and note any actions needed for data preprocessing or modeling.

In [None]:
# Print a summary of key insights
print("Key Insights from Data Exploration:")
print("1. Dataset Shape:", df.shape)
print("2. Missing Values:", df.isnull().sum().sum())
print("3. Numerical Columns:", list(numerical_cols))
print("4. Categorical Columns:", list(categorical_cols))
print("5. Potential Issues: Check for outliers and skewed distributions as shown in plots.")
print("6. Next Steps: Handle missing values, encode categorical variables, and normalize numerical features if needed.")