# Exploratory Data Analysis (EDA) - Number 1

## Objective
This notebook demonstrates a clean and properly summarized exploratory data analysis using NumPy and Pandas. The analysis follows best practices for data exploration, cleaning, and visualization.

---

## 1. Library Imports
Import essential libraries for data manipulation, analysis, and visualization.

In [None]:
# Data manipulation and analysis
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
plt.style.use('default')
sns.set_palette('husl')
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 2. Data Loading and Overview
Load the dataset and perform initial inspection.

In [None]:
# Create a sample dataset for demonstration
# Replace this with your actual data loading: df = pd.read_csv('your_data.csv')

np.random.seed(42)
n_samples = 100

df = pd.DataFrame({
    'id': range(1, n_samples + 1),
    'age': np.random.randint(18, 65, n_samples),
    'income': np.random.normal(50000, 15000, n_samples),
    'score': np.random.uniform(0, 100, n_samples),
    'category': np.random.choice(['A', 'B', 'C'], n_samples),
    'is_active': np.random.choice([True, False], n_samples)
})

# Introduce some missing values for demonstration
df.loc[np.random.choice(df.index, 5), 'income'] = np.nan
df.loc[np.random.choice(df.index, 3), 'score'] = np.nan

print("Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Display last few rows
print("Last 5 rows of the dataset:")
df.tail()

In [None]:
# Dataset information
print("Dataset Information:")
df.info()

In [None]:
# Data types
print("Data types of each column:")
df.dtypes

## 3. Data Cleaning and Preprocessing
Identify and handle data quality issues.

In [None]:
# Check for missing values
print("Missing values in each column:")
missing_values = df.isnull().sum()
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")
print(f"Percentage of missing values: {(missing_values.sum() / df.size) * 100:.2f}%")

In [None]:
# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Values Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Columns')
plt.tight_layout()
plt.show()

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
if duplicates > 0:
    print("\nDuplicate rows:")
    print(df[df.duplicated()])

In [None]:
# Handle missing values (example: fill with median for numerical columns)
df_cleaned = df.copy()

# Fill numerical missing values with median
numerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
    if df_cleaned[col].isnull().sum() > 0:
        median_value = df_cleaned[col].median()
        df_cleaned[col].fillna(median_value, inplace=True)
        print(f"Filled missing values in '{col}' with median: {median_value:.2f}")

print(f"\nMissing values after cleaning: {df_cleaned.isnull().sum().sum()}")

### Data Cleaning Summary
- **Missing Values**: Identified and filled numerical missing values with median
- **Duplicates**: Checked for duplicate rows
- **Data Types**: Verified all columns have appropriate data types

## 4. Univariate Analysis
Analyze individual variables to understand their distributions and characteristics.

In [None]:
# Statistical summary of numerical columns
print("Statistical Summary:")
df_cleaned.describe()

In [None]:
# Distribution of numerical variables
numerical_cols = ['age', 'income', 'score']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, col in enumerate(numerical_cols):
    axes[idx].hist(df_cleaned[col], bins=20, color='skyblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col.capitalize()}', fontweight='bold')
    axes[idx].set_xlabel(col.capitalize())
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, col in enumerate(numerical_cols):
    axes[idx].boxplot(df_cleaned[col].dropna(), vert=True)
    axes[idx].set_title(f'Box Plot: {col.capitalize()}', fontweight='bold')
    axes[idx].set_ylabel(col.capitalize())
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Categorical variable analysis
print("Category Distribution:")
category_counts = df_cleaned['category'].value_counts()
print(category_counts)
print(f"\nCategory Percentages:")
print(df_cleaned['category'].value_counts(normalize=True) * 100)

In [None]:
# Visualize categorical distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar plot
df_cleaned['category'].value_counts().plot(kind='bar', ax=axes[0], color='coral', edgecolor='black')
axes[0].set_title('Category Distribution (Bar Chart)', fontweight='bold')
axes[0].set_xlabel('Category')
axes[0].set_ylabel('Count')
axes[0].grid(axis='y', alpha=0.3)

# Pie chart
df_cleaned['category'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', startangle=90)
axes[1].set_title('Category Distribution (Pie Chart)', fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

## 5. Bivariate and Multivariate Analysis
Explore relationships between variables.

In [None]:
# Correlation matrix
print("Correlation Matrix:")
correlation_matrix = df_cleaned[numerical_cols].corr()
print(correlation_matrix)

In [None]:
# Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Scatter plots to visualize relationships
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Age vs Income
axes[0].scatter(df_cleaned['age'], df_cleaned['income'], alpha=0.6, color='blue')
axes[0].set_title('Age vs Income', fontweight='bold')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Income')
axes[0].grid(alpha=0.3)

# Age vs Score
axes[1].scatter(df_cleaned['age'], df_cleaned['score'], alpha=0.6, color='green')
axes[1].set_title('Age vs Score', fontweight='bold')
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Score')
axes[1].grid(alpha=0.3)

# Income vs Score
axes[2].scatter(df_cleaned['income'], df_cleaned['score'], alpha=0.6, color='red')
axes[2].set_title('Income vs Score', fontweight='bold')
axes[2].set_xlabel('Income')
axes[2].set_ylabel('Score')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Group analysis by category
print("Statistical Summary by Category:")
grouped_stats = df_cleaned.groupby('category')[numerical_cols].agg(['mean', 'median', 'std'])
print(grouped_stats)

In [None]:
# Visualize distributions by category
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, col in enumerate(numerical_cols):
    df_cleaned.boxplot(column=col, by='category', ax=axes[idx])
    axes[idx].set_title(f'{col.capitalize()} by Category', fontweight='bold')
    axes[idx].set_xlabel('Category')
    axes[idx].set_ylabel(col.capitalize())
    axes[idx].get_figure().suptitle('')  # Remove the automatic title

plt.tight_layout()
plt.show()

In [None]:
# Pair plot for comprehensive multivariate analysis
print("Generating pair plot...")
sns.pairplot(df_cleaned[numerical_cols + ['category']], hue='category', diag_kind='kde', corner=False)
plt.suptitle('Pair Plot: Multivariate Analysis', y=1.02, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 6. Key Findings and Insights

### Data Quality
- Dataset contains **100 observations** with **6 variables**
- Identified and handled **missing values** in income and score columns
- No duplicate records found
- All data types are appropriate for analysis

### Distribution Insights
- **Age**: Uniformly distributed between 18 and 65 years
- **Income**: Approximately normally distributed with mean around $50,000
- **Score**: Uniformly distributed between 0 and 100
- **Category**: Three categories (A, B, C) are fairly balanced

### Relationships
- Correlation analysis reveals relationships between variables
- No strong linear correlations observed in the sample data
- Categories show different patterns in the numerical variables

### Recommendations for Further Analysis
1. **Feature Engineering**: Consider creating derived variables based on existing features
2. **Outlier Treatment**: Investigate and handle outliers identified in box plots
3. **Advanced Analytics**: Apply statistical tests to validate observed patterns
4. **Predictive Modeling**: Use cleaned data for machine learning applications

---

## Conclusion
This EDA provides a comprehensive overview of the dataset, following best practices for data exploration. The analysis includes data quality checks, univariate analysis, bivariate relationships, and multivariate patterns. The cleaned dataset is now ready for further statistical analysis or machine learning applications.

**Note**: This notebook uses sample data for demonstration. Replace the data loading section with your actual dataset to perform a real analysis.

---
**End of EDA_number_1**