# Data Analysis Notebook
**Purpose**: Standardized exploratory analysis for [Dataset].  
**Author**: Eric    
**Key Tools**: Python, ydata_profiling, scipy, seaborn  
**Industry Standards**: CRISP-DM framework, Kaggle Survey Best Practices (2023)

### 1. Packages & Settings
*Why this matters*: Configurations ensure reproducibility and readability.  
*Industry Standard*: Always set random seeds, display limits, and visualization themes upfront.

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Advanced analysis
from scipy import stats
from ydata_profiling import ProfileReport

# Interactive tables (optional)
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)

# Configuration
pd.set_option('display.max_columns', 30)
sns.set_theme(style='whitegrid')
%config InlineBackend.figure_format = 'retina'
np.random.seed(42)  # Reproducibility

### 2. Importing Data
*Why this matters*: Raw data is the foundation of all analysis.  
*Industry Standard*: Always check file encoding and delimiter mismatches.

In [11]:
# 2. Data Loading
df = pd.read_csv(r"[Dataset Path]")

### 3. First Look
*What you’ll learn*:  
- Dataset size and column types  
- Immediate red flags (e.g., 90% missing values in a column)  
- Example: If `df.shape` shows (1000, 50), you know it's a medium-sized dataset.

In [None]:
# Shape and types
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
display(df.dtypes.to_frame(name='Data Type'))

# Missing values
null_summary = df.isna().sum().to_frame(name='Missing Values')
null_summary['% Missing'] = (null_summary['Missing Values'] / len(df)) * 100
display(null_summary.sort_values('% Missing', ascending=False))

# Sample data
show(df.sample(5))  # Random rows to avoid bias

### 4. Automated Analysis
*What you’ll learn*:  
- Correlations between variables (e.g., "Sales increase with Marketing Spend")  
- Skewed distributions (e.g., 80% of users are from the USA)  
- Duplicate rows or constant-value columns  
*Industry Trade-off*: Fast but surface-level – use to guide deeper analysis.

In [None]:
profile = ProfileReport(df, title="Automated EDA", explorative=True)
profile.to_notebook_iframe()
# Save to HTML for later review (optional)
# profile.to_file("automated_eda_report.html")

### 5. Hypothesis-Driven Analysis
*What you’ll learn*:  
- Statistical significance of observed patterns  
- Relationships not caught by automated tools (e.g., interaction effects)  
*Industry Standard*: Always validate automated findings manually.

### This is only an example section of code that relates to statistical analysis. This code is NOT being used directly.

In [None]:
# ----------------------------------
### A. Normality Check (Numerical Data)
# Example: If p < 0.05, data is non-normal -> use non-parametric tests
# ----------------------------------
for col in df.select_dtypes(include=np.number):
    stat, p = stats.shapiro(df[col].dropna().sample(5000))  # Limit sample size
    print(f"{col}: Shapiro-Wilk p = {p:.4f}")


# ----------------------------------
### B. Categorical Relationships
# Example: "Chi2 p < 0.05 implies Region affects Product Preference"
# ----------------------------------
def plot_categorical_association(df, col1, col2):
    contingency = pd.crosstab(df[col1], df[col2])
    chi2, p, _, _ = stats.chi2_contingency(contingency)
    
    plt.figure(figsize=(8,4))
    sns.heatmap(contingency, annot=True, fmt='d', cmap='Blues')
    plt.title(f"{col1} vs {col2}\nChi2 p-value: {p:.4f}")
    plt.show() 
# Usage: plot_categorical_association(df, 'Gender', 'Purchase_Status')


# ----------------------------------
### C. Correlation Significance
# Example: "Price and Sales have r=-0.7 (p<0.001) – strong negative relationship"
# ----------------------------------
corr_matrix = df.corr(numeric_only=True)
p_values = df.corr(method=lambda x, y: stats.pearsonr(x, y)[2]) - np.eye(corr_matrix.shape[1])

plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, mask=p_values > 0.05, cmap='coolwarm')
plt.title('Statistically Significant Correlations (p < 0.05)')
plt.show()

### 6. Focused Investigation
*When to use*:  
- Drill into subgroups (e.g., "Why do users aged 30-40 have higher churn?")  
- Export specific slices for stakeholder reviews  
*Industry Standard*: Never explore blindly – start with hypotheses from Sections 4-5.

In [None]:
# Example: Investigate high-income outliers
show(
    df.query("Income > 70000"),
    column_filters="footer",
    buttons=["copy", "csv"],
    scrollY="300px",
    classes="compact"
)