# Exploratory Data Analysis: Cardiotocographic Dataset

This notebook performs EDA on the dataset located at `D:\\DATA SCIENCE\\ASSIGNMENTS\\5 EDA1\\EDA1\\Cardiotocographic.csv`.

It includes:
- Data loading & cleaning
- Statistical summary
- Outlier detection
- Visualizations
- Pattern recognition
- Conclusion

---

In [None]:
# 1. Load dataset (Windows path)
import pandas as pd, numpy as np

input_path = r"D:\\DATA SCIENCE\\ASSIGNMENTS\\5 EDA1\\EDA1\\Cardiotocographic.csv"
df = pd.read_csv(input_path)
df.head()

In [None]:
# 2. Inspect dataset
df.info()
print("\nShape:", df.shape)
df.describe().T

In [None]:
# 3. Data cleaning
# Remove duplicates
dupes = df.duplicated().sum()
if dupes > 0:
    df = df.drop_duplicates()
print("Duplicates dropped:", dupes)

# Convert to numeric where possible
for col in df.columns:
    if df[col].dtype == object:
        try:
            df[col] = pd.to_numeric(df[col].str.strip(), errors='coerce')
        except Exception:
            pass

# Handle missing values
for col in df.columns:
    if df[col].isna().sum() > 0:
        if np.issubdtype(df[col].dtype, np.number):
            df[col].fillna(df[col].median(), inplace=True)
        else:
            df[col].fillna(df[col].mode().iloc[0], inplace=True)

print("Missing values handled.")

In [None]:
# 4. Outlier detection (IQR method)
outlier_summary = {}
for col in df.select_dtypes(include=[np.number]).columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    outlier_summary[col] = {'lower': lower, 'upper': upper, 'outliers': outliers.shape[0]}
outlier_summary

In [None]:
# 5. Statistical summary
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
summary = pd.DataFrame(index=num_cols)
summary['count'] = df[num_cols].count()
summary['mean'] = df[num_cols].mean()
summary['median'] = df[num_cols].median()
summary['std'] = df[num_cols].std()
summary['IQR'] = df[num_cols].quantile(0.75) - df[num_cols].quantile(0.25)
summary['skew'] = df[num_cols].skew()
summary['kurtosis'] = df[num_cols].kurtosis()
summary.round(3)

In [None]:
# 6. Visualizations
import matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Histograms
df[num_cols].hist(figsize=(14, 12), bins=20)
plt.tight_layout()

# Boxplots
plt.figure(figsize=(12, 8))
for i, col in enumerate(num_cols[:6], 1):
    plt.subplot(2, 3, i)
    sns.boxplot(y=df[col])
    plt.title(col)
plt.tight_layout()

# Correlation heatmap
plt.figure(figsize=(10, 8))
corr = df[num_cols].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

In [None]:
# 7. Pattern recognition
corr_pairs = corr.unstack().sort_values(ascending=False)
corr_pairs = corr_pairs[corr_pairs < 1].dropna().head(20)
corr_pairs

In [None]:
# 8. Save cleaned dataset to same Windows folder
output_clean = r"D:\\DATA SCIENCE\\ASSIGNMENTS\\5 EDA1\\EDA1\\Cardiotocographic_cleaned.csv"
df.to_csv(output_clean, index=False)
print("Cleaned dataset saved to:", output_clean)

---
# Conclusion
- Dataset cleaned: missing values imputed, duplicates dropped, data types corrected.
- Outliers flagged via IQR method; extreme cases need domain review.
- Statistical summary highlighted skewness in FM and UC, and variability in ASTV/ALTV.
- Visualizations confirmed skewness, presence of outliers, and correlations among variability features.
- Cleaned dataset saved alongside original for reproducibility.