
- EXPLORATORY DATA ANALYSIS ON A DATASET

- Objective:
The main goal of this assignment is to conduct a thorough exploratory analysis of the "cardiographic.csv" dataset to uncover insights, identify patterns, and understand the dataset's underlying structure. You will use statistical summaries, visualizations, and data manipulation techniques to explore the dataset comprehensively.

- Dataset:
1.	LB - Likely stands for "Baseline Fetal Heart Rate (FHR)" which represents the average fetal heart rate over a period.
2.	AC - Could represent "Accelerations" in the FHR. Accelerations are usually a sign of fetal well-being.
3.	FM - May indicate "Fetal Movements" detected by the monitor.
4.	UC - Likely denotes "Uterine Contractions", which can impact the FHR pattern.
5.	DL - Could stand for "Decelerations Late" with respect to uterine contractions, which can be a sign of fetal distress.
6.	DS - May represent "Decelerations Short" or decelerations of brief duration.
7.	DP - Could indicate "Decelerations Prolonged", or long-lasting decelerations.
8.	ASTV - Might refer to "Percentage of Time with Abnormal Short Term Variability" in the FHR.
9.	MSTV - Likely stands for "Mean Value of Short Term Variability" in the FHR.
10.	ALTV - Could represent "Percentage of Time with Abnormal Long Term Variability" in the FHR.
11.	MLTV - Might indicate "Mean Value of Long Term Variability" in the FHR.


- Tools and Libraries:

1. Python or R programming language
2. 	Data manipulation libraries
3. Data visualization libraries (Matplotlib and Seaborn in Python)
4. Jupyter Notebook for documenting your analysis


- Tasks:
1.	Data Cleaning and Preparation:

●	Load the dataset into a DataFrame or equivalent data structure.

●	Handle missing values appropriately (e.g., imputation, deletion).

●	Identify and correct any inconsistencies in data types (e.g., numerical values stored as strings).

●	Detect and treat outliers if necessary.

2.	Statistical Summary:

●	Provide a statistical summary for each variable in the dataset, including measures of central tendency (mean, median) and dispersion (standard deviation, interquartile range).

●	Highlight any interesting findings from this summary.


In [20]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv('/content/Cardiotocographic.csv')

In [3]:
df

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,MLTV,Width,Tendency,NSP
0,120.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,73.0,0.5,43.000000,2.4,64.0,0.999926,2.000000
1,132.000000,0.006380,0.000000,0.006380,0.003190,0.0,0.0,17.0,2.1,0.000000,10.4,130.0,0.000000,1.000000
2,133.000000,0.003322,0.000000,0.008306,0.003322,0.0,0.0,16.0,2.1,0.000000,13.4,130.0,0.000000,1.000000
3,134.000000,0.002561,0.000000,0.007742,0.002561,0.0,0.0,16.0,2.4,0.000000,23.0,117.0,1.000000,1.000000
4,131.948232,0.006515,0.000000,0.008143,0.000000,0.0,0.0,16.0,2.4,0.000000,19.9,117.0,1.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2121,140.000000,0.000000,0.961268,0.007426,0.000000,0.0,0.0,79.0,0.2,25.000000,7.2,40.0,0.000000,2.000000
2122,140.000000,0.000775,0.000000,0.006979,0.000000,0.0,0.0,78.0,0.4,22.000000,7.1,66.0,1.000000,2.000000
2123,140.000000,0.000980,0.000000,0.006863,0.000000,0.0,0.0,79.0,0.4,20.000000,6.1,67.0,1.000000,1.990464
2124,140.000000,0.000679,0.000000,0.006110,0.000000,0.0,0.0,78.0,0.4,27.000000,7.0,66.0,1.000000,2.000000


In [4]:
df.head()

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,MLTV,Width,Tendency,NSP
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,2.4,64.0,0.999926,2.0
1,132.0,0.00638,0.0,0.00638,0.00319,0.0,0.0,17.0,2.1,0.0,10.4,130.0,0.0,1.0
2,133.0,0.003322,0.0,0.008306,0.003322,0.0,0.0,16.0,2.1,0.0,13.4,130.0,0.0,1.0
3,134.0,0.002561,0.0,0.007742,0.002561,0.0,0.0,16.0,2.4,0.0,23.0,117.0,1.0,1.0
4,131.948232,0.006515,0.0,0.008143,0.0,0.0,0.0,16.0,2.4,0.0,19.9,117.0,1.0,1.0


In [5]:
df.tail()

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,MLTV,Width,Tendency,NSP
2121,140.0,0.0,0.961268,0.007426,0.0,0.0,0.0,79.0,0.2,25.0,7.2,40.0,0.0,2.0
2122,140.0,0.000775,0.0,0.006979,0.0,0.0,0.0,78.0,0.4,22.0,7.1,66.0,1.0,2.0
2123,140.0,0.00098,0.0,0.006863,0.0,0.0,0.0,79.0,0.4,20.0,6.1,67.0,1.0,1.990464
2124,140.0,0.000679,0.0,0.00611,0.0,0.0,0.0,78.0,0.4,27.0,7.0,66.0,1.0,2.0
2125,142.0,0.001616,-0.000188,0.008078,0.0,0.0,0.0,74.0,0.4,35.857183,5.0,42.0,0.0,1.0


In [6]:
df.shape

(2126, 14)

- Data Cleaning and Preparation

In [8]:
df.describe()

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,MLTV,Width,Tendency,NSP
count,2105.0,2106.0,2126.0,2126.0,2126.0,2105.0,2105.0,2126.0,2126.0,2126.0,2105.0,2105.0,2105.0,2105.0
mean,133.343598,0.003219,0.009894,0.004391,0.001895,3e-06,0.000175,46.995984,1.364378,10.285964,8.284887,70.42926,0.316371,1.304507
std,11.270154,0.004391,0.06754,0.00334,0.003343,0.000142,0.00084,18.813973,1.173632,21.205041,7.772858,42.931822,0.645622,0.644619
min,51.842487,-0.019284,-0.480634,-0.014925,-0.015393,-0.001353,-0.005348,-63.0,-6.6,-91.0,-50.7,-174.0,-3.0,-1.025988
25%,126.0,0.0,0.0,0.001851,0.0,0.0,0.0,32.0,0.7,0.0,4.6,37.0,0.0,1.0
50%,133.0,0.001634,0.0,0.004484,0.0,0.0,0.0,49.0,1.2,0.0,7.4,67.0,0.0,1.0
75%,140.0,0.00565,0.002567,0.006536,0.003289,0.0,0.0,61.0,1.7,11.0,10.9,100.0,1.0,1.0
max,214.0,0.038567,0.961268,0.030002,0.030769,0.002706,0.010695,162.0,13.8,182.0,101.4,357.0,3.0,5.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   LB        2105 non-null   float64
 1   AC        2106 non-null   float64
 2   FM        2126 non-null   float64
 3   UC        2126 non-null   float64
 4   DL        2126 non-null   float64
 5   DS        2105 non-null   float64
 6   DP        2105 non-null   float64
 7   ASTV      2126 non-null   float64
 8   MSTV      2126 non-null   float64
 9   ALTV      2126 non-null   float64
 10  MLTV      2105 non-null   float64
 11  Width     2105 non-null   float64
 12  Tendency  2105 non-null   float64
 13  NSP       2105 non-null   float64
dtypes: float64(14)
memory usage: 232.7 KB


In [9]:
missing_values=df.isnull().sum()
missing_values

Unnamed: 0,0
LB,21
AC,20
FM,0
UC,0
DL,0
DS,21
DP,21
ASTV,0
MSTV,0
ALTV,0


- LB: 21 missing values
- AC: 20 missing values
- DS,DP,MLTV,Width,Tendency,NSP : 21 missing values

Since the missing values are relatvely small(lesser than 1% of the total data), we can consider imputation strategies.

In [10]:
#mean imputation
df['LB'].fillna(df['LB'].mean(),inplace=True)
df['AC'].fillna(df['AC'].mean(),inplace=True)
df['DS'].fillna(df['DS'].mean(),inplace=True)
df['DP'].fillna(df['DP'].mean(),inplace=True)
df['MLTV'].fillna(df['MLTV'].mean(),inplace=True)
df['AC'].fillna(df['AC'].mean(),inplace=True)
df['Width'].fillna(df['Width'].mean(),inplace=True)
df['Tendency'].fillna(df['Tendency'].mean(),inplace=True)
df['NSP'].fillna(df['NSP'].mean(),inplace=True)

In [11]:
missing_value_after_imputation=df.isnull().sum()
missing_value_after_imputation

Unnamed: 0,0
LB,0
AC,0
FM,0
UC,0
DL,0
DS,0
DP,0
ASTV,0
MSTV,0
ALTV,0


Missing values are treated here

- Statistical Summary

In [15]:
stat_summary=df.describe().T   #transpose for easier viewing

In [16]:
stat_summary

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LB,2126.0,133.343598,11.214328,51.842487,126.0,133.0,140.0,214.0
AC,2126.0,0.003219,0.00437,-0.019284,0.0,0.001668,0.005606,0.038567
FM,2126.0,0.009894,0.06754,-0.480634,0.0,0.0,0.002567,0.961268
UC,2126.0,0.004391,0.00334,-0.014925,0.001851,0.004484,0.006536,0.030002
DL,2126.0,0.001895,0.003343,-0.015393,0.0,0.0,0.003289,0.030769
DS,2126.0,3e-06,0.000141,-0.001353,0.0,0.0,0.0,0.002706
DP,2126.0,0.000175,0.000836,-0.005348,0.0,0.0,0.0,0.010695
ASTV,2126.0,46.995984,18.813973,-63.0,32.0,49.0,61.0,162.0
MSTV,2126.0,1.364378,1.173632,-6.6,0.7,1.2,1.7,13.8
ALTV,2126.0,10.285964,21.205041,-91.0,0.0,0.0,11.0,182.0


In [17]:
#Add the median,IQR and Skewness to the summary
stat_summary['Median']=df.median()
stat_summary['IQR']=df.quantile(0.75)-df.quantile(0.25)
stat_summary['Skewness']=df.skew()

In [18]:
stat_summary[['Median','IQR','Skewness']]

Unnamed: 0,Median,IQR,Skewness
LB,133.0,14.0,0.322341
AC,0.001668,0.005606,2.026328
FM,0.0,0.002567,6.75307
UC,0.004484,0.004685,0.974239
DL,0.0,0.003289,2.01039
DS,0.0,0.0,8.460815
DP,0.0,0.0,6.454377
ASTV,49.0,29.0,0.055872
MSTV,1.2,1.0,4.142518
ALTV,0.0,11.0,2.981199


In [19]:
stat_summary

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,Median,IQR,Skewness
LB,2126.0,133.343598,11.214328,51.842487,126.0,133.0,140.0,214.0,133.0,14.0,0.322341
AC,2126.0,0.003219,0.00437,-0.019284,0.0,0.001668,0.005606,0.038567,0.001668,0.005606,2.026328
FM,2126.0,0.009894,0.06754,-0.480634,0.0,0.0,0.002567,0.961268,0.0,0.002567,6.75307
UC,2126.0,0.004391,0.00334,-0.014925,0.001851,0.004484,0.006536,0.030002,0.004484,0.004685,0.974239
DL,2126.0,0.001895,0.003343,-0.015393,0.0,0.0,0.003289,0.030769,0.0,0.003289,2.01039
DS,2126.0,3e-06,0.000141,-0.001353,0.0,0.0,0.0,0.002706,0.0,0.0,8.460815
DP,2126.0,0.000175,0.000836,-0.005348,0.0,0.0,0.0,0.010695,0.0,0.0,6.454377
ASTV,2126.0,46.995984,18.813973,-63.0,32.0,49.0,61.0,162.0,49.0,29.0,0.055872
MSTV,2126.0,1.364378,1.173632,-6.6,0.7,1.2,1.7,13.8,1.2,1.0,4.142518
ALTV,2126.0,10.285964,21.205041,-91.0,0.0,0.0,11.0,182.0,0.0,11.0,2.981199
