# Polycystic Ovary Syndrome | Dataset

Polycystic Ovary Syndrome (PCOS) is a hormonal imbalance disorder affecting women of reproductive age. 
It is characterized by irregular menstrual cycles, excess androgen (male hormone) levels, and the presence 
of multiple small cysts in the ovaries.
Women with PCOS may experience various symptoms, including weight gain, acne, and difficulty getting pregnant.

In [2]:
#importing libraries

import numpy as np
import pandas as pd

# Load the dataset into Jupyter Notebook

There are various formats for a dataset, .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online. In our case, the PCOS Dataset is an online source, and it is in CSV (comma separated value) format.

dataset name : PCOS_data.csv

The Pandas Library is a useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in Pandas Library so that all we need to do is import Pandas without installing.

In [3]:
#load the dataset
#Dataset source - https://www.kaggle.com/datasets/shreyasvedpathak/pcos-dataset
#Read the dataset in pcos varaible

pcos = pd.read_csv('PCOS_data.csv')
pcos

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Fast food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm),Unnamed: 44
0,1,1,0,28,44.6,152.000,19.3,15,78,22,...,1.0,0,110,80,3,3,18.0,18.0,8.5,
1,2,2,0,36,65.0,161.500,24.9,15,74,20,...,0.0,0,120,70,3,5,15.0,14.0,3.7,
2,3,3,1,33,68.8,165.000,25.3,11,72,18,...,1.0,0,120,80,13,15,18.0,20.0,10.0,
3,4,4,0,37,65.0,148.000,29.7,13,72,20,...,0.0,0,120,70,2,2,15.0,14.0,7.5,
4,5,5,0,25,52.0,161.000,20.1,11,72,18,...,0.0,0,120,80,3,4,16.0,14.0,7.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536,537,537,0,35,50.0,164.592,18.5,17,72,16,...,0.0,0,110,70,1,0,17.5,10.0,6.7,
537,538,538,0,30,63.2,158.000,25.3,15,72,18,...,0.0,0,110,70,9,7,19.0,18.0,8.2,
538,539,539,0,36,54.0,152.000,23.4,13,74,20,...,0.0,0,110,80,1,0,18.0,9.0,7.3,
539,540,540,0,27,50.0,150.000,22.2,15,74,20,...,0.0,0,110,70,7,6,18.0,16.0,11.5,


In [7]:
#Show the first 5 rows of the dataset

pcos.head()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Fast food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm),Unnamed: 44
0,1,1,0,28,44.6,152.0,19.3,15,78,22,...,1.0,0,110,80,3,3,18.0,18.0,8.5,
1,2,2,0,36,65.0,161.5,24.9,15,74,20,...,0.0,0,120,70,3,5,15.0,14.0,3.7,
2,3,3,1,33,68.8,165.0,25.3,11,72,18,...,1.0,0,120,80,13,15,18.0,20.0,10.0,
3,4,4,0,37,65.0,148.0,29.7,13,72,20,...,0.0,0,120,70,2,2,15.0,14.0,7.5,
4,5,5,0,25,52.0,161.0,20.1,11,72,18,...,0.0,0,120,80,3,4,16.0,14.0,7.0,


In [8]:
#Show the last 5 rows of the dataset

pcos.tail()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Fast food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm),Unnamed: 44
536,537,537,0,35,50.0,164.592,18.5,17,72,16,...,0.0,0,110,70,1,0,17.5,10.0,6.7,
537,538,538,0,30,63.2,158.0,25.3,15,72,18,...,0.0,0,110,70,9,7,19.0,18.0,8.2,
538,539,539,0,36,54.0,152.0,23.4,13,74,20,...,0.0,0,110,80,1,0,18.0,9.0,7.3,
539,540,540,0,27,50.0,150.0,22.2,15,74,20,...,0.0,0,110,70,7,6,18.0,16.0,11.5,
540,541,541,1,23,82.0,165.0,30.1,13,80,20,...,1.0,0,120,70,9,10,19.0,18.0,6.9,


In [6]:
# List of column names

for col in list(pcos.columns):
    print(col)

Sl. No
Patient File No.
PCOS (Y/N)
 Age (yrs)
Weight (Kg)
Height(Cm) 
BMI
Blood Group
Pulse rate(bpm) 
RR (breaths/min)
Hb(g/dl)
Cycle(R/I)
Cycle length(days)
Marraige Status (Yrs)
Pregnant(Y/N)
No. of abortions
  I   beta-HCG(mIU/mL)
II    beta-HCG(mIU/mL)
FSH(mIU/mL)
LH(mIU/mL)
FSH/LH
Hip(inch)
Waist(inch)
Waist:Hip Ratio
TSH (mIU/L)
AMH(ng/mL)
PRL(ng/mL)
Vit D3 (ng/mL)
PRG(ng/mL)
RBS(mg/dl)
Weight gain(Y/N)
hair growth(Y/N)
Skin darkening (Y/N)
Hair loss(Y/N)
Pimples(Y/N)
Fast food (Y/N)
Reg.Exercise(Y/N)
BP _Systolic (mmHg)
BP _Diastolic (mmHg)
Follicle No. (L)
Follicle No. (R)
Avg. F size (L) (mm)
Avg. F size (R) (mm)
Endometrium (mm)
Unnamed: 44


In [26]:
#gives the size/shape of the dataset
#How many columns and rows are there in the dataset.

num_rows, num_columns = pcos.shape

print("Number of rows:", num_rows)
print("Number of columns:", num_columns)

Number of rows: 541
Number of columns: 45


By analyzing the dataset, we found that there is a column named "unnamed" which contains only null values. This is due to the extra comma after the last column of each row, which is unnecessary and should not be present in our dataset.

In [14]:
#Check the data type of dataframe pcos

pcosdf.dtypes

Sl. No                      int64
Patient File No.            int64
PCOS (Y/N)                  int64
 Age (yrs)                  int64
Weight (Kg)               float64
Height(Cm)                float64
BMI                       float64
Blood Group                 int64
Pulse rate(bpm)             int64
RR (breaths/min)            int64
Hb(g/dl)                  float64
Cycle(R/I)                  int64
Cycle length(days)          int64
Marraige Status (Yrs)     float64
Pregnant(Y/N)               int64
No. of abortions            int64
  I   beta-HCG(mIU/mL)    float64
II    beta-HCG(mIU/mL)     object
FSH(mIU/mL)               float64
LH(mIU/mL)                float64
FSH/LH                    float64
Hip(inch)                   int64
Waist(inch)                 int64
Waist:Hip Ratio           float64
TSH (mIU/L)               float64
AMH(ng/mL)                 object
PRL(ng/mL)                float64
Vit D3 (ng/mL)            float64
PRG(ng/mL)                float64
RBS(mg/dl)    

In [27]:
#Information about the dataset
#This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage. 

pcosdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541 entries, 0 to 540
Data columns (total 45 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Sl. No                  541 non-null    int64  
 1   Patient File No.        541 non-null    int64  
 2   PCOS (Y/N)              541 non-null    int64  
 3    Age (yrs)              541 non-null    int64  
 4   Weight (Kg)             541 non-null    float64
 5   Height(Cm)              541 non-null    float64
 6   BMI                     541 non-null    float64
 7   Blood Group             541 non-null    int64  
 8   Pulse rate(bpm)         541 non-null    int64  
 9   RR (breaths/min)        541 non-null    int64  
 10  Hb(g/dl)                541 non-null    float64
 11  Cycle(R/I)              541 non-null    int64  
 12  Cycle length(days)      541 non-null    int64  
 13  Marraige Status (Yrs)   540 non-null    float64
 14  Pregnant(Y/N)           541 non-null    in

# Discription

This dataset comprises information on 541 patients, primarily focusing on factors related to polycystic ovary syndrome (PCOS).

The patients have an average age of 31.43 years, with a typical weight of 59.64 kg and height of 156.48 cm. The average Body Mass Index (BMI) among the patients is 24.31, indicating a moderate level of body weight relative to height.

Approximately 32.7% of patients have been diagnosed with PCOS, a hormonal disorder common among women of reproductive age. Additionally, 38.0% of patients are currently pregnant, while the average number of abortions among patients is 0.29.

In [29]:
#If we would like to get a statistical summary of each column, such as count, column mean value, column standard deviation, etc.
#We use the describe method:

pcosdf.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Sl. No,541.0,271.0,156.317519,1.0,136.0,271.0,406.0,541.0
Patient File No.,541.0,271.0,156.317519,1.0,136.0,271.0,406.0,541.0
PCOS (Y/N),541.0,0.327172,0.469615,0.0,0.0,0.0,1.0,1.0
Age (yrs),541.0,31.430684,5.411006,20.0,28.0,31.0,35.0,48.0
Weight (Kg),541.0,59.637153,11.028287,31.0,52.0,59.0,65.0,108.0
Height(Cm),541.0,156.484835,6.033545,137.0,152.0,156.0,160.0,180.0
BMI,541.0,24.307579,4.055129,12.4,21.6,24.2,26.6,38.9
Blood Group,541.0,13.802218,1.840812,11.0,13.0,14.0,15.0,18.0
Pulse rate(bpm),541.0,73.247689,4.430285,13.0,72.0,72.0,74.0,82.0
RR (breaths/min),541.0,19.243993,1.688629,16.0,18.0,18.0,20.0,28.0


This dataset provides a comprehensive overview of patient demographics, medical history, laboratory results, and physical characteristics, offering valuable insights into factors relevant to PCOS and gynecological health.