# Building a Classification model

## Problem Statement

Polycystic ovary syndrome is a disorder involving infrequent, irregular or prolonged menstrual periods, and often excess male hormone (androgen) levels. The ovaries develop numerous small collections of fluid — called follicles — and may fail to regularly release eggs.

PCOS(Polycystic ovary syndrome) is becoming a common condition in women in the modern world. Also, it is believed that women with PCOS problem suffer with weight gain, pregnancy issues, hairfall, and so on and it might affect women's fertility rate. 

In this project, we are building a classification model to predict if a women can get pregnant based on multiple features. 

## Success Criteria

**Business Success Criteria:** Reduce the diagnosis time from anywhere between 20% to 40%

**ML Success Criteria:** Achieve Accuracy of atleast 0.8

## Data Collection

Data: Dataset contains all physical and clinical parameters to determine PCOS and infertility related issues. The data are collected from 10 different hospital across Kerala,India.

author = {Prasoon Kottarathil},
title = {Polycystic ovary syndrome (PCOS)},
year = {2020},
publisher = {kaggle},
journal = {Kaggle Dataset},
how published = {\url{https://www.kaggle.com/prasoonkottarathil/polycystic-ovary-syndrome-pcos}}

**Importing required packages**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

In [7]:
url = 'https://github.com/Dharshana03/pcos_classification/blob/main/Data/PCOS_data_without_infertility.xlsx?raw=true'

In [10]:
df = pd.read_excel(url,sheet_name=1)

In [11]:
df.head(10)

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Fast food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm),Unnamed: 44
0,1,1,0,28,44.6,152.0,19.3,15,78,22,...,1.0,0,110,80,3,3,18.0,18.0,8.5,
1,2,2,0,36,65.0,161.5,24.921163,15,74,20,...,0.0,0,120,70,3,5,15.0,14.0,3.7,
2,3,3,1,33,68.8,165.0,25.270891,11,72,18,...,1.0,0,120,80,13,15,18.0,20.0,10.0,
3,4,4,0,37,65.0,148.0,29.674945,13,72,20,...,0.0,0,120,70,2,2,15.0,14.0,7.5,
4,5,5,0,25,52.0,161.0,20.060954,11,72,18,...,0.0,0,120,80,3,4,16.0,14.0,7.0,
5,6,6,0,36,74.1,165.0,27.217631,15,78,28,...,0.0,0,110,70,9,6,16.0,20.0,8.0,
6,7,7,0,34,64.0,156.0,26.298488,11,72,18,...,0.0,0,120,80,6,6,15.0,16.0,6.8,
7,8,8,0,33,58.5,159.0,23.139907,13,72,20,...,0.0,0,120,80,7,6,15.0,18.0,7.1,
8,9,9,0,32,40.0,158.0,16.023073,11,72,18,...,0.0,0,120,80,5,7,17.0,17.0,4.2,
9,10,10,0,36,52.0,150.0,23.111111,15,80,20,...,0.0,0,110,80,1,1,14.0,17.0,2.5,


## EXPLORATORY DATA ANALYSIS (EDA) / DESCRIPTIVE STATISTICS

In [12]:
df.describe()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Pimples(Y/N),Fast food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm)
count,541.0,541.0,541.0,541.0,541.0,541.0,541.0,541.0,541.0,541.0,...,541.0,540.0,541.0,541.0,541.0,541.0,541.0,541.0,541.0,541.0
mean,271.0,271.0,0.327172,31.430684,59.637153,156.484835,24.311285,13.802218,73.247689,19.243993,...,0.489834,0.514815,0.247689,114.661738,76.927911,6.12939,6.641405,15.018115,15.451701,8.475915
std,156.317519,156.317519,0.469615,5.411006,11.028287,6.033545,4.056399,1.840812,4.430285,1.688629,...,0.500359,0.500244,0.43207,7.384556,5.574112,4.229294,4.436889,3.566839,3.318848,2.165381
min,1.0,1.0,0.0,20.0,31.0,137.0,12.417882,11.0,13.0,16.0,...,0.0,0.0,0.0,12.0,8.0,0.0,0.0,0.0,0.0,0.0
25%,136.0,136.0,0.0,28.0,52.0,152.0,21.641274,13.0,72.0,18.0,...,0.0,0.0,0.0,110.0,70.0,3.0,3.0,13.0,13.0,7.0
50%,271.0,271.0,0.0,31.0,59.0,156.0,24.238227,14.0,72.0,18.0,...,0.0,1.0,0.0,110.0,80.0,5.0,6.0,15.0,16.0,8.5
75%,406.0,406.0,1.0,35.0,65.0,160.0,26.634958,15.0,74.0,20.0,...,1.0,1.0,0.0,120.0,80.0,9.0,10.0,18.0,18.0,9.8
max,541.0,541.0,1.0,48.0,108.0,180.0,38.9,18.0,82.0,28.0,...,1.0,1.0,1.0,140.0,100.0,22.0,20.0,24.0,24.0,18.0


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541 entries, 0 to 540
Data columns (total 45 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Sl. No                  541 non-null    int64  
 1   Patient File No.        541 non-null    int64  
 2   PCOS (Y/N)              541 non-null    int64  
 3    Age (yrs)              541 non-null    int64  
 4   Weight (Kg)             541 non-null    float64
 5   Height(Cm)              541 non-null    float64
 6   BMI                     541 non-null    float64
 7   Blood Group             541 non-null    int64  
 8   Pulse rate(bpm)         541 non-null    int64  
 9   RR (breaths/min)        541 non-null    int64  
 10  Hb(g/dl)                541 non-null    float64
 11  Cycle(R/I)              541 non-null    int64  
 12  Cycle length(days)      541 non-null    int64  
 13  Marraige Status (Yrs)   540 non-null    float64
 14  Pregnant(Y/N)           541 non-null    in

##  Data preprocessing

### Typecasting

**changing all the (Y/N) encoded variables to object datatype**

**Blood group is a categorical variable with different blood groups, typecasting it to objects datatype**

**Converting Sl. No, Patient File No., and no. of abortions as objects since they are unique values and not integers**

In [14]:
for i in df.columns:
    if "(Y/N)" in i :
        df[i]=df[i].astype('object')

df['Blood Group']=df['Blood Group'].astype('object')

df['Sl. No']=df['Sl. No'].astype('object')
df['Patient File No.']=df['Patient File No.'].astype('object')
df['No. of aborptions']=df['No. of aborptions'].astype('object')


**Converting columns which are misinterpreted as objects due to junk values to numeric**

In [15]:
def to_numeric(input_col):
    return pd.to_numeric(input_col,errors='coerce')

In [16]:
df["II    beta-HCG(mIU/mL)"] = to_numeric(df["II    beta-HCG(mIU/mL)"])
df["AMH(ng/mL)"] = to_numeric(df["AMH(ng/mL)"])

### Datatypes after preprocessing

In [17]:
df.dtypes

Sl. No                     object
Patient File No.           object
PCOS (Y/N)                 object
 Age (yrs)                  int64
Weight (Kg)               float64
Height(Cm)                float64
BMI                       float64
Blood Group                object
Pulse rate(bpm)             int64
RR (breaths/min)            int64
Hb(g/dl)                  float64
Cycle(R/I)                  int64
Cycle length(days)          int64
Marraige Status (Yrs)     float64
Pregnant(Y/N)              object
No. of aborptions          object
  I   beta-HCG(mIU/mL)    float64
II    beta-HCG(mIU/mL)    float64
FSH(mIU/mL)               float64
LH(mIU/mL)                float64
FSH/LH                    float64
Hip(inch)                   int64
Waist(inch)                 int64
Waist:Hip Ratio           float64
TSH (mIU/L)               float64
AMH(ng/mL)                float64
PRL(ng/mL)                float64
Vit D3 (ng/mL)            float64
PRG(ng/mL)                float64
RBS(mg/dl)    

In [18]:
no_of_records = df.shape[0]
no_of_records

541

### Setting the target variable as 'Pregnant'

In [19]:
target = 'Pregnant(Y/N)'

### Dropping the columns with 90% of missing data 

In [20]:
null_col_dict = dict([(i,sum(df[i].isnull())) for i in df.columns])
null_col_dict

{'Sl. No': 0,
 'Patient File No.': 0,
 'PCOS (Y/N)': 0,
 ' Age (yrs)': 0,
 'Weight (Kg)': 0,
 'Height(Cm) ': 0,
 'BMI': 0,
 'Blood Group': 0,
 'Pulse rate(bpm) ': 0,
 'RR (breaths/min)': 0,
 'Hb(g/dl)': 0,
 'Cycle(R/I)': 0,
 'Cycle length(days)': 0,
 'Marraige Status (Yrs)': 1,
 'Pregnant(Y/N)': 0,
 'No. of aborptions': 0,
 '  I   beta-HCG(mIU/mL)': 0,
 'II    beta-HCG(mIU/mL)': 1,
 'FSH(mIU/mL)': 0,
 'LH(mIU/mL)': 0,
 'FSH/LH': 0,
 'Hip(inch)': 0,
 'Waist(inch)': 0,
 'Waist:Hip Ratio': 0,
 'TSH (mIU/L)': 0,
 'AMH(ng/mL)': 1,
 'PRL(ng/mL)': 0,
 'Vit D3 (ng/mL)': 0,
 'PRG(ng/mL)': 0,
 'RBS(mg/dl)': 0,
 'Weight gain(Y/N)': 0,
 'hair growth(Y/N)': 0,
 'Skin darkening (Y/N)': 0,
 'Hair loss(Y/N)': 0,
 'Pimples(Y/N)': 0,
 'Fast food (Y/N)': 1,
 'Reg.Exercise(Y/N)': 0,
 'BP _Systolic (mmHg)': 0,
 'BP _Diastolic (mmHg)': 0,
 'Follicle No. (L)': 0,
 'Follicle No. (R)': 0,
 'Avg. F size (L) (mm)': 0,
 'Avg. F size (R) (mm)': 0,
 'Endometrium (mm)': 0,
 'Unnamed: 44': 539}

In [21]:
columns_removed = []
for i in null_col_dict:
    if null_col_dict[i] >= 0.90*no_of_records:
        df=df.drop(i,axis=1)
        columns_removed.append(i)
    

In [22]:
columns_removed

['Unnamed: 44']

### Code to remove zero variance numerical variables

In [23]:
numerical_col = list(set(df.select_dtypes(exclude='object'))-set(target))
categorical_col = list(set(df.select_dtypes(include='object'))-set(target))

In [24]:
zero_var_numerical_col = df.std()[round(df.std(),3)==0]
df = df.drop(zero_var_numerical_col.index,axis=1)

In [25]:
numerical_col = list(set(numerical_col)-set(zero_var_numerical_col))
zero_var_numerical_col

Series([], dtype: float64)

### Code to remove zero variance categorical variables

In [26]:
zero_var_categorical_col = [i for i in categorical_col if len(df[i].value_counts().index)==1]
df=df.drop(zero_var_categorical_col,axis=1)

In [27]:
categorical_col = list(set(categorical_col)-set(zero_var_categorical_col))
zero_var_categorical_col

[]

In [28]:
multi_var_categorical_col = [i for i in categorical_col if len(df[i].value_counts().index)>=200]
df=df.drop(multi_var_categorical_col,axis=1)


In [29]:
categorical_col = list(set(categorical_col)-set(multi_var_categorical_col))
multi_var_categorical_col

['Sl. No', 'Patient File No.']

In [30]:
len(numerical_col)

31

In [31]:
def find_outliers_IQR(df):
    q1=df.quantile(0.25)
    q3=df.quantile(0.75)
    IQR=q3-q1
    outliers = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]
    return outliers

In [32]:
outlier_dict={}
for i in numerical_col:
    outliers = find_outliers_IQR(df[i])
    #if len(outliers)>0:
    
    if len(outliers)> 0.03*(len(df[i])):
        outlier_dict[str(i)] = len(outliers)

In [33]:
print(len(outlier_dict))
print(outlier_dict)
out_list = list(outlier_dict.keys())

15
{'AMH(ng/mL)': 52, 'Pulse rate(bpm) ': 94, 'Weight (Kg)': 18, 'Vit D3 (ng/mL)': 31, 'Cycle length(days)': 77, 'Waist(inch)': 17, 'II    beta-HCG(mIU/mL)': 78, 'Hip(inch)': 21, 'TSH (mIU/L)': 27, 'PRG(ng/mL)': 39, 'LH(mIU/mL)': 24, 'FSH/LH': 48, 'RBS(mg/dl)': 30, '  I   beta-HCG(mIU/mL)': 46, 'PRL(ng/mL)': 21}


In [34]:
outliers = find_outliers_IQR(df[' Age (yrs)'])
len(outliers)

5

### Number of women who has pcos and got pregnant 

In [35]:
df[((df.loc[:,'PCOS (Y/N)']==1) & (df.loc[:,target]==1))].shape[0]

64

### Number of women who has pcos and did not get pregnant 

In [36]:
df[((df.loc[:,'PCOS (Y/N)']==1) & (df.loc[:,target]==0))].shape[0]

113

### Standardisation 

In [37]:
array = df[numerical_col].values
std_ins = StandardScaler().fit(array)
df[numerical_col] = pd.DataFrame(std_ins.transform(array))

### Imputing the mode for missing categorial variables and mean for missing numerical variables

#### Before Imputing

In [38]:
null_col_dict = dict([(i,sum(df[i].isnull())) for i in df.columns])
null_col_dict

{'PCOS (Y/N)': 0,
 ' Age (yrs)': 0,
 'Weight (Kg)': 0,
 'Height(Cm) ': 0,
 'BMI': 0,
 'Blood Group': 0,
 'Pulse rate(bpm) ': 0,
 'RR (breaths/min)': 0,
 'Hb(g/dl)': 0,
 'Cycle(R/I)': 0,
 'Cycle length(days)': 0,
 'Marraige Status (Yrs)': 1,
 'Pregnant(Y/N)': 0,
 'No. of aborptions': 0,
 '  I   beta-HCG(mIU/mL)': 0,
 'II    beta-HCG(mIU/mL)': 1,
 'FSH(mIU/mL)': 0,
 'LH(mIU/mL)': 0,
 'FSH/LH': 0,
 'Hip(inch)': 0,
 'Waist(inch)': 0,
 'Waist:Hip Ratio': 0,
 'TSH (mIU/L)': 0,
 'AMH(ng/mL)': 1,
 'PRL(ng/mL)': 0,
 'Vit D3 (ng/mL)': 0,
 'PRG(ng/mL)': 0,
 'RBS(mg/dl)': 0,
 'Weight gain(Y/N)': 0,
 'hair growth(Y/N)': 0,
 'Skin darkening (Y/N)': 0,
 'Hair loss(Y/N)': 0,
 'Pimples(Y/N)': 0,
 'Fast food (Y/N)': 1,
 'Reg.Exercise(Y/N)': 0,
 'BP _Systolic (mmHg)': 0,
 'BP _Diastolic (mmHg)': 0,
 'Follicle No. (L)': 0,
 'Follicle No. (R)': 0,
 'Avg. F size (L) (mm)': 0,
 'Avg. F size (R) (mm)': 0,
 'Endometrium (mm)': 0}

### Using SimpleImputer from sklearn to impute categorical variables

In [39]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent', 
                        missing_values=np.nan)
imputer = imputer.fit(df[categorical_col])
df[categorical_col] = imputer.transform(df[categorical_col])


### Using fillna function to impute numerical variables

In [40]:
df[numerical_col]=df[numerical_col].fillna(df[numerical_col].mean(), inplace = False)

#### After Imputing

In [41]:
null_col_dict = dict([(i,sum(df[i].isnull())) for i in df.columns])
null_col_dict

{'PCOS (Y/N)': 0,
 ' Age (yrs)': 0,
 'Weight (Kg)': 0,
 'Height(Cm) ': 0,
 'BMI': 0,
 'Blood Group': 0,
 'Pulse rate(bpm) ': 0,
 'RR (breaths/min)': 0,
 'Hb(g/dl)': 0,
 'Cycle(R/I)': 0,
 'Cycle length(days)': 0,
 'Marraige Status (Yrs)': 0,
 'Pregnant(Y/N)': 0,
 'No. of aborptions': 0,
 '  I   beta-HCG(mIU/mL)': 0,
 'II    beta-HCG(mIU/mL)': 0,
 'FSH(mIU/mL)': 0,
 'LH(mIU/mL)': 0,
 'FSH/LH': 0,
 'Hip(inch)': 0,
 'Waist(inch)': 0,
 'Waist:Hip Ratio': 0,
 'TSH (mIU/L)': 0,
 'AMH(ng/mL)': 0,
 'PRL(ng/mL)': 0,
 'Vit D3 (ng/mL)': 0,
 'PRG(ng/mL)': 0,
 'RBS(mg/dl)': 0,
 'Weight gain(Y/N)': 0,
 'hair growth(Y/N)': 0,
 'Skin darkening (Y/N)': 0,
 'Hair loss(Y/N)': 0,
 'Pimples(Y/N)': 0,
 'Fast food (Y/N)': 0,
 'Reg.Exercise(Y/N)': 0,
 'BP _Systolic (mmHg)': 0,
 'BP _Diastolic (mmHg)': 0,
 'Follicle No. (L)': 0,
 'Follicle No. (R)': 0,
 'Avg. F size (L) (mm)': 0,
 'Avg. F size (R) (mm)': 0,
 'Endometrium (mm)': 0}

### Balancing Dataset

In [42]:
df[target].value_counts()

0    335
1    206
Name: Pregnant(Y/N), dtype: int64

In [43]:
df[target]=df[target].astype('uint8')

In [44]:
from imblearn.over_sampling import SMOTE

input_var = list(set(df.columns) - set([target]))

over_sam = SMOTE(random_state=0)
X,Y = over_sam.fit_resample(df[input_var],df[target]) 

X = pd.DataFrame(X, columns = input_var)
y = pd.DataFrame(Y, columns = [target])

In [45]:
y[target].value_counts()

1    335
0    335
Name: Pregnant(Y/N), dtype: int64

In [46]:
# your code to create train and test sets goes in here
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30,random_state=40)

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

In [48]:
sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l2', solver='liblinear'))
sel_.fit(X_train, np.ravel(y_train,order='C'))

In [49]:
selected_feat = X_train.columns[(sel_.get_support())]
selected_feat= list(selected_feat)

In [50]:
selected_feat.append('PCOS (Y/N)')

In [51]:
selected_feat

['Fast food (Y/N)',
 'Weight (Kg)',
 'Skin darkening (Y/N)',
 'Hip(inch)',
 'Pimples(Y/N)',
 'Reg.Exercise(Y/N)',
 'hair growth(Y/N)',
 'Hb(g/dl)',
 'II    beta-HCG(mIU/mL)',
 'PCOS (Y/N)']

In [52]:
X_train_selected = X_train[selected_feat]
X_test_selected = X_test[selected_feat]
X_train_selected.shape, X_test_selected.shape

((469, 10), (201, 10))

In [53]:
# Logistic Regression
log_reg = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr') # creates a lR instance
log_reg.fit(X_train_selected, y_train)

  y = column_or_1d(y, warn=True)


In [54]:
# Decision Trees
dec_tree = DecisionTreeClassifier(criterion = 'gini', splitter='best', max_depth=15)
dec_tree.fit(X_train_selected, y_train)

In [55]:
# Random Forests
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=0)
rf.fit(X_train_selected, y_train)

  rf.fit(X_train_selected, y_train)


In [56]:
# K-NN
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train_selected, y_train)

  return self._fit(X, y)


In [57]:

def get_performance_metrics(actual,predict):
    c_matrix = confusion_matrix(actual,predict)
    total = sum(sum(c_matrix))
    accuracy = (c_matrix[0,0]+c_matrix[1,1])/total
    sensitivity = c_matrix[0,0]/(c_matrix[0,0]+c_matrix[0,1])
    specificity = c_matrix[1,1]/(c_matrix[1,0]+c_matrix[1,1])
    dict_ = {"accuracy":accuracy,"sensitivity":sensitivity,"specificity":specificity}
    return dict_
    

### Predict the target variable

In [58]:
predict_lr=log_reg.predict(X_test_selected)
predict_dt=dec_tree.predict(X_test_selected)
predict_rf=rf.predict(X_test_selected)
predict_knn=knn.predict(X_test_selected)

In [59]:
# accuracy, sensitivity, and specificity for model logistic regression
performance_lr = get_performance_metrics(y_test, predict_lr)


# accuracy, sensitivity, and specificity for model decision trees
performance_dt = get_performance_metrics(y_test, predict_dt)


# accuracy, sensitivity, and specificity for model random forests
performance_rf = get_performance_metrics(y_test, predict_rf)

# accuracy, sensitivity, and specificity for model k nearest neighbors
performance_knn = get_performance_metrics(y_test, predict_knn)

In [60]:
perf_df = pd.DataFrame([performance_lr,performance_dt,performance_rf,performance_knn],index = ['Logistic Regression',
                             'Decision Trees', 'Random Forest','K-NN'])
perf_df.round(2)

Unnamed: 0,accuracy,sensitivity,specificity
Logistic Regression,0.74,0.79,0.67
Decision Trees,0.8,0.81,0.78
Random Forest,0.86,0.88,0.82
K-NN,0.62,0.55,0.7
