## Data preparation:  

Loading the data set from a CSV.

In [1]:
import pandas as pd

data = pd.read_csv('/content/pcos_prediction_dataset.csv')

In [2]:
data.head()
print(data.columns)

Unnamed: 0,Country,Age,BMI,Menstrual Regularity,Hirsutism,Acne Severity,Family History of PCOS,Insulin Resistance,Lifestyle Score,Stress Levels,Urban/Rural,Socioeconomic Status,Awareness of PCOS,Fertility Concerns,Undiagnosed PCOS Likelihood,Ethnicity,Diagnosis
0,Madagascar,26,Overweight,Regular,Yes,Severe,Yes,Yes,2,Low,Rural,High,Yes,No,0.107938,Hispanic,Yes
1,Vietnam,16,Underweight,Regular,Yes,,No,Yes,4,High,Rural,Middle,Yes,No,0.156729,Other,No
2,Somalia,41,Normal,Regular,No,Moderate,No,No,7,Medium,Urban,Middle,Yes,Yes,0.202901,Other,No
3,Malawi,27,Normal,Irregular,No,Mild,No,No,10,Low,Urban,High,Yes,No,0.073926,Caucasian,Yes
4,France,26,Overweight,Irregular,Yes,,No,No,7,Medium,Urban,Middle,No,No,0.229266,Caucasian,No


Dropping irrelavent features from the dataset.

In [13]:
 r_data = data.drop(['Country', 'Urban/Rural', 'Awareness of PCOS', 'Ethnicity', 'Socioeconomic Status'], axis=1)
 r_data.head()

Unnamed: 0,Age,BMI,Menstrual Regularity,Hirsutism,Acne Severity,Family History of PCOS,Insulin Resistance,Lifestyle Score,Stress Levels,Fertility Concerns,Undiagnosed PCOS Likelihood,Diagnosis
0,26,Overweight,Regular,Yes,Severe,Yes,Yes,2,Low,No,0.107938,Yes
1,16,Underweight,Regular,Yes,,No,Yes,4,High,No,0.156729,No
2,41,Normal,Regular,No,Moderate,No,No,7,Medium,Yes,0.202901,No
3,27,Normal,Irregular,No,Mild,No,No,10,Low,No,0.073926,Yes
4,26,Overweight,Irregular,Yes,,No,No,7,Medium,No,0.229266,No


Checking for the unique values across the different features.   

In [21]:
exclude_columns = ['Age','Lifestyle Score','Undiagnosed PCOS Likelihood']

for col in r_data.columns:
    if col not in exclude_columns:
        unique_values = r_data[col].unique()
        print(f"For {col} unique values are: {unique_values}")

For BMI unique values are: ['Overweight' 'Underweight' 'Normal' 'Obese']
For Menstrual Regularity unique values are: ['Regular' 'Irregular']
For Hirsutism unique values are: ['Yes' 'No']
For Acne Severity unique values are: ['Severe' nan 'Moderate' 'Mild']
For Family History of PCOS unique values are: ['Yes' 'No']
For Insulin Resistance unique values are: ['Yes' 'No']
For Stress Levels unique values are: ['Low' 'High' 'Medium']
For Fertility Concerns unique values are: ['No' 'Yes']
For Diagnosis unique values are: ['Yes' 'No']


Replacing NaN values from a feature set

In [23]:
# Replace NaN with the mode
r_data['Acne Severity'] = r_data['Acne Severity'].fillna(r_data['Acne Severity'].mode()[0])

Encoding features and labels into Int values for futher training purposes

In [24]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

exclude_columns = ['Age','Lifestyle Score','Undiagnosed PCOS Likelihood']

for col in r_data.columns:
    if col not in exclude_columns:
        # Check if column is categorical
        if r_data[col].dtype == 'object':
            r_data[col] = label_encoder.fit_transform(r_data[col])

In [25]:
r_data.head()

Unnamed: 0,Age,BMI,Menstrual Regularity,Hirsutism,Acne Severity,Family History of PCOS,Insulin Resistance,Lifestyle Score,Stress Levels,Fertility Concerns,Undiagnosed PCOS Likelihood,Diagnosis
0,26,2,1,1,2,1,1,2,1,0,0.107938,1
1,16,3,1,1,0,0,1,4,0,0,0.156729,0
2,41,0,1,0,1,0,0,7,2,1,0.202901,0
3,27,0,0,0,0,0,0,10,1,0,0.073926,1
4,26,2,0,1,0,0,0,7,2,0,0.229266,0


Splitting the data set into features & labels, then into training and testing datasets.

In [28]:
from sklearn.model_selection import train_test_split

X = r_data[['Age',
            'BMI',
            'Menstrual Regularity',
            'Hirsutism',
            'Acne Severity',
            'Family History of PCOS',
            'Insulin Resistance',
            'Lifestyle Score',
            'Stress Levels',
            'Fertility Concerns',
            'Undiagnosed PCOS Likelihood']]
y = r_data['Diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(108000, 11)
(108000,)
(12000, 11)
(12000,)


## Model Preparation:

Logistical Regression:

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize and train the model
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Accuracy:", accuracy)

Logistic Regression Accuracy: 0.8931666666666667


Decision Trees:

In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the model
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)

# Predictions
y_pred = dtree.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Decision Tree Accuracy:", accuracy)

Decision Tree Accuracy: 0.7910833333333334


Random Forest:

In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Random Forest Accuracy:", accuracy)

Random Forest Accuracy: 0.8875833333333333


Gradient Boosting:

In [32]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the model
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)

# Predictions
y_pred = gb.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Gradient Boosting Accuracy:", accuracy)

Gradient Boosting Accuracy: 0.8931666666666667


Support Vector Machines (SVM):

In [33]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Initialize and train the model
svm = SVC(kernel='rbf', random_state=42)
svm.fit(X_train, y_train)

# Predictions
y_pred = svm.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("SVM Accuracy:", accuracy)

SVM Accuracy: 0.8931666666666667


K-Nearest Neighbors (KNN):

In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predictions
y_pred = knn.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("K-Nearest Neighbors Accuracy:", accuracy)

K-Nearest Neighbors Accuracy: 0.88625


Naive Bayes: MultinomialNB

In [35]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Initialize and train the Multinomial Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predictions
y_pred = nb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Multinomial Naive Bayes Accuracy:", accuracy)

Multinomial Naive Bayes Accuracy: 0.8931666666666667


Ensemble: Bagging

In [39]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the Bagging model with a Decision Tree as base estimator
bagging_model = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)

# Predictions
y_pred = bagging_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Bagging Accuracy:", accuracy)

Bagging Accuracy: 0.8924166666666666


Ensemble: Stacking

In [37]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Define base learners and final estimator
base_learners = [
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('lr', LogisticRegression(max_iter=200))
]
final_estimator = LogisticRegression()

# Initialize and train the Stacking model
stacking_model = StackingClassifier(estimators=base_learners, final_estimator=final_estimator)
stacking_model.fit(X_train, y_train)

# Predictions
y_pred = stacking_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Stacking Accuracy:", accuracy)

Stacking Accuracy: 0.8931666666666667


Multi-Layer Perceptron:

In [38]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the Multi-Layer Perceptron model
mlp_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=200, random_state=42)
mlp_model.fit(X_train, y_train)

# Predictions
y_pred = mlp_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("MLP Accuracy:", accuracy)

MLP Accuracy: 0.8931666666666667
