# Projct Title: Genetic Disorder Data Analysis & Model Comparison

# Objective:

The objective of this assignment is to analyze a **Genetic Disorder Dataset**, clean and preprocess the data, and compare the performance of three machine learning models: **Support Vector Machine (SVM)**, **Random Forest**, **and Decision Tree** to determine the best model fit for the dataset.

**Use Statuts Column as the dependent variable(y)**

# Instructions

# 1.Data Importing & Inspection

# 2.Data Cleaning:

**Hint**

a.	Rename the column names to remove all spaces and special characters.

**example**

-	Rename "Autopsy shows birth defect (if applicable)" to "Autopsy_shows_birth_defect".

-	Rename "H/O serious maternal illness" to "HO_serious_maternal_illness".

-	Rename "H/O radiation exposure (x-ray)" to "HO_radiation_exposure", etc.


b.	Drop all irrelevant columns.

c.	 check the unique keys in each column to identify and remove irrelevant values.

# 3. Exploratory Data Analysis (EDA)
   
●	Perform **univariate analysis** on key features (e.g., Barchart, histograms, boxplots).

●	Conduct **bivariate analysis** (e.g., correlation heatmap, scatterplots).

●	Perform **multivariate analysis** (e.g., PCA if applicable).

●	Identify patterns or trends from the dataset.


# 4. Data Preprocessing:

# 5. Model Training & Evaluation
   
●	Train the following model

  **○	Support Vector Machine (SVM)**

  **○	Random Forest**

  **○	Decision Tree**

●	Determine the best model based on performance metrics and justify your choice.


# 6. Conclusion & Recommendations 

●	Summarize key findings from the analysis.

●	Justify the best-performing model based on the results.

●	Provide recommendations for improving model performance.

●	Discuss any limitations of the dataset or models.


# Importing the libries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the dataset

In [None]:
df = pd .read_csv("C:/Users/Oceande/Genetic_disorder.csv")
df.head()

# DATASET DESCRIPTION

1.	Patient Id – A special code given to each patient.
2.	Patient Age – How old the patient is.
3.	Genes in mother's side – Did the patient get a certain gene from their mom? (Yes/No)
4.	Inherited from father – Did the patient get a certain gene from their dad? (Yes/No)
5.	Maternal gene – Extra info about genes from the mother (Yes/No).
6.	Paternal gene – Extra info about genes from the father (Yes/No).
7.	Blood cell count (mcL) – The number of blood cells in a tiny drop of blood.
8.	Patient First Name – The first name of the patient.
9.	Family Name – The last name of the patient.
10.	Father's name – The name of the patient’s father.
11.	Mother's age – How old the mother is.
12.	Father's age – How old the father is.
13.	Institute Name – The name of the hospital or lab where the patient was treated.
14.	Location of Institute – The place where the hospital or lab is.
15.	Status – The condition or progress of the patient 
16.	Respiratory Rate (breaths/min) – How many breaths the patient takes in a minute.
17.	Heart Rate (rates/min) – How many times the heart beats in a minute.
18.	Test 1 – Result of the first medical test.
19.	Test 2 – Result of the second medical test.
20.	Test 3 – Result of the third medical test.
21.	Test 4 – Result of the fourth medical test.
22.	Test 5 – Result of the fifth medical test.
23.	Parental consent – Did the parents give permission for medical tests? (Yes/No)
24.	Follow-up – Did the patient return for more checkups?
25.	Gender – Whether the patient is male or female.
26.	Birth asphyxia – Did the baby have trouble breathing at birth? (Yes/No)
27.	Autopsy shows birth defect (if applicable) – If the patient passed away, did they have birth defects?
28.	Place of birth – Where the patient was born (e.g., hospital, home).
29.	Folic acid details (peri-conceptional) – Did the mother take folic acid before/during pregnancy?
30.	H/O serious maternal illness – Did the mother have any serious illness? (H/O = history of)
31.	H/O radiation exposure (x-ray) – Was the mother exposed to X-rays during pregnancy?
32.	H/O substance abuse – Did the mother use harmful substances during pregnancy?
33.	Assisted conception IVF/ART – Was the baby conceived using medical help (like IVF)?
34.	History of anomalies in previous pregnancies – Did the mother have any pregnancy problems before?
35.	No. of previous abortion – How many times the mother had a miscarriage or abortion before.
36.	Birth defects – Does the patient have any birth defects?
37.	White Blood cell count (thousand per microliter) – The number of white blood cells, which fight infections.
38.	Blood test result – The outcome of a blood test (e.g., normal, inconclusive).
39.	Symptom 1 – A sign of illness (1 = present, 0 = not present).
40.	Symptom 2 – Another sign of illness.
41.	Symptom 3 – Another sign of illness.
42.	Symptom 4 – Another sign of illness.
43.	Symptom 5 – Another sign of illness.
44.	Genetic Disorder – The type of disorder the patient has due to their genes.
45.	Disorder Subclass – A more specific name for the disorder.


# Summarizing the numerical values

In [None]:
df.describe()

# Summarizing the categorical values

In [None]:
df.describe(include="object")

# Checking for how many rows and columns that are in the dataset

In [None]:
df.shape

# Checking for a brief overview of the dataset

In [None]:
df.info

# Checking for the columns name

In [None]:
df.columns

# Handling the Uneeded Columns

In [None]:
df.drop(["Patient Id", "Patient First Name", "Family Name", "Father's name", "Institute Name", "Location of Institute"], axis=1, inplace=True)

In [None]:
df.columns

In [None]:
df.shape

# Editing the column names

In [None]:
df.rename(columns = {"Patient Age": "Patient_Age", "Genes in mother's side": "Genes_in_mother's_side", "Inherited from father": "Inherited_from_father",
                   "Maternal gene": "Maternal_gene", "Paternal gene": "Paternal_gene", "Blood cell count (mcL)": "Blood_cell_count",
                   "Mother's age": "Mother's_age", "Father's age": "Father's_age", "Status": "Status", "Respiratory Rate (breaths/min)": 
                   "Respiratory_Rate", "Heart Rate (rates/min": "Heart_Rate", "Test 1": "Test_1", "Test 2": "Test_2", "Test 3": "Test_3", 
                   "Test 4": "Test_4", "Test 5": "Test_5", "Parental consent": "Parental_consent", "Follow-up": "Follow_up", "Gender": "Gender",
                   "Birth asphyxia": "Birth_asphyxia", "Autopsy shows birth defect (if applicable)": "Autopsy_shows_birth_defect", 
                   "Place of birth": "Place_of_birth", "Folic acid details (peri-conceptional)": "Folic_acid_details", "H/O serious maternal illness":
                   "HO_serious_maternal_illness", "H/O radiation exposure (x-ray)": "HO_radiation_exposure", "H/O substance abuse": 
                   "HO_substance_abuse", "Assisted conception IVF/ART": "Assisted_conception_IVF_ART", "History of anomalies in previous pregnancies": 
                   "History_of_anomalies_in_previous_pregnancies", "No. of previous abortion": "No_of_previous_abortion", "Birth defects": 
                   "Birth_defects", "White Blood cell count (thousand per microliter)": "White_Blood_cell_count", "Blood test result": 
                   "Blood_test_result", "Symptom 1": "Symptom_1", "Symptom 2": "Symptom_2", "Symptom 3": "Symptom_3", "Symptom 4": "Symptom_4", 
                   "Symptom 5": "Symptom_5", "Genetic Disorder": "Genetic_Disorder", "Disorder Subclass": "Disorder_Subclass"}, inplace = True)

In [None]:
df.columns

In [None]:
df.head()

# Checking for the data types

In [None]:
df.dtypes

# Checking for unique/inconsistent values

In [None]:
df["Genes_in_mother's_side"].unique()

In [None]:
df["Inherited_from_father"].unique()

In [None]:
df["Maternal_gene"].unique()

In [None]:
df["Paternal_gene"].unique()

In [None]:
df["Status"].unique()

In [None]:
df["Respiratory_Rate"].unique()

In [None]:
df["Heart_Rate"].unique()

In [None]:
df["Parental_consent"].unique()

In [None]:
df["Follow_up"].unique()

In [None]:
df["Gender"].unique()

In [None]:
df["Autopsy_shows_birth_defect"].unique()

In [None]:
df["Birth_asphyxia"].unique()

In [None]:
df["Place_of_birth"].unique()

In [None]:
df["Folic_acid_details"].unique()

In [None]:
df["HO_serious_maternal_illness"].unique()

In [None]:
df["HO_radiation_exposure"].unique()

In [None]:
df["HO_substance_abuse"].unique()

In [None]:
df["Assisted_conception_IVF_ART"].unique()

In [None]:
df["History_of_anomalies_in_previous_pregnancies"].unique()

In [None]:
df["Birth_defects"].unique()

In [None]:
df["Blood_test_result"].unique()

In [None]:
df["Genetic_Disorder"].unique()

In [None]:
df["Disorder_Subclass"].unique()

# Editting the Normal (30-60) in Respiratory_Rate

In [None]:
df["Respiratory_Rate"] = df["Respiratory_Rate"].str.replace(r"\s*\(30-60\)", "", regex=True)

In [None]:
print(df["Respiratory_Rate"].unique())

# Checking for missing values

In [None]:
df.isnull().sum()

# Checking for duplicate

In [None]:
df.duplicated().sum()

# Data cleaning

In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

In [None]:
df =  df[df.Gender != "Ambiguous"]

In [None]:
df.shape

In [None]:
df = df[~df["Birth_asphyxia"].isin(["Not available", "No record"])]

In [None]:
df.shape


In [None]:
df =  df[df.Blood_test_result != "inconclusive"]

In [None]:
df.shape

In [None]:
print(df["Gender"].unique())
print(df["Birth_asphyxia"].unique())
print(df["Autopsy_shows_birth_defect"].unique())
print(df["Birth_asphyxia"].unique())
print(df["HO_radiation_exposure"].unique())
print(df["HO_substance_abuse"].unique())

# Handling the underscore ('_') from the values in the columns HO_radiation_exposure and HO_substance_abuse.

In [None]:
df["HO_radiation_exposure"] = df["HO_radiation_exposure"].astype(str).replace({"_": "", "-": None})
df["HO_substance_abuse"] = df["HO_substance_abuse"].astype(str).replace({"_": "", "-": None})

In [None]:
print(df["HO_radiation_exposure"].unique())
print(df["HO_substance_abuse"].unique())


# Dropping the None

In [None]:
df.dropna(subset=["HO_radiation_exposure", "HO_substance_abuse", "Blood_test_result"], inplace=True)


In [None]:
print(df["HO_radiation_exposure"].unique())
print(df["HO_substance_abuse"].unique())

In [None]:
df.shape

# Exploratory Data Analysis(EDA)

## Univariate Analysis



### Categorical Variables

### Genes in mother's side

In [None]:
sns.barplot(x = df["Genes_in_mother's_side"].value_counts().index, y = df["Genes_in_mother's_side"].value_counts())
plt.title("Genes_in_mother's_side")
plt.show()

# Inherited from father

In [None]:
sns.barplot(x = df["Inherited_from_father"].value_counts().index, y = df["Inherited_from_father"].value_counts())
plt.title("Inherited_from_father")
plt.show()

# Maternal gene

In [None]:
sns.barplot(x = df["Maternal_gene"].value_counts().index, y = df["Maternal_gene"].value_counts())
plt.title("Maternal_gene")
plt.show()

# Paternal gene

In [None]:
sns.barplot(x = df["Paternal_gene"].value_counts().index, y = df["Paternal_gene"].value_counts())
plt.title("Paternal_gene")
plt.show()

# Status

In [None]:
sns.barplot(x = df["Status"].value_counts().index, y = df["Status"].value_counts())
plt.title("Status")
plt.show()

# Respiratory Rate

In [None]:
sns.barplot(x = df["Respiratory_Rate"].value_counts().index, y = df["Respiratory_Rate"].value_counts())
plt.title("Respiratory_Rate")
plt.show()

# Heart Rate

In [None]:
sns.barplot(x = df["Heart_Rate"].value_counts().index, y = df["Heart_Rate"].value_counts())
plt.title("Heart_Rate")
plt.show()

# Parental_consent

In [None]:
sns.barplot(x = df["Parental_consent"].value_counts().index, y = df["Parental_consent"].value_counts())
plt.title("Parental_consent")
plt.show()

# Follow Up

In [None]:
sns.barplot(x = df["Follow_up"].value_counts().index, y = df["Follow_up"].value_counts())
plt.title("Follow_up")
plt.show()

# Gender

In [None]:
sns.barplot(x = df['Gender'].value_counts().index, y = df['Gender'].value_counts())
plt.title('Gender')
plt.show()

# Birth Asphyxia

In [None]:
sns.barplot(x = df["Birth_asphyxia"].value_counts().index, y = df["Birth_asphyxia"].value_counts())
plt.title("Birth_asphyxia")
plt.show()

# Autopsy_shows_birth_defect

In [None]:
sns.barplot(x = df["Autopsy_shows_birth_defect"].value_counts().index, y = df["Autopsy_shows_birth_defect"].value_counts())
plt.title("Autopsy_shows_birth_defect")
plt.show()

# Place_of_birth

In [None]:
sns.barplot(x = df["Place_of_birth"].value_counts().index, y = df["Place_of_birth"].value_counts())
plt.title("Place_of_birth")
plt.show()

# Folic_acid_details

In [None]:
sns.barplot(x = df["Folic_acid_details"].value_counts().index, y = df["Folic_acid_details"].value_counts())
plt.title("Folic_acid_details")
plt.show()

# HO_serious_maternal_illness 

In [None]:
sns.barplot(x = df["HO_serious_maternal_illness"].value_counts().index, 
            y = df["HO_serious_maternal_illness"].value_counts())
plt.title("HO_serious_maternal_illness")
plt.show()



### HO_radiation_exposure

In [None]:
sns.barplot(x = df["HO_radiation_exposure"].value_counts().index, y = df["HO_radiation_exposure"].value_counts())
plt.title("HO_radiation_exposure")
plt.show()

### HO_substance_abuse

In [None]:
sns.barplot(x = df["HO_substance_abuse"].value_counts().index, y = df["HO_substance_abuse"].value_counts())
plt.title("HO_substance_abuse")
plt.show()

# Assisted_conception_IVF_ART

In [None]:
sns.barplot(x = df["Assisted_conception_IVF_ART"].value_counts().index, y = df["Assisted_conception_IVF_ART"].value_counts())
plt.title("Assisted_conception_IVF_ART")
plt.show()

# History_of_anomalies_in_previous_pregnancies

In [None]:
sns.barplot(x = df["History_of_anomalies_in_previous_pregnancies"].value_counts().index, 
            y = df["History_of_anomalies_in_previous_pregnancies"].value_counts())
plt.title("History_of_anomalies_in_previous_pregnancies")
plt.show()

# Birth_defects 

In [None]:
sns.barplot(x = df["Birth_defects"].value_counts().index, y = df["Birth_defects"].value_counts())
plt.title("Birth_defects")
plt.show()

# Blood_test_result

In [None]:
sns.barplot(x = df["Blood_test_result"].value_counts().index, y = df["Blood_test_result"].value_counts())
plt.title("Blood_test_result")
plt.show()

# Genetic_Disorder 

In [None]:
sns.barplot(y = df["Genetic_Disorder"].value_counts().index, 
            x = df["Genetic_Disorder"].value_counts())
plt.title("Genetic_Disorder")
plt.xlabel("Count")
plt.ylabel("Genetic Disorder")
plt.show()

# Numerical Variable

### Patient Age

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
sns.histplot(df["Patient_Age"], bins=5, kde=True, color='purple')
plt.title("Patient_Age")
plt.show()

# Blood cell count

In [None]:
sns.histplot(df["Blood_cell_count"], bins=5, kde=True, color='red')
plt.title("Distribution of Blood_cell_count")
plt.show()

# Mother's_Age

In [None]:
sns.histplot(df["Mother's_age"], bins=5, kde=True, color="purple")
plt.title("Mother's_age")
plt.show()

# Father's_Age

In [None]:
sns.histplot(df["Father's_age"], bins=5, kde=True, color="purple")
plt.title("Father's_age")
plt.show()

# Test_1 

In [None]:
sns.histplot(df["Test_1"], bins=5, kde=True, color='purple')
plt.title("Test_1")
plt.show()

# Test_2

In [None]:
sns.histplot(df["Test_2"], bins=5, kde=True, color='purple')
plt.title("Test_2")
plt.show()

# Test_3

In [None]:
sns.histplot(df["Test_3"], bins=5, kde=True, color='purple')
plt.title("Test_3")
plt.show()

Test_4

In [None]:
sns.histplot(df["Test_4"], bins=5, kde=True, color='purple')
plt.title("Test_4")
plt.show()

# Test_5

In [None]:
sns.histplot(df["Test_5"], bins=5, kde=True, color='purple')
plt.title("Test_5")
plt.show()

# No of previous abortion 

In [None]:
sns.histplot(df["No_of_previous_abortion"], bins=5, kde=True, color='purple')
plt.title("No_of_previous_abortion")
plt.show()

# White Blood cell count

In [None]:
sns.histplot(df["White_Blood_cell_count"], bins=5, kde=True, color="red")
plt.title("Distribution of White_Blood_cell_count")
plt.show()

# Symptom 1 

In [None]:
sns.histplot(df["Symptom_1"], bins=5, kde=True, color='purple')
plt.title("Symptom_1")
plt.show()

# Synptom 2

In [None]:
sns.histplot(df["Symptom_2"], bins=5, kde=True, color='purple')
plt.title("Symptom_2")
plt.show()

# Synptom 3

In [None]:
sns.histplot(df["Symptom_3"], bins=5, kde=True, color='purple')
plt.title("Symptom_3")
plt.show()

# Synptom 4

In [None]:
sns.histplot(df["Symptom_4"], bins=5, kde=True, color='purple')
plt.title("Symptom_4")
plt.show()

# Synptom 5

In [None]:
sns.histplot(df["Symptom_5"], bins=5, kde=True, color='purple')
plt.title("Symptom_5")
plt.show()

### Bivariate Analysis

In [None]:
sns.pairplot(df)
plt.show()

### Multivaiate Analysis

In [None]:
sns.scatterplot(x = "Patient_Age", y = "Blood_cell_count", hue = "Status", data = df)
plt.show()

In [None]:
sns.scatterplot(x = "Patient_Age", y = "Blood_cell_count", hue = "Genetic_Disorder", data = df)
plt.show()

In [None]:
sns.scatterplot(x = "Patient_Age", y = "Mother's_age", hue = "No_of_previous_abortion", style="No_of_previous_abortion", data = df)
plt.show()

# Data Pre-Processing

## Encoding Categorical data

### Encoding the Independent variable(two variables)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Genes_in_mother's_side"] = le.fit_transform(df["Genes_in_mother's_side"])
df["Inherited_from_father"] = le.fit_transform(df["Inherited_from_father"])
df["Maternal_gene"] = le.fit_transform(df["Maternal_gene"])
df["Paternal_gene"] = le.fit_transform(df["Paternal_gene"])
df["Status"] = le.fit_transform(df["Status"])
df["Respiratory_Rate"] = le.fit_transform(df["Respiratory_Rate"])
df["Heart_Rate"] = le.fit_transform(df["Heart_Rate"])
df["Parental_consent"] = le.fit_transform(df["Parental_consent"])
df["Follow_up"] = le.fit_transform(df["Follow_up"])
df["Gender"] = le.fit_transform(df["Gender"])
df["Place_of_birth"] = le.fit_transform(df["Place_of_birth"])
df["Folic_acid_details"] = le.fit_transform(df["Folic_acid_details"])
df["HO_serious_maternal_illness"] = le.fit_transform(df["HO_serious_maternal_illness"])
df["Assisted_conception_IVF_ART"] = le.fit_transform(df["Assisted_conception_IVF_ART"])
df["History_of_anomalies_in_previous_pregnancies"] = le.fit_transform(df["History_of_anomalies_in_previous_pregnancies"])
df["Birth_defects"] = le.fit_transform(df["Birth_defects"])

In [None]:
df

### Encoding the Independent variable(more than two variables)

In [None]:
x = df.drop('Status', axis=1).values

y = df['Status'].values


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [18, 19, 23, 24, 30, 36, 37])], remainder = 'passthrough')
x = np.array(ct.fit_transform(x))

In [None]:
x[0] 

# Encoding Dependent variable

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() 
y = le.fit_transform(y)

In [None]:
y

# Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 5)

In [None]:
x_train.shape

In [None]:
x_test.shape

# Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, [26, 31, 32, 33, 51]] = sc.fit_transform(x_train[:, [26, 31, 32, 33, 51]])
x_test[:, [26, 31, 32, 33, 51]] = sc.transform(x_test[:, [26, 31, 32, 33, 51]])

In [None]:
x_train[0]

# Model building

# SVM

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 5)
classifier.fit(x_train, y_train)

# Predicting the Test set result

In [None]:
y_pred = classifier.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# Checking the Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 5)
classifier.fit(x_train, y_train)

In [None]:
y_pred = classifier.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# Making the Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 5)
classifier.fit(x_train, y_train)

In [None]:
y_pred = classifier.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)),1))

# Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

# Performance Analysis and Model Comparison

I evaluated three models: **Support Vector Machine (SVM), Random Forest, and Decision Tree** using their confusion matrices and accuracy scores. All three models achieved perfect accuracy.

## Accuracy Comparison

In [None]:
SVM                                          1.0 (100%)

In [None]:
Random Forest                                1.0 (100%)

In [None]:
Decision Tree                                1.0 (100%)

## Confusion Matrix Insights

### SVM                
                [98 0 ] 
                [0  70]

## Random Forest

                [98 0 ] 
                [0  70]

## Decision Tree

                [98 0 ] 
                [0  70]

# Key Findings

In [None]:
1. All three models perfectly classified the dataset, achieving 100% accuracy.

2. There were no False Positives (FP) or False Negatives (FN) for any of the models.

3. The dataset appears to be easy to classify, suggesting that all models performed exceptionally well.

# Best Performing Model and Justification

### Since all models achieved a perfect score, we need to evaluate them based on other factors:

Model	               Accuracy	                    Interpretability	                    Overfitting Risk	                 Scalability
SVM	                     1.0	                          Low	                                  Low	                             Moderate
Random Forest	         1.0	                         Moderate	                              Low	                             High
Decision Tree	         1.0	                          High	                                  High	                             High

## Chosen Model: Random Forest

### Justification

In [None]:
1. Provides a balance between accuracy and robustness by averaging multiple decision trees.

2. Reduces overfitting, unlike a single Decision Tree.

3. Scalable for larger datasets.

4. Performs well even when some features are less important.

# Recommendations for Improving Model Performance

In [None]:
1. Cross-Validation:

    ● Implement k-fold cross-validation to ensure robustness and prevent overfitting.

2. Hyperparameter Tuning:

    ● For Random Forest, adjust parameters like n_estimators, max_depth, and min_samples_split.

    ● For SVM, fine-tune C and gamma parameters.

3. Feature Selection:

    ● Ensure relevant features are included to improve model generalization.

4. Ensemble Methods:

    ● Consider using boosting techniques (e.g., XGBoost, AdaBoost) to enhance performance.

# Limitations of the Dataset or Models

1. Dataset Imbalance:

   ● The confusion matrices indicate a balanced dataset; however, this perfect accuracy may be due to an easy classification problem.

   ● Testing on a more challenging dataset or applying synthetic noise would provide a clearer evaluation.

2. Risk (Decision Tree):

   ● A single Decision Tree may overfit, especially if the dataset is large or noisy.

3. Lack of Diversity in Evaluation Metrics:

   ● Relying solely on accuracy is not enough. Metrics like Precision, Recall, and F1-Score should be considered for a more thorough analysis.

4. Scalability Concerns:

   ● While Random Forest is robust, its performance can degrade with very large datasets if not tuned properly.

# Conclusion

In [None]:
1. Random Forest is the most reliable model considering accuracy, robustness against overfitting, and scalability.

2. Improvements could include cross-validation, hyperparameter tuning, and testing on more complex datasets.

3. It is essential to consider precision, recall, and F1-score for a more comprehensive evaluation.