### Introduction 

Within this project, I utilized my informations about the ML to conduct an in-depth analysis of a Disease Symptoms and Patient Profile Dataset. The dataset encompasses information regarding disease symptoms, denoting outcomes as either positive or negative. Additionally, it includes demographic details of patients, such as age and gender, along with their medical history, specifically chronic diseases. 

The primary objective revolves around the "outcome variable," which represents the test result of an unidentified outbreak. The outbreak is characterized by common symptoms such as fever, cough, fatigue, and difficulty breathing. The aim of the analysis is to gain comprehensive insights into the relevance of the test result (categorized as positive or negative) concerning both the patients medical history (captured in the Disease column) and demographic features (blood pressure or age). 

# Disease Symptoms and Patient Profile Dataset

Unveil the mysteries of diseases with our Comprehensive Disease Symptom and Patient Profile Dataset. This captivating dataset offers a treasure trove of information, revealing the fascinating connections between symptoms, demographics, and health indicators. Delve into the rich tapestry of fever, cough, fatigue, and difficulty breathing, intertwined with age, gender, blood pressure, and cholesterol levels. Whether you're a medical researcher, healthcare professional, or data enthusiast, this dataset holds the key to unlocking profound insights. Explore the hidden patterns, uncover unique symptom profiles, and embark on a captivating journey through the world of medical conditions. Get ready to revolutionize healthcare understanding with our dataset.

### Columns :

Disease: The name of the disease or medical condition.
Fever: Indicates whether the patient has a fever (Yes/No).
Cough: Indicates whether the patient has a cough (Yes/No).
Fatigue: Indicates whether the patient experiences fatigue (Yes/No).
Difficulty Breathing: Indicates whether the patient has difficulty breathing (Yes/No).
Age: The age of the patient in years.
Gender: The gender of the patient (Male/Female).
Blood Pressure: The blood pressure level of the patient (Normal/High).
Cholesterol Level: The cholesterol level of the patient (Normal/High).
Outcome Variable: The outcome variable indicating the result of the diagnosis or assessment for the specific disease (Positive/Negative).


### Usage:
This dataset can be used by various stakeholders, including:

Healthcare Professionals: Medical practitioners, doctors, and researchers can utilize this dataset for clinical analysis, research studies, and epidemiological investigations related to different diseases. It can aid in understanding the prevalence and patterns of symptoms among patients with specific medical conditions.

Medical Researchers: Researchers focused on specific diseases or conditions mentioned in the dataset can utilize it to explore relationships between symptoms, age, gender, and other variables. This data can contribute to developing new insights, treatment strategies, and preventive measures.
Healthcare Technology Companies: Companies developing healthcare applications, diagnostic tools, or AI algorithms can use this dataset to train and validate their models. The data can assist in the development of predictive models for disease diagnosis or monitoring based on symptoms and patient characteristics.



## Data Preparation:
Handle missing values: Check for missing values and decide whether to impute or remove them based on the extent of missing data.
Encode categorical variables: Convert categorical variables like "Gender" to numerical format using one-hot encoding.
Split the dataset into features (X) and the outcome variable (y).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix


In [2]:
df = pd.read_csv("Disease_symptom_and_patient_profile_dataset.csv")
print(df.head())
df.head()

       Disease Fever Cough Fatigue Difficulty Breathing  Age  Gender  \
0    Influenza   Yes    No     Yes                  Yes   19  Female   
1  Common Cold    No   Yes     Yes                   No   25  Female   
2       Eczema    No   Yes     Yes                   No   25  Female   
3       Asthma   Yes   Yes      No                  Yes   25    Male   
4       Asthma   Yes   Yes      No                  Yes   25    Male   

  Blood Pressure Cholesterol Level Outcome Variable  
0            Low            Normal         Positive  
1         Normal            Normal         Negative  
2         Normal            Normal         Negative  
3         Normal            Normal         Positive  
4         Normal            Normal         Positive  


Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Influenza,Yes,No,Yes,Yes,19,Female,Low,Normal,Positive
1,Common Cold,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
2,Eczema,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
3,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive


In [3]:
print(df.info())
# What is the size of dataset
df.shape
df.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Disease               349 non-null    object
 1   Fever                 349 non-null    object
 2   Cough                 349 non-null    object
 3   Fatigue               349 non-null    object
 4   Difficulty Breathing  349 non-null    object
 5   Age                   349 non-null    int64 
 6   Gender                349 non-null    object
 7   Blood Pressure        349 non-null    object
 8   Cholesterol Level     349 non-null    object
 9   Outcome Variable      349 non-null    object
dtypes: int64(1), object(9)
memory usage: 27.4+ KB
None


Disease                 object
Fever                   object
Cough                   object
Fatigue                 object
Difficulty Breathing    object
Age                      int64
Gender                  object
Blood Pressure          object
Cholesterol Level       object
Outcome Variable        object
dtype: object

In [4]:
#let's check for number of unique values
df.nunique()

Disease                 116
Fever                     2
Cough                     2
Fatigue                   2
Difficulty Breathing      2
Age                      26
Gender                    2
Blood Pressure            3
Cholesterol Level         3
Outcome Variable          2
dtype: int64

In [5]:
df['Disease'].unique()
df['Blood Pressure'].unique()
df['Cholesterol Level'].unique()
df['Outcome Variable'].unique()


array(['Positive', 'Negative'], dtype=object)

In [6]:
## Data Cleaning
df.isnull().sum()

Disease                 0
Fever                   0
Cough                   0
Fatigue                 0
Difficulty Breathing    0
Age                     0
Gender                  0
Blood Pressure          0
Cholesterol Level       0
Outcome Variable        0
dtype: int64

In [7]:
df.columns

Index(['Disease', 'Fever', 'Cough', 'Fatigue', 'Difficulty Breathing', 'Age',
       'Gender', 'Blood Pressure', 'Cholesterol Level', 'Outcome Variable'],
      dtype='object')

In [8]:
## MODEL
df_for_model = df.copy()
df_for_model.replace({'Yes':1,'No':0},inplace=True)
df_for_model.replace({'Positive':1,'Negative':0},inplace=True)
df_for_model.replace({'Low':0,'Normal':1,'High':2},inplace=True)
df_for_model.replace({'Female':0,'Male':1},inplace=True)
df_for_model

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Influenza,1,0,1,1,19,0,0,1,1
1,Common Cold,0,1,1,0,25,0,1,1,0
2,Eczema,0,1,1,0,25,0,1,1,0
3,Asthma,1,1,0,1,25,1,1,1,1
4,Asthma,1,1,0,1,25,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
344,Stroke,1,0,1,0,80,0,2,2,1
345,Stroke,1,0,1,0,85,1,2,2,1
346,Stroke,1,0,1,0,85,1,2,2,1
347,Stroke,1,0,1,0,90,0,2,2,1


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Disease               349 non-null    object
 1   Fever                 349 non-null    object
 2   Cough                 349 non-null    object
 3   Fatigue               349 non-null    object
 4   Difficulty Breathing  349 non-null    object
 5   Age                   349 non-null    int64 
 6   Gender                349 non-null    object
 7   Blood Pressure        349 non-null    object
 8   Cholesterol Level     349 non-null    object
 9   Outcome Variable      349 non-null    object
dtypes: int64(1), object(9)
memory usage: 27.4+ KB


## Train & Test Split



In [10]:
x_train,X_test,y_train,Y_test = train_test_split(pd.get_dummies(df_for_model.iloc[:,:-1],drop_first=True),df_for_model.iloc[:,-1])

In [11]:
X_train = df.drop(columns=['Outcome Variable'],axis=1)
y_train= df['Outcome Variable']
X_train.shape , y_train.shape

((349, 9), (349,))

In [12]:
X_train

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level
0,Influenza,Yes,No,Yes,Yes,19,Female,Low,Normal
1,Common Cold,No,Yes,Yes,No,25,Female,Normal,Normal
2,Eczema,No,Yes,Yes,No,25,Female,Normal,Normal
3,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal
...,...,...,...,...,...,...,...,...,...
344,Stroke,Yes,No,Yes,No,80,Female,High,High
345,Stroke,Yes,No,Yes,No,85,Male,High,High
346,Stroke,Yes,No,Yes,No,85,Male,High,High
347,Stroke,Yes,No,Yes,No,90,Female,High,High


## RandomForest & Bayesian

In [13]:
model = RandomForestClassifier()
model.fit(x_train,y_train)

RFC_TrainSet_Prediction = model.predict(x_train)
RFC_TestSet_Prediction = model.predict(X_test)

ValueError: Found input variables with inconsistent numbers of samples: [261, 349]

In [None]:
RFC_predict_evaulation = {'Random Forest Classification Predictions Evaluation (All Features)':
    {'Train MSE': mean_squared_error(y_train,RFC_TrainSet_Prediction**0.5),
     'Test MSE' : mean_squared_error(Y_test,RFC_TestSet_Prediction**0.5),
     'Train RMSE': mean_squared_error(y_train,RFC_TrainSet_Prediction,squared=False),
     'Test RMSE' : mean_squared_error(Y_test,RFC_TestSet_Prediction,squared=False),
     'Train Accuracy' : round(accuracy_score(y_train,RFC_TrainSet_Prediction),3),
     'Test Accuracy' : round(accuracy_score(Y_test,RFC_TestSet_Prediction),3)}
    }
RFC_predict_evaulation = pd.DataFrame(RFC_predict_evaulation)
RFC_predict_evaulation

In [None]:
import matplotlib.pyplot as plt

# Create separate DataFrames for different outcomes
fever_df = df[df.Outcome_Variable == 'Fever']
fatigue_df = df[df.Outcome_Variable == 'Fatigue']
difficulty_breathing_df = df[df.Outcome_Variable == 'Difficulty_Breathing']

# Plotting each DataFrame
fever_df.plot(title='Fever Data')
plt.show()

fatigue_df.plot(title='Fatigue Data')
plt.show()

difficulty_breathing_df.plot(title='Difficulty Breathing Data')
plt.show()

In [None]:
# using the scikit-learn library's LabelEncoder to encode categorical variables in our DataFrame into numerical values.
# The LabelEncoder is a utility class in scikit-learn that helps convert categorical text data into numerical labels.



import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df["Fever"] = LE.fit_transform(df["Fever"])
df["Cough"] = LE.fit_transform(df["Cough"])
df["Fatigue"] = LE.fit_transform(df["Fatigue"])
df["Difficulty Breathing"] = LE.fit_transform(df["Difficulty Breathing"])
df["Gender"] = LE.fit_transform(df["Gender"])
df["Blood Pressure"] = LE.fit_transform(df["Blood Pressure"])
df["Cholesterol Level"] = LE.fit_transform(df["Cholesterol Level"])
df["Outcome Variable"] = LE.fit_transform(df["Outcome Variable"])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Here, we are creating a new Series called category_counts using the value_counts() function.
# This Series will contain the counts of unique values in the "Disease" column of (df). 
# Each unique disease in the "Disease" column becomes an index,
# and the corresponding value is the count of occurrences of that disease in the df .

category_counts = df["Disease"].value_counts()

# you are creating a new column indf called "Disease_freq".
# The values in this new column are obtained by mapping each disease in the "Disease" column to its corresponding count in the category_counts Series. 
# This is done using the map function. So, for each row in df,the "Disease_freq" column will contain the count of occurrences of the disease specified in the "Disease" column.

df["Disease_freq"] = df["Disease"].map(category_counts)

# we'll see the count of occurrences for each unique disease in our dataset.
category_counts

In [None]:
# to remove the column named "Disease" from the df.

df = df.drop(columns = "Disease", axis = 1)
df.head()

In [None]:
df.info()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

df

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
## In Seaborn's catplot with kind='swarm', it is a way to generate a swarm plot for categorical data.
## The points are adjusted along the categorical axis, providing a clear view of the distribution while preventing overlap.
sns.catplot(x = 'Outcome Variable' , y = 'Age' , data = df , kind = "swarm") 

In [None]:
People in old ages have a higher probability of being tested positive for diseases which is an outlier for our dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier

In [None]:
lr=LinearRegression()

In [None]:
x_train,X_test,y_train,Y_test = train_test_split(pd.get_dummies(df.iloc[:,:-1],drop_first=True),df.iloc[:,-1])

In [None]:
x=df.iloc[:,8:9]
y=df.iloc[:,5:6]

In [None]:
lr.fit(x,y)

In [None]:
b0=lr.predict([[0]])
print(b0)

In [None]:
print(lr.predict([[15]]))
print(lr.predict([[7]]))

In [None]:
A=np.arange(20).reshape(-1,1)
plt.scatter(x,y)
y_pred=lr.predict(A)
plt.plot(A,y_pred,color="red")
plt.show()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [None]:
x=df.iloc[:,9:10]
y=df.iloc[:,5:6]

In [None]:
mlr=LinearRegression()
mlr.fit(x,y)

In [None]:
print(mlr.intercept_,mlr.coef_)

In [None]:
print(mlr.predict(np.array([[22]])))

In [None]:
from sklearn.preprocessing import PolynomialFeatures
plr=PolynomialFeatures(degree=3)

x_pol=plr.fit_transform(x)
lr_pol=LinearRegression()
lr_pol.fit(x_pol,y)

In [None]:
xnew=plr.fit_transform(np.array([[22]]))
print(lr_pol.predict(xnew))

In [None]:
import sklearn
print(sklearn.__version__)