# Mini project 1


Student name: Klajdi Bodurri
    
Github link: https://github.com/Klainti/Machine-Learning

## Introduction (project statement)

We have a dataset of patients treated in a cardiology department. Our goal is to build a **logistic regression** model to predict whether a patient takes part in the rehabilitation program or not.

## Descriptive analysis of the data

Firstly, lets read the 'Data.csv' and print some rows to see what features we have

In [7]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# read csv file
data = pd.read_csv("Data.csv", sep=";")

# print 5 rows of data
data.head(5)

Unnamed: 0,Reason,Gender,Age,Mobility,Distance,Participation
0,Hospital readmission,M,61.3,No car,68.7,0
1,Hospital readmission,M,85.8,Car,86.3,0
2,Hospital readmission,F,65.0,No car,46.2,1
3,Hospital readmission,F,72.5,No car,39.7,0
4,Hospital readmission,M,93.0,No car,73.3,0


*Note only 5 from 500 rows are printed...*


So, from the above chunk of code we can see that we have 5 features (X) and the target variable (y). 

**Features (predictors)**
1. **Reason**   (Hospital readmission, Other obligations, Resumed work, Medical reasons etc..)
2. **Gender**   (Male or Female)
3. **Age**      (numeric, float)
4. **Mobility** (Car or No car)
5. **Distance** (numeric, float)
    
**Target variable**
1. **Participation** (1 or 0, which means if someone will participate or not)
    
These 5 features are about to be used to predict if someone takes part in the rehabilitation program or not.

## Short explanation on what the data is about and how we are planning to use it in doing the task of the project

*What each feature is about:*
* **Reason** corresponds to the patient's reason of not taking the rehab program.
* **Gender** corresponds to the gender of the patients.
* **Age** corresponds the age of the patients.
* **Mobility** corresponds to the patient's trasportation.
* **Distance** corresponds to the distance from rehab center to his home.
* **Participation** corresponds to the patient's will to participate the rehab program.

We are going to use all the features to build our model, because they are caring a lot of information about the output and if we exclude one of them, maybe we will lose some infomation. But, some features, for example **Reason**, **Mobility** and **Gender** have not numeric values. So, we have to convert their values to numeric! (encode strings to numeric values with one-hot encoding). We **cannot** do regression with non-numeric values.

More details: https://stackoverflow.com/questions/34007308/linear-regression-analysis-with-string-categorical-features-variables

## The steps we are taking in order to perform the task

#### 1. Separate features (predictors) from the target variable.

In [8]:
# get features!
X = data[["Reason","Gender","Age","Mobility","Distance"]]

# get output!
y = data[["Participation"]]

#### 2. Apply one-hot encoding to the values of Reason, Mobility and Gender.

In [9]:
# encode Reason, Gender and Mobility features with one-hot encode, since they are strings and not integers!
X = pd.get_dummies(X, columns=['Reason', 'Gender', 'Mobility'])

#### 3. Spliting X, y to train and test sets.

In [10]:
# 67% of our data will train the model and other 33% is for testing!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#### 4. Training our model

In [11]:
# init regression class
regression_model = LogisticRegression()

# calculate coefficients and intercept! (fir our model)
fit = regression_model.fit(X_train, y_train.values.ravel()) # DataConversionWarning if we pass y_train instead y_train.values.ravel() !Expecting 1D array, not a vector..

print("Intercept: {}".format(fit.intercept_))
print("Coefficients: {}".format(fit.coef_))

Intercept: [ 1.92696799]
Coefficients: [[-0.015975   -0.06108842  0.1285306  -0.00785177 -0.03079547  0.8579193
  -0.38755738 -0.29048704 -0.06926309 -0.20969305 -0.11282085 -0.41452161
   0.91727591  1.00969208  1.82690876  0.10005923]]


We were expecting to see 5 (coefficients) + 1 (intercept) because of 5 features that we are using, but somehow we have many more coefficients. The reason behind this is because the function get_dummies (look step 2) converts non-numerical values of the features to numeric by applying one-hot encoding. But nevertheless, the result is exactly the same as if we had 5 features. Nothing changes to  the accuracy of our model!!

More details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

#### 5. Checking the accuracy of our model by giving as input the test set.

In [12]:
# Measure the score! 
accuracy = fit.score(X_test, y_test)

print("Accuracy: {}".format(accuracy))

Accuracy: 0.806060606060606


**What is this 0.806060606060606**:
It means that giving to our model a patient, it has the probability of 80% to correctly predict if the patient is going to participate in the rehab program or not. 

Is this accuracy is high or low ?

Well, it really depends on the problem you are facing. For example, if you want to determine if someone has malignant or benign tumor, then of course the higher the accuracy of the model, the better. But in our rehab participation case, I think the accuracy of 80% is very good!

## Conclusion

We saw how to read a dataset, split it in train and test set and how to train the model and calculate the accuracy. We used a high level programming language to do the task so we didn't see the math behind the logistic regression.

In summary, we applied all the knowledge that we took from the course into a real project and we achieved a very good performance!! 