# COMP3160 Assignment 1

## Part 2: Classification: Logistic regression - 50 marks

This notebook contains the second task required for the first assignment in COMP3160. The focus of this notebook is in classification, so while some questions may look similar to the first notebook, please read the question carefully as the requirement can be quite different.

This task contains 5 questions; the total marks available for this task is **50**. 

10% of the mark is awarded for well-structured and efficient code.

In [96]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression


### Task 1 -  Import CSV into pandas (5 Marks)

1. Create a function to read the CSV file provided into a DataFrame. 
2. You MUST place the CSV file in the same directory/folder where your notebook is located. The method below should work without change when you give the file name "stroke-data.csv". 
4. The file imported has NA (not available) values in some columns. These rows need to be dropped as machine learning algorithms cannot process data with missing values. Remember when rows are dropped some (row) indexes will be missing. 
3. The first step in processing data is to review the data types of the features (columns). 
4. Use **pandas** features *columns* and *dtypes* to create a dictionary with column names as keys and the datatype as values.
5. This function then returns the new dataframe (df) and the df_types dictionary (df_types), where a key-value pair represents column name-column's dtype. 

In [97]:
def process_data(fl):
    
    # Import the CSV file (fl)
    # Your code goes here
    df2=pd.read_csv(fl)
    
    # Drop all rows with NA values
    # Your code goes here
    print(df2.dtypes)
    df2 = df2.dropna()

    # Create a dictionary with keys the column names and values the type of data
    # Your code goes here
    df2_types = {}
    cols=df2.columns

    df2_types = {}
    for i in range(len(cols)):
        df2_types[cols[i]]=df2[cols[i]].dtypes
    
    return df2, df2_types


### Task 2  - Convert categorical (non-numeric) variables (20 marks)

Many machine learning algorithms are designed to process numeric data and cannot natively handle categorical data. Therefore as part of the model building process, we must apply pre-processing steps to convert the data into an encoded format which the algorithms can handle.

1. In the following function you will identify and convert categorical variables to numeric data type. 
2. You will need the python *dictionary* "df2_types" of the function "process_data" we created in task 1. We can use this to identify data in a categorical (non-numeric) data format.
3. Create a list "cat_ls" of column names which are non-numeric. 
4. Process each column named in "cat_ls" separately. 
5. For a column name, say "col_name", find the *distinct* categories. For example, in column "gender" there are 2 categories "Male" and "Female". 
6. For a (categorical) column 'C' with *k* categories *k-1* new columns are created and 'C' is replaced by these new columns. For example, the "*gender*" column will be replaced by one numerical column. The column "*smoking_status*" is to be replaced with 2 numerical columns. This process is referred to as *one-hot encoding*.
7. The encoding is done as follows. Suppose there are 3 categories "cat1", "cat2", "cat3" in column 'C'. Create 2 columns with distinct names, say "cat_level1", "cat_level2. If an observation corresponding to a row is 'cat1' then put a 1 in 'cat_level1' and 0 in 'cat_level2' in the same row. If it is 'cat2' put 0 in 'cat_level1' and 1 in 'cat_level2' and put 0 in both if the observation is 'cat3'. 
8. It is simpler if the column has only 2 categories (like "gender"). It will be replaced by 1 column of 1's and 0's. 
9. The number of columns in the new DataFrame will be generally more than the original. For the *stroke-dataset* this number is 11. Remember to **drop** the old non-numeric columns.  
10. Depending on how you do it the column orderings may change. This is important for identifying the output column "stroke". 
11. You may reorder the columns. Suggestion:move "stroke" to the last column in the new dataframe. 
13. You should NOT use any feature-processing modules from **sklearn** or pandas.get_dummies()for this part. If used the maximum mark for this task will not exceed 60%. 


In [98]:
def convert_to_numeric():
    
    # Read the appropriate file, should be in the same directory as the notebook
    df, dict_types = process_data("Stroke_data.csv")
    cat_ls=[]
    for key,value in dict_types.items():
        print(value)
        if value!=np.int64 and value!=np.float64:
            cat_ls.append(key)

    # Apply the one hot encoding process outlined to the new dataframe df2
    # Your code goes here
    df2 = None 
    df2=df
    df2["sex"]=0
    df2["Marital_status"]=0
    df2["Smokes"]=0
    df2["NTSmokes"]=0

    for i, row in df2.iterrows():
        if df2.at[i,"gender"]=="Male":
            df2.at[i,"sex"]=1
        else:
            df2.at[i,"sex"]=0
            
        if df2.at[i,"ever_married"]=="Yes":
            df2.at[i,"Marital_status"]=1
        else:
            df2.at[i,"Marital_status"]=0
            
        if df2.at[i,"smoking_status"]=="never smoked":
            df2.at[i,"NTSmokes"]=1
        elif df2.at[i,"smoking_status"]=="smokes":
            df2.at[i,"Smokes"]=1
        else:
            df2.at[i,"Smokes"]=0
            df2.at[i,"NTSmokes"]=0
            
    df2=df2.drop(['ever_married'], axis=1)
    df2=df2.drop(['gender'], axis=1)
    df2=df2.drop(['smoking_status'], axis=1)
    #print(df2)
   
    col = df.pop("stroke")
     
    #print(col.head(15))      
    df2=df2.drop(['stroke'], axis=1)
    df2[col.name]=col
    #df2 = df2.insert(10, col.name, col)
    #print(df2.head())
 
    return df2      


### Task 3 - Generate ndarrays for train and test data (5 marks)

1. Convert all columns except "id" and "stroke" into a numerical feature matrix **X**. The size of the matrix will be *no_of_rows* $\times$  *(no_of_columns-2)*. The number of columns should be 9. 
2. Put the values in the "stroke" column in the array **y**. 
3. Use the sklearn [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method to generate *X_train, X_test, y_train. y_test*. 
5. In the *train_test_split()* method the fraction of data to be split for testing has to be specified. Vary this fraction between .2 to .33. Run your program  a few times to choose  an optimim value. The optimum will correspond to the fraction giving the best accuracy/precision (see Task 5). 
6. Return the 4 arrays. 

In [99]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

def create_arrays():
    
    # Call the function created in Task 2 to source the encoded data frame
    df2 = convert_to_numeric()

    # Create the X and y objects
    # Your code goes here
    X = None
    y = None
    
    X = np.array(df2.iloc[:,1:10])
    np.set_printoptions(suppress=True)
    #the output vector
    y = np.array(df2.iloc[:, 10])
    

  

    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)
    #print(X_train.shape)
    #print(X_test.shape)
    # Function returns the four newly created objects
    return X_train, X_test, y_train, y_test


### Task 4 - Create the logistic regression model (5 marks)
1. In the following function you will use the [liner_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) from sklearn to create and train a logistic regression model. 
2. The model should be trained on the train set created in task 3. **Do not use the full dataset or test set for training**.
2. As this is a binary classification problem (2 classes: "stroke", "no-stroke") the default model does not need significant adjustment
3. You should refer to the document and experiment with changing the hyperparameters of the model


Once you have a trained a model, answer the below questions:
1. In the LogisticRegression class, the first keyword argument is *penalty='l2'*. What is penalized and why? Explain this in 2 sentences.  
2. Instead of $l_2$ penalty one may use $l_1$ penalty? What is the difference between the two?  

# Answers

1. L2 imposes the penalty of limiting the variables used in training the model. so all co-efficients are shrunk by the same percentage which results in some features having near zero coeffcient and thus no say in the predictions evaluated.

2. L1 regularization performs feature selection however l2 regularization decreases the weights to near zero but doesnot exactly remove them.L1 regularization may result in sparse models where as l2 regularization models are not sparse. L1 regularization is more robust than L2 regularization as it does not consider outliers.

In [100]:
def fit_logitmodel(X, y):
    
    # Create the logitmodel_stroke model
    # Your code goes here
    logitmodel_stroke = None
    
    # all parameters not specified are set to their defaults
    logitmodel_stroke= linear_model.LogisticRegression(random_state=0, multi_class='multinomial', penalty='none', solver='newton-cg')
  
    #linmodel_realest.fit(X, y)
    logitmodel_stroke.fit(X, y)
    
    #predictions = logitmodel_stroke.predict(y)
    
    # Train the logitmodel_stroke model
    # Your code goes here

    return logitmodel_stroke


### Task 5 - Model evaluation (15 marks)
The process for evaluating a classification model is different from a regression model. In regression we have a wide range of values so we measure variance, however classification has a much smaller problem space so we measure how often the correct prediction is made. There are multiple metrics for measuring this, [this article](https://www.mage.ai/blog/definitive-guide-to-accuracy-precision-recall-for-product-developers) and the [Wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall) provide additional context.

1. As this is binary classification there are 2 classes. Class 1 indicates positive stroke risk and class 0 indicates negative stroke risk. 
2. When testing we use a separate dataset which the model was not trained on. This is essential to observe how the model performs on data it has not seen before.
3. In the function below *X_ts* represents the data used to generate test predictions and *y_obs* represents the actual values we are trying to predict. 
4. We can evaluate a classification model by having it make a set of predictions for a test set (X_ts) and comparing these with the actual values (y_obs).
5. Suppose *y_pred* is a predicted value when run on a sample from *X_ts*. We compare it to the corresponding observed value in *y_obs*. There are four potential outcomes from this comparison:

    1. *y_pred* = 1 (positive) and *y_obs* = 1 (positive): counted as *true positive*.
    2. *y_pred* = 1 (positive) and *y_obs* = 0 (negative): counted as *false positive*. 
    3. *y_pred* = 0 (negative) and *y_obs* = 0 (negative): counted as *true negative*. 
    4. *y_pred* = 0 (positive) and *y_obs* = 1 (negative): counted as *false negative*. 
    
5. Count all the 4 cases for the entire sample input to the function *evaluate_logitmodel* and store them in 4 variables: *tp*, *fp*, *tn* and *fn*. For example, *tp* will give total number of true positives and *fn* the total of true negatives. 
6. The two metrics we will be using for evaluation are *accuracy* and *precision*. The formula for these is below. 
$$acc = \frac{tp+tn}{tp+tn+fp+fn} \quad\text{(accuracy)}, \quad prec = \frac{tp}{tp + fp} \quad\text{(precision)}$$

7. Run the model training/evaluation process for 5 different test/train split ratios (see task 3). Add a paragraph below outlining:
    1. The results of your different test/train splits
    2. How the different split sizes effected model evaluation
    3. Was there a difference in accuracy/precision and if so, what could be causing this?
    4. For a fixed train/test data evaluate the metrics on the train data (*X_train*) and test (*X_test*) seprately and record the valuse of the metrics.  
8. This task is designed to test your understanding of model evaluation. **No built-in evaluation functions or metrics should be used**. 

In [101]:
#the model object is the output of the function fit_logitmodelto obtain y_pred
def evaluate_logitmodel(model, X_ts,  y_obs):
    predictions=logic_model.predict(X_ts)
   
    # Use the .predict() method of the model to generate a set of predictions for X_ts
    # Your code goes here
    tp=0
    fp=0
    tn=0
    fn=0
    # Determine the tp, fp, tn and fn values for the prediction set
    # Your code goes here
    for i in range(len(predictions)):
        if predictions[i]==1 and y_obs[i]==1:
            tp=tp+1
        elif predictions[i]==1 and y_obs[i]==0:
            fp=fp+1
        elif predictions[i]==0 and y_obs[i]==0:
            tn=tn+1
        elif predictions[i]==0 and y_obs[i]==1:
            fn=fn+1

    # Calculate the accuracy and precision values
    # Your code goes here
    acc = 0
    prec = 0
    acc=(tp+tn)/float((tp+tn+fp+fn))
    if(tp+fp==0):
        tp=1
        prec=tp/(tp+fp)
    else:
        prec=tp/(tp+fp)
    print("Accuracy: ",acc)
    print("Precision: ",prec)
    
    return acc, prec
X_train, X_test, y_train, y_test = create_arrays()
logic_model=fit_logitmodel(X_train, y_train)
evaluate_logitmodel(logic_model, X_test,  y_test)
evaluate_logitmodel(logic_model, X_train,  y_train)

id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object
int64
object
float64
int64
int64
object
float64
float64
object
int64
Accuracy:  0.956268221574344
Precision:  1.0
Accuracy:  0.9467878001297858
Precision:  1.0


(0.9467878001297858, 1.0)

# Answers

ANSWERS

A)
For a test size of 0.25 

Accuracy and precision of the test data.

Accuracy:  0.9521586931155193.
Precision:  1.0.

For a test size of 0.50 

Accuracy and precision of the test data.

Accuracy:  0.9497956800934034
Precision:  0.5

For a test size of 0.10 

Accuracy and precision of the test data.

Accuracy:  0.956268221574344
Precision:  1.0

B)
The accuracy increased as the size of the test data decreased and that of train data increased. Further more the preision remained same at test data of 0.25 and 0.10 percent but dcecreased to half for an equal amount of test and train set.

C)
Keeping in mind the evaluations shown the precision of the model decreases as the traning set is significantly reducced to 50 percent so now the model predicts lesser correct predictions as the trainings set has become insufficient for good predictions.
For a test data of 0.25 and 0.10 , the model gives and excellent precision of 1 which means that the model rarely predicts wrongly. Above this the accuracy of the model is pretty high toppng to 90 mark however the accuracy decreases as the training set decreases which means that a model trained on lesser data will give low accuracy.


D)
For a test size of 0.25 and the training size of 0.75:.

Accuracy and precision of the test data.

Accuracy:  0.9521586931155193.
Precision:  1.0.

Accuracy and precision of the train data.

Accuracy:  0.9462616822429907.
Precision:  1.0.