## 1. Prequisite - Preprocessing Data

Perform the following steps before trying the exercises:

a) Import pandas as "pd" and load the lab1 dataset into "df"

b) Print dataset information to refresh your memory

c) Run preprocess_data function on the dataframe to perform preprocessing steps 
discussed last week

d) Split your data into training and test with 70:30 distribution, stratified, random state 0.

================================================================================================================

#### a) Import pandas as "pd" and load the lab1 dataset into "df"

In [290]:
import pandas as pd
import numpy as np

df = pd.read_csv('lab1.csv')

#### b) Print dataset information to refresh your memory

In [291]:
df.head()

Unnamed: 0,TargetB,ID,TargetD,GiftCnt36,GiftCntAll,GiftCntCard36,GiftCntCardAll,GiftAvgLast,GiftAvg36,GiftAvgAll,...,PromCntCardAll,StatusCat96NK,StatusCatStarAll,DemCluster,DemAge,DemGender,DemHomeOwner,DemMedHomeValue,DemPctVeterans,DemMedIncome
0,0,14974,,2,4,1,3,17.0,13.5,9.25,...,13,A,0,0,,F,U,0,0,0
1,0,6294,,1,8,0,3,20.0,20.0,15.88,...,24,A,0,23,67.0,F,U,186800,85,0
2,1,46110,4.0,6,41,3,20,6.0,5.17,3.73,...,22,S,1,0,,M,U,87600,36,38750
3,1,185937,10.0,3,12,3,8,10.0,8.67,8.5,...,16,E,1,0,,M,U,139200,27,38942
4,0,29637,,1,1,1,1,20.0,20.0,20.0,...,6,F,0,35,53.0,M,U,168100,37,71509


In [292]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9686 entries, 0 to 9685
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   TargetB           9686 non-null   int64  
 1   ID                9686 non-null   int64  
 2   TargetD           4843 non-null   float64
 3   GiftCnt36         9686 non-null   int64  
 4   GiftCntAll        9686 non-null   int64  
 5   GiftCntCard36     9686 non-null   int64  
 6   GiftCntCardAll    9686 non-null   int64  
 7   GiftAvgLast       9686 non-null   float64
 8   GiftAvg36         9686 non-null   float64
 9   GiftAvgAll        9686 non-null   float64
 10  GiftAvgCard36     7906 non-null   float64
 11  GiftTimeLast      9686 non-null   int64  
 12  GiftTimeFirst     9686 non-null   int64  
 13  PromCnt12         9686 non-null   int64  
 14  PromCnt36         9686 non-null   int64  
 15  PromCntAll        9686 non-null   int64  
 16  PromCntCard12     9686 non-null   int64  


#### c) Run preprocess_data function on the dataframe to perform preprocessing steps discussed last week 

In [293]:
# Dropping unnecessary variable TargetD and ID 
df.drop(['ID', 'TargetD'], axis=1, inplace=True)
    
# Change DemCluster from integer to nominal/str
df['DemCluster'] = df['DemCluster'].astype(str)

# Change DemHomeOwner into binary variable (Integer Encoding)
dem_home_owner_map = {'U':0, 'H': 1}
df['DemHomeOwner'] = df['DemHomeOwner'].map(dem_home_owner_map)

# Denote errorneous values in DemMidIncome
mask = df['DemMedIncome'] < 1   # filter out any DemMedIncome with value less than 1
df.loc[mask, 'DemMedIncome'] = np.nan  # denote zeroes in DemMedIncome as missing now

# Impute missing values DemAge, DemMedIncome and GiftAveCard36 with mean value
df['DemAge'] = df['DemAge'].fillna(value=df['DemAge'].mean(), inplace=True)
df['DemMedIncome'] = df['DemMedIncome'].fillna(value=df['DemMedIncome'].mean(), inplace=True)
df['GiftAvgCard36'].fillna(df['GiftAvgCard36'].mean(), inplace=True)  # different syntax

# Machine learning algorithms cannot work with categorical data directly because sklearn models only accept numerical matrices as input.
# Change data from categorical variables to binary variables. Categorical data must be converted to numbers.
# This converting process is commonly referred as one-hot encoding. 
# We do one hot encoding using .get_dummies().
df['DemGender'] = df['DemGender'].astype('category')  

# one hot encoding all categorical variables
# all numerical variables are automatically excluded

# one hot encoding
df = pd.get_dummies(df)

#### d) Split your data into training and test with 70:30 distribution, stratified, random state 0

In [294]:
from sklearn.model_selection import train_test_split

y = df['TargetB']
X = df.drop(['TargetB'], axis=1)

# Setting random state = 0
rs = 0

# Training set = 70%
# Test Set = 30%
# Stratify = Yes
X_mat = np.asmatrix(X)
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.3, 
                                                    stratify=y, random_state=rs)

## 1. Standardisation and Logistic Regression

Perform following operations and answer the following questions:

a) What is the difference between logistic regression and linear regression?  

b) Describe how logistic regression perform its prediction. 

c) Write code to perform standardisation on your training and test dataset. 

d) What does standardisation do to your data? How does it benefit your regression model? 

e) Write code to fit a logistic regression model to your training data. How does it perform on the training and test data? Do you see any indication of overfitting?  

f) Write code to find the most important features in your model.

================================================================================================================


#### a) What is the difference between logistic regression and linear regression?  

1) Linear Regression is a regression algorithm while Logistic Regression is a classification algorithm for machine learning.

2) Linear Regression resolve the problem of predicting/estimating the output value for a given element X (say f(x)). The result of the prediction is a continuous function where the values may be positive or negative. For example, predicting children's height given their age, weight, and other factors. In general, it is about fitting a straight line in the data 

3) Logistic Regression resolve classification problems where given an element that we have to classify. The outcome of dependent variable is discrete (not continuous). For example, given a mail to classify whether it as spam or not. Generally, Logistic Regression have binary target variables, but there can be two more categories of target variables that can be predicted by it. It is about fitting a curve to the data. 

4) Linear regression gives an equation which is of the form Y = mX + C, means equation with degree 1. However, logistic regression gives an equation which is of the form Y = 1 / (1 + e^-x)

<img src="pic 1.PNG" width="800" height="200" />


#### b) Describe how logistic regression perform its prediction. 

Logistic Regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes. 

It do not require a linear relationship between the dependent and independent variables. The outcome is discrete and not continuous. The dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

The Logistic Regression algorithm learns the 'weight' associated with each feature. The model will try to minimize the cost function, which basically says how far off the current predictions to the ground truth.

Assumption in Logistic Regression:
a) In case of binary logistic regression, the target variables must be binary and the desired outcome is represented by the factor level 1.

b) There should not be any multi-collinearity in the model, which means the independent variables must be independent of each other.

c) We must include meaningful variables in our model.

#### c) Write code to perform standardisation on your training and test dataset

In [295]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test)

#### d) What does standardisation do to your data? How does it benefit your regression model?

Standardization is the process of putting different variables on the same scale. It means these variables originally do not give equal contribution to the analysis. Hence, it is required to transform the data to comparable scales. The idea is to rescale an original variable to have equal range and/or variance.

Regression models are sensitive to extreme or outlying values in the input space. To avoid this problem, we should scale our inputs first before building our logistic regression model. In sklearn, this can easily be done using StandardScaler.

Standardizing independent variables can also help us to determine which variables are the most important.

#### e)(i) Write code to fit a logistic regression model to your training data. 

In [296]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

model = LogisticRegression()
model.fit(X_train, y_train)

print("Train accuracy:", model.score(X_train, y_train),
      "\nTest accuracy:", model.score(X_test, y_test))

y_pred = model.predict(X_test)
print("\n", classification_report(y_test, y_pred))

Train accuracy: 0.5859882005899705 
Test accuracy: 0.5763936682725396

               precision    recall  f1-score   support

           0       0.58      0.58      0.58      1453
           1       0.58      0.57      0.57      1453

    accuracy                           0.58      2906
   macro avg       0.58      0.58      0.58      2906
weighted avg       0.58      0.58      0.58      2906



#### e)(ii) How does it perform on the training and test data? Do you see any indication of overfitting?  


Overfitting occurs if a model performs well on training data but does not well on test data. In this case, the performance of the model on training and testing datasets has no significance difference hence the model is not overfitting.

#### f) Write code to find the most important features in your model.

Positive coefficient means positive change in the input feature will have positive correlation to the prediction value. 

Negative coefficient does the reverse. 

Using largest absolute coefficients to give us the clues of which variables are important. I sorting the absolute coefficient in descending order and limit to 20 for visualization purposes.

In [297]:
# First, print out all feature name associated with each coefficient to see all the coefficient number

feature_names = X.columns
coef = model.coef_[0]

coef = coef[:]
for i in range(len(coef)):
    print(feature_names[i], ':', coef[i])

GiftCnt36 : 0.07439448463481192
GiftCntAll : -0.016924292563642366
GiftCntCard36 : 0.10663312417862549
GiftCntCardAll : -0.025465675047160373
GiftAvgLast : -0.02845281190648024
GiftAvg36 : -0.032996633990946676
GiftAvgAll : 0.02932758922770193
GiftAvgCard36 : -0.045924192614627844
GiftTimeLast : -0.19003922054478953
GiftTimeFirst : 0.3076543735487482
PromCnt12 : -0.1198626983812884
PromCnt36 : 0.09004725480280376
PromCntAll : 0.12455561361580184
PromCntCard12 : 0.013220043918111607
PromCntCard36 : 0.09998473570690312
PromCntCardAll : -0.3658332240396197
StatusCatStarAll : 0.06520836152519284
DemHomeOwner : 0.02801469683666773
DemMedHomeValue : 0.08500853358859292
DemPctVeterans : 0.005757326410809153
StatusCat96NK_A : -0.03118081218104886
StatusCat96NK_E : 0.06750649878169673
StatusCat96NK_F : 0.007897533415741297
StatusCat96NK_L : 0.016240227060780047
StatusCat96NK_N : 0.023675630046334484
StatusCat96NK_S : -0.009117654009009948
DemCluster_0 : 0.05864622808416183
DemCluster_1 : 0.0247

In [298]:
# Then, sort features in descending order based on absolute coefficient number
indices = np.argsort(np.absolute(coef))
indices = np.flip(indices, axis=0)

# 20 most important features
indices = indices[:20]

for i in indices:
    print(feature_names[i], ':', coef[i])

PromCntCardAll : -0.3658332240396197
GiftTimeFirst : 0.3076543735487482
GiftTimeLast : -0.19003922054478953
PromCntAll : 0.12455561361580184
PromCnt12 : -0.1198626983812884
GiftCntCard36 : 0.10663312417862549
PromCntCard36 : 0.09998473570690312
PromCnt36 : 0.09004725480280376
DemMedHomeValue : 0.08500853358859292
GiftCnt36 : 0.07439448463481192
StatusCat96NK_E : 0.06750649878169673
StatusCatStarAll : 0.06520836152519284
DemCluster_0 : 0.05864622808416183
DemCluster_44 : -0.05853839559582289
DemCluster_10 : -0.054192682098912204
DemCluster_30 : -0.05243976728990517
DemCluster_32 : -0.05022873601776453
DemCluster_40 : 0.0462735130596608
DemCluster_16 : -0.04617406365282512
GiftAvgCard36 : -0.045924192614627844


#### The above 20 features are the most important features in this model. However this only applies to regression on standardised features. Without standardisation, each feature can have different scales, where coefficients do not help in interpreting the impact a feature has on the model.