# Lab Exercise 2

Tan Bing Shien WQD180104

## 1. Prequisite
Import dataset from **Lab Exercise 1**.

In [1]:
# Import library
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [2]:
# Import dataset
df = pd.read_csv('lab1.csv')

In [3]:
df.head()

Unnamed: 0,TargetB,ID,TargetD,GiftCnt36,GiftCntAll,GiftCntCard36,GiftCntCardAll,GiftAvgLast,GiftAvg36,GiftAvgAll,...,PromCntCardAll,StatusCat96NK,StatusCatStarAll,DemCluster,DemAge,DemGender,DemHomeOwner,DemMedHomeValue,DemPctVeterans,DemMedIncome
0,0,14974,,2,4,1,3,17.0,13.5,9.25,...,13,A,0,0,,F,U,0,0,0
1,0,6294,,1,8,0,3,20.0,20.0,15.88,...,24,A,0,23,67.0,F,U,186800,85,0
2,1,46110,4.0,6,41,3,20,6.0,5.17,3.73,...,22,S,1,0,,M,U,87600,36,38750
3,1,185937,10.0,3,12,3,8,10.0,8.67,8.5,...,16,E,1,0,,M,U,139200,27,38942
4,0,29637,,1,1,1,1,20.0,20.0,20.0,...,6,F,0,35,53.0,M,U,168100,37,71509


In [4]:
# Dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9686 entries, 0 to 9685
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   TargetB           9686 non-null   int64  
 1   ID                9686 non-null   int64  
 2   TargetD           4843 non-null   float64
 3   GiftCnt36         9686 non-null   int64  
 4   GiftCntAll        9686 non-null   int64  
 5   GiftCntCard36     9686 non-null   int64  
 6   GiftCntCardAll    9686 non-null   int64  
 7   GiftAvgLast       9686 non-null   float64
 8   GiftAvg36         9686 non-null   float64
 9   GiftAvgAll        9686 non-null   float64
 10  GiftAvgCard36     7906 non-null   float64
 11  GiftTimeLast      9686 non-null   int64  
 12  GiftTimeFirst     9686 non-null   int64  
 13  PromCnt12         9686 non-null   int64  
 14  PromCnt36         9686 non-null   int64  
 15  PromCntAll        9686 non-null   int64  
 16  PromCntCard12     9686 non-null   int64  


## 2. Data Cleaning

In [5]:
# Define a pre-processing functions (same as lab 1)
def preprocess_data(df):
    
    # Convert `DemCluster` to category datetype
    df['DemCluster'] = df['DemCluster'].astype('category')
    
    # Convert `DemCluster` to integer datetype with binary variables
    df['DemHomeOwner'] = df['DemHomeOwner'].replace({'H': 1,'U':0})
    df['DemCluster'] = df['DemCluster'].astype(int)

    # Replace invalid values (0) in `DemAge` with median
    df['DemAge'].replace(0, np.nanmedian(df['DemAge']), inplace = True)
    
    # Replace invalid values (0) in  `DemMedIncome`, `GiftAvgCard36` with mean
    cols = ['DemMedIncome', 'GiftAvgCard36']
    for c in cols:
        df[c].replace(0, df[c].mean(), inplace = True)
    
    # Impute median for missing values in `DemAge`
    df['DemAge'].fillna(np.nanmedian(df['DemAge']), inplace = True)
    
    # Impute mean for missing values in  `DemMedIncome`, `GiftAvgCard36`
    cols = ['DemMedIncome', 'GiftAvgCard36']
    for c in cols:
        df[c].fillna(df[c].mean(), inplace = True)
    
    # Drop `ID` and `TargetD` columns
    df.drop(columns = ['ID', 'TargetD'], inplace = True)
    
    return df

In [6]:
# Clean dataset
df_clean = preprocess_data(df)

In [7]:
df_clean.head()

Unnamed: 0,TargetB,GiftCnt36,GiftCntAll,GiftCntCard36,GiftCntCardAll,GiftAvgLast,GiftAvg36,GiftAvgAll,GiftAvgCard36,GiftTimeLast,...,PromCntCardAll,StatusCat96NK,StatusCatStarAll,DemCluster,DemAge,DemGender,DemHomeOwner,DemMedHomeValue,DemPctVeterans,DemMedIncome
0,0,2,4,1,3,17.0,13.5,9.25,17.0,21,...,13,A,0,0,60.0,F,0,0,0,40491.444249
1,0,1,8,0,3,20.0,20.0,15.88,14.224431,26,...,24,A,0,23,67.0,F,0,186800,85,40491.444249
2,1,6,41,3,20,6.0,5.17,3.73,5.0,18,...,22,S,1,0,60.0,M,0,87600,36,38750.0
3,1,3,12,3,8,10.0,8.67,8.5,8.67,9,...,16,E,1,0,60.0,M,0,139200,27,38942.0
4,0,1,1,1,1,20.0,20.0,20.0,20.0,21,...,6,F,0,35,53.0,M,0,168100,37,71509.0


In [8]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9686 entries, 0 to 9685
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   TargetB           9686 non-null   int64  
 1   GiftCnt36         9686 non-null   int64  
 2   GiftCntAll        9686 non-null   int64  
 3   GiftCntCard36     9686 non-null   int64  
 4   GiftCntCardAll    9686 non-null   int64  
 5   GiftAvgLast       9686 non-null   float64
 6   GiftAvg36         9686 non-null   float64
 7   GiftAvgAll        9686 non-null   float64
 8   GiftAvgCard36     9686 non-null   float64
 9   GiftTimeLast      9686 non-null   int64  
 10  GiftTimeFirst     9686 non-null   int64  
 11  PromCnt12         9686 non-null   int64  
 12  PromCnt36         9686 non-null   int64  
 13  PromCntAll        9686 non-null   int64  
 14  PromCntCard12     9686 non-null   int64  
 15  PromCntCard36     9686 non-null   int64  
 16  PromCntCardAll    9686 non-null   int64  


## 3. Data Modeling

### Data Partitioning

In [9]:
# Create X values
X = df.drop('TargetB', axis=1)

In [10]:
# Create Y values
y = df['TargetB']

In [11]:
# Instantiate LabelEncoder object
le = LabelEncoder()

# Apply le on categorical feature columns
X[['StatusCat96NK', 'DemGender']] = X[['StatusCat96NK', 'DemGender']].apply(lambda col: le.fit_transform(col))

In [12]:
# Split into training set and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, stratify = y, random_state = 0)

**Source:** [What is stratify in data partinioning?](https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn)

### Standardisation and Logistic Regression

1. **What is the difference between logistic regression and linear regression?**

| Logistic regression           | Linear Regression          |
| :---------------------------- | :---------------------------- |
| Predict the continuous dependent variable | Predict the categorical dependent|
| Solving Regression problem | Solving Classification problems|
| Least square estimation method is used for estimation of accuracy | Maximum likelihood estimation method is used for estimation of accuracy |
| It is required that relationship between dependent variable and independent variable must be linear | It is not required to have the linear relationship between the dependent and independent variable |

Source: [Linear Regression vs Logistic Regression](https://www.javatpoint.com/linear-regression-vs-logistic-regression-in-machine-learning#:~:text=Linear%20regression%20is%20used%20to,given%20set%20of%20independent%20variables.&text=In%20logistic%20Regression%2C%20we%20predict%20the%20values%20of%20categorical%20variables.)

2. **Describe how logistic regression perform its prediction.**

 It performs prediction by multiplying features/variables to set of weights

3. **Write code to perform standardisation on your training and test dataset.**

In [13]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test) # Don't fit this

**Source:** [fit_transform vs transform](https://datascience.stackexchange.com/questions/12321/difference-between-fit-and-fit-transform-in-scikit-learn-models#:~:text=%22transform%22%20uses%20a%20previously%20computed,of%20code%20instead%20of%202.)

4. **What does standardisation do to your data? How does it benefit your regression model?** 

Standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais.

Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis. 

The largest benefit of standardisation for regression models is ensuring gradient descent updates weights on similar speed. In addition, standardised input features allow us to compare their regression weights and figure out the important variables.

5. **Write code to fit a logistic regression model to your training data. How does it perform on the training and test data? Do you see any indication of overfitting?**

In [14]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train) 

TypeError: fit() missing 1 required positional argument: 'y'

In [None]:
print("Training accuracy:", logreg.score(X_train, y_train))
print("Test accuracy:", logreg.score(X_test, y_test))

In [None]:
# Classification report on test data
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))

*Accuracy score is not really desireable but at least it is not tends to overfitting.*

6. **Write code to find the most important features in your model.**

In [None]:
logreg.coef_

In [None]:
# Grab feature importances from mode and feature name from the dataset cols
feature_names = X.columns
coef = logreg.coef_[0]

# sort coef in descending order
indices = np.argsort(np.absolute(coef))
indices = np.flip(indices, axis=0)

for i in indices:
    print(feature_names[i], ':', coef[i])

***Coefficient*** *is degree of correlation between two variables. It can be referred as measure of the strength of the association between the two variables. ***Positive correlation*** indicates that both variables increase or decrease together, whereas ***negative correlation*** indicates that as one variable increases, so the other decreases, and vice versa.* 