# Contents
[Logistic Regression](#Logistic-Regression)<br>
[Oversampling vs No Oversampling](#Oversampling-vs-No-Oversampling)<br>
[More features vs less features](#More-features-vs-less-features)<br>
[Increase max_iter](#Increase-max_iter)<br>
[Model optimization by hyperparameter selection](#Model-optimization-by-hyperparameter-selection)<br>
[Competition outcome](#Competition-outcome)<br>
[Final Conclusion](#Final-Conclusion)<br>

# Logistic Regression
[Back to top](#Contents)<br>

Logistic Regression is a Linear Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable.

#### Note:
I will use the term test data and validation data interchangeably. Both refers to the same data that will be used to evaluate models' performance internally. The ACTUAL test data from the competition will be refered to as competition data. 

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression 
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
import time 
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.model_selection import cross_val_score

In [10]:
#implement logistic regression, display results of the model
def implement_logreg(X_train,y_train,X_test,y_test,Max_iter=None):
    
    start = time.time()
    if Max_iter == None:
        logreg = LogisticRegression()
    else:
        logreg = LogisticRegression(max_iter=Max_iter)        
    logreg.fit(X_train, y_train)
    
    #print the results: time, accuracy, classification report, confusion matrix, f1-micro
    print("Time taken to run the model: ",time.time()-start)
    print("Model accuracy on train data: {:.5f}%".format(logreg.score(X_train, y_train)*100))
    print("Model accuracy on test data: {:.5f}%".format(logreg.score(X_test, y_test)*100))
    print_report(logreg,X_train,y_train,X_test,y_test)
    return logreg

#implement random search, display results of the model with the best hyperparameter found by random search
def run_random_search(X_train,y_train,X_test,y_test,C,Max_iter):
    
    start = time.time()
    hyperparameters = dict(C=C)
    logistic = LogisticRegression(max_iter=Max_iter)
    
    #obtain logistic regression model with the best hyperparameter C found by random search
    clf = RandomizedSearchCV(logistic, hyperparameters, random_state=1, n_iter=300,\
                             cv=10, verbose=0, n_jobs=-1, scoring = 'f1_micro')    
    random_search = clf.fit(X_train,y_train)
    
    #print the results: time, best C value, accuracy, classification report, confusion matrix, f1-micro
    print("Time taken to run the model: ",time.time()-start)
    print('Best C:', random_search.best_estimator_.get_params()['C'])
    print("Model accuracy on train data: {:.2f}%".format(random_search.score(X_train, y_train)*100))
    print("Model accuracy on test data: {:.2f}%".format(random_search.score(X_test, y_test)*100))
    print_report(random_search,X_train,y_train,X_test,y_test)
    return random_search

#print the results of the model: classification report, confusion matrix, f1-micro
def print_report(model,X_train,y_train,X_test,y_test):
    y_train_pred = model.predict(X_train)
    print("\n\n=============   classification_report on train data   =============\n")
    print(classification_report(y_train, y_train_pred))
    print("Confusion matrix\n")
    print(metrics.confusion_matrix(y_train, y_train_pred))
    print()
    print("F1-Micro: ",f1_score(y_train, y_train_pred, average='micro'))
    
    y_test_pred = model.predict(X_test)
    print("\n\n=============   classification_report on test data   =============\n")
    print(classification_report(y_test, y_test_pred))
    print("Confusion matrix\n")
    print(metrics.confusion_matrix(y_test, y_test_pred))
    print()
    print("F1-Micro: ",f1_score(y_test, y_test_pred, average='micro'))

#store predicted result into csv file
def save_result_to_csv(test_pred, file_name):
    
    #get a list of the predicted damage grade
    pred = []
    for i in range(test_pred.shape[0]):
        if test_pred[i][0] >= test_pred[i][1] and test_pred[i][0] >= test_pred[i][2]:
            pred.append(1)
        elif test_pred[i][1] >= test_pred[i][0] and test_pred[i][1] >= test_pred[i][2]:
            pred.append(2)
        else:
            pred.append(3)
    
    #create a csv file with the building ids and the predicted damage grades
    new_col = pd.DataFrame(pred,columns=['damage_grade'])
    df = pd.read_csv("data/submission_format.csv")
    df["damage_grade"] = new_col
    df.to_csv("{}.csv".format(file_name), index=False) 

#save the final model as a pickle file. 
def save_model_as_pickle(model, filename):
    pickle.dump(model,open(filename,"wb"))

#load the model stored in a pickle file
def load_pickle_model(file_name):
    infile = open(file_name,'rb')
    model = pickle.load(infile)
    infile.close()
    return model
    

# Oversampling vs No Oversampling
[Back to top](#Contents)<br>

#### No oversampling, all less important features removed

In [3]:
#get preprocessed train, test and competition data, all less important features removed
%run ./Preprocess.ipynb

Using TensorFlow backend.


In [4]:
# logistic reg without oversampling, all less important features removed
logreg = implement_logreg(train_x, train_y, test_x, test_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Time taken to run the model:  6.206761598587036
Model accuracy on train data: 66.67978%
Model accuracy on test data: 66.66795%



              precision    recall  f1-score   support

           1       0.60      0.32      0.42     20072
           2       0.67      0.81      0.74    118670
           3       0.66      0.52      0.58     69738

    accuracy                           0.67    208480
   macro avg       0.64      0.55      0.58    208480
weighted avg       0.66      0.67      0.65    208480

Confusion matrix

[[ 6445 13296   331]
 [ 4178 96473 18019]
 [  192 33450 36096]]

F1-Micro:  0.6667977743668457



              precision    recall  f1-score   support

           1       0.61      0.32      0.42      5052
           2       0.67      0.82      0.74     29589
           3       0.67      0.51      0.58     17480

    accuracy                           0.67     52121
   macro avg       0.65      0.55      0.58     52121
weighted avg       0.66      0.67      0.65    

#### Oversampling, all unimportant features removed

In [5]:
# logistic reg with oversampling, all less important features removed
logreg_over = implement_logreg(train_x_over, train_y_over, test_x, test_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Time taken to run the model:  11.505195140838623
Model accuracy on train data: 65.25659%
Model accuracy on test data: 58.12053%



              precision    recall  f1-score   support

           1       0.73      0.77      0.75    118670
           2       0.53      0.46      0.49    118670
           3       0.67      0.73      0.70    118670

    accuracy                           0.65    356010
   macro avg       0.65      0.65      0.65    356010
weighted avg       0.65      0.65      0.65    356010

Confusion matrix

[[91251 23213  4206]
 [26358 54926 37386]
 [ 6760 25767 86143]]

F1-Micro:  0.6525659391590124



              precision    recall  f1-score   support

           1       0.34      0.77      0.47      5052
           2       0.74      0.47      0.57     29589
           3       0.57      0.72      0.64     17480

    accuracy                           0.58     52121
   macro avg       0.55      0.65      0.56     52121
weighted avg       0.65      0.58      0.58   

### Outcome

##### Train data comparision

With oversampling:

The F1-score of the minority classes(damage grade 1 and 3) improved by 33% and 12% respectively.

The F1-score of the majority class(damage grade 2) decreased by 25%.

The F1-micro score decreased by 1.4%.

##### Validation data comparision

With oversampling:

The F1-score of the minority classes(damage grade 1 and 3) improved by 5% and 6% respectively.

The F1-score of the majority class(damage grade 2) decreased by 17%.

The F1-micro score decreased by 8.5%.

##### Conclusion:

Oversampling should only be used if we  want more accurate prediction on the minority class and ignore the accuracy of prediction on the majority class. 

Oversampling reduced the model's overall performance, we will therefore not use oversampling from now on 

# More features vs less features
[Back to top](#Contents)<br>

### Possible underfitting issue

Model without oversampling:

train f1-micro = 0.6668

validation f1-micro = 0.6667

Both train and validation results were poor, and were very similar, suggested that our model might underfit.


Since underfitting could be reduced by adding more features, we will add more features into our model (features previously identified as less important)

#### No oversampling, keep all features  except for geo_index_level_2 and 3

In [6]:
#get preprocessed train, test and competition data, keep all features except for geo_index_level_2 and 3
%run ./Preprocess_keep_features_except_geo_2_3.ipynb

In [8]:
# logistic reg without oversampling, keep all features except for geo_index_level_2 and 3
logreg_except_geo_2_3 = implement_logreg(train_x_keep, train_y_keep, test_x_keep, test_y_keep)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Time taken to run the model:  7.200513601303101
Model accuracy on train data: 66.93016%
Model accuracy on test data: 67.58312%



              precision    recall  f1-score   support

           1       0.60      0.36      0.45     20204
           2       0.68      0.81      0.74    118318
           3       0.67      0.52      0.58     69958

    accuracy                           0.67    208480
   macro avg       0.65      0.56      0.59    208480
weighted avg       0.67      0.67      0.66    208480

Confusion matrix

[[ 7265 12595   344]
 [ 4537 95785 17996]
 [  249 33223 36486]]

F1-Micro:  0.6693016116653876



              precision    recall  f1-score   support

           1       0.59      0.34      0.43      4920
           2       0.68      0.82      0.75     29941
           3       0.67      0.52      0.59     17260

    accuracy                           0.68     52121
   macro avg       0.65      0.56      0.59     52121
weighted avg       0.67      0.68      0.66    

### Outcome

##### Adding categorical features previously identified as less important improved model performance

f1-micro increased from 0.667 to 0.669 on train data

f1_micro increased from 0.667 to 0.676 on validation data

Even though more features were added, and f1-micro scores on both train and validation data improved slightly, the underfitting issue was still not solved. The f1-micro score of the validation data even exceeded that of the train data. Still, keeping more features was able to improve model's performance by increasing learning capacity slightly. We will therefore do further optimization based on this model.

# Increase max_iter 
[Back to top](#Contents)<br>

Previously when we ran the models, we noticed from the warning messages that the solver failed to converge when the default max_iter value of 100 was used.

To ensure the solver converges, we increased max_iter from 100 to 500.

In [9]:
# logistic reg without oversampling, keep all features except for geo_index_2 and 3 ids, max iter increase from 100 to 500
logreg_max_iter = implement_logreg(train_x_keep, train_y_keep, test_x_keep, test_y_keep, Max_iter = 500)

Time taken to run the model:  26.934953451156616
Model accuracy on train data: 66.95414%
Model accuracy on test data: 67.57929%



              precision    recall  f1-score   support

           1       0.60      0.36      0.45     20204
           2       0.68      0.81      0.74    118318
           3       0.67      0.52      0.58     69958

    accuracy                           0.67    208480
   macro avg       0.65      0.56      0.59    208480
weighted avg       0.67      0.67      0.66    208480

Confusion matrix

[[ 7261 12603   340]
 [ 4505 95882 17931]
 [  247 33268 36443]]

F1-Micro:  0.6695414428242518



              precision    recall  f1-score   support

           1       0.59      0.33      0.43      4920
           2       0.68      0.82      0.75     29941
           3       0.67      0.52      0.59     17260

    accuracy                           0.68     52121
   macro avg       0.65      0.56      0.59     52121
weighted avg       0.67      0.68      0.66   

### Outcome 

##### When the number of iterations increased

The F1-micro on train data increased from 0.6693 to 0.6695

The F1-micro on validation data decreased from 0.67583 to 0.67579

##### Increased running time

The running time increased from 7s to 27s

##### Conclusion

The solver converged with 500 iterations and obtained parameters that decreased the error on the train data.

However, there might be an increased overfitting effect, for which the model fitted even better on the train data, but was unable to generalize on the validation data. We could infer this from the decrease in the f1-micro score on the validation data.

Due to this problem, and the significant increase in executional time, we will keep our iteration at 100 for hyperparameter selection.

# Model optimization by hyperparameter selection
[Back to top](#Contents)<br>

We will tune the hyperparameter C using random search.

#### Hyperparameter C 
C is the inverse of the regularization strength(lambda). The lower the value of C, the greater the strength of regularization. The tunning of this parameter is generally used to address the issue of overfitting.

#### Random search
The random search function tries random C values, evaluates the models' performance using k-fold cross validation and returns the solution that maximise the f1-micro score. 

#### Cross Validation
Cross validation generally give us a better indication of how well the model will perform on unseen data, while train test split may result in overfitting hence unable to generalize on new data.

In [11]:
# random search
C = uniform(loc=0, scale=4)
random_search = run_random_search(train_x_keep, train_y_keep, test_x_keep, test_y_keep, C, Max_iter=100)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Time taken to run the model:  6699.820973157883
Best C: 3.2295651548380953
Model accuracy on train data: 66.93%
Model accuracy on test data: 67.58%



              precision    recall  f1-score   support

           1       0.60      0.36      0.45     20204
           2       0.68      0.81      0.74    118318
           3       0.67      0.52      0.59     69958

    accuracy                           0.67    208480
   macro avg       0.65      0.56      0.59    208480
weighted avg       0.67      0.67      0.66    208480

Confusion matrix

[[ 7250 12606   348]
 [ 4516 95751 18051]
 [  247 33166 36545]]

F1-Micro:  0.6693495778971604



              precision    recall  f1-score   support

           1       0.59      0.33      0.42      4920
           2       0.68      0.82      0.75     29941
           3       0.67      0.53      0.59     17260

    accuracy                           0.68     52121
   macro avg       0.65      0.56      0.59     52121
weighted avg       0.67   

### Outcome 

##### Negligible change in performance 

The F1-micro on train data increased from 0.66930 to 0.66935

The F1-micro on validation data decreased from to 0.67583 to 0.67581  

##### Reason for the negligible change in performance

Tunning the regularization parameter (C) is used to address the overfitting issue of the model

Our model might suffer from underfitting instead of overfitting. The default C value used before random search assumed there was completely no overfitting, therefore the default C value was already very good.

Even with 300 random search iterations, we still could not find any C value that performed better than the default C value.

# Competition outcome
[Back to top](#Contents)<br>

#### Currently, the top 3 models with the highest f1-micro score on the validation data are:

1. logreg_except_geo_2_3 (no oversampling, increased features)

F1-Micro:  0.67583

2. logreg_max_iter (no oversampling, increased features, increased max_iter)

F1-Micro:  0.67579

3. random_search (no oversampling, increased features,  hyperparamter tuning)

F1-Micro:  0.67581

#### We will predict all the 3 models with the competition test data and submit them for the competition 



In [12]:
#save logreg_except_geo_2_3's prediction to csv
test_pred = logreg_except_geo_2_3.predict_proba(test_values_keep)
save_result_to_csv(test_pred, 'logreg_except_geo_2_3')

In [13]:
#save logreg_max_iter's prediction to csv
test_pred = logreg_max_iter.predict_proba(test_values_keep)
save_result_to_csv(test_pred, 'logreg_max_iter')

In [14]:
#save random_search's prediction to csv
test_pred = random_search.predict_proba(test_values_keep)
save_result_to_csv(test_pred, 'random_search')

#### Competition score:

logreg_except_geo_2_3: 0.6711

logreg_max_iter: 0.6708

random_search: 0.6711

There wass a tie between logreg_except_geo_2_3 and random_search. Since logreg_except_geo_2_3 had a slightly better f1-micro score on the validation data, it would be the final model for logistic regression

In [15]:
#save the model in a pickle file 
save_model_as_pickle(logreg_except_geo_2_3, 'logreg.pickle')

# Final Conclusion
[Back to top](#Contents)<br>

We have tried many different ways to optimize the logistic regression model. Here summarised all our discoveries.

1. Random oversampling increased the weight of the minority classes, but reduced the overall performance.

2. Adding features previously identified as less important improved models' performance. Firstly, this implied that these features still had some relation with damage grade. Secondly, this showed that adding more features could reduce the issue of underfitting.

3. Increasing max_iter might increased the risk of overfitting.

4. Optimizing the regularization parameter would not help if overfitting was not a issue.

Additionally, logistic regression was a simple machine learning algorithm that could be trained very fast. Therefore it could be used as a quckly implemented baseline model to compare with more complex algorithms.

Since our models still lacked learning capacity, we could even one-hot encode geo_index_level_2 and level_3 to train our models. I would be very happy to try this if I can find a better computer in the future, such as some computers at Lee Wee Nam library.