# Machine Learning Concepts and Principles
## Software Defect Detection

> Lazaros Panitsidis & Konstantinos Kravaritis<br />
> MSc Data Science <br />
> International Hellenic University <br />
> lpanitsidis@ihu.edu.gr & kkravaritis@ihu.edu.gr

## Contents
1. [Useful Python Libraries](#0)
1. [Data Content](#1)
1. [Feature Engineering](#2)
     1. [Data Preprocessing](#3)
     1. [Visualization & Analysis](#4)
1. [Feature Selection and Random Forest Classification](#5)
     1. [Feature Selection by Correlation](#6)
     1. [Univariate feature selection (SelectKbest)](#7)
     1. [Recursive Feature Elimination (RFE)](#8)
     1. [Recursive Feature Elimination with Cross-Validation (RFECV)](#9)
     1. [Feature importances with a forest of trees](#10)
     1. [XGBoost Feature Importances](#11)
     1. [Minimum Redundancy & Maximum Relevance](#12)
1. [Feature extraction with PCA](#11)
1. [Summary](#12)

<a id='0'></a>
## Useful Python Libraries

In [1]:
### Numeric operations & plots ###
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import scipy.stats as stats
import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt

### Validation & Normalization methods ###
from sklearn.model_selection import cross_val_score , GridSearchCV , StratifiedKFold, RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler, StandardScaler

### ML models ###
from sklearn.linear_model import LogisticRegression # C1
from sklearn.linear_model import SGDClassifier # C1 loss: log_loss => LogisticRegression with SGD
from sklearn.linear_model import Perceptron # C2
from sklearn.svm import SVC # C3
from sklearn.tree import DecisionTreeClassifier # C4
from sklearn.ensemble import RandomForestClassifier # C5
from sklearn.neural_network import MLPClassifier # C6
from sklearn.preprocessing import LabelEncoder

### Metrics ###
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, make_scorer, classification_report
from imblearn.metrics import geometric_mean_score # https://imbalanced-learn.org/stable/references/generated/imblearn.metrics.geometric_mean_score.html
import time
import timeit # https://stackoverflow.com/questions/17579357/time-time-vs-timeit-timeit

### Pipeline ###
from sklearn.pipeline import make_pipeline , Pipeline # https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

### Custom Modules ###
import sys
sys.path.append("..")
from functions.data_types import optimize_dtypes
from functions.dataframe_actions import df_info

### Settings ###
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'
#import warnings library
import warnings
# ignore all warnings
warnings.filterwarnings('ignore')

In [2]:
# dataframe information
def df_info(dataframes):
  """
    Finds some usefull information about all dataframes given in the function.
    
    Usage: pass a list of dataframes into the function
    dataframes = [df1,df2,...]
  """

  for df in dataframes:
         # Check if the dataframe has at least one column
        if not df.empty:
          # # Use list comprehension to get the name of the dataframe from global variables
          df_name = [name for name, obj in globals().items() if obj is df][0] # [0] is the name of each dataframe in the list
          print("----- information for ", df_name, " -----")
          print(df_name, " : ", df.shape, "(rows, columns)")
          print(df_name, " : ", df.isna().sum().sum() , "missing values")
          print(df_name, " : ", df.duplicated().sum() , "duplicated values")
          #df.describe()
          #df.info()
          
          # Identify and count values of the last column
          last_column = df.columns[-1]
          value_counts = df[last_column].value_counts()

          print(df_name, " : Value counts for ", last_column)
          print(value_counts)
        else:
          print(df_name, ': The dataframe is empty.')

## Data preprocessing

In [3]:
# to read .csv files from another directory
data_location = "../../"

### read the .csv files and make dataframes

In [4]:
jm1 = pd.read_csv(data_location + "jm1.csv")
mc1 = pd.read_csv(data_location + "mc1.csv")
pc3 = pd.read_csv(data_location + "pc3.csv")

### extract useful information about the dataframes

In [5]:
dataframes = [jm1, mc1, pc3]
df_info(dataframes)

----- information for  jm1  -----
jm1  :  (10885, 22) (rows, columns)
jm1  :  0 missing values
jm1  :  1973 duplicated values
jm1  : Value counts for  defects
defects
False    8779
True     2106
Name: count, dtype: int64
----- information for  mc1  -----
mc1  :  (9466, 39) (rows, columns)
mc1  :  0 missing values
mc1  :  7450 duplicated values
mc1  : Value counts for  c
c
False    9398
True       68
Name: count, dtype: int64
----- information for  pc3  -----
pc3  :  (1563, 38) (rows, columns)
pc3  :  0 missing values
pc3  :  124 duplicated values
pc3  : Value counts for  c
c
False    1403
True      160
Name: count, dtype: int64


#### label encoding

In [7]:
# Label encoding with sklearn's LabelEncoder
class_le = LabelEncoder()
jm1['defects'] = class_le.fit_transform(jm1['defects'].values)
mc1['c'] = class_le.fit_transform(mc1['c'].values)
pc3['c'] = class_le.fit_transform(pc3['c'].values)

#### find optimal data types for faster computation

In [8]:
jm1 = optimize_dtypes(jm1)
mc1 = optimize_dtypes(mc1)
pc3 = optimize_dtypes(pc3)

In [9]:
jm1.dtypes

loc                  float16
v(g)                 float16
ev(g)                float16
iv(g)                float16
n                    float16
v                    float32
l                    float16
d                    float16
i                    float16
e                    float32
b                    float16
t                    float32
lOCode                uint16
lOComment             uint16
lOBlank               uint16
locCodeAndComment      uint8
uniq_Op               object
uniq_Opnd             object
total_Op              object
total_Opnd            object
branchCount           object
defects                uint8
dtype: object

### setting up scoring and cv

In [None]:
# Define multiple metrics
scoring = {'Accuracy': make_scorer(accuracy_score),
           'F1-score': make_scorer(f1_score, average='binary'),
           'G-Mean score': make_scorer(geometric_mean_score, average='binary')
          }