# **Machine Learning Project**
Hananel Mandeleyl

***
## Part 1 | Exploration
***

First of all, some importation:

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1. **A sneak peak at our data**


Before any visualization and statistics, let's see what we are dealing with:

In [9]:
data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,Feature_0,Feature_1,Evaporation,Feature_3,Feature_4,Feature_5,Feature_6,MaxTemp,Feature_8,Feature_9,...,Feature_16,Feature_17,Feature_18,Feature_19,WindGustSpeed,Feature_21,Year,Feature_23,Feature_24,label
0,0.896902,6.084509,0.6,80.0,76.0,D,a21,1.107143,0.692857,5,...,13.9,12.2,D,D,28.0,7.0,2011,40.0,10.693188,1
1,2.63269,23.441093,6.4,43.0,64.0,N,a9,1.7,0.614286,11,...,18.6,16.5,N,I,61.0,43.0,2012,110.0,57.225409,0
2,1.133413,5.994495,0.4,63.0,100.0,C,a4,1.242857,0.428571,6,...,16.5,9.6,M,,15.0,7.0,2012,0.0,146.400294,0
3,2.387702,18.165247,4.2,65.0,71.0,K,a15,1.05,0.671429,10,...,14.2,11.4,K,D,39.0,24.0,2010,130.0,217.614788,0
4,2.101356,16.652846,3.2,40.0,62.0,F,a1,1.95,1.085714,3,...,26.6,23.4,C,,30.0,20.0,2011,0.0,81.49078,0


<br>


### 2. **Statistics and visualization**

#### 2.1. **Shape:**

In [None]:
print("Rows: {}, columns: {}.".format(data.shape[0], data.shape[1]))

As we can see, we have 22,161 entries, consisting of 25 features and 1 label column.  
<br>

#### 2.2. **Feature data types:**

We would like to know:
1. How many different data types do we have?
2. How many columns of each type do we have?
3. Of what type is each column?

In [None]:
print("Number of different types:", data.dtypes.nunique(),
      "\n\nType names:", ', '.join([str(dtype) for dtype in data.dtypes.unique()]),
      "\n\nNumber of columns of each type:\n{}".format(data.dtypes.value_counts().to_string()),
      "\n\nDetailed description:\n{}".format(data.dtypes.to_string()))

  \* We can see that the `object` type features are suspected to be categorical, although we have to make sure.  
  <br>

#### 2.3. **How many categorical features do we really have?**

We would like to find out which features are *truly* categorical.  
In order to do that we will iterate through all the `object` type features and analyze each.  
Based on the structure of the feature's entry, i.e. only consist of `char`s, or both `char`s and numbers (which we have to process differently).


In [None]:
total_true_categ_cols = 0
total_false_categ_cols = 0
true_categ_cols_dict = {}
chars_ints_mix_dict = {}

for col in data.select_dtypes(["object"]):
  col_index = data.columns.get_loc(col)
  # we have to figure out whether the feature is a mixture of digits and chars or not
  # to do that, we need to find a non-NaN entry in the column to examine.
  # we iterate through entries until we find the row index of a non-NaN entry
  i = 0
  while pd.isna(data.iloc[i, col_index]) == True:
    i += 1
  # now we know for sure the i is the row index of a non-NaN entry
  # now we check if the entry is a mixture of chars and ints or not
  contains_digits = any(map(str.isdigit, data.iloc[i, col_index]))
  if contains_digits:
    chars_ints_mix_dict[col] = col_index
    total_false_categ_cols += 1
  else:
    true_categ_cols_dict[col] = col_index
    total_true_categ_cols += 1

print("Number of true categorical features: {}"
      "\nTheir names and indices:"
      "\n{}"
      "\n\nNumber of false categorical features: {}"
      "\nTheir names and indices:"
      "\n{}"
      .format(total_true_categ_cols,
              '\n'.join("{}:\t{}".format(feature, index) for feature, index in true_categ_cols_dict.items()),
              total_false_categ_cols,
              '\n'.join("{}:\t{}".format(feature, index) for feature, index in chars_ints_mix_dict.items())))

<br>

#### 2.4. **Numerical features (raw) statistics:**

In [None]:
feature_stats = data.describe().T
feature_stats

<br>

#### 2.5. **NaN entries:**

We would like to know how many NaN values we have for each column.  
Furthermore, obtaining the NaN entries' percentage might be helpful as well;  
We can use it to remove features whose NaN percentage is higher than the allowed threshold.

In [None]:
columns_nans_dict = {"NaN entries": data.isnull().sum()}
columns_nans_df = pd.DataFrame(columns_nans_dict)

columns_nans_df["Percentage of NaN entries"] = columns_nans_df / data.shape[0]

print(columns_nans_df.sort_values(by = ['Percentage of NaN entries'], ascending = False))

Fortunately, we conclude that our data is pretty full.  
The *Sunshine* feature is topping the list with only $\approx$ 8% missing values.  
<br>

#### 2.6. **Feature correlations:**

We would like to see how correlated the primal features are.  
We'll make a correlation matrix and plot a heatmap on it.

In [None]:
plt.figure(figsize = (12, 10))
corr_matrix = data.corr()
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_matrix, annot = True, mask = mask, square = True)

plt.show()

As we can see, some feature are highly correlated.
My initial thought was to remove both correlated features and replace with an engineered mean feature, but, as we notice that correlation is high, we also suffer from a too high dimensionality.  
Based on the above I've decided to leave it to our PCA/feature selector.

<br>

***
## Part 2 | Preprocessing
***

### Defining our configurations

We would like to build a configurations file/text to control the entire workflow:

In [None]:
import json
# config_text = open('exp101', 'r').read()
config_text = """{
                    "nan_fill_method": "pad",
                    "file_name": "first",
                    "scaling_method": "standard",
                    "pca": {"use": true, "number": 15},
                    "feature_selection": {"use": false, "method": "f_classif", "number": 10},
                    "models": {"gnb": true, "knn": false, "log_reg": true, "mlp": true, "svm": true},
                    "mode": "test"
                  }"""
config_dict = json.loads(config_text)

class Config:
  def __init__(self, config_dict):
    try:
      self.target_file = config_dict.get('file_name', 'def_file')
      if self.target_file is 'def_file':
        print('Warning! No file name, saving to default.')

      self.nan_fill_method = config_dict['nan_fill_method']

      self.scaling_method = config_dict['scaling_method']

      self.use_pca = config_dict['pca'].get('use', False)
      if self.use_pca:
        self.pca_number_of_componnents = config_dict['pca']['number']

      self.use_feature_select = config_dict['feature_selection'].get('use', False)
      if self.use_feature_select:
        self.fs_method = config_dict['feature_selection']['method']
        self.fs_number_of_features = config_dict['feature_selection']['number']
      
      self.use_gnb = config_dict['models']['gnb']
      self.use_knn = config_dict['models']['knn']
      self.use_log_reg = config_dict['models']['log_reg']
      self.use_mlp = config_dict['models']['mlp']
      self.use_svm = config_dict['models']['svm']
      
      self.mode = config_dict['mode']
    except ValueError as e:
      print('Missing required value in config file.')
      raise e

config = Config(config_dict)
print("Selected configuration:\n",
      "\nNaN filling method:", config.nan_fill_method,
      "\nScaling method:", config.scaling_method,
      "\nUse PCA:", config.use_pca,
      "\nPCA number of components: {}".format(config.pca_number_of_componnents) if config.use_pca else "",
      "\nUse feature selection:", config.use_feature_select,
      "\Feature selection method: {}, number of features: {}".format(config.fs_method, config.fs_number_of_features) if config.use_feature_select else "",
      "\nUse Gaussian Naive Bayes:", config.use_gnb,
      "\nUse KNN:", config.use_knn,
      "\nUse Logistic Regression:", config.use_log_reg,
      "\nUse MLP:", config.use_mlp,
      "\nUse SVM:", config.use_svm)

<br>

### Some helper functions

In [None]:
def is_int_or_dot(some_char):
  if some_char.isdigit() or some_char == '.':
    return True
  else:
    return False

def remove_chars(some_string):
  # We filter out every char that isn't a digit or a '.'
  some_string_wo_chars = ''.join(filter(is_int_or_dot, some_string))
  # If the entry doesn't contain any digits, e.g. 'nanmm', we return NaN
  if len(some_string_wo_chars) != 0:
    return float(some_string_wo_chars)
  else:
    return np.nan

def remove_chars_from_col(df, column):
  column_index = data.columns.get_loc(column)
  # We apply remove_chars() on every entry that isn't already NaN
  df[column] = [remove_chars(entry) if pd.isna(entry) == False else np.nan for entry in df[column]]

def categ_col_to_num(df, column):
  # We change column's data type to 'category' in order to use pandas.DataFrame.cat.codes
  df[column] = df[column].astype('category')
  # We convert categorical values to numerical values using pandas.DataFrame.cat.codes
  # Note: pandas.DataFrame.cat.codes replaces NaN entries with -1, so we fill them in with most common value beforehand
  df[column] = df[column].fillna(df[column].mode()[0]).cat.codes

def find_max_categs(df):
  max_unique_vals = 0
  for col in df.select_dtypes(["object"]):
      # We first find a non-NaN entry in the column to examine wether it's a real categorical or not.
      # We do this by iterating through entries until we find the row index (i) of a non-NaN entry.
      col_index = df.columns.get_loc(col)
      i = 0
      while pd.isna(df.iloc[i, col_index]) == True:
        i += 1
      # Now that we know for sure the i is the row index of a non-NaN entry,
      # we check if it is a mixture of chars and ints or not by checking if it contains digits.
      contains_digits = any(map(str.isdigit, df.iloc[i, col_index]))
      if contains_digits == False:
        unique_vals = df[col].nunique()
        max_unique_vals = unique_vals if unique_vals > max_unique_vals else max_unique_vals
  return max_unique_vals

def fill_discrete_cols(df, max_vals):
  for col in df.columns:
    if df[col].nunique() <= max_vals:
      df[col] = df[col].fillna(df[col].mode()[0])

<br>

### Processing classes

Now, we will build processing classes, leaving it to the user to decide what processors with which parameters to use.

*NOTE: After checking for outliers, I have decided to avoid the removal of entries, and scale the data standardly.* 

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif



class Processor:
  def __init__(self):
    pass
  def process_data(self, data: pd.DataFrame):
    return data


class CategToNumConverter(Processor):
  def process_data(self, data):
    # Because this process involves sub-processes that build on top of each other,
    # we need to make a copy of the data, to pass between all the steps.
    data_wo_categs = data.copy()
    # We iterate through all the non-numerical columns and process each differently,
    # depending on wether it's a real categorical feature or a mixture of digits and chars.
    for col in data_wo_categs.select_dtypes(["object"]):
      # We first find a non-NaN entry in the column to examine wether it's a real categorical or not.
      # We do this by iterating through entries until we find the row index (i) of a non-NaN entry.
      col_index = data_wo_categs.columns.get_loc(col)
      i = 0
      while pd.isna(data_wo_categs.iloc[i, col_index]) == True:
        i += 1
      # Now that we know for sure the i is the row index of a non-NaN entry,
      # we check if it is a mixture of chars and ints or not by checking if it contains digits.
      contains_digits = any(map(str.isdigit, data_wo_categs.iloc[i, col_index]))
      if contains_digits:
        remove_chars_from_col(data_wo_categs, col)
      else:
        categ_col_to_num(data_wo_categs, col)
    return data_wo_categs


class NanFiller(Processor):
  def __init__(self, filler):
    self.filler = filler
  def process_data(self, data):
    data_wo_nans = data.copy()
    # We would like to make sure that categorical and discrete features are only
    #filled with the most common value, and not with the mean/median, which
    # don't represent those feature accurately. To do that, we'll first use
    # the find_max_categs() function in order to find the maximum amount of
    # unique values for a categorical or discrete feature in our dataset,
    # to be able to identify which columns are categorical/discrete and which
    # are not.
    max_categs = find_max_categs(data)
    # Now, we fill all the categorical/discrete columns with most common value
    # using the fill_discrete_cols() function.
    fill_discrete_cols(data_wo_nans, max_categs)
    # We fill all the rest with a filler of choice, passed via the
    # self.filler attribute.
    data_wo_nans = data_wo_nans.fillna(method = self.filler)
    return data_wo_nans


class Scaler(Processor):
  def __init__(self, scaling_method):
    self.scaler = scaling_method
  def process_data(self, data):
    # Our default scaler is the standard scaler, unless stated otherwise.
    if self.scaler == 'standard':
      scaler = StandardScaler()
      data_scaled = scaler.fit_transform(data)
      return pd.DataFrame(data_scaled)
    else:
      scaler = MinMaxScaler()
      data_scaled = scaler.fit_transform(data)
      return pd.DataFrame(data_scaled)


class PCAHandler(Processor):
  def __init__(self, num_of_components):
    self.num_of_components = num_of_components
  def process_data(self, data):
    pca = PCA(n_components = self.num_of_components, copy = False)
    data_post_pca = pca.fit_transform(data)
    return pd.DataFrame(data_post_pca)


class FeatureSelect(Processor):
  def __init__(self, fs_method, num_of_features):
    self.fs_method = fs_method
    self.num_of_features = num_of_features
  def process_data(self, data):
    if self.fs_method == 'f_classif':
      data_post_fs = SelectKBest(f_classif, self.num_of_features).fit_transform(X, y)
      return pd.DataFrame(data_post_fs)
    elif self.fs_method == 'chi2':
      data_post_fs = SelectKBest(chi2, self.num_of_features).fit_transform(X, y)
      return pd.DataFrame(data_post_fs)
    else:
      data_post_fs = SelectKBest(mutual_info_classif, self.num_of_features).fit_transform(X, y)
      return pd.DataFrame(data_post_fs)

<br>

### Building a processors list

Now, we shall build a processors list according to the user's specification:

In [None]:
# We define the basic pre-processors, which are a must, according to the user's
# methods of choice. 
categoric_handler = CategToNumConverter()
nan_handler = NanFiller(config.nan_fill_method)
scaler = Scaler(config.scaling_method)
# We use them to create a basic pre-processors list.
processors = [categoric_handler, nan_handler, scaler]
# PCA and/or feature selection is up to the user.
if config.use_pca:
  processors.append(PCAHandler(config.pca_number_of_componnents))
if config.use_feature_select:
  processors.append(FeatureSelect(config.fs_method, config.fs_number_of_features))

<br>

### Applying the pre-processing

Finally, all we have to to is define a pre-processing function and run it!

In [None]:
def apply_pre_processors(processors, data, config):
  # Initially y is set to None, unless the configuration states it's train mode.
  y = None
  if config.mode == 'train':
    y = pd.DataFrame(data['label'])
    data = data.drop(columns=['label'])
  print(data, '\n')
  # We iterate though all the accumulated processors and apply each on our data.
  for proc in processors:
    data = proc.process_data(data)
    print(data, '\n')
  
  return data, y


X, y = apply_pre_processors(processors, data, config)

<br>

***
## Part 3 | **Model Construction**
***

### Building an list of optimizers

For this part, we would like to build a parameter-searchers list accoring to the user's chosen classifiers configuration.  
The interesting part is that for each chosen classifier, we make a cross-validation grid searcher.  
This tool helps us find the perfect parameters out of a given list of options **using cross-validation**, and not just the training dataset!

In [None]:
import scipy
import sklearn
from sklearn import datasets
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.naive_bayes import GaussianNB
from sklearn import neighbors
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn import svm


def get_clfs_searcher(config):
  # We initialize an empty classifiers array, then check for each possible model
  # if it was chosen in the configuration file.
  searchers = []

  if config.use_gnb:
    gnb_clf_parameters = {'priors': [[0.5, 0.5], None]}
    gnb_clf = GaussianNB()
    gnb_clf_searcher = GridSearchCV(gnb_clf, gnb_clf_parameters)
    searchers.append(gnb_clf_searcher)

  if config.use_knn:
    knn_clf_parameters = {'n_neighbors': [1, 5, 10, 15, 20]}
    knn_clf = neighbors.KNeighborsClassifier()
    knn_clf_searcher = GridSearchCV(knn_clf, knn_clf_parameters)
    searchers.append(knn_clf_searcher)

  if config.use_log_reg:
    log_reg_clf_parameters = {'C': range(0, 11, 2)}
    log_reg_clf = LogisticRegression()
    log_reg_clf_searcher = GridSearchCV(log_reg_clf, log_reg_clf_parameters)
    searchers.append(log_reg_clf_searcher)

  if config.use_mlp:
    mlp_clf_parameters = {'hidden_layer_sizes': [(100,), (10, 10)],
                          'activation': ['logistic', 'relu'],
                          'alpha': [0.0001, 0.0005]}
    mlp_clf = MLPClassifier()
    mlp_clf_searcher = GridSearchCV(mlp_clf, mlp_clf_parameters)
    searchers.append(mlp_clf_searcher)

  if config.use_svm:
    svc_parameters = {'kernel': ['rbf', 'linear'],
                      'C': [1, 2],
                      'probability': [True]}
                      
    svc = svm.SVC()
    svc_searcher = GridSearchCV(svc, svc_parameters)
    searchers.append(svc_searcher)
  
  return searchers

<br>

### Training, optimization, and results export functions

Now, we will define two function:  
1. Receives a classifier's searcher, uses it to finds the classifier's best parameters, and fits it to the data. Returns a classifier.
2. Receives a list of searchers, uses the above function to train and adjust all of their classifiers. Returns a list of trained classifiers.

In [None]:
def train_and_search_clf(searcher, X, y):
  searcher.fit(X, y)
  print(searcher.best_params_)
  clf = searcher.best_estimator_.fit(X, y)
  print("train precision of {} is {}".format(type(clf), sklearn.metrics.precision_score(y, clf.predict(X))))
  return clf


def train_and_search_clfs(srchrs, X, y):
  results = []
  for srchr in srchrs:
      results.append(train_and_search_clf(srchr, X, y))
  return results

Now, we shall define functions that will define and export our results:

In [None]:
from google.colab import files
import pandas as pd
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import KFold
import pickle


def plot_roc_curve(ax, fpr, tpr, title):
    ax.plot(fpr, tpr, color = 'darkorange', label = 'ROC')
    ax.plot([0, 1], [0, 1], color = 'red', linestyle = '--')
    ax.set_xlabel('False Positive Rate (FPR)')
    ax.set_ylabel('True Positive Rate (TPR)')
    ax.set_title(title)


def save(results, X, y):
  print("Saving started.")
  for clf in results:
    y_predicted = clf.predict(X)

    fig = plt.figure(figsize = (20, 16))
    fig.tight_layout()

    ax = fig.add_subplot(221)
    conf_mat = sklearn.metrics.confusion_matrix(y, y_predicted)
    alpha = ['True', 'False']
    beta = ['Positive', 'Negative']
    cax = ax.matshow(conf_mat, interpolation = 'nearest')
    fig.colorbar(cax)

    ax.set_xticklabels(['']+alpha)
    ax.set_yticklabels(['']+beta)

    for (i, j), z in np.ndenumerate(conf_mat):
      ax.text(j, i, '{:0.1f}'.format(z), ha = 'center', va = 'center')

    kf = KFold(n_splits = 3, random_state = None, shuffle = False)
    curr_k = 2

    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        clf.fit(X_train, y_train)
        y_prob = clf.predict_proba(X_test)
        
        curr_fpr, curr_tpr, curr_thresholds = roc_curve(y_test, y_prob[:, 1])
        ax = fig.add_subplot(2,2,curr_k)
        plot_roc_curve(ax, curr_fpr, curr_tpr,
                       'ROC Curve for %i-fold CV\nwith AUC of %.4f' %(curr_k, auc(curr_fpr, curr_tpr)))
        
        curr_k += 1
    
    name = config.target_file + str(type(clf)) + ".jpg"
    plt.savefig(name)
    print("val precision of {} is {}".format(type(clf), sklearn.metrics.precision_score(y, clf.predict(X))))
    pickle.dump(clf, open(config.target_file + str(type(clf)) + ".p", 'wb'))
    

<br>

***
## Part 4 | **Model Training Evaluation**
***

As you probably noticed, we train, evaluate, optimize and export our results in one step:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)
clfs = get_clfs_searcher(config)
res = train_and_search_clfs(clfs, X_train, y_train)

save(res, X_test, y_test)

<br>

***
## Part 5 | **Prediction Output**
***

**Before running this step we have to change the running mode to test in our configurations, and choose a model to use on the test!**

In [None]:
clf = res[1]
test_data = pd.read_csv('test_without_target.csv')
#config to test mode
X, _ = apply_pre_processors(processors, test_data, config)
pred = clf.predict_proba(X)

pd.DataFrame(pred[:, 1], columns = ["pred_proba"]).to_csv("311160808.csv", columns = ["pred_proba"])