# Logistic Regression and Decision Tree

In this assignment we'll try to build some machine learning algorithms and understand principles of their work. Let's start from some very basic things.

## Part 1: warm up

Goal of this part is to build linear and tree classifying models and verify them on famous 'Iris' dataset.

Let's start from libraries.

In [None]:
import numpy as np
import pandas as pd
import os

In [None]:
from sklearn import datasets
import matplotlib.pyplot as plt

Prepare dataset, as well.

In [None]:
iris = datasets.load_iris()

In [None]:
X = iris.data

In [None]:
y = iris.target

Check targets. We should see 3 classes and 50 objects for each class.

In [None]:
y

### Tree model

Let's go to tree model. As you probably remember, decision tree makes splits and uses **information gain** criteria to classify objects.
Basically, information gain defined as defference between entropy levels before and after split.
So, let's define entropia function:
$$S = \sum^C p_i log(p_i)$$

_Hint_: [np.unique](https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.scipy.org_doc_numpy-2D1.15.0_reference_generated_numpy.unique.html&d=DwIGAg&c=h-HJQ5E3_Jo9moAvrfQA-w&r=JM10sIbQjoYl5jZCggpcPxDjBIH6Jt2R7pmSzUL7VAg&m=mzfoj3NaHyP5aLtcPoDSze1F0SAnrbhEKsTgs76I05o&s=53y6uNNEQAYEoF1Y61nngx_BSl-5R-oJY8tXZ2oYUaI&e= ) could help you here

In [None]:
def calculate_entropy(array):
    #YOUR CODE HERE
    pass

Smoke test

In [None]:
calculate_entropy(np.array([1,1,1,1,1,1,1,1])) # should be 0

In [None]:
calculate_entropy(np.array([1,1,1,1,2,2,2,2])) # should be log 2 ~ 0.693

And initial entropy of dataset

In [None]:
calculate_entropy(y) # S(y) ~ 1.0986

Then implement function for split. It should check every possible way to split dataset in two part and choose best one.

Please return predicate which achieves maximal information gain.

In [None]:
def split(X,y):
    #YOUR CODE HERE
    # iterate over columns of x and examine every split
    # return the best one

And check it on this trivial case

In [None]:
simple_data = np.array([1,2,3,4,5,6]).reshape(6,1)
simple_label = np.array([0,0,0,1,1,1])
predicate = split(simple_data, simple_label)
simple_test = np.array([2, 4]).reshape(2,1)
print(np.apply_along_axis(predicate, 1, simple_test)) # should be [False  True]

So, we are ready to implement our first decision tree!
Let's start from template below.

In [None]:
class DecisionTreeNode:
    left = None
    right = None
    predicate = None
    outcome = None
    
    def train(self, X, y):
        if (len(np.unique(y)) == 1):
            self.outcome = y[0]
            return
        if len(X) < 6:
            self.outcome = np.median(y)
            return
        self.predicate = split(X, y)
        index_left = np.apply_along_axis(self.predicate, 1, X)
        index_right = np.invert(index_left)
        self.left = DecisionTreeNode()
        self.left.train(X[index_left], y[index_left])
        self.right = DecisionTreeNode()
        self.right.train(X[index_right,:], y[index_right])
        
    def predict(self, X):
        def single_predict(x):
            node = self
            while node.predicate is not None:
                out = node.predicate(x)
                if out:
                    node = node.left
                else:
                    node = node.right
            return node.outcome
        return np.apply_along_axis(single_predict, 1, X)
            
        

And see how it works on iris data.

In [None]:
sc = DecisionTreeNode()
sc.train(X, y)

In [None]:
sc.predict(X)

Nice!  Let's move to Logistic Regression part.

### Logistic regression

Significant part of logistic model is sigmoid function. Let's implement it.

$$\sigma(x)=\frac{1}{1 + e^{-x}}$$

In [None]:
def sigmoid(x):
    #YOUR CODE HERE
    pass

And derivative as well

Hint: use sigmoid function property  :
$$\frac{d \sigma(x)}{d x} = \sigma(x) (1 - \sigma(x))$$

In [None]:
def sigmoid_derivative(x):
    #YOUR CODE HERE
    pass

Graph below will show if functions implemented correctly.

In [None]:
x = np.linspace(-5, 5, 100)
plt.plot(x, sigmoid(x),label='sigmoid')
plt.plot(x, sigmoid_derivative(x),label='sigmoid derivative')
plt.legend()

OK. Then implement logistic regression training loop. Use gradient descent for parameter adjsutment.

In [None]:
class CustomLogisticRegression:
    coef_ = None
    bias_ = None
    learning_rate_= None
    minibatch_size_ = None
    stop_threshold_ = None

    def __init__(self, learning_rate=1e-3, minibatch_size=100, stop_threshold=0.1) -> None:
        self.learning_rate_ = learning_rate
        self.minibatch_size_ = minibatch_size
        self.stop_threshold_ = stop_threshold

    def fit(self, X : np.ndarray, y : np.ndarray):
        self.coef_ = np.random.randn(1, X.shape[1])
        self.bias_ = np.random.randn()
        #YOUR CODE HERE
        # crate train loop
        pass

    def predict(self, X : np.ndarray):
        return sigmoid(self.coef_ @ X.T + self.bias_)
    

Smoke test that all works as expected.

In [None]:
logreg = CustomLogisticRegression()
logreg.fit(X[:100,:], y[:100])

In [None]:
logreg.predict(X[:100, :]) > 0.5

## Part 2: ticket data

### Data Examination

Here we going to analyze bunch of project tickets and find out if there any dependencies between ticket text and team which will handle it. Of course there is such dependency but will it be visible for our relatively simple models?

Let's import data and examine it.

In [None]:
from preprocess import DATA_DIR

Read data. [Pandas read csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) as reference

In [None]:
tickets = pd.read_csv(os.path.join(DATA_DIR, 'tickets.csv'))

In [None]:
len(tickets)

In [None]:
tickets.head()

As you can see, there are some NaN's. Let's ignore them so far.

In [None]:
tickets = tickets.fillna(' ')

Examine teams working on tickets

In [None]:
tickets['team'].value_counts()

Note we can't just fit our models on raw text data. We should transform ticket's text to vectors, a natural data input format both for logistic regression and descision tree. There are many ways to do it, we'll show a most simple one -- bag-of-words approach. It erforms in three steps: estimating corpus dictionary, enumerating it and assigning a vector of words count for each text.

Module scikit-learn provies [Bag-of-words implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from itertools import chain

In [None]:
cv = CountVectorizer()
#PLEASE CHECK DOCUMENTATION AND ADJUST SOME REASONABLE PARAMETERS

In [None]:
cv.fit(chain(tickets['summary'], tickets['description']))

Let's show number of words of created dictionary.

In [None]:
len(cv.vocabulary_)

After training of vectorizer we are finally available to create numeric features.

In [None]:
summary_array = cv.transform(tickets['summary'])

In [None]:
comment_array = cv.transform(tickets['description'])

In [None]:
summary_array.shape

In [None]:
comment_array.shape

In [None]:
features = summary_array + comment_array

OK. Now we're set up features and almost ready to go ahead with data analysis. We still have to determine target variable. For demonstration purpose, let's check if ticket could be assigned for some team. We'll build predicate fot it:

In [None]:
target = tickets['team'].apply(lambda x : int('3' in x))

In [None]:
target

As we have plenty amount of data, we'll use classic train-test scheme.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.4, random_state=42)

### Logistic Regression

Let's start from [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()
# CHECK DOCUMENTATION AND ADJUST SOME REASONABLE PARAMETERS

In [None]:
lr.fit(X_train, y_train)

In [None]:
predicted = lr.predict(X_test)

Examine classification result using [Confsuion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_test, predicted)

In [None]:
cm

Take a look on false positives / false negatives statistic

In [None]:
tn, fp, fn, tp = cm.ravel()

In [None]:
(tn, fp, fn, tp)

And check [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) which is relatively good for imbalanced data

In [None]:
from sklearn.metrics import f1_score

In [None]:
f1_score(y_test, predicted)

And, as final action let's see which words are most important in linear regression model:

In [None]:
def visualize_primary_features(features, feature_names, n_top_features=25):
    # inspired by https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/topic04_linear_models/topic4_linear_models_part4_good_bad_logit_movie_reviews_XOR.ipynb
    # get coefficients with large absolute values 
    positive_coefficients = np.argsort(features)[-n_top_features:]
    # plot them
    plt.figure(figsize=(15, 5))
    colors = ['red']*len(positive_coefficients)
    plt.bar(np.arange(n_top_features), features[positive_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(0, n_top_features), feature_names[positive_coefficients], rotation=60, ha="right");


In [None]:
visualize_primary_features(lr.coef_.ravel(), cv.get_feature_names())

### Decision Tree

We'll train another model here, [Decision Tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for comparison.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier()
# CHECK DOCUMENTATION AND ADJUST SOME REASONABLE PARAMETERS

In [None]:
tree.fit(X_train, y_train)

In [None]:
tree_predicted = tree.predict(X_test)

And repeat model examination phase

In [None]:
cm_tree = confusion_matrix(y_test, tree_predicted)

In [None]:
tn, fp, fn, tp = cm_tree.ravel()

In [None]:
(tn, fp, fn, tp)

In [None]:
f1_score(y_test, tree_predicted)

In [None]:
visualize_primary_features(tree.feature_importances_, cv.get_feature_names())

## Additional Tasks

**Task1 : examine model parameters and use 3-split strategy to choose best ones**

**Task2: compare best models using cross-validation scheme. Note score variance and decide if best model has statistically proved advantage**