<a href="https://colab.research.google.com/github/QBlek/ML_practice/blob/main/MLpractice4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1.

Consider the subset of (x,y) points shown below. These are actually a subset of the data points found in the sklearn diabetes dataset.

x | y
--- | ---
 0.08 | 233
-0.04 | 91
 0.01 | 111
-0.04 | 152
-0.03 | 120
 0.01 | 67
 0.09 | 310
-0.03 | 94
-0.06 | 183
-0.03 | 66
 0.06 | 173
-0.06 | 72
 0.00 | 49
-0.02 | 64
-0.07 | 48


For a candidate linear regressor with parameters $\Theta_0 = 75.1$ and $\Theta_1 = -0.001$, calculate the mean squared error with respect to the data points and perform one iteration of gradient descent , assuming $\alpha = 0.01$. Show all of your work.

---

Answer: The mean squared error is 7524.174. $\Theta_0 $ is updated by first computing the sum of errors, multiplying this value by $\alpha$, and adding the result to the previous value, resulting in $\Theta_0$ = 82.165.

$\Theta_1$ is udpated by computing the sum of each error multiplied by the corresponding value of x, multiplying the sum by $\alpha$, and adding the result to the previous value, resulting in $\Theta_1$ = 0.312. The result does move the regression closer to an optimal result as indicated by the new mean squared error of 6907.257.

---

#2.

Explain what value is generated by the equation $h(x) = \frac{1}{1+e^{-(\Theta_0 + \Theta_1 x)}}$ in logistic regression. What steps are taken to convert the linear regression algorithm into the classification-based logistic regression algorithm?

---

Answer: $h(x)$ represents the probability that $x$ belongs to the positive class. This value is typically compared to a threshold to determine whether to output $+$ or $-$.

Linear linear regression, logistic regression learns parameters $\Theta_0$ and $\Theta_1$. Linear regression learns a line but this is not effective for classification when the line dips below 0 or rises about 1. For this reason, we modify the decision boundary to flatten out when it hits $y=0$ and $y=1$ (a sigmoid function). Unlike linear regression, the output represents a probability that $y=1$ rather than a continuous value of $y$. Additionally, we have to modify the linear regression cost function to make it context so we can apply gradient descent. The cost function for linear regression is the sum of squared error. Logistic regression uses logistic loss, or Cost(h(x), y) = -y log(h(x)) - ((-y) log (1-h(x)).

---

#3.

Regularization is introduced in class as a numeric term that can be incorporated into a loss function. This is straightforward for algorithms such as neural networks that seek to optimize a numeric loss function. Here, explain possible ways regularization can be used to fine tune other types of algorithms, specifically decision trees and naive Bayes classifiers.

---

Answer: For decision trees, regularization takes the form of pruning the learned tree. Aggressive pruning will result in underfitting the data, while little to no pruning may result in overfitting the data.

Typically, naive Bayes always includes all features in its calculation of the most likely output. $L_1$ regularization offers a valuable variable selecction property that we can use to prune irrelevant and/or redundant predictors.

---

#4.

The goal of this program is to give you detailed experience with naive Bayes classifiers as well as with text mining methods and libraries. In this program, you are tasked with writing a naive Bayes classifier to classifier phone messages as spam versus not spam (ham). To test your program, use the labeled data available from the UCI machine learning repository at https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. The data is also available at https://drive.google.com/file/d/1nn2baOauApGbxrOeTb4l-30Qz1prPIS3/view?usp=sharing.

You will need to read in each message and convert it to a set of features. Use the sklearn libraries to

- remove punctuation
- convert to lower case
- create a bag of words vector that stores term frequency
- filter out words from the stop list (stop_words)
- include 1-grams and 2-grams
- normalize frequences based on document length (tfidf)

Report performance of your classifier on 3-fold cross validation in terms of accuracy and macro f1 score.

Additionally, notice that the class distribution is imbalanced. There are 4,827 legitimate messages and 747 spam messages. Experiment with three alternative methods for addressing this issue and report impact of each method on performance.

- Undersample the majority class so they are balanced.
- Oversample the minority class so they are balanced.
- Weight the data points based on class imbalance.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction import stop_words
from sklearn.utils.class_weight import compute_sample_weight
from nltk.tokenize import TreebankWordTokenizer
import numpy as np
import pandas as pd
import pprint
import string
import copy

from google.colab import drive
drive.mount('/content/gdrive')


def read_data():
  infile = open('/content/gdrive/My Drive/ML/HW/SMSSpamCollection')
  data = []
  y = []
  for line in infile:
    vector = line.strip().lower().split('\t')
    data.append(vector[1])
    y.append(vector[0])
  return data, y


def learn(X, y, weights):
  # train a naive Bayes classifier on data
  clf = MultinomialNB().fit(X, y, sample_weight=weights)
  scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
  print('Accuracy over 3 folds:', np.mean(scores))
  scores = cross_val_score(clf, X, y, cv=3, scoring='f1_macro')
  print('Macro f1 score over 3 folds:', np.mean(scores))


def main():
  data, y = read_data()
  count_vect = CountVectorizer()
  tokenizer = TreebankWordTokenizer()
  count_vect.set_params(tokenizer=tokenizer.tokenize)

  # include 1-grams and 2-grams
  count_vect.set_params(ngram_range=(1,2))

  X_counts = count_vect.fit_transform(data)

  # normalize counts based on document length
  # weight common words less (is, a, an, the)
  tfidf_transformer = TfidfTransformer()
  X_tfidf = tfidf_transformer.fit_transform(X_counts)

  # learn with equal weights, no sampling
  print('unweighted')
  weights = np.full(len(y), 1.0, dtype=float)
  learn(X_tfidf, y, weights)

  # learn with weights inversely proportional to class size
  weights = compute_sample_weight("balanced", y)
  print('weights', weights)
  learn(X_tfidf, y, weights)

  return


main()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
unweighted
Accuracy over 3 folds: 0.923394330821672
Macro f1 score over 3 folds: 0.7786465483571515
weights [0.57737725 0.57737725 3.73092369 ... 0.57737725 0.57737725 0.57737725]
Accuracy over 3 folds: 0.923394330821672
Macro f1 score over 3 folds: 0.7786465483571515
