<a href="https://colab.research.google.com/github/Priyanka-Gangadhar-Palshetkar/MITx-Micromasters-Statistics-Data-Science/blob/main/pegasos_one_league_w4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Week 4: Machine Learning 

**OBJECTIVES**
- Implement PEGASOS algorithm 
- Compare Perceptron, Average Perceptron, and Pegasos algorithms for classification
- Explore Perceptron and SVM with `scikitlearn`

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 

from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV

#### PEGASOS Algorithm

if $y^{(i)}(\theta \cdot x^{(i)}) \leq 1$:


  $$\theta = (1 - \eta \lambda)\theta + \eta y^{(i)}x^{(i)}$$

else:

  $$\theta = (1 - \eta \lambda)\theta$$

```
pegasos_single_step_update input:
feature_vector: [-0.44344863 -0.415128    0.33549888  0.23596999  0.1697304  -0.19186354
  0.10594417  0.10680173  0.08120402 -0.34161713]
label: 1
L: 0.9560342718892494
eta: 0.9478274870593494
theta: [ 0.13847134  0.30366944  0.3602441   0.00906745 -0.1279397   0.43571169
  0.00206729  0.4012078   0.37103297 -0.13598557]
theta_0: 1.9819714009584377
pegasos_single_step_update output is (['0.0129948', '0.0284977', '0.0338069', '0.0008509', '-0.0120064', '0.0408891', '0.0001940', '0.0376511', '0.0348194', '-0.0127615'], '1.9819714')
```

In [2]:
import numpy as np

In [11]:
feature_vector = [-0.44344863 , -0.415128 ,  0.33549888 , 0.23596999,  0.1697304 , -0.19186354, 0.10594417 , 0.10680173 , 0.08120402, -0.34161713]
label = 1
L =  0.9560342718892494
eta =  0.9478274870593494
theta =  [ 0.13847134 , 0.30366944 , 0.3602441  , 0.00906745 ,-0.1279397   ,0.43571169, 0.00206729,  0.4012078 ,  0.37103297, -0.13598557]
theta_0 = 1.9819714009584377

theta = np.array(theta) 
feature_vector = np.array(feature_vector)

In [12]:
#test condition
label * (theta@feature_vector)

-0.05012484554876653

In [13]:
#include bias term
label * (theta@feature_vector + theta_0)

1.931846555409671

In [14]:
#implement the update
theta = (1-eta*L)*theta
theta

array([ 0.01299477,  0.02849769,  0.03380691,  0.00085093, -0.01200643,
        0.04088912,  0.000194  ,  0.03765112,  0.03481938, -0.01276149])

**EXAMPLE 2**

Test the example below to be sure your algorithm is performing correctly.

```
feature_vector: [ 0.06631939 -0.3264986  -0.10624803 -0.09306145  0.12834772  0.06241767
 -0.12983977 -0.02200566 -0.4283686  -0.4063009 ]
label: -1
L: 0.4399370508011895
eta: 0.3521151798545281
theta: [ 0.10688052  0.2692996  -0.48399076 -0.21459193 -0.07127003 -0.31128647
 -0.29462128 -0.37442059  0.12293973  0.40547262]
theta_0: -0.8768832328685103
pegasos_single_step_update output is (['0.0903238', '0.2275828', '-0.4090165', '-0.1813498', '-0.0602297', '-0.2630655', '-0.2489819', '-0.3164197', '0.1038953', '0.3426615'], '-0.8768832')
```

In [20]:
feature_vector = [ 0.06631939 ,-0.3264986 , -0.10624803 ,-0.09306145 , 0.12834772 , 0.06241767, -0.12983977 , -0.02200566,  -0.4283686 , -0.4063009 ]
label = -1
L = 0.4399370508011895
eta = 0.3521151798545281
theta = [ 0.10688052 , 0.2692996 , -0.48399076, -0.21459193, -0.07127003 ,-0.31128647,-0.29462128, -0.37442059,  0.12293973 , 0.40547262]
theta_0 = -0.8768832328685103

feature_vector = np.array(feature_vector)
theta = np.array(theta)

In [21]:
label * (theta@feature_vector)

0.20893599157672338

In [22]:
label * (theta@feature_vector + theta_0)

1.0858192244452338

In [23]:
theta = (1-eta*L)*theta
theta

array([ 0.09032382,  0.2275828 , -0.40901647, -0.18134981, -0.0602297 ,
       -0.26306555, -0.24898194, -0.31641965,  0.10389532,  0.34266146])

**Pegasos Single Step**

In [28]:
def pegasos_single_step_update(feature_vector, label,L, eta, theta, theta_0):
    if label * (theta@feature_vector + theta_0) > 1:
      theta = (1-eta*L)*theta
    else:
      theta = (1-eta*L)*theta + eta * label * feature_vector
      theta_0 = theta_0+ eta*label
    return theta, theta_0


**Pegasos**

Now, we update the parameters using a single step update.  In addition to this, we will change the value of $\eta$ as we go to $\eta = \frac{1}{\sqrt{t}}$. 

In [30]:
def pegasos(feature_matrix, labels, T, L): 
    theta = np.zeros(feature_matrix.shape[1])
    theta_0 = 0
    counter = 1
    for n in range(T):
        for i in range(feature_matrix.shape[0]):
            #eta
            eta = 1/np.sqrt(counter)
            #update
            theta, theta_0 = pegasos_single_step_update(feature_matrix.shape[i], label[i], L, eta, theta, theta_0)
            #increment count
            counter = counter + 1
    return theta, theta_0

#### Sentiment Analysis

In [33]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [34]:
train = pd.read_csv('/content/drive/My Drive/Colab Notebooks/reviews_train.tsv', encoding = 'ISO-8859-1', sep = '\t')

In [35]:
train.head()

Unnamed: 0,sentiment,productId,userId,summary,text,helpfulY,helpfulN
0,-1,B000EQYQBO,A2JZVE0Y19VLL0,blue chips,The chips are okay Not near as flavorful as th...,0,0
1,-1,B000LKVHYC,A3NAKOMAS0I5L9,Bad even for 'healthy',"I had high hopes for this, but it was bad. Re...",0,0
2,-1,B003QRQRY2,ARBO3XW14MNGA,Alot of money for one can,I guess it's only one can since there is nothi...,1,1
3,-1,B008EG58V8,A1IQXGT4MJUYJ8,"The Box says ""OATMEAL SQUARES"" which I believe...","""Oatmeal Squares"" is in about the largest prin...",0,0
4,1,B004WZZY8M,A2TBL6WAZGXB9P,Delicious!,"I really enjoyed this flavor, this has a very ...",1,0


In [37]:
test = pd.read_csv('/content/drive/My Drive/Colab Notebooks/reviews_test.tsv', encoding = 'ISO-8859-1', sep = '\t')

In [38]:
val = pd.read_csv('/content/drive/My Drive/Colab Notebooks/reviews_val.tsv', encoding = 'ISO-8859-1', sep = '\t')

#### Bag of Words Representation

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

In [40]:
#instantiate vectorizer
vectorizer = CountVectorizer()

In [41]:
X = train.head()['text']
y = train.head()['sentiment']

In [44]:
#create bag of words
X_vect = vectorizer.fit_transform(X)

In [45]:
#examine in dataframe
pd.DataFrame(X_vect.todense(), columns = vectorizer.get_feature_names_out())

Unnamed: 0,about,all,an,and,are,as,back,bad,bag,be,...,we,what,wheat,when,which,who,whole,why,wonder,you
0,0,0,0,0,1,2,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,2,0,1,...,0,0,0,0,0,0,1,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1
3,2,1,1,3,0,0,0,0,0,2,...,0,1,3,2,1,1,0,1,1,2
4,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
#compare to X
X

0    The chips are okay Not near as flavorful as th...
1    I had high hopes for this, but it was bad.  Re...
2    I guess it's only one can since there is nothi...
3    "Oatmeal Squares" is in about the largest prin...
4    I really enjoyed this flavor, this has a very ...
Name: text, dtype: object

#### Learning with Data

In [None]:
thetas = ''
theta_0 = ''

In [None]:
theta, theta_0 = ''

In [None]:
theta

In [None]:
theta_0

#### Vectorizing all the Data

In [52]:
#training data
cvect = CountVectorizer(max_features=1000, stop_words = 'english')
X_train = train['text']
y_train = train['sentiment']
X_train_vect = cvect.fit_transform(X_train)
X_train_vect

<4000x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 85455 stored elements in Compressed Sparse Row format>

In [53]:
#testing data
X_test = test['text']
y_test = test['sentiment']
X_test_vect = cvect.transform(X_test)
X_test_vect

<500x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 10662 stored elements in Compressed Sparse Row format>

In [62]:
#validation data
X_val = val['text']
y_val = val['sentiment']
X_val_vect = cvect.transform(X_val)
X_val_vect

<500x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 10893 stored elements in Compressed Sparse Row format>

#### Using `sklearn`

Using `sklearn` models is similar from one model to the next.  Typically there is a three step modeling process.  

- Instantiate
- Fit
- Predict

In [None]:
#define X and y


In [58]:
#instantiate SGDClassifier
sgd = SGDClassifier(random_state=42)

In [59]:
#fit
sgd.fit(X_train_vect, y_train)

SGDClassifier(random_state=42)

In [60]:
#score on training
sgd.score(X_train_vect, y_train)

0.8985

In [61]:
#score on testing
sgd.score(X_test_vect, y_test)

0.764

In [63]:
#score on validation
sgd.score(X_val_vect, y_val)

0.732

#### Improving the Models

As in the recitation there are likely some parameters to tune in the model including the regularization penalty and adjusting the loss function.  

In [84]:
#SGD Parameters?
sgd = SGDClassifier(random_state=42, alpha=0.0003, loss='hinge')

In [85]:
sgd.fit(X_train_vect, y_train)

SGDClassifier(alpha=0.0003, random_state=42)

In [86]:
#score on training
sgd.score(X_train_vect, y_train)

0.912

In [87]:
#score on testing
sgd.score(X_test_vect, y_test)

0.774

In [88]:
#score on validation
sgd.score(X_val_vect, y_val)

0.744

#### Exercise

Set up an experiment to using different loss functions and penalty terms.  Select the parameters that perform best on the *test* data.

#### Up Next: The Regression Setting



In [None]:
import seaborn as sns

In [None]:
tips = sns.load_dataset('tips')
tips.head()

In [None]:
#X and y


In [None]:
#parameters


In [None]:
#prediction


In [None]:
#loss


$$w_{i + 1} = w_i + \alpha (y - \theta X)X$$

In [None]:
#update the weight


#### Using `sklearn`

In [None]:
from sklearn.linear_model import SGDRegressor

In [None]:
#instantiate


In [None]:
#fit


In [None]:
#predict


In [None]:
#score
