# Applying logistic regression and SVM

## KNN classification

In this exercise you'll explore a subset of the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/). The variables X_train, X_test, y_train, and y_test are already loaded into the environment. The X variables contain features based on the words in the movie reviews, and the y variables contain labels for whether the review sentiment is positive (+1) or negative (-1).

This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the Scikit-Learn Cheat Sheet and keep it handy!

**Instructions**

* Create a KNN model with default hyperparameters.
* Fit the model.
* Print out the prediction for the test example 0.

In [8]:
movie_rev=pd.read_table("labeledTrainData.tsv")
movie_rev

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...


In [12]:
#movie_rev_train=movie_rev.iloc[0:2000,:]
X_train,y_train=movie_rev['review'][0:2000],movie_rev['sentiment'][0:2000]

print(X_train.shape)
print(y_train.shape)

(2000,)
(2000,)


In [13]:
#X_test=pd.read_table("testData.tsv")
X_test,y_test=movie_rev['review'][2000:4000],movie_rev['sentiment'][2000:4000]

print(X_test.shape)
print(y_test.shape)

(2000,)
(2000,)


### Bag of Words

In [19]:

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(lowercase=True,token_pattern = '(?u)\\b\\w\\w+\\b',stop_words='english')

# Fit the training data and then return the sparse matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the sparse matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

In [22]:
training_data

<2000x25549 sparse matrix of type '<class 'numpy.int64'>'
	with 181607 stored elements in Compressed Sparse Row format>

In [23]:
from sklearn.neighbors import KNeighborsClassifier

# Create and fit the model
knn = KNeighborsClassifier()
knn.fit(training_data,y_train)

# Predict on the test features, print the results
pred = knn.predict(testing_data[0])
print("Prediction for test example 0:", pred)

Prediction for test example 0: [1]


In [28]:
movie_rev.iloc[2000:2001,:]

Unnamed: 0,id,sentiment,review
2000,1766_10,1,I got a free pass to a preview of this movie l...


This is a good prediction

An example of overfitting?
* Training accuracy 95%, testing accuracy 50%

**Logistic Regression Example**

In [31]:
import sklearn.datasets
wine = sklearn.datasets.load_wine()
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')
lr.fit(wine.data, wine.target)
lr.score(wine.data, wine.target)

0.9719101123595506

Showing confidence scores like probabilities of predictions.

In [6]:
wine.data.shape

(178, 13)

In [9]:
lr.predict_proba(wine.data[:1])

array([[9.95108707e-01, 4.35738010e-03, 5.33913399e-04]])

The classifier reports over 99% confidence on the first class label and low probabilities for the other 2.

In SKLearn the basic skLearn classifier is LinearSVC

**Using Linear SVC**

In [24]:
import sklearn.datasets
#wine = sklearn.datasets.load_wine()
from sklearn.svm import LinearSVC
svm = LinearSVC()
svm.fit(wine.data, wine.target)
svm.score(wine.data, wine.target)



0.5449438202247191

**Using Non Linear SVM**

This can overfit the model

In [25]:
from sklearn.svm import SVC
svm = SVC() # default hyperparameters
svm.fit(wine.data, wine.target)
svm.score(wine.data, wine.target)

0.7078651685393258

* Underfitting: model is too simple, low training accuracy.
* Overfitting: model is too complex, low test accuracy

## Running LogisticRegression and SVC

In this exercise, you'll apply logistic regression and a support vector machine to classify images of handwritten digits.

**Instructions**

* Apply logistic regression and SVM (using SVC()) to the handwritten digits data set using the provided train/validation split.
* For each classifier, print out the training and validation accuracy.

In [31]:
import random

random.seed(42)
from sklearn import datasets
from sklearn.model_selection import train_test_split 
digits = datasets.load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

# Apply logistic regression and print scores
lr = LogisticRegression()
lr.fit(X_train,y_train)
print('Logistic train score ',lr.score(X_train,y_train))
print('Logistic test score ',lr.score(X_test,y_test))

# Apply SVM and print scores
svm = SVC()
svm.fit(X_train,y_train)
print('SVM train score ',svm.score(X_train,y_train))
print('SVM test score ',svm.score(X_test,y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Logistic train score  1.0
Logistic test score  0.9466666666666667
SVM train score  0.9962880475129918
SVM test score  0.9911111111111112


## Sentiment analysis for movie reviews

In this exercise you'll explore the probabilities outputted by logistic regression on a subset of the Large Movie Review Dataset.

The variables X and y are already loaded into the environment. X contains features based on the number of times words appear in the movie reviews, and y contains labels for whether the review sentiment is positive (+1) or negative (-1).

**Instructions**

* Train a logistic regression model on the movie review data.
* Predict the probabilities of negative vs. positive for the two given reviews.
* Feel free to write your own reviews and get probabilities for those too!

In [56]:
def get_features(rev):
    
    return count_vector.transform([rev])


In [57]:

# Instantiate logistic regression and train
lr = LogisticRegression()
lr.fit(training_data,y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [60]:
review1 = "LOVED IT! This movie was amazing. Top 10 this year."
[review1]

['LOVED IT! This movie was amazing. Top 10 this year.']

In [58]:
# Predict sentiment for a glowing review
#review1 = "LOVED IT! This movie was amazing. Top 10 this year."
review1_features = get_features(review1)
print("Review:", review1)
print(lr.predict(review1_features))
print("Probability of positive review:", lr.predict_proba(review1_features)[0,1])


Review: LOVED IT! This movie was amazing. Top 10 this year.
[1]
Probability of positive review: 0.8604539102359844


In [59]:

# Predict sentiment for a poor review
review2 = "Total junk! I'll never watch a film by that director again, no matter how good the reviews."
review2_features = get_features(review2)
print("Review:", review2)
print("Probability of positive review:", lr.predict_proba(review2_features)[0,1])

Review: Total junk! I'll never watch a film by that director again, no matter how good the reviews.
Probability of positive review: 0.34915605471632144


The second probability would have been even lower, but the word "good" trips it up a bit, since that's considered a "positive" word.