<a href="https://colab.research.google.com/github/ProfAI/nlp00/blob/master/6%20-%20Sentiment%20Analysis/classifier_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creare un classificatore con Scikit-learn

## Procuriamoci il dataset

In [1]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2019-04-15 15:53:33--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2019-04-15 15:53:37 (20.3 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [0]:
!tar -xzf aclImdb_v1.tar.gz

## Usiamo le features precalcolate

In [0]:
dataset = load_svmlight_file("aclImdb/train/labeledBow.feat")
dataset

(<25000x89527 sparse matrix of type '<class 'numpy.float64'>'
 	with 3456685 stored elements in Compressed Sparse Row format>,
 array([9., 7., 9., ..., 4., 2., 2.]))

In [0]:
import numpy as np
from sklearn.datasets import load_svmlight_file

def get_xy(file, dictsize=5000, binary=True):
        
    MAX_DICTSIZE = 89522
        
    if(dictsize>MAX_DICTSIZE):
        dict_size = MAX_DICTSIZE
    
    dataset = load_svmlight_file(file)
    X = dataset[0].todense()
    X = np.array(X[:,:dictsize])
    y = dataset[1]
    
    if(binary):
        y[y<=5] = 0
        y[y>5] = 1
    #y = np.array(dataset[:,-1]).flatten()
        
    return (X,y)

In [0]:
X_train, y_train = get_xy("aclImdb/train/labeledBow.feat")
X_test, y_test = get_xy("aclImdb/test/labeledBow.feat")
print(X_train.shape)
print(y_train.shape)

(25000, 5000)
(25000,)


In [0]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [0]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=0.001)
lr.fit(X_train, y_train)

LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [0]:
from sklearn.metrics import accuracy_score, log_loss

train_pred = lr.predict(X_train) 
train_pred_proba = lr.predict_proba(X_train)

train_accuracy = accuracy_score(y_train, train_pred)
train_loss = log_loss(y_train, train_pred_proba)

test_pred = lr.predict(X_test)
test_pred_proba = lr.predict_proba(X_test)

test_accuracy = accuracy_score(y_test, test_pred)
test_loss = log_loss(y_test, test_pred_proba)

print("Train Accuracy %.4f - Train Loss %.4f" % (train_accuracy, train_loss)) 
print("Test Accuracy %.4f - Test Loss %.4f" % (test_accuracy, test_loss)) 

Train Accuracy 0.9440 - Train Loss 0.1978
Test Accuracy 0.8776 - Test Loss 0.3126


## Estraiamo le features

In [9]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [22]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from os import listdir

tokenizer = RegexpTokenizer(r'\w+')
stopwords = stopwords.words('english')

dictionary = set({})

dataset = []

for file in listdir("aclImdb/train/pos"):
    review_file = open("aclImdb/train/pos/"+file)
    review = review_file.read()
    
    print(review)
    
    review = review.lower()
    review = tokenizer.tokenize(review)
    
    words={}
    
    for word in review:
      if(word not in stopwords):
          words[word] = words[word]+1 if word in words.keys() else 1
          dictionary.add(word)
        
    break
    
print(words)
print(dictionary)

The 1967 In Cold Blood was perhaps more like "the real thing" (Think about it: would we really want to see the real thing?), but it was black and white in a color world, and a lot of people didn't even know what it was, and there was an opportunity to remake it for television. Plus, if you remake it, you can show some stuff not shown in the original. The book In Cold Blood by Truman Capote was the first "nonfiction novel". Truman's book was in fact not 100% true to the real story. I thought the Canadian location sufficed for Kansas pretty much for a TV movie. Look for the elements of sex, drugs and rock 'n' roll: Dick's womanizing, Perry being an aspirin junkie, Perry playing blues guitar.
{'1967': 1, 'cold': 2, 'blood': 2, 'perhaps': 1, 'like': 1, 'real': 3, 'thing': 2, 'think': 1, 'would': 1, 'really': 1, 'want': 1, 'see': 1, 'black': 1, 'white': 1, 'color': 1, 'world': 1, 'lot': 1, 'people': 1, 'even': 1, 'know': 1, 'opportunity': 1, 'remake': 2, 'television': 1, 'plus': 1, 'show': 

In [0]:
from os import listdir
from sklearn.utils import shuffle


def get_xy(files_path, labels=["pos","neg"]):
    
    
    label_map = {labels[0]:1, labels[1]:0}
    
    reviews = []
    y = []
    
    for label in labels:
      path = files_path+label
      for file in listdir(path):
        review_file = open(path+"/"+file)
        review = review_file.read()    
        
        reviews.append(review)
        y.append(label_map[label])
        
    reviews, y = shuffle(reviews,y)
    
    return(reviews,y)

In [0]:
reviews_train, y_train = get_xy("aclImdb/train/")
reviews_test, y_test = get_xy("aclImdb/test/")

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(max_features=5000)

bow_train = bow.fit_transform(reviews_train)
bow_test = bow.transform(reviews_test)

X_train = bow_train.toarray()
X_test = bow_test.toarray()

X_train.shape

(25000, 5000)

In [31]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)



In [32]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=0.001)
lr.fit(X_train, y_train)



LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [33]:
from sklearn.metrics import accuracy_score, log_loss

train_pred = lr.predict(X_train) 
train_pred_proba = lr.predict_proba(X_train)

train_accuracy = accuracy_score(y_train, train_pred)
train_loss = log_loss(y_train, train_pred_proba)

test_pred = lr.predict(X_test)
test_pred_proba = lr.predict_proba(X_test)

test_accuracy = accuracy_score(y_test, test_pred)
test_loss = log_loss(y_test, test_pred_proba)

print("Train Accuracy %.4f - Train Loss %.4f" % (train_accuracy, train_loss)) 
print("Test Accuracy %.4f - Test Loss %.4f" % (test_accuracy, test_loss)) 

Train Accuracy 0.9432 - Train Loss 0.1975
Test Accuracy 0.8772 - Test Loss 0.3125
