# Sentiment Analyse / Opinion Mining:
<br>
    -  ist Teilgebiet der Verarbeitung natürlicher Sprache <b>(Natural Language processing    NLP)</b><br>
-  Texte anhand Tonalität klassifizieren<br>
-  Ziel dieses Notebooks: Unterscheidung Filmbewerteungen in positive oder negative Bewertungen

In [22]:
import pandas as pd
import numpy as np
import pyprind # progress bar library
import os

##### Load data ( ~ 10min CPU)

In [13]:
basepath = 'aclImdb'
pbar = pyprind.ProgBar(50000) # progressbar with 50k steps (= number of total documents in imdb)
labels = {'pos':1, 'neg':0}
df = pd.DataFrame()

for s in {'test', 'train'}:
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        print(path)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding = 'utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index = True)
            pbar.update() # update progress bar after each file appending
            
df.columns = ['review', 'sentiment'] # add column names

aclImdb\test\pos


0% [#######                       ] 100% | ETA: 00:01:21

aclImdb\test\neg


0% [###############               ] 100% | ETA: 00:00:59

aclImdb\train\pos


0% [######################        ] 100% | ETA: 00:00:34

aclImdb\train\neg


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:23


##### Save created dataframe to csv-file

In [23]:
np.random.seed(0) # same set of numbers will appear every time
# so far dataset is sorted, permutation randomly shuffles the set
# df = df.reindex(np.random.permutation(df.index)) 
# df.to_csv('movie_data.csv', index = False, encoding = 'utf-8')
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


##### Bag-of-words-Modell

-  Text muss in numerische Daten umgewandelt werden um von Lernalgorithmen verarbeitet werden zu können
-  Das bag-of-words-Modell wandelt Text in numerischen Merkmalsvektor um:
    - Hierfür wird zuerst ein Vokabular eindeutiger Tokens angelegt
    - Danach wird den Tokens ein Index zugeordnet
    - Die Zahl an Index n gibt die Vorkommenshäufigkeit des entsprechenden Wortes im entsprechenden Dokument an
-  Default zerlegt CountVectorizer in Monogramm (jeder Token des Vokabulars 1 Wort), Bigramme, N-Gramme sind möglich

In [8]:
from sklearn.feature_extraction.text import CountVectorizer # erstellt Bag-of-words Modell

count = CountVectorizer()
docs = np.array(['The sun in shining',
                 'Hello how are you',
                  'Is this true'])

bag = count.fit_transform(docs)
print(count.vocabulary_)
print("")
print('Merkmalsvektor mit Raw term frequencies: \n', bag.toarray())

{'the': 7, 'sun': 6, 'in': 3, 'shining': 5, 'hello': 1, 'how': 2, 'are': 0, 'you': 10, 'is': 4, 'this': 8, 'true': 9}

Merkmalsvektor mit Raw term frequencies: 
 [[0 0 0 1 0 1 1 1 0 0 0]
 [1 1 1 0 0 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 0 1 1 0]]


#### Beurteilung der Wortrelevanz

-  Tf-idf-Maß zur Gewichtung Wörter in Merkmalsvektor (Term frequency / inverse document frequency)
-  Idee dahinter: Häufig Auftauchende Wörter liefern keine Informationen bzw. eignen sich nicht für Unterscheidung der Dokumente, weil sie in allen Texten auftauchen 

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

# l2-Norm: Normierung liefertn Vektor mit Länge 1
tfidf = TfidfTransformer(use_idf = True, norm = 'l2', smooth_idf = True)
np.set_printoptions(precision = 2) # Float output auf 2 Nachkommastellen
print(tfidf.fit_transform(bag).toarray())


[[0.   0.   0.   0.5  0.   0.5  0.5  0.5  0.   0.   0.  ]
 [0.5  0.5  0.5  0.   0.   0.   0.   0.   0.   0.   0.5 ]
 [0.   0.   0.   0.   0.58 0.   0.   0.   0.58 0.58 0.  ]]


#### Preprocessing / Bereinigung Textdaten

-  Bewertungen beeinhalten:
   - HTML-Codes
   - Emoticons (sinnvoll für Bestimmung Sentiment)
   - Satzzeichen (können sinnvoll sein)

In [22]:
test=df.loc[0,'review'][-50:]
test

'is seven.<br /><br />Title (Brazil): Not Available'

In [90]:
import re # Evtl. nicht die Perfekte Lösung um HTML-Ausdrücke zu parsen, hier ausreichend

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) # Muster ersetzen
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # Emoticons finden und abspeichern
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text
                                                
preprocessor(test)

NameError: name 'test' is not defined

In [28]:
# Anwendung Preprocessor auf gesamten Datensatz
df['review'] = df['review'].apply(preprocessor)
df.to_csv('movie_data_clean.csv', index = False, encoding = 'utf-8')
df.head()

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about w...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how ...,0


##### Tokenisierung

-  verschiedene Möglichkeiten z.B. Text am Whitespace trennen oder
-  <b>Stemming (Stammformreduktion):</b> Wörter auf Stammform zurückführen, implementiert in NLTK

In [75]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

##### Stoppwörter

-  Wörter wie "is", "and", "has" entfernen

In [36]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner has shoes and runners like running and runs a lot') if w not in stop]

['runner', 'ha', 'shoe', 'runner', 'like', 'run', 'run', 'lot']

## Logistisches Regressionmodell Klassifikation Filmbewertungen

-  Modell zur Klassifizierung
-  gut geeignet bei nicht linear trennbaren Klassen
-  nur für binäre Klassifizierungsaufgaben geeignet [S.85]
-  Rückgabetyp Wahrscheinlichkeiten

In [161]:
import pandas as pd

df = pd.read_csv('movie_data._clean.csv') # Indexspalte nicht mit einlesen
df.head(5)

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about w...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how ...,0


In [162]:
x = 100

X_train = df.loc[:x, 'review'].values
y_train = df.loc[:x, 'sentiment'].values
X_test = df.loc[x:, 'review'].values
y_test = df.loc[x:, 'sentiment'].values

In [159]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
stop = stopwords.words('english') # Stopwörter 

def tokenizer(text):
    return text.split()

porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

def remove_stopwords(text):
    return [w for w in text if w not in stop]

#### Training

-  Hyperparameteroptimierung mit Rastersuche  
-  Auswahl L1 oder L2 Regularisierung um Komplexität des Modells zu reduzieren [S.141]

In [163]:
# Optimale Parameterkombination finden
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer # vereint CountVectorizer + TfidfTransformer

tfidf = TfidfVectorizer(strip_accents = None, lowercase = False, preprocessor = None)

param_grid = [
               {'vect__ngram_range': [(1,1)], # Monogramm
               'vect__stop_words': [stop, None], # 'english or None is applied'
               'vect__tokenizer': [tokenizer, tokenizer_porter],# Wörter am Leerzeichen trennen oder auch Stemming anwenden
               'clf__penalty': ['l1', 'l2'], # L1 oder L2 Regularisierung Logisitische regressin
               'clf__C': [1.0, 10.0, 100.0] # Je größer C, desto weniger Regularisierung
              },
    
              {'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf' : [False],
               'vect__norm': [None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]
              }
             ]
              
pipeline = Pipeline([
                    ('vect', tfidf),
                    ('clf', LogisticRegression(random_state = 0))
                    ])
              
search = GridSearchCV(pipeline, param_grid,
                      scoring = 'accuracy',
                      #cv = 5, # 5-fold stratified cross validation
                      verbose = 1,
                      n_jobs = -1 # n_jobs = -1 (alle Prozessorkerne nutzen)
                          )
              
search.fit(X_train, y_train)    

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:   37.9s finished
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...e, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', re

##### Optimale Parameterkombination + Testing

In [165]:
print('Beste Parameterkombination: {}'.format(search.best_params_))

Beste Parameterkombination: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both'

In [168]:
print("Cross Value Score: {}".format(search.best_score_)) # Mean cross-validated score of the best_estimator
best_model = search.best_estimator_ # Estimator that was chosen by the search, i.e. estimator which gave highest score
print("Best Model Cross Value Score: {}".format(best_model.score(X_test, y_test))) # Mean cross-validated score of best_model

Cross Value Score: 0.6435643564356436
Cross Value Score: 0.7068336673346693


#### Naive Bayer-Klassifikatior possible as well for Text-Classification tasks

In [None]:
s