* This is a bullet list
* This is a bullet list
* This is a bullet list


1. And you can also create ordered lists
2. by using numbers
3. and listing new items in the lists 
4. on their own lines

## Heading Two

### Heading Three

#### Heading Four

*These are italicized words, not a bullet list*
**These are bold words, not a bullet list**

* **This is a bullet item with bold words**
* *This is a bullet item with italicized words*

`Here is some code!`

***

Here is some important text!

***

In [1]:
# ![alt text here](url-to-image-here)

***
# Document Classification using Natural Language Processing
***

Natural Language Processing (NLP) is a type of computational linguistics that uses machine learning algorithms to understand how us people communicate.
Sentiment Analysis (also known as opinion mining or emotion AI) is a sub-field of NLP that tries to identify and extract opinions within a given text across blogs, reviews, social media, forums, news etc.

***

### **In this project we try to analyse restaurant reviews and classify them into Positive or Negative**

***

### The steps are as follows:
1. **Finding commonly used words specific for negative reviews and positive reviews (with spaCy)**
2. **Training a model to classify the reviews (with sklearn)**
3. **Measuring emotional affect from the reviews (with text2emotion)**
4. **Testing the model & emotional affect library with custom reviews**

## Finding commonly used words
We can find the frequencies of the various nouns and phrases for the reviews. Studying them will help us know with combinations of words are more commonly used for positive and negative reviews.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set()

In [2]:
# loading the dataset
reviews = pd.read_csv('datasets/Restaurant_Reviews.tsv', sep='\t')
reviews.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


### Separating the negative and positive reviews

In [3]:
# join all positive reviews into one text
positive_texts = reviews[reviews['Liked']== 1]['Review'].values
separator = ', '
joined_positive_texts = separator.join(positive_texts)

In [4]:
# join all negative reviews into one text
negative_texts = reviews[reviews['Liked']== 0]['Review'].values
separator = ', '
joined_negative_texts = separator.join(negative_texts)

In [5]:
import spacy
from collections import Counter

In [6]:
nlp = spacy.load("en_core_web_lg")

### Function that returns a dataframe with the most common nouns from given texts

In [7]:
def get_nouns(texts):
    # turn text into nlp object
    doc = nlp(texts)
    nouns = []
    
    # each nlp object contains added information about each word
    # (eg. base form of a word, a part-of-speech tag)  
    # check the base form of each word and keep all NOUNS
    nouns = [token.lemma_ for token in doc
         if (not token.is_stop and not token.is_punct
            and token.pos_ == "NOUN")]
    
    # count the occurrences of each noun
    noun_freq = Counter(nouns)
    
    # get the N most common nouns
    common_nouns = noun_freq.most_common(20)
    
    # create a dataframe from the nouns and the number of occurences
    nouns_df = pd.DataFrame(common_nouns, columns = ['nouns' ,'count'])
    return nouns_df

In [8]:
# get most commonly used nouns for positive reviews
positive_df = get_nouns(joined_positive_texts)
# get most commonly used nouns for negative reviews
negative_df = get_nouns(joined_negative_texts)
positive_df.head()

Unnamed: 0,nouns,count
0,place,60
1,food,53
2,service,39
3,time,26
4,restaurant,17


### Plotting most frequent nouns

In [26]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig=px.bar(positive_df, x = 'nouns', y = 'count', color = 'nouns', 
             title='Top 20 nouns used in positive reviews')

fig.update_xaxes(tickangle=45)
fig.show()

In [27]:
fig=px.bar(negative_df, x = 'nouns', y = 'count', color = 'nouns', 
             title='Top 20 nouns used in negative reviews')
fig.update_xaxes(tickangle=45)
fig.show()

***

### **If we look at the foods that are mentioned in both types of reviews we can see that people really liked the steak, pizza and beer. The most disliked foods were the burger, salad and chicken.**

***

## Training a model to classify the reviews

Most commonly used algorithms for document classification are **Logistic Regression & SVC(Support Vector Classification)**

The reviews have been previously vectorized using spaCy and saved in a .csv file

### 1. Training and testing data

In [46]:
# training dataset divided into X and y
train = pd.read_csv('datasets/train.csv')
X_train = train.drop(['y'], axis=1)
y_train = train['y']

In [47]:
# testing dataset divided into X and y
test = pd.read_csv('datasets/test.csv')
X_test = test.drop(['y'], axis=1)
y_test = test['y']

### 2. Training a model with SVC algorithm

In [49]:
from sklearn.svm import SVC

In [52]:
svc = SVC(random_state=42,C=1, kernel='linear',probability=True)
_ = svc.fit(X_train, y_train)

In [54]:
print(f'Model test accuracy: {svc.score(X_test, y_test)*100:.3f}%')

Model test accuracy: 85.500%


### 3. Training a model with Logistic Regression

In [None]:
## TODO

## Measuring emotional affect from the reviews

In [29]:
import text2emotion as te

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Bonnana\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bonnana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Bonnana\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [36]:
# all texts
# join all reviews into one text
texts = reviews['Review'].values
separator = ', '
all_texts = separator.join(texts)

### text2emotion is a library that processes any textual data, recognizes the emotion embedded in it, and provides the output in the form of a dictionary

In [40]:
em = te.get_emotion(all_texts)
Emotion_df = pd.DataFrame.from_dict(em, orient='index').sort_values(by=0, ascending=False).reset_index()
Emotion_df.columns = ['Emotion', 'Frequency']
Emotion_df.head()

Unnamed: 0,Emotion,Frequency
0,Fear,0.29
1,Happy,0.23
2,Surprise,0.22
3,Sad,0.2
4,Angry,0.05


In [41]:
fig = px.pie(Emotion_df, values = 'Frequency', names='Emotion',
             title='Emotion Frequency in all of the reviews',
             hover_data=['Emotion'], labels={'Emotion':'Emotion'})
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

## Testing the model & emotional affect library with custom reviews

### Every new review needs to be transformed into vectors before it's used by the model

In [42]:
# spacy has a premade vocabulary that enables us to transform text into vectors
nlp = spacy.load("en_core_web_lg")
# function for vectorizing text
def vectorizer(text):
    with nlp.disable_pipes():
        vector = np.array(nlp(text).vector)
    return vector

### Trying a good review

In [43]:
good_review = "I like this restaurant"
df = pd.DataFrame(vectorizer(good_review).reshape(1,-1))

# TODO
# linear_model.predict(df)

In [44]:
# sentence = te.get_emotion(good_review)

# EmotionDF = pd.DataFrame.from_dict(sentence, orient='index').sort_values(by=0, ascending=False).reset_index()
# EmotionDF.columns = ['Emotion', 'Frequency']
# EmotionDF

Unnamed: 0,Emotion,Frequency
0,Happy,0
1,Angry,0
2,Surprise,0
3,Sad,0
4,Fear,0


In [50]:
# TODO