# Integrazione della conoscenza

## In questa sezione vengono aggiunte 4 colonne: <br /> <br /> - numberPositiveReview: indica il numero di parole positive per ogni reviewText <br /> <br /> - numberNegativeReview: indica il numero di parole negative per ogni reviewText <br /> <br /> - numberPositiveSummary: indica il numero di parole positive per ogni summary <br /> <br /> - numberNegativeSummary: indica il numero di parole negative per ogni summary

## Per trovare le parole positive e negative vengono utilizzati due file che contengono, rispettivamente, una lista delle parole considerate positive e una lista delle parole considerate negative

## I due file si possono reperire ai seguenti link:

## - https://ptrckprry.com/course/ssd/data/positive-words.txt
## - https://ptrckprry.com/course/ssd/data/negative-words.txt

### Import delle librerie e moduli necessari

In [1]:
import numpy as np
import pandas as pd

from bisect import bisect_left

In [2]:
clean_dataset = pd.read_csv("datasets/clean_dataset.csv", index_col=0)
clean_dataset.dropna(axis='index', how='any', inplace=True)

In [4]:
clean_dataset.head()

Unnamed: 0,rating,reviewText,summary,sentiment
0,3,jace rankin may short nothing mess man haul sa...,entertaining average,0
1,5,great short read want put read one sit sex sce...,terrific menage scene,1
2,3,ill start say first four book expect 34conclud...,snapdragon alley,0
3,3,aggie angela lansbury carry pocketbook instead...,light murder cozy,0
4,4,expect type book library please find price right,book,1


# 1. Creazione delle colonne

In [5]:
# creazione della colonna numberPositiveReview
if not 'numberPositiveReview' in clean_dataset:
    clean_dataset.insert(4, 'numberPositiveReview', 0)

# creazione della colonna numberNegativeReview
if not 'numberNegativeReview' in clean_dataset:
    clean_dataset.insert(5, 'numberNegativeReview', 0)

# creazione della colonna numberPositiveSummary
if not 'numberPositiveSummary' in clean_dataset:
    clean_dataset.insert(6, 'numberPositiveSummary', 0)

# creazione della colonna numberNegativeSummary
if not 'numberNegativeSummary' in clean_dataset:
    clean_dataset.insert(7, 'numberNegativeSummary', 0)

In [6]:
clean_dataset.head()

Unnamed: 0,rating,reviewText,summary,sentiment,numberPositiveReview,numberNegativeReview,numberPositiveSummary,numberNegativeSummary
0,3,jace rankin may short nothing mess man haul sa...,entertaining average,0,0,0,0,0
1,5,great short read want put read one sit sex sce...,terrific menage scene,1,0,0,0,0
2,3,ill start say first four book expect 34conclud...,snapdragon alley,0,0,0,0,0
3,3,aggie angela lansbury carry pocketbook instead...,light murder cozy,0,0,0,0,0
4,4,expect type book library please find price right,book,1,0,0,0,0


# 2. Calcolo numero parole positive/negative

## 2.1 Definizioni funzione per il conteggio

In [7]:
def countWords(list_word, text):
    splitted_text = text.split()
    count = 0
    
    for word in splitted_text:
        i = bisect_left(list_word, word)
        if i != len(list_word) and list_word[i] == word:
            count = count + 1
    
    if len(splitted_text) == 0:
        return 0
    
    count = count / len(splitted_text)
    count = round(count, 2)
    return count

## 2.2 Conteggio per la colonna reviewText

In [8]:
# calcolo numero parole positive/negative della colonna reviewText
list_word_positive = []
with open('words/positive_words.txt') as file:
        list_word_positive = file.read().split()
clean_dataset["numberPositiveReview"] = clean_dataset["reviewText"].apply(lambda text: countWords(list_word_positive, text))

list_word_negative = []
with open('words/negative_words.txt') as file:
        list_word_negative = file.read().split()
clean_dataset["numberNegativeReview"] = clean_dataset["reviewText"].apply(lambda text: countWords(list_word_negative, text))


## 2.3 Conteggio per la colonna summary

In [9]:
list_word_positive = []
with open('words/positive_words.txt') as file:
        list_word_positive = file.read().split()
clean_dataset["numberPositiveSummary"] = clean_dataset["summary"].apply(lambda text: countWords(list_word_positive, text))

list_word_negative = []
with open('words/negative_words.txt') as file:
        list_word_negative = file.read().split()
clean_dataset["numberNegativeSummary"] = clean_dataset["summary"].apply(lambda text: countWords(list_word_negative, text))



### Panormaica del dataset con colonne dei conteggi (normalizzate sul numero di parole totali per ogni review/summary)

In [10]:
clean_dataset.head()

Unnamed: 0,rating,reviewText,summary,sentiment,numberPositiveReview,numberNegativeReview,numberPositiveSummary,numberNegativeSummary
0,3,jace rankin may short nothing mess man haul sa...,entertaining average,0,0.06,0.11,0.5,0.0
1,5,great short read want put read one sit sex sce...,terrific menage scene,1,0.19,0.03,0.33,0.0
2,3,ill start say first four book expect 34conclud...,snapdragon alley,0,0.05,0.02,0.0,0.0
3,3,aggie angela lansbury carry pocketbook instead...,light murder cozy,0,0.1,0.13,0.33,0.33
4,4,expect type book library please find price right,book,1,0.12,0.0,0.0,0.0


In [11]:
# salvataggio del dataset
clean_dataset.to_csv('datasets/clean_dataset.csv')