# Sentiment Analysis with python

### Import libraries

In [27]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import re

In [7]:
#import database
db = pd.read_csv("IMDB Dataset.csv")
db.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Data preprocessing

1. #### Turn to lower cases and delete special characters

In [31]:
def preprocessor(text):
    #Getting rid of html tags
    text = re.sub('<[^>]*>','',text)
    emoticons = re.findall('(?::|;|=) (?:-)?(?:\) |\(|D|P)',text)#We store the emoticons
    text = (re.sub('[\W]+',' ',text.lower()) + ''.join(emoticons).replace('-',''))

    return text

For more information regarding regular expresions, consult the following [link](https://developers.google.com/edu/python/regular-expressions)

In [32]:
db['review'] = db['review'].apply(preprocessor)

In [35]:
db.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there s a family where a little boy ...,negative
4,petter mattei s love in the time of money is a...,positive


### Creating the bag of words

For this projects's purpose, we are going to implement the **_CountVectorizer_** module from the sklearn library to create a model known as **bag of words**. This model allows us to represent text into numeric vectors, which will be useful for the computer to undestand.

#### Steps in CountVectorizer
* **Tokenization**:Text is split into individual words or tokens. Removes punctuation and splits text based on whitespace (configurable).
* **Building Vocabulary**: Generates a vocabulary of unique words/tokens from the entire set of documents.
Each word becomes a feature, indexed based on its position in the vocabulary.
* **Counting Occurrences**: For each document, generates a vector representing word counts from the vocabulary.
Vector length = vocabulary size; each element = count of the word at that index.
* **Sparse Matrix**: Resulting matrix is typically sparse due to many zero counts, efficient for storage/computation.

In [13]:
count = CountVectorizer()
docs = np.array(db["review"].tolist())
bag = count.fit_transform(docs) #bag of words

In [19]:
print("Words:",list(count.vocabulary_.keys())[0:10])
print("Index:",list(count.vocabulary_.values())[0:10])

Words: ['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']
Index: [64131, 63757, 90160, 64776, 75511, 40745, 57558, 90137, 2970, 98226]


As we see above, each word from the vocabulary is stored in a unique index as a dictionary. For instance, the word one has an index value of 64131, which will be useful to count the ocurrances of each word within a document once we create the vectors of characteristics.  

In [21]:
print(bag.toarray()[0:10])

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


As we can see, the number of ocurrences will be mostly **zeros** due to the fact that each document is not going to use all the words from the complete vocabulary. Each value from the vectors of characteristics, are also known as **raw term frequency**: **$tf(t,d)$**. The number of ocurrances that a term appears in a document. However, is not simply enough to count the relevance of a word by its ocurrance within a document, that's why we implement **$tf$-$idf(t,d)$** technique

### Term Frequency-Inverse Document Frequency (TF-IDF)

**Term Frequency-Inverse Document Frequency (TF-IDF)** is a technique used to evaluate the importance of a term in a document within a collection (corpus) of documents. It's calculated as the product of TF and IDF.

**TF-IDF**(**t**,**d**,**D**)**=**TF**(**t**,**d**)**×**IDF**(**t**,**D**)

* **Purpose** : Highlights terms that are important in a document but not necessarily common across all documents in the corpus.
* **Key Points** :
  * TF-IDF considers not only the frequency of a term in a document (TF) but also its rarity across all documents (IDF).
  * High TF-IDF values occur when a term has a high frequency within a specific document but is rare across other documents.
  * Common terms across documents receive lower TF-IDF scores, as they are less discriminative.

In this context, $idf(t,d)$ is the inverse frequency of document and can be calculated with the following equation:

$$
idf(t,d) = log\frac{n_d}{1 + df(d,t)}
$$

For this case, $n_d$ represents the total number of documents and $df(d,t)$ is the number of documents $d$ that contains the $t$ term.

#### Case example

In [25]:
tfidf = TfidfTransformer(use_idf=True,norm='l2',smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs))
    .toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
