In [3]:
from random import sample

from URLS import IMDB_DATASET_URL
import pandas as pd

df = pd.read_csv(IMDB_DATASET_URL)


In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Preprocessing Data

In [6]:
import spacy
import re
import pandas as pd

# Load the SpaCy English language model
nlp = spacy.load('en_core_web_sm')

def preprocess_reviews(df):
    def clean_text(text):
        # Remove HTML tags
        text = re.sub(r'<.*?>', '', text)
        
        # Remove special characters and punctuation
        text = re.sub(r'[^\w\s]', '', text)
        
        # Process the text using SpaCy
        doc = nlp(text.lower())
        
        # Remove stopwords and lemmatize
        tokens = [
            token.lemma_ 
            for token in doc 
            if not token.is_stop and not token.is_punct
        ]
        
        # Join tokens back to string
        return ' '.join(tokens)

    # Apply the cleaning function to the 'review' column
    df['cleaned_review'] = df['review'].apply(clean_text)
    return df

### Save preprocessed text

In [8]:
sample_df = df.sample(n=100, random_state=42)
preprocessed_df = preprocess_reviews(sample_df)
preprocessed_df.head()

Unnamed: 0,review,sentiment,cleaned_review
33553,I really liked this Summerslam due to the look...,positive,like summerslam look arena curtain look overal...
9427,Not many television shows appeal to quite as m...,positive,television show appeal different kind fan like...
199,The film quickly gets to a major chase scene w...,negative,film quickly get major chase scene increase de...
12447,Jane Austen would definitely approve of this o...,positive,jane austen definitely approve onegwyneth palt...
39489,Expectations were somewhat high for me when I ...,negative,expectation somewhat high go movie think steve...


# Text Data Representation & Vectorize Model to pass on AI/ML 

## There are different ways to represent text data
### 1 -> One-Hot Encoding
 The way it work is finding unique words and create vectors of 0s and 1s where 1 represents particular word and 0 nothing
 
#### Problems
1. Out Of Vocabulary issues
2. Don't understand semantic meaning
3. Computationally costly
4. A lot of theres which is not used and just make you to use more computation 
### 2 -> Bag Of Words
First finds unique words then calculates frequencies of each word, which is good suited for classification tasks, for example
classifying whether sentence is negative or positive, based on frequency the words used in the sentence you'll find answer

#### Problems are same as One-hot encoding have but it works better and use less 0s
### 3 -> TF-IDF
TF-IDF finds important words by measuring their frequency in a document (TF) and how uncommon they are across all documents (IDF), highlighting unique terms.

Formula:
TF-IDF = TF * log(N / DF)

Where:

    TF = term frequency
    N = total number of documents
    DF = number of documents containing the term

### 4 -> N-Grams 
Choosing N which represents the number of words be used to count the number of words. For example if N=3 You will you 3 words.

### 4-> Word2Vec
Word2Vec transforms words into continuous vector spaces through neural networks, capturing semantic relationships.

    Input Layer: Takes one-hot encoded words.
    Projection Layer: Learns weights (embeddings) via skip-gram or CBOW.
    Output Layer: Predicts context words from target words.
    Training: Utilizes gradient descent and backpropagation to optimize embeddings.

Final: Generates dense word vectors reflecting word meanings and relationships.


  ![word2vec](../images/word2vecsteps.png)


### 5-> Transformers

Transformers are deep learning models designed for sequential data, using self-attention mechanisms to weigh the importance of different input parts. They process data in parallel, allowing for efficient handling of long-range dependencies. This architecture underlies many state-of-the-art natural language processing tasks, such as translation and text generation.

# Now All is left is to use just ML model and get results on movie data 
