## Feature engineering

- learned about feature engineering
- Text vectorization: converting categorical column to numbers
- core idea: numerical representation must convery semantic meaning of sentence
- Techniques:
  - One Hot encoding
  - Bag of words
  - n-gram
  - Tf-Idf
  - Custom features
  - Word2Vec

In [1]:
import pandas as pd

In [2]:
imdb_data = pd.read_csv('IMDB Dataset.csv')

In [3]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
imdb_data['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [5]:
import re
def remove_tags(text):
    return re.sub(r'<.*?>','',text)

In [6]:
for i in range(0,len(imdb_data['review'])):
    imdb_data['review'][i] = remove_tags(imdb_data['review'][i])

In [7]:
imdb_data['review'].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

In [8]:
imdb_data['review'][7]

"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air."

In [9]:
from string import punctuation

In [10]:
punctuations = punctuation

In [11]:
def remove_puns(text):
    return text.translate(str.maketrans('','',punctuations))

In [12]:
remove_puns("Hello! my , name?")

'Hello my  name'

In [13]:
imdb_data['review'] = imdb_data['review'].apply(remove_puns)

In [14]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production The filming tech...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically theres a family where a little boy J...,negative
4,Petter Matteis Love in the Time of Money is a ...,positive


In [15]:
imdb_data['review'][9]

'If you like original gut wrenching laughter you will like this movie If you are young or old then you will love this movie hell even my mom liked itGreat Camp'

In [16]:
imdb_data['review'] = imdb_data['review'].str.lower()

In [17]:
imdb_data['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically theres a family where a little boy j...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    im going to have to disagree with the previous...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [26]:
#pip install contractions

In [19]:
import contractions
import re

# Dictionary for informal/missing apostrophes
informal_fixes = {
    "im": "i am",
    "dont": "do not",
    "cant": "can not",
    "wont": "will not",
    "didnt": "did not",
    "isnt": "is not",
    "wasnt": "was not",
    "wouldnt": "would not",
    "couldnt": "could not",
    "shouldnt": "should not",
    "hasnt": "has not",
    "havent": "have not",
    "hadnt": "had not",
    "thats": "that is",
    "whats": "what is",
    "heres": "here is",
    "theres": "there is"
}

def clean_review(text):
    # First, fix informal contractions (without apostrophes)
    for old, new in informal_fixes.items():
        pattern = r'\b' + old + r'\b'
        text = re.sub(pattern, new, text, flags=re.IGNORECASE)
    # Then, fix standard contractions (with apostrophes)
    text = contractions.fix(text)
    return text

In [20]:
imdb_data['review'] = imdb_data['review'].apply(clean_review)

In [21]:
imdb_data['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically there is a family where a little boy...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    i am going to have to disagree with the previo...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [24]:
#pip install spacy

In [25]:
#pip install contextualSpellCheck

In [41]:
# import spacy
# import contextualSpellCheck

In [42]:
from nltk import word_tokenize

In [43]:
demo_data = word_tokenize(imdb_data['review'][0])

In [44]:
demo_data

['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'oz',
 'episode',
 'you',
 'will',
 'be',
 'hooked',
 'they',
 'are',
 'right',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'methe',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'go',
 'trust',
 'me',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid',
 'this',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs',
 'sex',
 'or',
 'violence',
 'its',
 'is',
 'hardcore',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'wordit',
 'is',
 'called',
 'oz',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'it',
 'focuses',
 'mainly',
 'on',
 'em

In [51]:
unique_data = list(set(demo_data))

In [52]:
len(demo_data), len(unique_data)

(307, 189)

In [55]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

In [68]:
encoder = OneHotEncoder(sparse_output=False)

In [69]:
unique_encoder = encoder.fit(np.array(unique_data).reshape(-1,1))

In [70]:
demo_encoder = encoder.transform(np.array(demo_data).reshape(-1,1))

In [74]:
demo_encoder[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.])

In [78]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [79]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [82]:
from nltk.corpus import stopwords

In [83]:
def remove_stopword(text):
    new_text = []
    for i in text.split():
        if i not in stopwords.words('english'):
            new_text.append(i)
    return " ".join(new_text)

In [84]:
imdb_data['review'] = imdb_data['review'].apply(remove_stopword)

KeyboardInterrupt: 