## Feature engineering

- learned about feature engineering
- Text vectorization: converting categorical column to numbers
- core idea: numerical representation must convery semantic meaning of sentence
- Techniques:
  - One Hot encoding
  - Bag of words
  - n-gram
  - Tf-Idf
  - Custom features
  - Word2Vec

In [1]:
import pandas as pd

In [2]:
imdb_data = pd.read_csv('IMDB Dataset.csv')

In [3]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
imdb_data['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [5]:
import re
def remove_tags(text):
    return re.sub(r'<.*?>','',text)

In [6]:
for i in range(0,len(imdb_data['review'])):
    imdb_data['review'][i] = remove_tags(imdb_data['review'][i])

In [7]:
imdb_data['review'].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

In [8]:
imdb_data['review'][7]

"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air."

In [9]:
from string import punctuation

In [10]:
punctuations = punctuation

In [11]:
def remove_puns(text):
    return text.translate(str.maketrans('','',punctuations))

In [12]:
remove_puns("Hello! my , name?")

'Hello my  name'

In [13]:
imdb_data['review'] = imdb_data['review'].apply(remove_puns)

In [14]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production The filming tech...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically theres a family where a little boy J...,negative
4,Petter Matteis Love in the Time of Money is a ...,positive


In [15]:
imdb_data['review'][9]

'If you like original gut wrenching laughter you will like this movie If you are young or old then you will love this movie hell even my mom liked itGreat Camp'

In [16]:
imdb_data['review'] = imdb_data['review'].str.lower()

In [17]:
imdb_data['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically theres a family where a little boy j...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    im going to have to disagree with the previous...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [18]:
#pip install contractions

In [19]:
import contractions
import re

# Dictionary for informal/missing apostrophes
informal_fixes = {
    "im": "i am",
    "dont": "do not",
    "cant": "can not",
    "wont": "will not",
    "didnt": "did not",
    "isnt": "is not",
    "wasnt": "was not",
    "wouldnt": "would not",
    "couldnt": "could not",
    "shouldnt": "should not",
    "hasnt": "has not",
    "havent": "have not",
    "hadnt": "had not",
    "thats": "that is",
    "whats": "what is",
    "heres": "here is",
    "theres": "there is"
}

def clean_review(text):
    # First, fix informal contractions (without apostrophes)
    for old, new in informal_fixes.items():
        pattern = r'\b' + old + r'\b'
        text = re.sub(pattern, new, text, flags=re.IGNORECASE)
    # Then, fix standard contractions (with apostrophes)
    text = contractions.fix(text)
    return text

In [20]:
imdb_data['review'] = imdb_data['review'].apply(clean_review)

In [21]:
imdb_data['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically there is a family where a little boy...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    i am going to have to disagree with the previo...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [22]:
#pip install spacy

In [23]:
#pip install contextualSpellCheck

In [24]:
# import spacy
# import contextualSpellCheck

In [25]:
from nltk import word_tokenize

In [26]:
demo_data = word_tokenize(imdb_data['review'][0])

In [27]:
demo_data

['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'oz',
 'episode',
 'you',
 'will',
 'be',
 'hooked',
 'they',
 'are',
 'right',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'methe',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'go',
 'trust',
 'me',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid',
 'this',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs',
 'sex',
 'or',
 'violence',
 'its',
 'is',
 'hardcore',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'wordit',
 'is',
 'called',
 'oz',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'it',
 'focuses',
 'mainly',
 'on',
 'em

In [28]:
unique_data = list(set(demo_data))

In [29]:
len(demo_data), len(unique_data)

(307, 189)

In [30]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

In [31]:
encoder = OneHotEncoder(sparse_output=False)

In [32]:
unique_encoder = encoder.fit(np.array(unique_data).reshape(-1,1))

In [33]:
demo_encoder = encoder.transform(np.array(demo_data).reshape(-1,1))

In [34]:
demo_encoder[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.])

In [39]:
#pip install nltk

In [36]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
from nltk.corpus import stopwords

In [41]:
stop_words = set(stopwords.words('english'))

In [42]:
def remove_stopword(text):
    new_text = []
    for i in text.split():
        if i not in stop_words:
            new_text.append(i)
    return " ".join(new_text)

In [40]:
for i in range(0,1001):
    imdb_data['review'][i] = remove_stopword(imdb_data['review'][i])

In [43]:
from nltk.corpus import stopwords
import re

stop_words = set(stopwords.words('english'))
# Create regex pattern to match stopwords as whole words
pattern = r'\b(' + '|'.join(re.escape(word) for word in stop_words) + r')\b'

# Remove stopwords in one vectorized operation
imdb_data['review'] = imdb_data['review'].str.replace(pattern, '', regex=True, case=False)
# Clean up extra spaces
imdb_data['review'] = imdb_data['review'].str.replace(r'\s+', ' ', regex=True).str.strip()

In [46]:
corpus = [i for i in imdb_data['review'].values]

In [47]:
corpus

['one reviewers mentioned watching 1 oz episode hooked right exactly happened methe first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awayi would say main appeal show due fact goes shows would dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz mess around first episode ever saw struck nasty surreal could say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bit

In [62]:
import itertools

In [63]:
words = [i.split() for i in corpus]

In [75]:
words

[['one',
  'reviewers',
  'mentioned',
  'watching',
  '1',
  'oz',
  'episode',
  'hooked',
  'right',
  'exactly',
  'happened',
  'methe',
  'first',
  'thing',
  'struck',
  'oz',
  'brutality',
  'unflinching',
  'scenes',
  'violence',
  'set',
  'right',
  'word',
  'go',
  'trust',
  'show',
  'faint',
  'hearted',
  'timid',
  'show',
  'pulls',
  'punches',
  'regards',
  'drugs',
  'sex',
  'violence',
  'hardcore',
  'classic',
  'use',
  'wordit',
  'called',
  'oz',
  'nickname',
  'given',
  'oswald',
  'maximum',
  'security',
  'state',
  'penitentary',
  'focuses',
  'mainly',
  'emerald',
  'city',
  'experimental',
  'section',
  'prison',
  'cells',
  'glass',
  'fronts',
  'face',
  'inwards',
  'privacy',
  'high',
  'agenda',
  'city',
  'home',
  'manyaryans',
  'muslims',
  'gangstas',
  'latinos',
  'christians',
  'italians',
  'irish',
  'moreso',
  'scuffles',
  'death',
  'stares',
  'dodgy',
  'dealings',
  'shady',
  'agreements',
  'never',
  'far',
  

In [70]:
vocab = list(itertools.chain(*words))

In [72]:
vocabulary = set(vocab)

In [78]:
vocabulary

{'sterile',
 'llthird',
 'fastprobably',
 'releaseif',
 'stammering',
 'moneda',
 'familyshowever',
 'evertytime',
 'fairytale',
 'zierings',
 'gameyou',
 'permit',
 'anjanette',
 'vampireits',
 'greatlively',
 'coronet',
 'violentwell',
 'terrorizing',
 'versionso',
 'federica',
 'trywhile',
 'historybtw',
 'rationalein',
 'quartermain',
 'distractingly',
 'flyboys',
 '399',
 'racoons',
 'quasiguru',
 'believabilityhe',
 'shandi',
 'furrierpawn',
 'neitheri',
 'waistcoat',
 'terriblea',
 'planethis',
 'abstracts',
 'mysterieswatch',
 'insultingfirst',
 'koln',
 'stinkeroonie',
 'foundeven',
 'contributiondebt',
 'propagandas',
 'deeperbirthday',
 'pixarthe',
 'prose',
 'clashed',
 'effect8',
 'iceopen',
 'fuflo',
 'rula',
 '3dfx',
 'durai',
 'ideasin',
 'nathans',
 'utrillo',
 'tooproper',
 'cooking',
 'disappointed7',
 'makeshift',
 'rizzuto',
 'chowringhee',
 'supermarketits',
 'bounds',
 'lordi',
 'anymorein',
 'funclean',
 'cloningand',
 'lineno',
 'wellstar',
 'plane»hasta',
 'kl

In [73]:
len(vocab), len(vocabulary)

(5884099, 222305)

In [76]:
vocab2 = {word for i in words for word in i}

In [77]:
vocab2

{'sterile',
 'llthird',
 'fastprobably',
 'releaseif',
 'stammering',
 'moneda',
 'familyshowever',
 'evertytime',
 'fairytale',
 'zierings',
 'gameyou',
 'permit',
 'anjanette',
 'vampireits',
 'greatlively',
 'coronet',
 'violentwell',
 'terrorizing',
 'versionso',
 'federica',
 'trywhile',
 'historybtw',
 'rationalein',
 'quartermain',
 'distractingly',
 'flyboys',
 '399',
 'racoons',
 'quasiguru',
 'believabilityhe',
 'shandi',
 'furrierpawn',
 'neitheri',
 'waistcoat',
 'terriblea',
 'planethis',
 'abstracts',
 'mysterieswatch',
 'insultingfirst',
 'koln',
 'stinkeroonie',
 'foundeven',
 'contributiondebt',
 'propagandas',
 'deeperbirthday',
 'pixarthe',
 'prose',
 'clashed',
 'effect8',
 'iceopen',
 'fuflo',
 'rula',
 '3dfx',
 'durai',
 'ideasin',
 'nathans',
 'utrillo',
 'tooproper',
 'cooking',
 'disappointed7',
 'makeshift',
 'rizzuto',
 'chowringhee',
 'supermarketits',
 'bounds',
 'lordi',
 'anymorein',
 'funclean',
 'cloningand',
 'lineno',
 'wellstar',
 'plane»hasta',
 'kl

In [79]:
#Apply bag words and find the vocabulary also find the times each word has occured

In [97]:
sample_corpus = corpus[10:15]

In [80]:
from sklearn.feature_extraction.text import CountVectorizer

In [91]:
vectorizer = CountVectorizer()

In [98]:
X = vectorizer.fit_transform(sample_corpus)

In [104]:
v1 = vectorizer.get_feature_names_out()

In [100]:
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 2, 0, 1],
       [0, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=int64)

In [101]:
print(X.toarray()[0])

[0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 2 1 1 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
 0 1 2 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 2 0 1 0 1 0 1 1 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]


In [114]:
count_vector = []
for j in range(len(v1)):
    temp = []
    for i in sample_corpus:
        temp.append(i.count(v1[j]))
    count_vector.append(temp)

In [115]:
count_vector

[[0, 2, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 0, 0, 1],
 [1, 0, 1, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [1, 0, 0, 0, 1],
 [1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 0, 0, 1, 0],
 [0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 0, 1, 0, 1],
 [0, 0, 1, 0, 0],
 [0, 0, 0, 0, 1],
 [1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 0, 0, 1],
 [1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 1, 2, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 7, 0, 0],
 [0, 0, 2, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 0, 2, 0],
 [0, 0, 0, 1, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 0, 1, 0],
 [1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 2, 0, 0, 0],
 [0, 0, 1, 1, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 5, 0, 0],
 [0, 0, 0, 1, 0],
 [0, 0, 0, 1, 0],
 [1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 0, 0, 1],
 [0, 0, 1, 0, 0],
 [0, 0, 0, 1, 0],
 [0, 0, 0,

In [116]:
len(count_vector)

319

In [117]:
# Apply bag of bi-gram and bag of tri-gram and write down your observation about the dimensionality of the vocabulary

In [118]:
vector2 = CountVectorizer(ngram_range=(2,2))

In [120]:
Y = vector2.fit_transform(sample_corpus).toarray()

In [122]:
len(Y[0])

413

In [125]:
Y.shape

(5, 413)

In [126]:
X.shape

(5, 319)

In [133]:
vector3 = CountVectorizer(ngram_range=(3,3))

In [134]:
Z = vector2.fit_transform(sample_corpus).toarray()

In [135]:
Z.shape

(5, 413)

In [136]:
vector4 = CountVectorizer(ngram_range=(4,4))
W = vector4.fit_transform(sample_corpus).toarray()
W.shape

(5, 409)

In [137]:
# Apply tf-idf and find out the idf scores of words, also find out the vocabulary.

In [138]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [147]:
tfidf = TfidfVectorizer(use_idf=True)

In [148]:
X_1 = tfidf.fit_transform(sample_corpus)

In [149]:
tfidf.get_feature_names_out(),len(tfidf.get_feature_names_out())

(array(['12', 'acting', 'actors', 'actual', 'actually', 'afternoons',
        'agenda', 'ago', 'ahead', 'air', 'alien', 'along', 'angle',
        'annoying', 'another', 'anymoreits', 'apparently', 'appeal',
        'appreciate', 'areas', 'around', 'bad', 'badass', 'bart', 'based',
        'beautiful', 'become', 'better', 'beyond', 'big', 'bird', 'biz',
        'boat', 'boll', 'bolls', 'bought', 'bowdler', 'bowdlerization',
        'bratwurst', 'bring', 'brother', 'budget', 'bumped', 'called',
        'came', 'cannot', 'care', 'carver', 'cast', 'certain',
        'characters', 'cheesy', 'clooney', 'complained', 'composition',
        'constant', 'countrymen', 'cromedalbino', 'cry', 'currently',
        'cut', 'dangling', 'daughter', 'delivers', 'demented', 'died',
        'disappointed', 'dr', 'dudes', 'early', 'eating', 'end', 'english',
        'enjoyed', 'enters', 'erain', 'even', 'eventually', 'everybody',
        'everything', 'evil', 'experience', 'famous', 'fan', 'fantastic',
   

In [150]:
X_1.toarray().shape

(5, 319)

In [155]:
idf_values = tfidf.idf_

In [154]:
sorted_vocab = sorted(tfidf.vocabulary_)

In [159]:
for idf,vocab in zip(idf_values,sorted_vocab):
    print(f'The word "{vocab}" has idf score of "{idf}"')

The word "12" has idf score of "2.09861228866811"
The word "acting" has idf score of "2.09861228866811"
The word "actors" has idf score of "2.09861228866811"
The word "actual" has idf score of "2.09861228866811"
The word "actually" has idf score of "2.09861228866811"
The word "afternoons" has idf score of "2.09861228866811"
The word "agenda" has idf score of "2.09861228866811"
The word "ago" has idf score of "2.09861228866811"
The word "ahead" has idf score of "2.09861228866811"
The word "air" has idf score of "2.09861228866811"
The word "alien" has idf score of "2.09861228866811"
The word "along" has idf score of "2.09861228866811"
The word "angle" has idf score of "2.09861228866811"
The word "annoying" has idf score of "2.09861228866811"
The word "another" has idf score of "1.6931471805599454"
The word "anymoreits" has idf score of "2.09861228866811"
The word "apparently" has idf score of "2.09861228866811"
The word "appeal" has idf score of "2.09861228866811"
The word "appreciate" h

In [160]:
corpus_tfidf = tfidf.fit_transform(corpus)

In [163]:
corpus_tfidf.shape

(50000, 221169)

In [166]:
type(corpus_tfidf)

scipy.sparse._csr.csr_matrix

In [167]:
corpus_tfidf.nnz

4884771