<a href="https://colab.research.google.com/github/MethEthPro/colab/blob/main/nlp/text_representation_feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# text representation

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.DataFrame({"text":["people watch campusx","campusx watch campusx","people write comment","campusx write comment"],"output":[1,1,0,0]})

## one hot encoding

In [4]:
from sklearn.preprocessing import OneHotEncoder

In [5]:
OHE = OneHotEncoder()

In [6]:
ohe = OHE.fit_transform(df)


In [7]:
ohe.toarray()

array([[0., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 1., 0.]])

## bag of words

In [8]:
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [10]:
bow = cv.fit_transform(df['text'])

In [11]:
# vocab
# this gives us the word and then its index next to it
# its done alphabetically
cv.vocabulary_

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

In [12]:
print(bow[0].toarray())
print(bow[2].toarray())

[[1 0 1 1 0]]
[[0 1 1 0 1]]


In [13]:
cv.transform(["people hello people watch campusx"]).toarray()

array([[1, 0, 2, 1, 0]])

refer to the [doc](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and try to change some hyper parameters

binarybool, default=False

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

max_featuresint, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. Otherwise, all features are used.

## n-grams

In [14]:
cv = CountVectorizer(ngram_range=(2,2))


In [15]:
bow = cv.fit_transform(df['text'])

In [16]:
cv.vocabulary_

{'people watch': 2,
 'watch campusx': 4,
 'campusx watch': 0,
 'people write': 3,
 'write comment': 5,
 'campusx write': 1}

In [17]:
bow[0].toarray()

array([[0, 0, 1, 0, 1, 0]])

In [18]:
cv = CountVectorizer(ngram_range=(2,3))

In [19]:
bow = cv.fit_transform(df['text'])

In [20]:
cv.vocabulary_

{'people watch': 4,
 'watch campusx': 8,
 'people watch campusx': 5,
 'campusx watch': 0,
 'campusx watch campusx': 1,
 'people write': 6,
 'write comment': 9,
 'people write comment': 7,
 'campusx write': 2,
 'campusx write comment': 3}

## tf-idf

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [22]:
tf_idf = tfidf.fit_transform(df['text'])

In [23]:
tf_idf.toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [24]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']


# assignment

In [25]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [26]:
!kaggle datasets download -d heptapod/titanic


Dataset URL: https://www.kaggle.com/datasets/heptapod/titanic
License(s): DbCL-1.0
titanic.zip: Skipping, found more recently modified local copy (use --force to force download)


In [27]:
!unzip /content/imdb-dataset-of-50k-movie-reviews.zip -d data


Archive:  /content/imdb-dataset-of-50k-movie-reviews.zip
replace data/IMDB Dataset.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: data/IMDB Dataset.csv   


In [28]:
import pandas as pd

In [29]:
df = pd.read_csv("/content/data/IMDB Dataset.csv")

In [30]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## problem 1 - preprocessing

### lowercasing

In [31]:
df['review'] = df['review'].str.lower()
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


### removing html,urls and punctuations


In [32]:
import re
def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'',text)


In [33]:
df['review'] = df['review'].apply(remove_html_tags)

In [34]:
def remove_url(text):
  pattern = re.compile(r'https?://\S+|www\.\S+')
  return pattern.sub(r'',text)

In [35]:
df['review'] = df['review'].apply(remove_url)


In [36]:
import string
exclude = string.punctuation
def remove_punc(text):
  return text.translate(str.maketrans('','',exclude))

df['review'] = df['review'].apply(remove_punc)

### handling slangs

In [37]:
import pandas as pd

# URL of the raw file
url = "https://raw.githubusercontent.com/rishabhverma17/sms_slang_translator/refs/heads/master/slang.txt"

# Load data into a DataFrame
slang_df = pd.read_csv(url, delimiter="=", header=None, names=["Key", "Value"])  # Change delimiter based on file format (e.g., "," for CSV)
common_slangs = slang_df.set_index("Key")["Value"].to_dict()

def expand_slangs(text):
  new_text=[]
  for w in text.split():
    if w.upper() in common_slangs:
      new_text.append(common_slangs[w.upper()])
    else:
      new_text.append(w)
  return " ".join(new_text)

df['review'] = df['review'].apply(expand_slangs)

### spelling correction

In [38]:
# will take a lot of time

### removing stop words

In [39]:
!pip install nltk



In [40]:
import nltk
from nltk.corpus import stopwords

In [41]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [42]:
from nltk.corpus import stopwords

# Convert stopwords to a set for fast lookup
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

# Apply function to DataFrame
df['review'] = df['review'].apply(remove_stopwords)


### removing emojis

In [43]:
import re

# Function to remove emojis
def remove_emoji(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # Emoticons
        "\U0001F300-\U0001F5FF"  # Symbols & Pictographs
        "\U0001F680-\U0001F6FF"  # Transport & Map Symbols
        "\U0001F700-\U0001F77F"  # Alchemical Symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251"  # Enclosed Characters
        "]+",
        flags=re.UNICODE,
    )
    return emoji_pattern.sub(r"", text)


df['review'] = df['review'].apply(remove_emoji)

## problem 2 -

 Find out the number of words in the entire corpus and also the total number of unique words(vocabulary) using just python

In [44]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


In [45]:
total_count = 0
my_list = []
my_set = set()
for review in df['review']:
  my_list = review.split()
  total_count = total_count + len(my_list)

  my_set.update(my_list)


print(f"total words : {total_count}")
print(f"unique words: {len(my_set)}")
print(f"factor : {total_count/len(my_set)}")

total words : 5995453
unique words: 222453
factor : 26.951549316035297


## problem 3 - one hot encoding

In [46]:
from sklearn.preprocessing import OneHotEncoder

In [47]:
ohe = OneHotEncoder()

In [48]:
myobj = ohe.fit_transform(df)

In [49]:
myobj.toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

## problem 4 - bag of words

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

In [51]:
cv = CountVectorizer()

In [53]:
bow = cv.fit_transform(df['review'])

In [54]:
bow[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [56]:
cv.vocabulary_

{'one': 139356,
 'reviewers': 163068,
 'mentioned': 123346,
 'watching': 211960,
 'oz': 143101,
 'episode': 64516,
 'youll': 219877,
 'hooked': 93648,
 'right': 163840,
 'exactly': 66472,
 'happened': 87796,
 'methe': 123951,
 'first': 73552,
 'thing': 195446,
 'struck': 186997,
 'brutality': 29009,
 'unflinching': 205102,
 'scenes': 169298,
 'violence': 209640,
 'set': 173627,
 'word': 217012,
 'go': 82251,
 'trust': 202016,
 'show': 176004,
 'faint': 69010,
 'hearted': 89425,
 'timid': 197914,
 'pulls': 155768,
 'punches': 155851,
 'regards': 160465,
 'drugs': 58753,
 'sex': 173893,
 'hardcore': 88093,
 'classic': 38379,
 'use': 207109,
 'wordit': 217031,
 'called': 31219,
 'nickname': 134037,
 'given': 81791,
 'oswald': 141224,
 'maximum': 121370,
 'security': 171498,
 'state': 184580,
 'penitentary': 145991,
 'focuses': 74958,
 'mainly': 118336,
 'emerald': 62394,
 'city': 38075,
 'experimental': 67589,
 'section': 171463,
 'prison': 153661,
 'cells': 34252,
 'glass': 81917,
 'fron

In [58]:

# Get the feature names (words in the vocabulary)
words = cv.get_feature_names_out()

# Get the word counts (sum of occurrences of each word across all documents)
word_counts = bow.sum(axis=0).A1  # Convert to a 1D array

# Combine words and counts in a dictionary
word_count_dict = dict(zip(words, word_counts))

# Print the word counts
for word, count in word_count_dict.items():
    print(f"{word}: {count}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
listtalk: 1
listthat: 1
listthe: 6
listthen: 1
listthere: 2
listtromadudes: 1
listvan: 1
listwhich: 1
listwthere: 1
listyes: 1
liswood: 1
liszhen: 2
liszt: 2
lisztomania: 1
liszts: 2
lit: 169
lita: 7
litanies: 1
litany: 12
lite: 30
litecharacters: 1
liteif: 1
litel: 6
litely: 2
litening: 1
liter: 1
literacy: 7
literal: 105
literalism: 1
literalist: 1
literalized: 1
literalizing: 1
literally: 908
literallybakshi: 1
literallydont: 1
literallyi: 2
literallyjon: 1
literallyminded: 1
literallysuperman: 1
literallythe: 1
literallywhen: 1
literallyzerobudget: 1
literalminded: 4
literalness: 1
literarly: 1
literary: 148
literate: 46
literately: 1
literates: 1
literateshe: 1
literati: 2
literature: 160
literaturea: 1
literaturedont: 1
literaturei: 1
literatureinstead: 1
literatureon: 1
literatures: 5
literaturethe: 1
literaturethis: 1
literaturetofilm: 1
literlly: 1
liteversion: 1
litfrom: 1
litghow: 1
lithe: 8
lithely: 1
lithgow:

KeyboardInterrupt: 

## problem 5 - bi-gram,tri-gram

In [61]:
cv_2 = CountVectorizer(ngram_range=(2,2))
cv_3 = CountVectorizer(ngram_range=(3,3))


In [62]:
bg = cv_2.fit_transform(df['review'])
tg = cv_3.fit_transform(df['review'])

In [63]:
bg[0].toarray().shape

(1, 3310669)

In [64]:
tg[0].toarray().shape

(1, 5472936)

## problem 6 - tf-idf

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [67]:
tfidf = TfidfVectorizer()

In [68]:
tf = tfidf.fit_transform(df['review'])

In [70]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[ 9.87388814 10.721186   11.1266511  ... 11.1266511  11.1266511
 11.1266511 ]
['00' '000' '0000000000001' ... 'þór' 'יגאל' 'כרמון']
