# ***CampusX Tutorial ***

https://www.youtube.com/watch?v=Svpy_ZbHShU

NLP stands for **Natural Language Processing**, which is a field of artificial intelligence (AI) that focuses on the interaction between computers and human language. The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

NLP combines linguistics and computer science to process and analyze large amounts of natural language data. Some common tasks in NLP include:

1. **Text classification**: Categorizing text into predefined categories (e.g., spam vs. non-spam emails).
2. **Sentiment analysis**: Determining the sentiment (positive, negative, or neutral) of a piece of text.
3. **Named entity recognition (NER)**: Identifying and classifying proper names, such as people, organizations, and locations, in a text.
4. **Machine translation**: Translating text from one language to another (e.g., Google Translate).
5. **Speech recognition**: Converting spoken language into text (e.g., voice assistants like Siri or Alexa).
6. **Question answering**: Extracting answers from a body of text in response to specific queries.

In essence, NLP helps bridge the gap between human communication and machine understanding.

In [3]:

import numpy as np
import pandas as pd

# ***import Dataset from Kaggle***

In [4]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

cp: cannot stat 'kaggle.json': No such file or directory


In [5]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 74% 19.0M/25.7M [00:00<00:00, 174MB/s]
100% 25.7M/25.7M [00:00<00:00, 178MB/s]


In [6]:
import zipfile
zip_ref = zipfile.ZipFile('/content/imdb-dataset-of-50k-movie-reviews.zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

In [7]:
df = pd.read_csv(r"/content/IMDB Dataset.csv")

In [8]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [9]:
df.shape

(50000, 2)

#1. Convert all text data into lowercase.

In [10]:
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [11]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [12]:
df['review']= df['review'].str.lower()

In [13]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


# Remove html tags

In [14]:
# we use regular expression to remove html tags.
import re
def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'', text)

In [15]:
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [16]:
remove_html_tags(text)

' Movie 1 Actor - Aamir Khan Click here to download'

In [17]:
# apply regular expression to remove html tags from the dataset.

df['review'] = df['review'].apply(remove_html_tags)

In [18]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

# Remove URLs from the text dataset.

In [19]:
def remove_url(text):
  pattern = re.compile(r'https?://\S+|www\.\s+')
  return pattern.sub(r'', text)

In [20]:
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

In [21]:
remove_url(text1)

'Check out my notebook '

In [22]:
remove_url(text2)

'Check out my notebook '

In [23]:
remove_url(text3)

'Google search here www.google.com'

In [24]:
remove_url(text4)

'For notebook click  to search check www.google.com'

# Removing punctuation from the text.

In [25]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [26]:
exclude = string.punctuation

In [27]:
def remove_punc(text):
  for char in exclude:
    text = text.replace(char,' ')
  return text

In [28]:
text = 'string.with.Punctuation?'

In [29]:
# we use time module to find out the execution time.

start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1*50000)

string with Punctuation 
38.6357307434082


# Use alternative approach  to solve same remove punctuation.
# This is efficient and fast approach and we can use on large dataset.

In [30]:
def remove_punc1(text):
  return text.translate(str.maketrans('','', exclude))

In [31]:
start = time.time()
remove_punc1(text)
time2 = time.time() - start
print(time2*50000)

5.710124969482422


In [32]:
time1/time2

6.766179540709812

In [33]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [34]:
remove_punc1(df['review'][5])

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

#Chat Word Treatment

In [97]:
import pandas as pd

# Read the CSV file
chat_words = pd.read_csv(r"/content/slang.txt", sep=';')

# Convert the DataFrame to a dictionary
# You can choose different orientations; here are a few examples:
# 1. Default (dict of columns)
dict_default = chat_words.to_dict()  # This will create a dictionary where columns are keys and values are lists.
# 2. If you want a list of dictionaries (each row as a dictionary)
dict_records = chat_words.to_dict(orient='records')  # Each row will be a dictionary.

# Print the dictionary
print(dict_default)  # or print(dict_records) depending on your preference


{'AFAIK=As Far As I Know': {0: 'AFK=Away From Keyboard', 1: 'ASAP=As Soon As Possible', 2: 'ATK=At The Keyboard', 3: 'ATM=At The Moment', 4: 'A3=Anytime, Anywhere, Anyplace', 5: 'BAK=Back At Keyboard', 6: 'BBL=Be Back Later', 7: 'BBS=Be Back Soon', 8: 'BFN=Bye For Now', 9: 'B4N=Bye For Now', 10: 'BRB=Be Right Back', 11: 'BRT=Be Right There', 12: 'BTW=By The Way', 13: 'B4=Before', 14: 'B4N=Bye For Now', 15: 'CU=See You', 16: 'CUL8R=See You Later', 17: 'CYA=See You', 18: 'FAQ=Frequently Asked Questions', 19: 'FC=Fingers Crossed', 20: "FWIW=For What It's Worth", 21: 'FYI=For Your Information', 22: 'GAL=Get A Life', 23: 'GG=Good Game', 24: 'GN=Good Night', 25: 'GMTA=Great Minds Think Alike', 26: 'GR8=Great!', 27: 'G9=Genius', 28: 'IC=I See', 29: 'ICQ=I Seek you (also a chat program)', 30: 'ILU=ILU: I Love You', 31: 'IMHO=In My Honest/Humble Opinion', 32: 'IMO=In My Opinion', 33: 'IOW=In Other Words', 34: 'IRL=In Real Life', 35: 'KISS=Keep It Simple, Stupid', 36: 'LDR=Long Distance Relation

In [98]:
print(dict_records)

[{'AFAIK=As Far As I Know': 'AFK=Away From Keyboard'}, {'AFAIK=As Far As I Know': 'ASAP=As Soon As Possible'}, {'AFAIK=As Far As I Know': 'ATK=At The Keyboard'}, {'AFAIK=As Far As I Know': 'ATM=At The Moment'}, {'AFAIK=As Far As I Know': 'A3=Anytime, Anywhere, Anyplace'}, {'AFAIK=As Far As I Know': 'BAK=Back At Keyboard'}, {'AFAIK=As Far As I Know': 'BBL=Be Back Later'}, {'AFAIK=As Far As I Know': 'BBS=Be Back Soon'}, {'AFAIK=As Far As I Know': 'BFN=Bye For Now'}, {'AFAIK=As Far As I Know': 'B4N=Bye For Now'}, {'AFAIK=As Far As I Know': 'BRB=Be Right Back'}, {'AFAIK=As Far As I Know': 'BRT=Be Right There'}, {'AFAIK=As Far As I Know': 'BTW=By The Way'}, {'AFAIK=As Far As I Know': 'B4=Before'}, {'AFAIK=As Far As I Know': 'B4N=Bye For Now'}, {'AFAIK=As Far As I Know': 'CU=See You'}, {'AFAIK=As Far As I Know': 'CUL8R=See You Later'}, {'AFAIK=As Far As I Know': 'CYA=See You'}, {'AFAIK=As Far As I Know': 'FAQ=Frequently Asked Questions'}, {'AFAIK=As Far As I Know': 'FC=Fingers Crossed'}, {'A

In [101]:
def chat_conversion(text):
  new_text = []
  for w in text.split():
    if w.upper() in chat_words:
      new_text.append(chat_words[w.upper()])
    else:
      new_text.append(w)
  return " ".join(new_text)

# ***#Tokenization***

# 1. Using the split() function

In [36]:
# word tokenization
sent1 = 'I am going to Jaqipur!'
sent1.split()

['I', 'am', 'going', 'to', 'Jaqipur!']

In [37]:
# sentence tokenization.

sent2 = 'I am going to jaipur. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to jaipur',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [38]:
# problem with the split() function.
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')
# it can't split the sent4 because there is no '.' so this
# problem solve by using regular expression.

['Where do think I should go? I have 3 day holiday']

# Regular Expression

In [39]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w]+", sent3)
tokens

# still is did not seperate ! mark

['I', 'am', 'going', 'to', 'delhi']

In [40]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sentences = re.compile('[.!?]').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book",
 '']

# NLTK using for Tokenization

In [41]:
import nltk

In [42]:
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [43]:
sent1 = 'I am going to delhi to attend the family function!'
word_tokenize(sent1)

['I',
 'am',
 'going',
 'to',
 'delhi',
 'to',
 'attend',
 'the',
 'family',
 'function',
 '!']

In [44]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [45]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "WE're here to help! mail us at nks@gmail.com"
sent7 ='A 5kmn ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [46]:
word_tokenize(sent6)
# here we get problem becuase it breaks email id so there is no sense to
# break the email id.

['WE',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'nks',
 '@',
 'gmail.com']

In [47]:
word_tokenize(sent7)
# here 5km are not seperated.

['A', '5kmn', 'ride', 'cost', '$', '10.50']

# Use Spacy library

In [48]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [49]:
# first convert sentence into documents.
sent5 = 'I have a Ph.D in A.I'
sent6 = "WE're here to help! mail us at nks@gmail.com"
sent7 ='A 5kmn ride cost $10.50'
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent5)

In [50]:
for token in doc4:
  print(token)

I
have
a
Ph
.
D
in
A.I


In [52]:
for token in doc2:
  print(token)

WE're
here
to
help
!
mail
us
at
nks@gmail.com


In [53]:
for token in doc2:
  print(token)

WE're
here
to
help
!
mail
us
at
nks@gmail.com


In [54]:
for token in doc3:
  print(token)

A
5kmn
ride
cost
$
10.50


# **Stemming**

In [55]:
from nltk.stem.porter import PorterStemmer

In [56]:
ps = PorterStemmer()
def stem_words(text):
  return " ".join([ps.stem(word) for word in text.split()])

In [57]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [58]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [59]:
stem_words(text)

# by using stemming we get root word without any meaning so we use lemmatization to solve this issue.

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

# **Leammatization**

In [60]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package wordnet to /root/nltk_data...


In [61]:
wordnet_lemmatizer = WordNetLemmatizer()
sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuation = "?;!.,:"
sentence_words = nltk.word_tokenize(sentence)
sentence_words


['He',
 'was',
 'running',
 'and',
 'eating',
 'at',
 'same',
 'time',
 '.',
 'He',
 'has',
 'bad',
 'habit',
 'of',
 'swimming',
 'after',
 'playing',
 'long',
 'hours',
 'in',
 'the',
 'Sun',
 '.']

In [62]:
for word in sentence_words:
  if word in punctuation:
    sentence_words.remove(word)
sentence_words
print("{0:20}{1:20}".format("word","Lemma"))
for word in sentence_words:
  print("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))


word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


# **Video 4**

# **Text Representation | NLP Lecture 4 | Bag of Words | Tf-Idf | N-grams, Bi-grams and Uni-grams**

# Feature_Extraction
# Bag of Word(BOW)

# This is example of uni-gram

In [63]:
import numpy as np
import pandas as pd

In [64]:
df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'], 'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [65]:
# use scikit learn

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

In [66]:
bow = cv.fit_transform(df['text'])

In [67]:
# vocab
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [68]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())

[[1 0 1 1 0]]
[[2 0 0 1 0]]
[[0 1 1 0 1]]
[[1 1 0 0 1]]


In [69]:
# use new sentence
cv.transform(['campusx watch and write comment of campusx']).toarray()

array([[2, 1, 0, 1, 1]])

#Bag of N-Grams, Bi_Grams, Uni-Grams, Tri-Grams

In [70]:
import numpy as np
import pandas as pd

In [71]:
df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'], 'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [72]:
## use scikit learn

from sklearn.feature_extraction.text import CountVectorizer

# cv = CountVectorizer(ngram_range=(1,1)) # this is for uni-grams
#cv = CountVectorizer(ngram_range=(1,2))   # create 11 pairs
#cv = CountVectorizer(ngram_range=(1,3))   # create 15 pairs of words.
# cv = CountVectorizer(ngram_range=(1,4))   #
#cv = CountVectorizer(ngram_range=(2,2))   # show the pair of two words.
cv = CountVectorizer(ngram_range=(3,3))   # show the pair of three words.
#cv = CountVectorizer(ngram_range=(4,4))   # show error becuase there is not 4 words in the sentence.#

In [73]:
bow = cv.fit_transform(df['text'])

In [74]:
# vocab
print(cv.vocabulary_)

{'people watch campusx': 2, 'campusx watch campusx': 0, 'people write comment': 3, 'campusx write comment': 1}


In [75]:
print(len(cv.vocabulary_))

4


In [76]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())

[[0 0 1 0]]
[[1 0 0 0]]
[[0 0 0 1]]
[[0 1 0 0]]


In [77]:
# use new sentence
cv.transform(['campusx watch and write comment of campusx']).toarray()

array([[0, 0, 0, 0]])

#TF-IDF(Term Freequency - inverse document freequency)

In [78]:
import numpy as np
import pandas as pd


In [79]:
df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'], 'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [80]:
## use scikit learn

from sklearn.feature_extraction.text import TfidfVectorizer
t_1 = TfidfVectorizer()
t_1.fit_transform(df['text']).toarray()




array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [81]:
print(t_1.idf_)

#reand this document
# https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer


[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]


In [82]:
print(t_1.get_feature_names_out())

['campusx' 'comment' 'people' 'watch' 'write']


#5 Video
# WOrd2Vec
- https://github.com/campusx-official/game-of-thrones-word2vec/blob/main/game-of-thrones-word2vec.ipynb

- https://colab.research.google.com/drive/1aes3A6AumwokaSmdHL4F477Eb-PBdWLD?usp=sharing


 - word2vec applied on game of thrones data

Dataset Link: https://www.kaggle.com/khulasasndh/game-of-thrones-books

In [83]:
import numpy as np
import pandas as pd

In [84]:
# !pip install gensim

In [85]:
import gensim
import os

In [86]:
# this code only using when you want to work on full zip file
'''
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

story = []
for filename in os.listdir('data'):

    f = open(os.path.join('data',filename))
    corpus = f.read()
    raw_sent = sent_tokenize(corpus)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))


'''

"\nfrom nltk import sent_tokenize\nfrom gensim.utils import simple_preprocess\n\nstory = []\nfor filename in os.listdir('data'):\n\n    f = open(os.path.join('data',filename))\n    corpus = f.read()\n    raw_sent = sent_tokenize(corpus)\n    for sent in raw_sent:\n        story.append(simple_preprocess(sent))\n\n\n"

In [87]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [103]:
import os
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
from google.colab import files


# If the file is uploaded directly to Colab via the file uploader:
uploaded = files.upload()

# Assuming the file you uploaded is named 'yourfile.txt'
filename = '/content/gameofthrones.txt'  # Change to your actual file name

# Initialize an empty list to hold tokenized sentences
story = []

# Read the content of the uploaded text file
with open(filename, 'r', encoding='utf-8') as f:
    corpus = f.read()

# Tokenize the corpus into sentences
raw_sent = sent_tokenize(corpus)

# Process each sentence: clean and tokenize using simple_preprocess
for sent in raw_sent:
    story.append(simple_preprocess(sent))

# Now, `story` contains the tokenized sentences from your text file
print(story[:5])  # Print the first 5 tokenized sentences for review


Saving gameofthrones.txt to gameofthrones (1).txt
[['we', 'should', 'start', 'back', 'gared', 'urged', 'as', 'the', 'woods', 'began', 'to', 'grow', 'dark', 'around', 'them'], ['the', 'wildlings', 'are', 'dead', 'do', 'the', 'dead', 'frighten', 'you', 'ser', 'waymar', 'royce', 'asked', 'with', 'just', 'the', 'hint', 'of', 'smile'], ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'], ['he', 'was', 'an', 'old', 'man', 'past', 'fifty', 'and', 'he', 'had', 'seen', 'the', 'lordlings', 'come', 'and', 'go'], ['dead', 'is', 'dead', 'he', 'said']]


In [104]:
story

[['we',
  'should',
  'start',
  'back',
  'gared',
  'urged',
  'as',
  'the',
  'woods',
  'began',
  'to',
  'grow',
  'dark',
  'around',
  'them'],
 ['the',
  'wildlings',
  'are',
  'dead',
  'do',
  'the',
  'dead',
  'frighten',
  'you',
  'ser',
  'waymar',
  'royce',
  'asked',
  'with',
  'just',
  'the',
  'hint',
  'of',
  'smile'],
 ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'],
 ['he',
  'was',
  'an',
  'old',
  'man',
  'past',
  'fifty',
  'and',
  'he',
  'had',
  'seen',
  'the',
  'lordlings',
  'come',
  'and',
  'go'],
 ['dead', 'is', 'dead', 'he', 'said'],
 ['we',
  'have',
  'no',
  'business',
  'with',
  'the',
  'dead',
  'are',
  'they',
  'dead',
  'royce',
  'asked',
  'softly'],
 ['what', 'proof', 'have', 'we', 'will', 'saw', 'them', 'gared', 'said'],
 ['if',
  'he',
  'says',
  'they',
  'are',
  'dead',
  'that',
  'proof',
  'enough',
  'for',
  'me',
  'will',
  'had',
  'known',
  'they',
  'would',
  'drag',
  'him',
  'into',
  'the',
  'qu

In [105]:
len(story)

69340

In [106]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [107]:
model.build_vocab(story)

In [108]:
model.train(story,total_examples=model.corpus_count, epochs=model.epochs)

(3832044, 5053720)

In [109]:
model.wv.most_similar('queen')

[('margaery', 0.8511649966239929),
 ('princess', 0.8463174104690552),
 ('prince', 0.8405818939208984),
 ('myrcella', 0.7999573945999146),
 ('son', 0.796017050743103),
 ('wife', 0.795843243598938),
 ('wedding', 0.7854769825935364),
 ('sister', 0.785374104976654),
 ('cersei', 0.7852895855903625),
 ('lady', 0.7822032570838928)]

In [110]:
model.wv.doesnt_match(['jon','arya','sansa','bran'])
# beacuse jons are adopted son adn remaining are original son

'jon'

In [111]:
model.wv.doesnt_match(['cersei', 'jaime', 'bronn', 'tyrion'])

'bronn'

In [112]:
model.wv['king']

array([ 1.9223019 , -0.2879819 , -2.0181117 ,  0.4479232 , -2.7426596 ,
       -0.9359759 ,  0.00822008,  0.46971244, -1.971389  ,  2.885412  ,
        1.0433332 ,  2.7008479 ,  1.1565878 ,  1.8025534 , -0.65418863,
       -0.17459047,  1.038172  , -1.1572726 , -1.3268249 , -0.4014035 ,
        2.3685923 , -1.4192103 , -1.3574281 ,  0.58783245, -0.5645755 ,
       -0.28150243,  2.395748  ,  1.873136  ,  0.6633944 ,  2.3919547 ,
        2.287387  ,  1.2627022 ,  1.9283987 , -0.41570362,  0.9122332 ,
       -0.8622801 ,  1.0826526 ,  0.9121959 ,  1.7829633 , -2.6028335 ,
        3.1073768 , -1.0308994 ,  1.1495705 ,  0.8002043 , -0.5189259 ,
        0.2704996 ,  1.4322042 , -1.6500543 ,  3.2427416 , -0.06947897,
       -1.6367288 ,  1.4745736 , -0.34389472, -3.3206522 ,  2.1089091 ,
        0.08599988, -0.44747075,  2.3294828 ,  0.8709477 ,  0.89417344,
        2.0770357 , -0.55201846, -0.1397929 , -0.5683991 , -2.0916378 ,
        0.22736204,  1.3930175 ,  1.315946  , -3.8110452 ,  0.82

In [113]:
model.wv.similarity('arya','sansa')

0.82627165

In [114]:
model.wv.similarity('cersei','sansa')

0.7070886

In [115]:
model.wv.similarity('tywin','sansa')

0.20312007

In [116]:
model.wv.get_normed_vectors()

array([[ 0.02713661,  0.04519505,  0.03458809, ..., -0.13926364,
        -0.09399639,  0.10018145],
       [-0.11874198,  0.13368411,  0.19491315, ...,  0.09674258,
        -0.13030227, -0.10984587],
       [ 0.0913324 , -0.03046454, -0.12717643, ..., -0.01373477,
         0.18925923, -0.11317864],
       ...,
       [ 0.10862397,  0.2402007 ,  0.13891834, ...,  0.06350525,
         0.08091044, -0.0321219 ],
       [-0.0376674 ,  0.16063496,  0.14694567, ..., -0.10173275,
        -0.05601796, -0.01042343],
       [ 0.05049824,  0.20688161,  0.07393984, ..., -0.00732647,
        -0.04432392, -0.05973356]], dtype=float32)

In [117]:
y = model.wv.index_to_key

In [118]:
len(y)

13673

In [119]:
y

['the',
 'and',
 'to',
 'of',
 'he',
 'his',
 'was',
 'you',
 'in',
 'it',
 'her',
 'had',
 'she',
 'that',
 'as',
 'with',
 'him',
 'but',
 'not',
 'for',
 'they',
 'said',
 'at',
 'on',
 'my',
 'is',
 'lord',
 'have',
 'be',
 'no',
 'them',
 'from',
 'me',
 'were',
 'would',
 'all',
 'your',
 'when',
 'ser',
 'so',
 'if',
 'one',
 'will',
 'could',
 'their',
 'there',
 'we',
 'man',
 'are',
 'up',
 'king',
 'what',
 'this',
 'did',
 'out',
 'back',
 'do',
 'been',
 'by',
 'jon',
 'or',
 'more',
 'men',
 'down',
 'well',
 'than',
 'like',
 'who',
 'tyrion',
 'only',
 'father',
 'hand',
 'now',
 'see',
 'off',
 'even',
 'never',
 'before',
 'old',
 'know',
 'into',
 'too',
 'an',
 'black',
 'told',
 'eyes',
 'made',
 'll',
 'thought',
 'lady',
 'then',
 'arya',
 'some',
 'how',
 'long',
 'time',
 'through',
 'here',
 'can',
 'over',
 'brother',
 'come',
 'face',
 'head',
 'boy',
 'bran',
 'where',
 'sansa',
 'might',
 'still',
 'us',
 'way',
 'has',
 'red',
 'must',
 'took',
 'night',


In [120]:
# so reducing the dimention we use PCA
from sklearn.decomposition import PCA

In [121]:
pca = PCA(n_components=3)

In [123]:
X = pca.fit_transform(model.wv.get_normed_vectors())

In [124]:
X.shape

(13673, 3)

In [125]:
import plotly.express as px
fig = px.scatter_3d(X[200:300],x=0,y=1,z=2, color=y[200:300])
fig.show()

# **Video 6**


In [None]:
import numpy as np
import pandas as pd

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
import zipfile
zip_ref = zipfile.ZipFile('/content/imdb-dataset-of-50k-movie-reviews.zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

In [None]:
df = pd.read_csv(r"/content/IMDB Dataset.csv")

In [None]:
df.head()

In [None]:
df['review'][1]

In [None]:
df['sentiment'].value_counts()

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

# Basic Preprocessing
# Remove tags
# Lowercase
# remove stopwords


In [None]:
# remove html tags
# using regular expression.

import re
def remove_tags(raw_text):
  cleaned_text = re.sub(re.compile('<.*?>'),' ', raw_text)
  return cleaned_text


In [None]:
df['review'] = df['review'].apply(remove_tags)

In [None]:
df


# convert into lowercase

In [None]:
df['review'] = df['review'].apply(lambda x:x.lower())

In [None]:
df

# Removing Stopwords

In [None]:
import nltk
nltk.download('stopwords')


In [None]:
from nltk.corpus import stopwords
sw_list = stopwords.words('english')
df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

In [None]:
df

# seperate data into input and output column.

In [None]:
x = df.iloc[:,0:1]
y = df['sentiment']

In [None]:
x

In [None]:
y

# Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder


In [None]:
encoder = LabelEncoder()
y = encoder.fit_transform(y)

In [None]:
y

# Train test split

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=1)

In [None]:
x_train.shape

# Applying BoW

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer()

In [None]:
x_train_bow = cv.fit_transform(x_train['review']).toarray()
x_test_bow = cv.transform(x_test['review']).toarray()

In [None]:
x_train_bow.shape