### Cleaning The News Dataset Stuff

In [6]:
import pandas as pd
import re
from bs4 import BeautifulSoup

In [7]:
df= pd.read_csv("scrappedData.csv", index_col=0)
print(df.head())

                                               title  \
0             Pakistan travel - Lonely Planet | Asia   
1           Pakistan Tourism Development Corporation   
2          Pakistan International Travel Information   
3    Pakistan Travel Advice & Safety | Smartraveller   
4  Travel advice and advisories for Pakistan - Tr...   

                                         description  \
0  Pakistan is one of Asia's most affordable dest...   
1  PTDC is owned by the Government of Pakistan (9...   
2  Entry, Exit and Visa Requirements · Obtain you...   
3  Australian Government travel advice for Pakist...   
4  Travel Advice and Advisories from the Governme...   

                       time_ago  \
0            Time not specified   
1               Careers in PTDC   
2  Dual Nationals: Be aware ...   
3            Time not specified   
4            Time not specified   

                                                link  
0              https://www.lonelyplanet.com/pakistan  
1    

In [8]:
def clean_html_and_urls(text):
    soup = BeautifulSoup(text, "html.parser") #remove tagz
    cleaned_text = soup.get_text(separator=" ")
    
    cleaned_text = re.sub(r"http\S+|www\S+", "", cleaned_text) #remove URLs

    return cleaned_text.strip()


def remove_special_characters(text):
    pattern = r"[^\w\s.,]" #remove special characters (using regex)
    cleaned_text = re.sub(pattern, "", text)
    
    return cleaned_text


def convert_to_lowercase(text):
    return text.lower()

In [9]:
df['title'] = df['title'].apply(clean_html_and_urls)
df['title'] = df['title'].apply(remove_special_characters)
df['title'] = df['title'].apply(convert_to_lowercase)

df['description'] = df['description'].apply(clean_html_and_urls)
df['description'] = df['description'].apply(remove_special_characters)
df['description'] = df['description'].apply(convert_to_lowercase)

print(df)

                                                 title  \
0                 pakistan travel  lonely planet  asia   
1             pakistan tourism development corporation   
2            pakistan international travel information   
3        pakistan travel advice  safety  smartraveller   
4    travel advice and advisories for pakistan  tra...   
..                                                 ...   
306  travel to pakistan updated travel advisory l u...   
307  female travelers visit pakistan for the first ...   
308  people are deprived of affordable travel facil...   
309     pakistan entry  exit new travel rules  youtube   
310                                  pakistan  youtube   

                                           description  \
0    pakistan is one of asias most affordable desti...   
1    ptdc is owned by the government of pakistan 99...   
2    entry, exit and visa requirements  obtain your...   
3    australian government travel advice for pakist...   
4    travel a

  soup = BeautifulSoup(text, "html.parser") #remove tagz


In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ibzcl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ibzcl\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ibzcl\AppData\Roaming\nltk_data...


True

In [14]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def stem_text(text):
    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)

df['description_stemmed'] = df['description'].apply(stem_text)
df['description_lemmatized'] = df['description'].apply(lemmatize_text)


print(df.head())

                                               title  \
0               pakistan travel  lonely planet  asia   
1           pakistan tourism development corporation   
2          pakistan international travel information   
3      pakistan travel advice  safety  smartraveller   
4  travel advice and advisories for pakistan  tra...   

                                         description  \
0  pakistan is one of asias most affordable desti...   
1  ptdc is owned by the government of pakistan 99...   
2  entry, exit and visa requirements  obtain your...   
3  australian government travel advice for pakist...   
4  travel advice and advisories from the governme...   

                       time_ago  \
0            Time not specified   
1               Careers in PTDC   
2  Dual Nationals: Be aware ...   
3            Time not specified   
4            Time not specified   

                                                link  \
0              https://www.lonelyplanet.com/pakistan   
1  

In [18]:
stop_words = set(stopwords.words('english'))
def remove_stop_words(text):
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    return ' '.join(filtered_tokens)

df['title_no_stopwords'] = df['title'].apply(remove_stop_words)
df['description_no_stopwords'] = df['description'].apply(remove_stop_words)

print(df.head())

                                               title  \
0               pakistan travel  lonely planet  asia   
1           pakistan tourism development corporation   
2          pakistan international travel information   
3      pakistan travel advice  safety  smartraveller   
4  travel advice and advisories for pakistan  tra...   

                                         description  \
0  pakistan is one of asias most affordable desti...   
1  ptdc is owned by the government of pakistan 99...   
2  entry, exit and visa requirements  obtain your...   
3  australian government travel advice for pakist...   
4  travel advice and advisories from the governme...   

                       time_ago  \
0            Time not specified   
1               Careers in PTDC   
2  Dual Nationals: Be aware ...   
3            Time not specified   
4            Time not specified   

                                                link  \
0              https://www.lonelyplanet.com/pakistan   
1  

In [21]:
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
#Tokenizing
df['title_tokens'] = df['title_no_stopwords'].apply(word_tokenize)
df['description_tokens'] = df['description_no_stopwords'].apply(word_tokenize)

def extract_ngrams(tokens, n):
    ngrams = []
    for i in range(len(tokens) - n + 1):
        ngrams.append(tuple(tokens[i:i+n]))
    return ngrams


df['title_unigrams'] = df['title_tokens'].apply(lambda x: extract_ngrams(x, 1)) #unigram
df['title_bigrams'] = df['title_tokens'].apply(lambda x: extract_ngrams(x, 2)) #bigram
df['title_trigrams'] = df['title_tokens'].apply(lambda x: extract_ngrams(x, 3)) #trigram

df['description_unigrams'] = df['description_tokens'].apply(lambda x: extract_ngrams(x, 1))
df['description_bigrams'] = df['description_tokens'].apply(lambda x: extract_ngrams(x, 2))
df['description_trigrams'] = df['description_tokens'].apply(lambda x: extract_ngrams(x, 3))

                                                 title  \
0                 pakistan travel  lonely planet  asia   
1             pakistan tourism development corporation   
2            pakistan international travel information   
3        pakistan travel advice  safety  smartraveller   
4    travel advice and advisories for pakistan  tra...   
..                                                 ...   
306  travel to pakistan updated travel advisory l u...   
307  female travelers visit pakistan for the first ...   
308  people are deprived of affordable travel facil...   
309     pakistan entry  exit new travel rules  youtube   
310                                  pakistan  youtube   

                                           description  \
0    pakistan is one of asias most affordable desti...   
1    ptdc is owned by the government of pakistan 99...   
2    entry, exit and visa requirements  obtain your...   
3    australian government travel advice for pakist...   
4    travel a

In [27]:
print(df['title_unigrams'].head(1))
print(df['description_unigrams'].head(1))

0    [(pakistan,), (travel,), (lonely,), (planet,),...
Name: title_unigrams, dtype: object
0    [(pakistan,), (one,), (asias,), (affordable,),...
Name: description_unigrams, dtype: object


In [28]:
print(df['title_bigrams'].head(1))
print(df['description_bigrams'].head(1))

0    [(pakistan, travel), (travel, lonely), (lonely...
Name: title_bigrams, dtype: object
0    [(pakistan, one), (one, asias), (asias, afford...
Name: description_bigrams, dtype: object


In [29]:
print(df['title_trigrams'].head(1))
print(df['description_trigrams'].head(1))

0    [(pakistan, travel, lonely), (travel, lonely, ...
Name: title_trigrams, dtype: object
0    [(pakistan, one, asias), (one, asias, affordab...
Name: description_trigrams, dtype: object


In [31]:
from collections import Counter

def calculate_ngram_frequency(ngram_series):
    all_ngrams = [ngram for sublist in ngram_series for ngram in sublist]
    frequency_counter = Counter(all_ngrams)
    return frequency_counter

title_unigram_frequency = calculate_ngram_frequency(df['title_unigrams'])
title_bigram_frequency = calculate_ngram_frequency(df['title_bigrams'])
title_trigram_frequency = calculate_ngram_frequency(df['title_trigrams'])

description_unigram_frequency = calculate_ngram_frequency(df['description_unigrams'])
description_bigram_frequency = calculate_ngram_frequency(df['description_bigrams'])
description_trigram_frequency = calculate_ngram_frequency(df['description_trigrams'])

In [32]:
print("Top 10 Most Common Unigrams in Titles:")
print(title_unigram_frequency.most_common(10))
print("\nTop 10 Most Common Bigrams in Titles:")
print(title_bigram_frequency.most_common(10))
print("\nTop 10 Most Common Trigrams in Titles:")
print(title_trigram_frequency.most_common(10))

print("\nTop 10 Most Common Unigrams in Descriptions:")
print(description_unigram_frequency.most_common(10))
print("\nTop 10 Most Common Bigrams in Descriptions:")
print(description_bigram_frequency.most_common(10))
print("\nTop 10 Most Common Trigrams in Descriptions:")
print(description_trigram_frequency.most_common(10))

Top 10 Most Common Unigrams in Titles:
[(('pakistan',), 245), (('travel',), 134), (('youtube',), 37), (('...',), 32), ((',',), 32), (('visa',), 29), (('guide',), 25), (('tourism',), 23), (('2024',), 17), (('tourist',), 13)]

Top 10 Most Common Bigrams in Titles:
[(('pakistan', 'travel'), 31), (('travel', 'guide'), 20), (('travel', 'pakistan'), 12), (('pakistan', 'youtube'), 12), (('travel', 'advice'), 8), (('traveling', 'pakistan'), 8), (('pakistan', 'tourism'), 7), (('travel', 'advisory'), 6), (('visit', 'pakistan'), 6), (('embassy', 'pakistan'), 6)]

Top 10 Most Common Trigrams in Titles:
[(('pakistan', 'travel', 'guide'), 12), (('know', 'traveling', 'pakistan'), 5), (('pakistan', 'travel', 'advice'), 4), (('department', 'foreign', 'affairs'), 4), (('need', 'know', 'traveling'), 4), (('health', 'alert', 'u.s.'), 4), (('alert', 'u.s.', 'mission'), 4), (('u.s.', 'mission', 'pakistan'), 4), (('mission', 'pakistan', 'january'), 4), (('pakistan', 'january', '8'), 4)]

Top 10 Most Common U

In [33]:
from nltk.tokenize import word_tokenize, sent_tokenize

df['description_word_tokens'] = df['description'].apply(word_tokenize)
df['description_sentence_tokens'] = df['description'].apply(sent_tokenize)

print(df[['description', 'description_word_tokens', 'description_sentence_tokens']].head())

                                         description  \
0  pakistan is one of asias most affordable desti...   
1  ptdc is owned by the government of pakistan 99...   
2  entry, exit and visa requirements  obtain your...   
3  australian government travel advice for pakist...   
4  travel advice and advisories from the governme...   

                             description_word_tokens  \
0  [pakistan, is, one, of, asias, most, affordabl...   
1  [ptdc, is, owned, by, the, government, of, pak...   
2  [entry, ,, exit, and, visa, requirements, obta...   
3  [australian, government, travel, advice, for, ...   
4  [travel, advice, and, advisories, from, the, g...   

                         description_sentence_tokens  
0  [pakistan is one of asias most affordable dest...  
1  [ptdc is owned by the government of pakistan 9...  
2  [entry, exit and visa requirements  obtain you...  
3  [australian government travel advice for pakis...  
4  [travel advice and advisories from the governm..

In [34]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

df['description_bert_tokens'] = df['description'].apply(lambda x: tokenizer.tokenize(x))

print(df[['description', 'description_bert_tokens']])

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

                                           description  \
0    pakistan is one of asias most affordable desti...   
1    ptdc is owned by the government of pakistan 99...   
2    entry, exit and visa requirements  obtain your...   
3    australian government travel advice for pakist...   
4    travel advice and advisories from the governme...   
..                                                 ...   
306  13jul2021  travel to pakistan updated travel a...   
307                                          19aug2020   
308                                          02dec2023   
309  20aug2022  pakistan entry  exit new travel rul...   
310  4 days ago  your browser cant play this video....   

                               description_bert_tokens  
0    [pakistan, is, one, of, asia, ##s, most, affor...  
1    [pt, ##dc, is, owned, by, the, government, of,...  
2    [entry, ,, exit, and, visa, requirements, obta...  
3    [australian, government, travel, advice, for, ...  
4    [travel, advi

### Question 7

#### Contextualized Tokenization:
##### Explore contextualized tokenization by discussing its significance and demonstrating how
##### models like BERT process text differently compared to traditional methods. Does this
##### type of tokenization will help you?

#### Answer:
######  Contextualized tokenization provided by models like BERT can be highly beneficial. It allows you to encode the descriptions in a more sophisticated and context-aware manner, which can lead to better performance when analyzing or processing the text data. By leveraging contextualized tokenization, you can extract richer features from the text, enabling more accurate and robust analysis or downstream tasks.

###### Traditional tokenization methods, such as splitting text based on whitespace or punctuation, tokenize each word or subword independently without considering its context. For example, the word "bank" would be tokenized as-is, without differentiating between its meanings in different contexts ("financial institution" vs. "river bank").

###### In contrast, models like BERT tokenize words or subwords based on their context. For example, in the sentence "He deposited money in the bank," BERT may tokenize "bank" differently compared to the sentence "He sat by the river bank." This contextualized tokenization allows BERT to capture the nuanced meaning of each token based on its surrounding context, leading to more accurate representations of text.