# **NLTK**

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK also includes graphical demonstrations and sample data

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import pandas as pd
import seaborn as sns
from nltk import ne_chunk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')
nltk.download('vader_lexicon')
nltk.download('maxent_ne_chunker_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzippin

True

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Reading the file as DataFrame**

In [5]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data.csv')

In [6]:
print(df.columns)

Index(['Sentence', 'Sentiment'], dtype='object')


In [7]:
# declaring lemmatizer and stop words
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# **Processing Data**

In [8]:
# Function to clean and lemmatize
def preprocess(sentence):
    tokens = word_tokenize(sentence.lower())
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    pos_tags = pos_tag(tokens)
    lemmatized = [lemmatizer.lemmatize(word, pos='v') for word, tag in pos_tags]
    return ' '.join(lemmatized)

# Apply to the 'sentence' column
df['processed'] = df['Sentence'].apply(preprocess)
print(df[['Sentence', 'processed']])

                                               Sentence  \
0     The GeoSolutions technology will leverage Bene...   
1     $ESI on lows, down $1.50 to $2.50 BK a real po...   
2     For the last quarter of 2010 , Componenta 's n...   
3     According to the Finnish-Russian Chamber of Co...   
4     The Swedish buyout firm has sold its remaining...   
...                                                 ...   
5837  RISING costs have forced packaging producer Hu...   
5838  Nordic Walking was first used as a summer trai...   
5839  According shipping company Viking Line , the E...   
5840  In the building and home improvement trade , s...   
5841  HELSINKI AFX - KCI Konecranes said it has won ...   

                                              processed  
0     geosolutions technology leverage benefon gps s...  
1                           esi low bk real possibility  
2     last quarter componenta net sales double perio...  
3     accord chamber commerce major construction com...  
4

# **Sentiment Classification using NLTK**

In [9]:
sia = SentimentIntensityAnalyzer()
# Apply sentiment scoring
def get_sentiment_score(sentence):
    return sia.polarity_scores(sentence)['compound']

df['sentiment_score'] = df['processed'].apply(get_sentiment_score)

# Optional: Convert score to label
def label_sentiment(score):
    if score > 0.05:
        return 'positive'
    elif score < -0.05:
        return 'negative'
    else:
        return 'neutral'

df['sentiment'] = df['sentiment_score'].apply(label_sentiment)
print(df[['Sentence', 'sentiment_score', 'sentiment']])

                                               Sentence  sentiment_score  \
0     The GeoSolutions technology will leverage Bene...           0.5423   
1     $ESI on lows, down $1.50 to $2.50 BK a real po...          -0.2732   
2     For the last quarter of 2010 , Componenta 's n...           0.1531   
3     According to the Finnish-Russian Chamber of Co...           0.0000   
4     The Swedish buyout firm has sold its remaining...           0.0000   
...                                                 ...              ...   
5837  RISING costs have forced packaging producer Hu...          -0.1027   
5838  Nordic Walking was first used as a summer trai...           0.0000   
5839  According shipping company Viking Line , the E...           0.2023   
5840  In the building and home improvement trade , s...           0.4588   
5841  HELSINKI AFX - KCI Konecranes said it has won ...           0.0000   

     sentiment  
0     positive  
1     negative  
2     positive  
3      neutral  
4 

# **TF-IDF and N-grams**

In [10]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['processed'])
print("\nTF-IDF Matrix:")
print(X.toarray())


TF-IDF Matrix:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


# **Extract bigrams from a single example**

In [11]:
example_text = df['processed'][0]
bigrams = list(ngrams(example_text.split(), 2))
print("\nExample Bigrams:")
print(bigrams)


Example Bigrams:
[('geosolutions', 'technology'), ('technology', 'leverage'), ('leverage', 'benefon'), ('benefon', 'gps'), ('gps', 'solutions'), ('solutions', 'provide'), ('provide', 'location'), ('location', 'base'), ('base', 'search'), ('search', 'technology'), ('technology', 'communities'), ('communities', 'platform'), ('platform', 'location'), ('location', 'relevant'), ('relevant', 'multimedia'), ('multimedia', 'content'), ('content', 'new'), ('new', 'powerful'), ('powerful', 'commercial'), ('commercial', 'model')]


# **NER Classification**

In [12]:
# Function to extract named entities
def extract_ner(sentence):
    tokens = word_tokenize(sentence)
    tags = pos_tag(tokens)
    tree = ne_chunk(tags)
    return tree

# Apply to dataset
df['ner'] = df['Sentence'].apply(lambda x: extract_ner(x))
print("\nNamed Entities (first 2 rows):")
print(df['ner'].head(2))

KeyboardInterrupt: 

# **Latent Semantic Analysis (LSA)**

LSA (Latent Semantic Analysis) is a dimensionality reduction technique
that uncovers hidden semantic structures in a document-term matrix.

It helps identify patterns and relationships between words and documents.

LSA is often applied to TF-IDF vectors and uses Singular Value Decomposition (SVD)

to reduce the matrix into fewer latent topics/dimensions.

In [13]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['processed'])
print("\nTF-IDF Matrix:")
print(X.toarray())


TF-IDF Matrix:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [14]:
# Apply LSA on TF-IDF matrix
lsa = TruncatedSVD(n_components=2)
X_lsa = lsa.fit_transform(X)
print("\nLSA Reduced Matrix:")
print(X_lsa)



LSA Reduced Matrix:
[[ 0.00388309 -0.02530209]
 [ 0.00096297 -0.00662548]
 [ 0.19952914 -0.25447918]
 ...
 [ 0.01448858 -0.06341392]
 [ 0.34349713  0.03023724]
 [ 0.01118625 -0.04620756]]


# **Latent Dirichlet Allocation (LDA)**

LDA (Latent Dirichlet Allocation) is a generative probabilistic model
for topic modeling.

It assumes that documents are mixtures of topics and that topics are distributions over words.

LDA tries to infer the hidden topic structure that likely generated the corpus.

It's widely used to discover abstract topics from large document sets.

Each document is described as a distribution over topics, and each topic as a distribution over words.

In [15]:
# Convert processed text into a document-term matrix
count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(df['processed'])

# Apply LDA
lda = LatentDirichletAllocation(n_components=2, random_state=0)
X_topics = lda.fit_transform(X_counts)

# Show topic distribution for each sentence
print("\nLDA Topic Distribution:")
print(X_topics)

# Show top words per topic
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic #{topic_idx + 1}: ",
              " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

print("\nTop Words Per LDA Topic:")
display_topics(lda, count_vectorizer.get_feature_names_out(), 5)


LDA Topic Distribution:
[[0.93867166 0.06132834]
 [0.09889036 0.90110964]
 [0.95909773 0.04090227]
 ...
 [0.17225938 0.82774062]
 [0.93536382 0.06463618]
 [0.03346456 0.96653544]]

Top Words Per LDA Topic:
Topic #1:  eur mn profit net operate
Topic #2:  company share say million finnish
