1. **Text Extraction**:
   - The 'Text' field is extracted from each item in the loaded data, which will be used for subsequent natural language processing tasks.

2. **Preprocessing and NLP Operations**:
   - **Stopword Removal**: Common stopwords are removed from the text using the `remove_stopwords` function from `gensim`.
   - **Stemming**: Texts are processed using the `PorterStemmer` from `nltk` to reduce words to their stems, helping to simplify the vocabulary.
   - **Lemmatization**: The `WordNetLemmatizer` from `nltk` is used to convert words into their base forms.

3. **Vectorization**:
   - **TF-IDF** (Term Frequency-Inverse Document Frequency): A TF-IDF matrix for the vocabulary is created using `TfidfVectorizer`, which helps to evaluate the importance of words in documents.
   - **Bag of Words**: A frequency-based vector representation is created using `CountVectorizer`.

4. **Topic Modeling (LDA)**:
   - The `Latent Dirichlet Allocation` (LDA) method is employed for topic modeling to extract potential topics from the texts, displaying the top 10 most significant words for each topic.

5. **Sentiment Analysis**:
   - Sentiment analysis is performed on each text using `TextBlob` to compute sentiment polarity (ranging from -1 to 1), and the average sentiment score across all texts is calculated.

6. **Document Similarity**:
   - Document similarity is calculated using cosine similarity based on the TF-IDF matrix, identifying the most similar documents to a given document and listing the top five.

7. **Keyword Extraction**:
   - `TfidfVectorizer` is used again to extract keywords, and these are ranked based on their TF-IDF scores, displaying the top 10 keywords with the highest scores.

In [4]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/a1234/nltk_data...


True

In [6]:
import json
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from gensim.parsing.preprocessing import remove_stopwords

# Step 1: Load the data line by line
file_path = '/Users/a1234/Desktop/BU/677 PYTHON/project/combined_data/combined_02134_Toyota_100_car_info.json'
data = []
with open(file_path, 'r') as file:
    for line in file:
        data.append(json.loads(line.strip()))

# Step 2: Extract the 'Text' field
texts = [item['Text'] for item in data]

# Step 3: Preprocessing and NLP operations
# Remove stopwords
texts = [remove_stopwords(text) for text in texts]

# Stemming
stemmer = PorterStemmer()
texts_stemmed = [' '.join([stemmer.stem(word) for word in text.split()]) for text in texts]

# Lemmatization
lemmatizer = WordNetLemmatizer()
texts_lemmatized = [' '.join([lemmatizer.lemmatize(word) for word in text.split()]) for text in texts]

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(texts_lemmatized)

# Bag of Words
count_vectorizer = CountVectorizer()
bow_matrix = count_vectorizer.fit_transform(texts_lemmatized)

# Print some results of TF-IDF and Bag of Words
print("TF-IDF Feature names:", tfidf_vectorizer.get_feature_names_out()[:50])  # Show some feature words
print("Bag of Words Feature names:", count_vectorizer.get_feature_names_out()[:50])


TF-IDF Feature names: ['00' '000' '0008fa19b63f4f2dbc68f1e3088caace' '000k' '000lb' '000lbs'
 '001' '001b61acf917474f9ff76a72c794cb70' '002' '003' '0036' '005'
 '005529' '007' '0076' '0077' '007788' '0083' '0084' '009' '00am' '00pm'
 '01' '01013' '01028' '01040' '0104241' '010455' '01060' '01075' '01077'
 '01085' '01089' '01109' '0111' '0111235' '0111248' '01129' '011636'
 '0118244' '011877c' '012331' '0125249' '0128ba4e5056a981' '012917' '013'
 '01301' '0131245' '013691' '01398ba45056a981']
Bag of Words Feature names: ['00' '000' '0008fa19b63f4f2dbc68f1e3088caace' '000k' '000lb' '000lbs'
 '001' '001b61acf917474f9ff76a72c794cb70' '002' '003' '0036' '005'
 '005529' '007' '0076' '0077' '007788' '0083' '0084' '009' '00am' '00pm'
 '01' '01013' '01028' '01040' '0104241' '010455' '01060' '01075' '01077'
 '01085' '01089' '01109' '0111' '0111235' '0111248' '01129' '011636'
 '0118244' '011877c' '012331' '0125249' '0128ba4e5056a981' '012917' '013'
 '01301' '0131245' '013691' '01398ba45056a981']


In [7]:
'''Topic Modeling using Latent Dirichlet Allocation (LDA):'''
from sklearn.decomposition import LatentDirichletAllocation

# Define the number of topics
num_topics = 5

# Perform LDA
lda = LatentDirichletAllocation(n_components=num_topics)
lda.fit(tfidf_matrix)

# Get the topic-word distributions
topic_word_distributions = lda.components_

# Get the top words for each topic
feature_names = tfidf_vectorizer.get_feature_names_out()
n_top_words = 10
for topic_idx, topic_words in enumerate(topic_word_distributions):
    top_words = [feature_names[i] for i in topic_words.argsort()[:-n_top_words - 1:-1]]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")


Topic 1: new, east, hartford, vehicle, haven, windsor, toyota, price, auto, best
Topic 2: na, cab, diesel, 4x4, 2015, 4dr, sb, tacoma, dieselland, 603
Topic 3: north, west, new, brookfield, 978, hadley, bridgewater, online, east, braintree
Topic 4: toyota, new, power, with, clean, car, miles, rear, vehicle, great
Topic 5: lease, own, blue, car, 7999, you, 6999, honda, need, weekly


In [8]:
'''Sentiment Analysis using TextBlob:'''
from textblob import TextBlob

# Perform sentiment analysis on each text
sentiments = [TextBlob(text).sentiment.polarity for text in texts]

# Print the average sentiment score
avg_sentiment = sum(sentiments) / len(sentiments)
print("Average Sentiment:", avg_sentiment)


Average Sentiment: 0.16730260226571336


In [9]:
'''Document Similarity using Cosine Similarity:'''
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the cosine similarity between documents
similarity_matrix = cosine_similarity(tfidf_matrix)

# Get the most similar documents to a given document
document_index = 0
similar_documents = similarity_matrix[document_index].argsort()[::-1][1:6]  # Top 5 similar documents
for doc_index in similar_documents:
    print("Similar Document:", texts[doc_index])


Similar Document: Toyota 4 Runner TRD Pro Roof Rack - Fits 2010 - 2023 Models All mounting hardware included. Toyota OEM. 2 years old.
Similar Document: 2 keys Toyota camry le 2002 Good condtion
Similar Document: Sunroof , leather seats , Bluetooth , camera Clean title 2 owners 3 months warranty 2 original keys
Similar Document: Fuel Injection Idle Air Control Valve, pulled good running 2001 Camry CE, good working condition, works Camry models, check number sure.
Similar Document: keys Toyota camry le Also Mannuel 2002 Toyota camry le


In [10]:
'''Keyword Extraction: '''
# Extraction of keywords or phrases in the text, used to understand the theme or important content of the text

from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the preprocessed texts into TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(texts_lemmatized)

# Get the feature names (keywords)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Get the TF-IDF scores for each keyword
tfidf_scores = tfidf_matrix.sum(axis=0).A1

# Sort the keywords based on their TF-IDF scores
top_keywords_indices = tfidf_scores.argsort()[::-1][:10]  # Get top 10 keywords

# Print the top keywords and their TF-IDF scores
for index in top_keywords_indices:
    keyword = feature_names[index]
    tfidf_score = tfidf_scores[index]
    print(f"Keyword: {keyword}, TF-IDF Score: {tfidf_score}")


Keyword: na, TF-IDF Score: 378.0
Keyword: toyota, TF-IDF Score: 122.93005684427112
Keyword: new, TF-IDF Score: 94.66869604403516
Keyword: cab, TF-IDF Score: 86.37138534506313
Keyword: tacoma, TF-IDF Score: 77.5955236424751
Keyword: 4x4, TF-IDF Score: 75.42622345376009
Keyword: power, TF-IDF Score: 74.41829467892178
Keyword: credit, TF-IDF Score: 72.25356678785614
Keyword: 4dr, TF-IDF Score: 69.62573302668747
Keyword: car, TF-IDF Score: 68.21537221762354
