# **Gen-AI Bootcamp 24** *Natural Language Processing Course Assignment*

_________________________________________________________________________

## **Part 1: Text Collection and Loading**
**Objective:** *Collect and load a text dataset from a selected domain into a suitable format for
processing.*

#### **Domain**: *Automotive* 

#### **Kaggle Dataset**: *https://www.kaggle.com/datasets/ankkur13/edmundsconsumer-car-ratings-and-reviews?select=Scraped_Car_Review_dodge.csv*
This is a dataset containing consumer's thought and the star rating of car manufacturer/model/type.

#### **Loading Dataset**:

In [24]:
import pandas as pd

# Load the data into a DataFrame
df = pd.read_csv('Scraped_Car_Review_dodge.csv')
# Print the shape of the DataFrame
df.shape

(1020, 7)

#### **Displaying the first few rows**

In [25]:
# Print the first five rows of the DataFrame
df.head()

Unnamed: 0,Id,Review_Date,Author_Name,Vehicle_Title,Review_Title,Review,Rating
0,0,on 10/13/05 15:30 PM (PDT),roadking,2002 Dodge Ram Cargo Van 1500 3dr Van (3.9L 6c...,Great delivery vehicle,It's been a great delivery vehicle for my caf...,4.625
1,1,on 07/17/05 21:59 PM (PDT),Mark,2002 Dodge Ram Cargo Van 3500 3dr Ext Van (5.2...,Disappointmnet,Bought this car as a commuter vehicle for a v...,2.125
2,2,on 07/16/02 00:00 AM (PDT),Tom Sheer,2002 Dodge Ram Cargo Van 3500 Maxi 3dr Ext Van...,Sweet van,"This van rocks its the best, lots of room. I ...",5.0
3,3,on 12/29/07 21:57 PM (PST),Keven Smith,2001 Dodge Ram Cargo Van 2500 Maxi 3dr Ext Van...,Keven Smith,Great work vehicle. Drives nice. has lots of ...,4.5
4,4,on 02/09/05 18:52 PM (PST),VanMan,2001 Dodge Ram Cargo Van 1500 3dr Van (3.9L 6c...,Not what Dodge used to be,Good solid frame and suspension. Well equipp...,2.875


_________________________________________________________________________
## **Part 2: Text Preprocessing**
**Objective:** *Gain hands-on experience with text preprocessing techniques.*

### **Step 1: Import the Necessary Libraries and Corpus**

In [26]:
import nltk
nltk.download('brown')  # Download Brown Data
nltk.download('punkt') # Download Punkt Data
nltk.download('stopwords') # Download Stopwords Data
nltk.download('wordnet') # Download WordNet Data
nltk.download('omw-1.4')  # Download WordNet Data
from nltk.corpus import brown
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### **Step 2: Load the Brown Corpus**

In [27]:
# Load the text from the 'news' category of the Brown Corpus
text = ' '.join(brown.words(categories='science_fiction'))
print("Original Text:", text[:500])  # Print the first 500 characters for reference

Original Text: Now that he knew himself to be self he was free to grok ever closer to his brothers , merge without let . Self's integrity was and is and ever had been . Mike stopped to cherish all his brother selves , the many threes-fulfilled on Mars , corporate and discorporate , the precious few on Earth -- the unknown powers of three on Earth that would be his to merge with and cherish now that at last long waiting he grokked and cherished himself . Mike remained in trance ; ; there was much to grok , loos


### **Step 3: Tokenization**
Tokenization splits the text into individual words and sentences.

**Impact of Tokenization:**
* Sentence Tokenization: Breaks down the text into manageable units (sentences) for further processing.
* Word Tokenization: Provides the basic units (words) needed for subsequent analysis steps.

In [28]:
# Sentence Tokenization
sentences = sent_tokenize(text)
print("First 5 sentences:", sentences[:5])

# Word Tokenization
words = word_tokenize(text)
print("First 20 words:", words[:20])

First 5 sentences: ['Now that he knew himself to be self he was free to grok ever closer to his brothers , merge without let .', "Self's integrity was and is and ever had been .", 'Mike stopped to cherish all his brother selves , the many threes-fulfilled on Mars , corporate and discorporate , the precious few on Earth -- the unknown powers of three on Earth that would be his to merge with and cherish now that at last long waiting he grokked and cherished himself .', "Mike remained in trance ; ; there was much to grok , loose ends to puzzle over and fit into his growing -- all that he had seen and heard and been at the Archangel Foster Tabernacle ( not just cusp when he and Digby had come face to face alone ) why Bishop Senator Boone made him warily uneasy , how Miss Dawn Ardent tasted like a water brother when she was not , the smell of goodness he had incompletely grokked in the jumping up and down and wailing -- Jubal's conversations coming and going -- Jubal's words troubled him mo

### **Step 4: Stemming**
Stemming reduces words to their root form, stripping suffixes.

**Impact of Stemming:**
* Reduction of Variants: Words like "running," "runner," and "ran" are reduced to "run," which simplifies the text and reduces complexity.
* Potential Loss of Meaning: Sometimes, stemming can strip too much, losing the actual meaning of the word.

In [29]:
# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# Apply stemming to each word
stemmed_words = [stemmer.stem(word) for word in words]
print("First 20 stemmed words:", stemmed_words[:20])

First 20 stemmed words: ['now', 'that', 'he', 'knew', 'himself', 'to', 'be', 'self', 'he', 'wa', 'free', 'to', 'grok', 'ever', 'closer', 'to', 'hi', 'brother', ',', 'merg']


### **Step 5: Lemmatization**
Lemmatization reduces words to their base or dictionary form, considering the context.

**Impact of Lemmatization:**
* Context-Aware Reduction: More accurate than stemming, as it considers the part of speech and context.
* Improved Meaning Preservation: Maintains the integrity of the words better than stemming.

In [32]:
# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization to each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
print("First 20 lemmatized words:", lemmatized_words[:20])

First 20 lemmatized words: ['now', 'that', 'he', 'knew', 'himself', 'to', 'be', 'self', 'he', 'wa', 'free', 'to', 'grok', 'ever', 'closer', 'to', 'hi', 'brother', ',', 'merg']


### **Step 6: Stop Word Removal**
Stop words are common words (e.g., "the," "is") that may not add significant meaning to text analysis.

**Impact of Stop Word Removal:**
* Noise Reduction: Eliminates common but uninformative words, reducing the size of the text data.
* Focus on Meaningful Words: Enhances the focus on significant words that contribute more to the text analysis.

In [33]:
# Get the list of stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokenized words
filtered_words = [word for word in lemmatized_words if word.lower() not in stop_words and word not in string.punctuation]
print("First 20 words after stop word removal:", filtered_words[:20])

First 20 words after stop word removal: ['knew', 'self', 'wa', 'free', 'grok', 'ever', 'closer', 'hi', 'brother', 'merg', 'without', 'let', 'self', "'s", 'integr', 'wa', 'ever', 'mike', 'stop', 'cherish']
