# NLP Hands-on Project 

# Restaurant Review Dataset:

1. Primary Goal: This dataset is designed to train and evaluate machine learning models for understanding and analyzing customer sentiment and opinions expressed in restaurant reviews.
2. Applications: The processed data can be used for various tasks, including:
-  Sentiment Analysis: Classifying reviews as positive, negative, or neutral to assess customer satisfaction.
-  Aspect Extraction: Identifying specific aspects of the restaurant experience that customers mention, such as food quality, service, ambiance, etc.
-  Recommendation Systems: Recommending restaurants to users based on their preferences and historical review data.
-  Topic Modeling: Discovering underlying themes and topics that frequently arise in reviews, potentially revealing areas for improvement or highlighting popular features.
# Key Considerations:

- Data Cleaning and Preprocessing: As noted in the ratings, real-world datasets often require careful cleaning and preprocessing to remove irrelevant information (e.g., stop words, punctuation) and prepare the data for modeling.
- Textual Nature: NLP techniques are essential to process this textual data in order to extract meaningful insights.
- Understanding Context: Sentiment understanding goes beyond individual words. Context within sentences and the overall review should be considered for accurate sentiment analysis.
# Additional Insights from Ratings:

1. Data Quality: The quality and representativeness of the dataset are crucial for training effective models. It's essential to have a diverse collection of reviews that reflects the real-world distribution of customer experiences.
2. Model Selection and Training: Choosing the appropriate NLP algorithms and tuning them effectively through training is crucial to achieve good performance in sentiment analysis and other NLP tasks.
3. Evaluation and Interpretation: Evaluating model performance on unseen data and interpreting the results are vital to assess the model'sgeneralizability and real-world applicability.
#Incorporating Feedback:

Headings: While the dataset may or may not contain explicit headings, identifying clear categories or labels within the data can be helpful for organizing and structuring the reviews.
Missing Values: Missing values can potentially affect the analysis. Techniques like data imputation or model-based strategies might be needed to address them.
Ethical Considerations: When collecting and using data, it's essential to prioritize user privacy and ensure data anonymization as much as possible.
By understanding these core points and addressing the considerations raised in the ratings, you can effectively leverage NLP techniques to extract valuable insights from restaurant review data for various applications.

# How to Clean Text for Machine Learning with Python


You cannot go straight from raw text to fitting a machine learning or deep learning model.

You must clean your text first, which means splitting it into words and handling punctuation and case.

## In fact, there is a whole suite of text preparation methods that you may need to use, and the choice of methods really depends on your natural language processing task.

In this tutorial, you will discover how you can clean and prepare your text ready for modeling with machine learning.

- After completing this tutorial, you will know:

- How to get started by developing your own very simple text cleaning tools.
- How to take a step up and use the more sophisticated methods in the NLTK library.
- How to prepare text when using modern text representation methods like word embeddings.

# To apply NLP techniques to this dataset, you would typically follow these steps:

1. Data Preprocessing:

Clean the text data by removing any unwanted characters, converting text to lowercase, and handling any missing or inconsistent values.
Tokenize the review text to break it down into individual words or tokens. Apply stemming or lemmatization to reduce words to their base form and remove stopwords if necessary.

2. Feature Extraction:

Convert the tokenized text data into numerical representations that can be used by machine learning models. This can be done using techniques such as TF-IDF vectorization or word embeddings. Prepare the feature matrix (X) containing the numerical representations of the review text.

3. Model Training:

Split the dataset into training and testing sets to evaluate the performance of the model. Train a binary classification model (e.g., logistic regression, support vector machine, or neural network) using the training data.

4. Model Evaluation:

Evaluate the trained model using the testing data to assess its performance in predicting whether reviewers liked the restaurant.
Calculate evaluation metrics such as accuracy, precision, recall, and F1-score to measure the model's effectiveness.


# Import Libraries:

1. nltk: Provides tools for NLP tasks.
stopwords: Offers a list of common stop words in various languages.
2. re: Enables regular expression operations for robust text cleaning.
3. Download Resources: Uncomment these lines if you haven't already downloaded the necessary NLTK resources (stop words and sentence tokenizer).

In [1]:
"""   Round 1 libraries """
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords# Importing the stopword-data 
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
nltk.download('punkt')
nltk.download('stopwords')
all_stopwords=set(stopwords.words('english')) 

[nltk_data] Downloading package punkt to C:\Users\Odai For
[nltk_data]     Computer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Odai For
[nltk_data]     Computer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#  Read the required  Dataset 



In [None]:

data = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

# Data Information:
You can use the pandas library in Python to load your dataset and view its structure using the head() and info() functions. Here's an example:

In [None]:

# Display the first few rows of the dataset
print(data.head())

# Display information about the dataset
print(data.info())


# Missing Value Handling:
To handle missing values in your dataset, you can use techniques like imputation (replacing missing values with a sensible estimate) or deletion (removing rows or columns with missing values). Here's how you can check for missing values and handle them using pandas:

In [None]:
# Check for missing values
print(data.isnull().sum())

# Remove rows with missing values
data.dropna(inplace=True)

# Fill missing values with a specific value
# df.fillna(value, inplace=True)


# Balanced Dataset:
A balanced dataset is one in which the distribution of classes (or labels) is approximately equal or evenly distributed. In other words, each class or label in the dataset has a similar number of instances or samples.

Characteristics:
Equal Distribution: Each class or label in the dataset has a comparable number of instances. For example, in a binary classification problem, if one class has 500 instances, the other class also has around 500 instances.
No Bias Towards Specific Classes: There is no significant bias towards any particular class in the dataset. This ensures that the model trained on the dataset is not skewed towards predicting one class over others.
Reliable Model Evaluation: Since each class is represented fairly equally, model evaluation metrics such as accuracy, precision, recall, and F1-score provide reliable performance measures.
Example:
Consider a dataset containing images of cats and dogs for a classification task. If the dataset contains 500 images of cats and 500 images of dogs, it would be considered a balanced dataset.

# Imbalanced Dataset:
An imbalanced dataset is one in which the distribution of classes is skewed, with one or more classes having significantly fewer instances compared to others. This often occurs in real-world scenarios where certain classes are rare or less frequent.

Characteristics:
Skewed Class Distribution: One or more classes have a disproportionately large or small number of instances compared to other classes. This results in an unequal distribution of classes.
Minority Class: The class with fewer instances is often referred to as the minority class, while the class with more instances is called the majority class.
Model Bias: Imbalanced datasets can lead to biased models that favor the majority class and perform poorly on minority classes.
Challenges in Model Evaluation: Traditional evaluation metrics such as accuracy may not accurately reflect the model's performance, as a model can achieve high accuracy by simply predicting the majority class for all instances.
Example:
In a credit card fraud detection dataset, where fraudulent transactions are rare compared to legitimate transactions, the dataset might contain 95% legitimate transactions and only 5% fraudulent transactions. This would be considered an imbalanced dataset.

# Handling Imbalanced Datasets:
Handling imbalanced datasets requires special attention to ensure that the model does not become biased towards the majority class. Techniques for addressing imbalance include:

Resampling: This involves either oversampling the minority class (adding more instances of the minority class) or undersampling the majority class (removing instances of the majority class).
Synthetic Minority Over-sampling Technique (SMOTE): SMOTE generates synthetic samples for the minority class to balance the dataset.
Cost-sensitive Learning: Assigning higher misclassification costs to minority class instances to encourage the model to focus on correctly predicting the minority class.
Ensemble Methods: Using ensemble methods such as bagging or boosting algorithms that are inherently robust to class imbalance.
Different Evaluation Metrics: Using evaluation metrics such as precision, recall, F1-score, or area under the ROC curve (AUC-ROC) that are more suitable for imbalanced datasets.
By understanding the characteristics of balanced and imbalanced datasets and employing appropriate techniques, you can effectively train models that generalize well to real-world scenarios.

- U can use the Histogram Figure to visualize the output values as below:

In [None]:

# Unique values in a column
print("\nUnique values in the 'Review' column:")
print(data['Liked'].unique())
# Count the number of items in the 'liked' column
liked_counts = data['Liked'].value_counts()

'''' visilization  ''' 
data['Liked'].hist(bins=10) 

# Tokenization:

Purpose: Tokenization breaks down text into individual units like words, numbers, or punctuation marks. This allows further processing and analysis.
Example (NLTK Library):

In [None]:
without_cleaning=data['Review'][111]
sentences = word_tokenize(without_cleaning)

# Regular Expressions (RegEx):

Purpose: RegEx allows you to define patterns for efficiently searching, matching, and manipulating text.
Syntax: It uses a special syntax with metacharacters like ^ (take all the characters defined inside the []  ), + (matches one or more occurrences), . (matches any single character), and character classes like [] (matches any character within the brackets). --- But be carfull convert the list to text using join function 

In [None]:
after_cleaning_1 = re.sub('[^a-zA-Z ]', '', " ".join(sentences))

# Lowercasing:

Purpose: Converting all characters to lowercase ensures consistency in word representation. This is helpful for NLP tasks because some algorithms are case-sensitive.
Example:

In [None]:
after_lower=after_cleaning_1.lower()

# Stop Words:

Purpose: Stop words are common words in a language that carry little meaning on their own. Removing them can help focus on the more content-rich words in a text.


# Stemming:

Purpose: Stemming reduces a word to its base form (stem) by removing suffixes. This helps group related words together for analysis.


In [None]:

all_stopwords.remove("not")
ps = PorterStemmer()
after_stopwords = [ps.stem(word) for word in word_tokenize(after_lower) if not word in all_stopwords]
After_cleaning = ' '.join(after_stopwords)

   



# Apply the following processes to the entire dataset:

In [None]:
corpus = []
for i in range(0, 1000):
   without_cleaning=data['Review'][i]
   sentences = word_tokenize(without_cleaning)
   after_cleaning_1 = re.sub('[^a-zA-Z ]', '', " ".join(sentences))
   after_lower=after_cleaning_1.lower()
   ps = PorterStemmer()
   after_stopwords = [ps.stem(word) for word in word_tokenize(after_lower) if not word in all_stopwords]
   After_cleaning = ' '.join(after_stopwords)
   corpus.append( After_cleaning)

# """  revised-round """

In [None]:
"""  revised-round """
revised=[]
for m in range(0, 1000):
    aftersplit=corpus[m].split(" ")
    revised.extend(aftersplit)
from collections import Counter 
count=Counter(revised)
top_10 = count.most_common(2)



# CountVectorizer and TfidfVectorizer: 


1. Feature Extraction:
Conversion of Text Data to Numerical Representations: Machine learning algorithms require numerical input data. CountVectorizer and TfidfVectorizer convert textual data into numerical feature vectors that can be used for modeling.
Sparse Matrix Representation: They create sparse matrices efficiently, representing the presence or absence of words in documents.
2. Handling Text Data:
Preprocessing: They handle common text preprocessing tasks such as tokenization, lowercasing, and stop word removal.
Normalization: TfidfVectorizer normalizes term frequencies by accounting for document length, which can be essential for comparing documents of different lengths.
Dimensionality Reduction: By converting text data into a matrix representation, they reduce the dimensionality of the data, making it suitable for machine learning algorithms.
3. Importance of Words:
Frequency Analysis: CountVectorizer counts the occurrences of each word in a document, providing insights into the importance of words within individual documents.
Term Importance: TfidfVectorizer assigns higher weights to terms that are important in a document but not frequent across all documents. This helps in identifying significant terms that differentiate documents.
4. Applications in NLP Tasks:
Text Classification: They are widely used for sentiment analysis, spam detection, topic classification, and other text classification tasks.
Information Retrieval: In search engines, they help in ranking documents based on relevance to a query.
Document Clustering: They assist in clustering similar documents together based on their textual content.
Keyword Extraction: They aid in identifying important keywords or phrases within documents.
5. Flexibility and Customization:
Parameter Tuning: Both CountVectorizer and TfidfVectorizer offer various parameters that can be tuned to customize their behavior, such as n-gram ranges, stop word lists, and token patterns.
Integration with Pipelines: They seamlessly integrate with scikit-learn's pipeline framework, allowing for easy integration into machine learning workflows.
In summary, CountVectorizer and TfidfVectorizer are indispensable tools for processing and analyzing textual data in a wide range of applications. They facilitate the conversion of raw text into a format that can be effectively used by machine learning algorithms, enabling various NLP tasks and text mining analyses.

# CountVectorizer:
CountVectorizer is a feature extraction technique used to convert text data into numerical representations. It counts the frequency of each word (token) in the document and creates a sparse matrix where each row represents a document and each column represents a unique word in the entire corpus.

Example:
Consider the following two documents:

"The cat sat on the mat."
"The dog played with the ball."
The CountVectorizer would convert these documents into the following sparse matrix:


![image-2.png](attachment:image-2.png)

Each row corresponds to a document, and each column corresponds to a unique word in the corpus. The values represent the count of each word in the respective document.

Code Example:
Here's how you can use CountVectorizer in Python using scikit-learn:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog played with the ball."
]

# Create CountVectorizer object
count_vectorizer = CountVectorizer()

# Fit and transform the documents
X_count = count_vectorizer.fit_transform(documents)

# Convert sparse matrix to array for readability (optional)
X_count_array = X_count.toarray()

# Display the vocabulary (unique words)
print("Vocabulary (unique words):", count_vectorizer.get_feature_names_out())

# Display the document-term matrix
print("Document-Term Matrix:")
print(X_count_array)

Vocabulary (unique words): ['ball' 'cat' 'dog' 'mat' 'on' 'played' 'sat' 'the' 'with']
Document-Term Matrix:
[[0 1 0 1 1 0 1 2 0]
 [1 0 1 0 0 1 0 2 1]]


# TfidfVectorizer:
TfidfVectorizer stands for Term Frequency-Inverse Document Frequency Vectorizer. It is similar to CountVectorizer but takes into account not only the frequency of a word in a document but also its importance in the entire corpus. It penalizes words that occur frequently across all documents and assigns higher weights to words that are unique to specific documents.

Example:
Consider the same two documents as before. The TfidfVectorizer would convert them into the following sparse matrix:
![image.png](attachment:image.png)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents (same as before)
documents = [
    "The cat sat on the mat.",
    "The dog played with the ball."
]

# Create TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# Convert sparse matrix to array for readability (optional)
X_tfidf_array = X_tfidf.toarray()

# Display the vocabulary (unique words)
print("Vocabulary (unique words):", tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
print("TF-IDF Matrix:")
print(X_tfidf_array)


# Apply what has been explained to the data set and  create an input (X bag of words after NLP process ) and an output ( Y the liked )

In [None]:
# # Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1573)
X = cv.fit_transform(corpus).toarray()
y = data.iloc[:,1:].values

# Apply Machine learning models such as ( Naive Bayes and logistic Regression )

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)


from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)


# Predicting the Test set results
y_pred =logreg.predict(X_test)
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, y_pred))



# Here are some resources to delve deeper into NLP:

- Books:
"Speech and Language Processing" by Jurafsky and Martin
"Natural Language Processing with Python" by Bird, Klein, and Loper
- Online Courses:
https://online.stanford.edu/courses/cs230-deep-learning (Stanford)
https://course.fast.ai/ (fast.ai)
Remember, this is a simplified overview. As you progress, you'll encounter more advanced NLP techniques and delve deeper into specific areas like sentiment analysis or topic modeling.