## Introduction
In this notebook, we will explore various techniques and models for Natural Language Processing (NLP). We will cover topics such as text preprocessing, feature extraction, sentiment analysis, and text classification.

## Dataset
For this project, we will use the "Sentiment Analysis on Movie Reviews" dataset from Kaggle. It contains movie reviews along with their corresponding sentiment labels (positive or negative).

## Data Preprocessing
1. Load the dataset and perform basic exploratory data analysis.
2. Clean the text data by removing special characters, numbers, and stopwords.
3. Perform tokenization and lemmatization to convert text into a usable format.

## Feature Extraction
1. Implement Bag-of-Words (BoW) representation using CountVectorizer or TfidfVectorizer.
2. Generate word embeddings using pre-trained models such as Word2Vec or GloVe.
3. Explore feature engineering techniques like n-grams, term frequency-inverse document frequency (TF-IDF), and word frequency.

## Sentiment Analysis
1. Split the dataset into training and testing sets.
2. Train a sentiment analysis model using various algorithms like Naive Bayes, Support Vector Machines (SVM), or Recurrent Neural Networks (RNNs).
3. Evaluate the model's performance using metrics such as accuracy, precision, recall, and F1-score.

## Text Classification
1. Convert the text data into numerical features using the chosen feature extraction technique.
2. Split the dataset into training and testing sets.
3. Train a text classification model using algorithms like Logistic Regression, Random Forest, or Convolutional Neural Networks (CNNs).
4. Evaluate the model's performance using metrics such as accuracy, precision, recall, and F1-score.

## Conclusion
NLP is a fascinating field that offers a wide range of techniques for understanding and analyzing textual data. In this notebook, we explored various aspects of NLP, including data preprocessing, feature extraction, sentiment analysis, and text classification. By leveraging these techniques, we can gain valuable insights from text data and build powerful predictive models.


# **Data Preprocessing**

In [8]:
#pip install nltk

In [9]:
# Load the dataset
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

df = pd.read_csv("/kaggle/input/movie-review/movie_review.csv")

# Basic exploratory data analysis
print(df.head())
print(df.shape)

def clean_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = text.lower()
    text = word_tokenize(text)
    text = [word for word in text if word not in stopwords.words("english")]
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(word) for word in text]
    return " ".join(text)

df["cleaned_text"] = df["text"].apply(clean_text)


[nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary
[nltk_data]     failure in name resolution>
[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>



KeyboardInterrupt



# **Feature Extraction(Bag od Words)**

In [None]:
# Implement Bag-of-Words representation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df["cleaned_text"])
y = df["sentiment"]

# Print the feature matrix shape
print(X.shape)

# **Sentiment Analysis (Naive Bayes)**

In [None]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Evaluate the model's performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = nb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

# **Text Classification (Logistic Regression)**

In [None]:
# Convert text data into numerical features using Bag-of-Words representation
X = vectorizer.transform(df["cleaned_text"])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Evaluate the model's performance
y_pred = lr_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

# **Exploratory Data Analysis (Word Cloud)**

In [None]:
# Visualize word cloud for positive sentiment
from wordcloud import WordCloud
import matplotlib.pyplot as plt

positive_reviews = df[df["sentiment"] == "positive"]["cleaned_text"]
positive_text = " ".join(positive_reviews)

wordcloud = WordCloud(width=800, height=400, background_color="white").generate(positive_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud - Positive Sentiment")
plt.show()

# **Model Performance Evaluation (Confusion Matrix)**

In [None]:
# Evaluate the model's performance using a confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# **Model Performance Comparison (Bar Plot)**

In [None]:
# Compare model performance using bar plot
models = ["Naive Bayes", "Logistic Regression"]
accuracy_scores = [accuracy_nb, accuracy_lr]
precision_scores = [precision_nb, precision_lr]
recall_scores = [recall_nb, recall_lr]
f1_scores = [f1_nb, f1_lr]

plt.figure(figsize=(10, 6))
x_pos = [i for i, _ in enumerate(models)]
plt.bar(x_pos, accuracy_scores, color="blue", alpha=0.7, label="Accuracy")
plt.bar(x_pos, precision_scores, color="green", alpha=0.7, label="Precision")
plt.bar(x_pos, recall_scores, color="orange", alpha=0.7, label="Recall")
plt.bar(x_pos, f1_scores, color="purple", alpha=0.7, label="F1-Score")
plt.xticks(x_pos, models)
plt.xlabel("Model")
plt.ylabel("Scores")
plt.title("Model Performance Comparison")
plt.legend()
plt.show()