# Introduction to NLP

**Natural Language Processing (NLP)** refers to a subfield of **Artificial Intelligence** and computational linguistics that focuses on the interaction between computers and human (natural) language. NLP is conventionally done by developing algorithms, models, and tools to enable computers to understand, interpret, and generate human language. Examples of NLP are:

1.   Sentiment analysis
2.   Machine translation
3.   Automatic summarization
4.   Chatbot
5.   Email filtering



# Importing Libraries

Several libraries and modules will be used throughout this project. Importing the necessary libraries is the initial step to begin the project.

1.   **Pandas** - Pandas is a powerful data manipulation and analysis library for Python as it provides data structures and functions to work with structured data like tables.
2.   **Numpy** - Numpy is a fundamental package for numerical computations in Python  as it provides support for arrays and matrices along with mathematical functions to operate on them efficiently.
3.   **NLTK** - NLTK stands for Natural Language Toolkit, a library for working with human language data. It provides easy-to-use interfaces to linguistic data and models for natural language processing tasks.
4.   **Gzip** - Gzip is a module that proJson module provides methods for working with JSON data which is a common data format for storing and exchanging structured datavides support for reading and writing GZIP-compressed files. It will be used in this project due to the input dataset is a GZIP-compressed JSON file.
5.   **Json** - Json module provides methods for working with JSON data which is a common data format for storing and exchanging structured data.



In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import gzip
import json

In [None]:
# Load the JSON GZ dataset
def load_json_gz_dataset(filename):
    data = []
    with gzip.open(filename, 'rt', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# Load and preprocess the JSON GZ dataset
dataset = load_json_gz_dataset("Magazine_Subscriptions.json.gz")

# Convert the dataset to a Pandas DataFrame
df = pd.DataFrame(dataset)

# Display some sample data\n",
print("Sample Data:")
print(df.head())

# Preprocessing Data

The preprocessing phase of this project involved stopwords removal. **Stopwords** are common words (e.g. "the", "and", "in") that are often removed from text data since they do not carry significant meaning. A set of English stopwords is generated using the NLTK stopwords dataset to filter out stopwords from the text. Another text preprocessing phase will be stemming. **Stemming** is a process of reducing the words to their root or base form (e.g. "improving", "improvement" -> "improv") which is helpful in text analysis by reducing inflected words to a common form.

In [None]:
#  Preprocess the text data
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

def preprocess_text(text):
    if isinstance(text, str):  # Exclude non-string 'text'
        text = text.lower()
        text = " ".join([word for word in text.split() if word not in stop_words])
        text = " ".join([stemmer.stem(word) for word in text.split()])
    return text

df["reviewText"] = df["reviewText"].apply(preprocess_text)

# Map 'overall' ratings (1-5) to 'sentiment' labels ("Positive", "Neutral", "Negative")
def map_rating_to_sentiment(rating):
    if rating <= 2:
        return 'negative'
    elif rating == 3:
        return 'neutral'
    else:
        return 'positive'

df["sentiment"] = df["overall"].apply(map_rating_to_sentiment)

# Preparing Model Training Data

Before feeding data to train a model, **TF-IDF (Term Frequency-Inverse Data Frequency)** vectorization has to be performed on text data. TF-IDF is an essential technique for converting text documents into a numerical format that machine learning algorithms can work with.

In [4]:
# Drop rows with missing or NaN values in the "reviewText" column
df.dropna(subset=["reviewText"], inplace=True)

# Split the data into training and testing sets
X = df["reviewText"]
y = df["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Vectorize the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Training Model

**Support Vector Machine (SVM)** will be used in this project as it is a popular and powerful machine learning algorithm that is used in various fields for classification and regression tasks. SVMs are particularly **effective when dealing with high-dimensional feature spaces** which makes them well-suited for tasks involving text data, images, and other data types with numerous features. The main objective of SVM is to **find a decision boundary that best separates data points into different classes** and another objective is to **find a hyperplane that maximizes the margen between the nearest data points of different classes**.

In [None]:
# # Perform hyperparameter tuning
# param_grid = {'C': [0.1, 1, 10],
#               'kernel': ['linear']}

# svm_classifier = SVC(random_state=42)
# grid_search = GridSearchCV(svm_classifier, param_grid, cv=3, scoring='accuracy', verbose=2)
# grid_search.fit(X_train_tfidf, y_train)

# # Get the best parameters and model
# best_params = grid_search.best_params_
# svm_classifier = grid_search.best_estimator_

# print("\nBest Hyperparameters:")
# print(best_params)

In [None]:
# Train an SVM classifier
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train_tfidf, y_train)

In [None]:
# Evaluate the model
y_pred = svm_classifier.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Model Evaluation:")
print("Accuracy:", accuracy)
print("Classification Report:", report)

In [None]:
# Test model's correctness
def predict_sentiment(text):
    preprocessed_text = preprocess_text(text)
    text_tfidf = tfidf_vectorizer.transform([preprocessed_text])
    sentiment = best_model.predict(text_tfidf)
    return sentiment[0]

sample_texts = [
    "This product is amazing and I love it!",
    "The service was terrible and I'm highly disappointed.",
    "Neutral sentiment for this one."
]

print("\nTesting the Model on Sample Texts:")
for text in sample_texts:
    sentiment = predict_sentiment(text)
    print(f"Text: '{text}'")
    print(f"Predicted Sentiment: {sentiment}\n")