# M6- W3 Assignment: NLP

Natural language processing (NLP) is an important and rapidly developing part of machine learning. New powerful  models (the so-called transformer type) appear regularly and each new one outperforms the previous one in a fundamental NLP task, such as question-answering, name-entity recognition, etc. However, often simple, classical methods tend to work quite well and are a good first approach to solve many NLP problems.

In this assignment, you will work with a famous data set for sentiment analysis, namely the Amazon reviews data set. One place where the data can be found is here: https://www.kaggle.com/bittlingmayer/amazonreviews (Links to an external site.)Links to an external site.

Download and import the training and testing data sets. It’s not in a usual .csv format, so the below code can help you transform it to a pandas data frame.
     import bz2

      train_file = bz2.BZ2File("train.ft.txt.bz2")

      # Load and decode

      lines = [x.decode('utf-8') for x in train_file.readlines()]

      # Split in two: sentiment and review

      score_review_list = [l.strip('__label__').split(' ', 1) for l in lines]

      df = pd.DataFrame(score_review_list, columns = ['score', 'review'] )

Bonus points for extracting reviews and labels using regular expressions and named groups.
Create a new feature, called ‘n_tokens’ that counts how many tokens(words) there are in a review. In other words, a feature for the length of a review.
Create a new feature, called ‘language’, which detects what is the language of each review. So this feature will have a different value for each row (review) of the data.
Transform each review into a numeric vector of tokens using a bag-of-words. Use can use the CountVectorizer module from sklearn but limit the maximum number of features to be 1000 to avoid memory issues (you can decrease it further if you still have memory issues). Explore the other parameters of the function as well.
Using the fitted and transformed vector and the above created features, train a model that predicts the sentiment of a review. Note that this will be a classification problem. Evaluate your model and motivate your choice of a performance metric.(Hint: the feature for language is of type ‘object’, you may want to transform it to binary, such that it is 1 if the language is in English, 0 otherwise).

## Step 1: Download and Import the Data

In [3]:
import bz2
import os
import pandas as pd

os.chdir("C:\\Users\\ManosIeronymakisProb\\OneDrive - Probability\\Bureaublad\\ELU\M6- W3 Assignment NLP")

train_file = bz2.BZ2File("train.ft.txt.bz2")

## Load and decode
lines = [x.decode('utf-8') for x in train_file.readlines()]

# Split into sentiment and review
score_review_list = [l.strip('__label__').split(' ', 1) for l in lines]


df = pd.DataFrame(score_review_list, columns=['score', 'review'])

## Step 2: Extract Reviews and Labels using Regular Expressions

In [None]:
import re

score_review_list = [re.match(r"__label__(\d+) (.*)", l).groups() for l in lines]

df = pd.DataFrame(score_review_list, columns=['score', 'review'])


## Step 3: Create the 'n_tokens' Feature

In [None]:
df['n_tokens'] = df['review'].apply(lambda x: len(x.split()))

## Step 4: Create the 'language' Feature

In [None]:
from langdetect import detect
import numpy as np

def detect_language(text):
    try:
        if len(text.strip()) == 0:
            return np.nan  # Return NaN for empty reviews
        else:
            return detect(text)
    except:
        return np.nan  # Return NaN if language detection fails

df['language'] = df['review'].apply(detect_language)

## Step 5: Transform Reviews into Numeric Vectors using Bag-of-Words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(df['review']).toarray()

# Create a new DataFrame with the transformed vectors
df_transformed = pd.DataFrame(X, columns=vectorizer.get_feature_names())

# Concatenate the transformed DataFrame with the original DataFrame
df = pd.concat([df, df_transformed], axis=1)


## Step 6: Prepare the Data for Model Training

In [None]:
df['language'] = df['language'].apply(lambda x: 1 if x == 'en' else 0)

X = df.drop(['score', 'review'], axis=1)
y = df['score']


## Step 7: Train a Model for Sentiment Classification and Evaluate Performance

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
