# M6- W3 Assignment: NLP

Student: Loai Siwas

Natural language processing (NLP) is an important and rapidly developing part of machine learning. New powerful  models (the so-called transformer type) appear regularly and each new one outperforms the previous one in a fundamental NLP task, such as question-answering, name-entity recognition, etc. However, often simple, classical methods tend to work quite well and are a good first approach to solve many NLP problems.

In this assignment, you will work with a famous data set for sentiment analysis, namely the Amazon reviews data set. One place where the data can be found is here: https://www.kaggle.com/bittlingmayer/amazonreviews


1. Download and import the training and testing data sets. It’s not in a usual .csv format, so the below code can help you transform it to a pandas data frame.


```python
   import bz2
   train_file = bz2.BZ2File("train.ft.txt.bz2")
   # Load and decode
   lines = [x.decode('utf-8') for x in train_file.readlines()]
   # Split in two: sentiment and review
   score_review_list = [l.strip('__label__').split(' ', 1) for l in lines]
   df = pd.DataFrame(score_review_list, columns = ['score', 'review'] )
```

2. Bonus points for extracting reviews and labels using regular expressions and named groups.
3. Create a new feature, called ‘n_tokens’ that counts how many tokens(words) there are in a review. In other words, a feature for the length of a review.
4. Create a new feature, called ‘language’, which detects what is the language of each review. So this feature will have a different value for each row (review) of the data.
5. Transform each review into a numeric vector of tokens using a bag-of-words. Use can use the CountVectorizer module from sklearn but limit the maximum number of features to be 1000 to avoid memory issues (you can decrease it further if you still have memory issues). Explore the other parameters of the function as well.
6. Using the fitted and transformed vector and the above created features, train a model that predicts the sentiment of a review. Note that this will be a classification problem. Evaluate your model and motivate your choice of a performance metric.(Hint: the feature for language is of type ‘object’, you may want to transform it to binary, such that it is 1 if the language is in English, 0 otherwise).

Submit your solution in a Jupyter notebook and optionally, a link to a GitHub repo where you have also uploaded your code.

---
#### Part 1: Downloading the dataset
---

In [1]:
# import libraries and packages here
import pandas as pd
import bz2

In [2]:
# importing the train dataset

train_file = bz2.BZ2File("train.ft.txt.bz2") # open the file

# Load and decode
lines = [x.decode('utf-8') for x in train_file.readlines()] # decode the file

# Split in two: sentiment and review
score_review_list = [l.strip('__label__').split(' ', 1) for l in lines] # strip the label and split the score and review

train_df = pd.DataFrame(score_review_list, columns = ['score', 'review'] ) # create a dataframe with the score and review

In [3]:
train_df.shape # check the shape of the train data

(3600000, 2)

In [4]:
train_df.head(2) # check the first 2 rows of the train data

Unnamed: 0,score,review
0,2,Stuning even for the non-gamer: This sound tra...
1,2,The best soundtrack ever to anything.: I'm rea...


In [5]:
# importing the test dataset

test_file = bz2.BZ2File("test.ft.txt.bz2") # open the file
# Load and decode
lines = [x.decode('utf-8') for x in test_file.readlines()] # decode the file

# Split in two: sentiment and review
score_review_list = [l.strip('__label__').split(' ', 1) for l in lines] # strip the label and split the score and review

test_df = pd.DataFrame(score_review_list, columns = ['score', 'review'] ) # create a dataframe with the score and review

In [6]:
test_df.shape # check the shape of the test data

(400000, 2)

In [7]:
test_df.head(5) # check the first 2 rows of the test data

Unnamed: 0,score,review
0,2,Great CD: My lovely Pat has one of the GREAT v...
1,2,One of the best game music soundtracks - for a...
2,1,Batteries died within a year ...: I bought thi...
3,2,"works fine, but Maha Energy is better: Check o..."
4,2,Great for the non-audiophile: Reviewed quite a...


---
#### Part 2: extracting reviews and labels using regular expressions and named groups. (for the Bonus points)
---

In [1]:
# import libraries and packages here
import pandas as pd
import bz2
import re

In [9]:
# Define the regular expression pattern to extract the labels and reviews
pattern = r"(__label__(\d+)\s+)(.*)"

In [10]:
# Open the compressed file and read its content for the training data
with bz2.open('train.ft.txt.bz2', 'rt', encoding='utf-8') as file:
    content = file.readlines()

# Initialize an empty list to store the extracted data
data = []

# Extract the labels and reviews using regular expressions
for line in content:
    match = re.match(pattern, line)
    if match:
        label = int(match.group(2))
        review = match.group(3)
        data.append({'score': label, 'review': review})

# Create a pandas DataFrame from the extracted data
train_df = pd.DataFrame(data)

In [11]:
train_df.shape # check the shape of the train data

(3600000, 2)

In [12]:
train_df.head(2) # check the first 2 rows of the train data

Unnamed: 0,score,review
0,2,Stuning even for the non-gamer: This sound tra...
1,2,The best soundtrack ever to anything.: I'm rea...


In [13]:
# Open the compressed file and read its content for the test data
with bz2.open('test.ft.txt.bz2', 'rt', encoding='utf-8') as file:
    content = file.readlines()

# Initialize an empty list to store the extracted data
data = []

# Extract the labels and reviews using regular expressions
for line in content:
    match = re.match(pattern, line)
    if match:
        label = int(match.group(2))
        review = match.group(3)
        data.append({'score': label, 'review': review})

# Create a pandas DataFrame from the extracted data
test_df = pd.DataFrame(data)

In [14]:
test_df.shape # check the shape of the train data

(400000, 2)

In [15]:
test_df.head(2) # check the first 2 rows of the train data

Unnamed: 0,score,review
0,2,Great CD: My lovely Pat has one of the GREAT v...
1,2,One of the best game music soundtracks - for a...


---
#### Part 3: Create a new feature, called ‘n_tokens’ that counts how many tokens(words) there are in a review. In other words, a feature for the length of a review.:
---

In [16]:
# Create the 'n_tokens' column by counting the number of tokens in each review
train_df['n_tokens'] = train_df['review'].apply(lambda x: len(x.split()))

In [17]:
train_df.shape # check the shape of the train data

(3600000, 3)

In [18]:
# Print the first 2 rows of the DataFrame with the 'n_tokens' column
train_df.head(2)

Unnamed: 0,score,review,n_tokens
0,2,Stuning even for the non-gamer: This sound tra...,80
1,2,The best soundtrack ever to anything.: I'm rea...,97


---
#### Part 4: Create a new feature, called ‘language’, which detects what is the language of each review. So this feature will have a different value for each row (review) of the data.:
---

In [10]:
# import libraries and packages here
from langdetect import detect # or we can use from polyglot.detect import Detector

In [11]:
for index, row in train_df.iterrows():
    try:
        language = detect(row['review'])
    except:
        language = 0
        print("This row throws an error:", row['review'])
    train_df.loc[index, 'language'] = language

This row throws an error: ........: ............ ..... ..... ...... ...... ....... ..... ....... ....... ........ ......... ........ ........ ..... ......... ..... ..... ...... .. ......
This row throws an error: 47382 75983 37483 83740!: 38493 34740 47383 37054 48624 78568? 18581 28682 18558 24866 22584 24995 26484 14589 15648 15486 73893. 77504 03478 47589 43705 47309 67490 27348 57490 57409 37405 40978, 39794 (39847 57303 57049 32740 57403) 75093 47309 47328 54798 68978 97231 23473 34785 34097 34097. 34987: 34908 74309 34709 34908 40700 34087 45709 39874 97865 53586 97423 64987 36549 $54868 97668 52585 58855 93633 48457 20385 49884. 57430 34094 08908 34098 & 30409 08745 72009 23730 40508%. 24094 32098 00908 34042 20835 27789 29735 #93487 98743 32907 75928 29873 29723 54097 29735 29357 40540 34094 21009 08423 40755 09725. 25097 45097 23099 23073@03550.23405 23098, 50983 20408 60846 23974 90766.
This row throws an error: &#4321;&#4304;&#4311;&#4304;&#4323;&#4320;&#4312;: &#4315;&#4304

In [12]:
train_df.shape # check the shape of the train data

(3600000, 4)

In [13]:
# Print the first 2 rows of the DataFrame with the 'n_tokens' column
train_df.head(2)

Unnamed: 0,score,review,n_tokens,language
0,2,Stuning even for the non-gamer: This sound tra...,80,en
1,2,The best soundtrack ever to anything.: I'm rea...,97,en


---

**Note:**
The above code took 326m 58.7s to detect the language, that is, about five and a half hours, so I saved the data frame as a CSV file, and then re-read it and completed the work, to avoid running the code repeatedly in each modification, it can be considered as a checkpoint.

In [15]:
# train_df.to_csv('train_df.csv', index=False) # save the train data to a csv file

In [20]:
train_df = pd.read_csv('train_df.csv') # read the train data from the csv file

In [21]:
train_df.shape # check the shape of the train data

(3600000, 4)

In [22]:
# Print the first 2 rows of the DataFrame with the 'n_tokens' column
train_df.head(2)

Unnamed: 0,score,review,n_tokens,language
0,2,Stuning even for the non-gamer: This sound tra...,80,en
1,2,The best soundtrack ever to anything.: I'm rea...,97,en


---
#### Part 5: Transform each review into a numeric vector of tokens using a bag-of-words. Use can use the CountVectorizer module from sklearn but limit the maximum number of features to be 1000 to avoid memory issues (you can decrease it further if you still have memory issues). Explore the other parameters of the function as well.:
---

In [23]:
# import libraries and packages here
from sklearn.feature_extraction.text import CountVectorizer

In [24]:
# Initialize the CountVectorizer with parameters
vectorizer = CountVectorizer(max_features=250)  # Set max_features to limit the number of features

In [25]:
# Fit and transform the reviews into bag-of-words vectors
X = vectorizer.fit_transform(train_df['review'])

In [26]:
# Convert the transformed vectors to a pandas DataFrame
df_bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [27]:
df_bow.head(2) # check the first 2 rows of the bag of words dataframe

Unnamed: 0,about,actually,after,again,album,all,also,always,am,amazon,...,work,works,world,worth,would,written,year,years,you,your
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,2,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,1,1,0


---
#### Part 6: Using the fitted and transformed vector and the above created features, train a model that predicts the sentiment of a review. Note that this will be a classification problem. Evaluate your model and motivate your choice of a performance metric.(Hint: the feature for language is of type ‘object’, you may want to transform it to binary, such that it is 1 if the language is in English, 0 otherwise).:
---

In [2]:
# import libraries and packages here
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [30]:
# Transform the 'language' feature to binary (English or not English)
train_df['language'] = train_df['language'].apply(lambda x: 1 if x == 'en' else 0)

In [31]:
train_df.shape # check the shape of the train data

(3600000, 4)

In [32]:
# Print the first 2 rows of the DataFrame with the 'n_tokens' column
train_df.head(2)

Unnamed: 0,score,review,n_tokens,language
0,2,Stuning even for the non-gamer: This sound tra...,80,1
1,2,The best soundtrack ever to anything.: I'm rea...,97,1


In [33]:
# Prepare the features and target variables
X = pd.concat([train_df[['language', 'n_tokens']], df_bow], axis=1)
y = train_df['score']

In [34]:
X.shape # check the shape of the features

(3600000, 252)

In [35]:
y.shape # check the shape of the target

(3600000,)

In [18]:
X.head(2) # check the first 2 rows of the features dataframe

Unnamed: 0,language,n_tokens,about,actually,after,again,album,all,also,always,...,work,works,world,worth,would,written,year,years,you,your
0,1,80,0,0,0,0,0,1,0,0,...,0,0,0,0,2,0,0,0,0,1
1,1,97,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,1,1,0


In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # or we can use the training data that we imported earlier

In [5]:
# Train a random forest classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

In [7]:
# save the trained classifier as a file using joblib to be used later and avoid retraining the model

from joblib import dump # import dump from joblib

# Save the trained classifier as a file
# dump(clf, 'trained_classifier.joblib')

# Save the trained classifier as a file in the E drive
dump(clf, 'E:/trained_classifier.joblib') # as there is no space in c drive to save the file so I saved it in the E drive instead of the C drive to avoid errors

['E:/trained_classifier.joblib']

In [None]:
# Load the saved classifier from file using joblib

from joblib import load # import load from joblib

# Load the saved classifier from file
clf = load('E:/trained_classifier.joblib')

In [8]:
# Make predictions on the testing set
y_pred = clf.predict(X_test)

In [9]:
# Evaluate the model using accuracy as the performance metric
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8095805555555555


---

# End of Assignment

Student Name: Loai Siwas