## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Edward Day
    - Email: ED558@drexel.edu
- Group member 2
    - Name: Sahar Siddiqi
    - Email: ss5226@drexel.edu
- Group member 3
    - Name: NA
    - Email: NA
- Group member 4
    - Name: NA
    - Email: NA

### Additional submission comments
- Tutoring support received: Jacob Rosen jkr58@drexel.edu Ali Jazayeri aj629@drexel.edu
- Other (other): NA

# Assignment group 4: Machine learning and regression

## Module B _(62 pts)_ Exploring Classifier Transferability
### Data sets

__Data Set 1:__ There's a lot more than e-mail text out there, and malicious SPAM-like text-based deception is pervasive in other domains. One domain of particular interest to a few companies is called _opinion SPAM_, in which product and business reviews are spoofed, either to help or hurt a business.

An interesting data set for purposes of studying opinion SPAM was produced by a researcher named Myle Ott. In addition to collecting real reviews on hotels from the web and TripAdvisor, Ott et al. ran Amazon Mechanical Turk surveys to have real people write both positive and negative fake reviews of the hotels:

- http://myleott.com/op-spam.html

The goal with the data set was to train computers to detect which reviews were real vs. fake. These are provided in the following nested file structure:

- `./data/op_spam_v1.4/negative_polarity/deceptive_from_MTurk/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/negative_polarity/truthful_from_Web/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/positive_polarity/truthful_from_TripAdvisor/fold[1-5]/*.txt`

__Data Set 2:__ The big picture of what we're trying to do here is train an Opinion SPAM classifier on the _curated_ __Data Set 1__, and apply it to get an idea of how prolific SPAM is on this completely different, _real-world_ hotel [booking website's](booking.com) data. The data from this website live in the assignment's data directory, too:

- `./data/Hotel_Reviews.csv`
    
and were taken from [Kaggle](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe).

__B1.__ _(2 pts_) To load the Op SPAM data we'll be using `sklearn`, but as a requirement we'll need a full list of all the different review files in the data set. To compile a list of file paths, review the datas directory structure and use the `glob` module's `.glob(regex)` method to output a list of all `all_files` matching the provided `regex` pattern.

When this is complete, print the first 5 files to show your code's function.

In [2]:
import numpy as np
import pandas as pd
import glob
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.pipeline import Pipeline

In [3]:
import glob

## initialize a list for all file names
all_files = []

## a regex-like patter that specifies all possible file locations
file_paths = "./data/op_spam_v1.4/*/*/*/*.txt"

## grab all file paths matching the pattern
x = glob.glob(file_paths)

## let's see what the first two file paths are
print(len(x))
for filename in x[:5]:
    print(filename)

1600
./data/op_spam_v1.4\negative_polarity\deceptive_from_MTurk\fold1\d_hilton_1.txt
./data/op_spam_v1.4\negative_polarity\deceptive_from_MTurk\fold1\d_hilton_10.txt
./data/op_spam_v1.4\negative_polarity\deceptive_from_MTurk\fold1\d_hilton_11.txt
./data/op_spam_v1.4\negative_polarity\deceptive_from_MTurk\fold1\d_hilton_12.txt
./data/op_spam_v1.4\negative_polarity\deceptive_from_MTurk\fold1\d_hilton_13.txt


__B2.__ _(3 pts)_ Since this is supervised learning, we'll neeed labels, too. To construct, use a regex match on `all_files`. In particular, since we're doing sentiment classification, utilize the word 'positive_polarity' in the file path to indicate a positve label (of value `1`) and otherwise use a negative label (value `0`). Store these values in a `np.array()` called `labels.

When this is done, compute and print the size of positive and negative portions of the data set and discuss the imbalance you observe in the response box below. 

_Response._ 
There isn't an imbalance as the positive and negative portions are split 50/50.

In [3]:
import re

## make an empty list for our class labels
y = []
## loop through all files
for filename in x:
    ## if the file path has the word "deceptive"
    ## then it's spam (positive/y = 1)
    if re.search("positive_polarity", filename):
        y.append(1)
    else:
        y.append(0)
y = np.array(y)
print((y == 0).sum())
print((y == 1).sum())

800
800


__B3.__ _(3 pts)_ Now, `import` `sklearn`'s TDM-maker `CountVectorizer` from `sklearn.feature_extraction.text`. Initialize an instance of 
- `CountVectorizer(input = 'filename')` 

and called `vectorizer`, apply its `.fit()` and `.transform()` methods to `all_files` to produce a `TDM`.

When this is complete, exhibit its shape, and be sure to apply `TDM.toarra()` to convert the matrix to a dense representation.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(input = 'filename')
vectorizer.fit(x)
matrix = vectorizer.transform(x)
matrix.shape 
matrix.toarray()


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

__B4.__ _(2 pts)_ Now, use `train_test_split` to split the `TDM` and `labels` into $75\%$ training and $25\%$ test sets, importing the function from `sklearn.model_selection`. Also, be sure to use use `random_state = 0`.

In [5]:
from sklearn.model_selection import train_test_split
matrix_train, matrix_test,y_train, y_test = train_test_split(matrix, y, test_size=0.25,random_state=0)
print(len(y_train))
print(len(y_test))

1200
400


__B5.__ _(5pts)_ Now, `import`, initialize, and `.fit()` a binary classifer of your choosing (from __Chapter 8.__) with `sklearn` on the training data split. After training, apply and print `.predict()` and `.score()` to review the model's accuracy.

In [6]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(matrix_train, y_train)
print(clf.predict(matrix_test))
print(clf.score(matrix_test, y_test))

[1 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1
 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 1
 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 0 0 0
 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1
 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0
 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0
 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 1
 1 0 0 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 0 1 0
 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0
 1 0 1 1 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0
 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0]
0.955


__B6.__ _(5 pts)_ Now, determine precision, recall, and $F_1$ for the classifier's performance on the test set. Do these results provide any different information as compared to accuracy? If not, why do you think? Provide discussion in the markdown cell below.

_Response._
These results are meant to give the percentage of true positives out of a combination of true positives and false negatives or true positives and false positives. Given that we haven't trained the classifier on what is false or true, these results don't really provide additional information.

In [7]:
from sklearn import metrics
from sklearn.metrics import f1_score, precision_score, recall_score
logistic_SPAM = LogisticRegression(solver='lbfgs')

logistic_SPAM.fit(matrix_train, y_train)


predictions = logistic_SPAM.predict(matrix_test)

print("Precision, recall, and F1 were:")
print(precision_score(predictions, y_test))
print(recall_score(predictions, y_test))
print(f1_score(predictions, y_test))  
print("")


Precision, recall, and F1 were:
0.9128205128205128
0.9270833333333334
0.9198966408268734





__B7.__ _(2 pts)_ Let's see how well our sentiment polarity classifier does on a different data set:

- `./data/Hotel_Reviews.csv`

which was hosted on a Kaggle competition, but came from from Booking.com:

- https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

There's a decent description of the data there, where it seems a customer can comment with positive and negative reviews, in parallel. To get started, load these data in with pandas, print out the column names and identify (in the markdown cell, below) which have the positive and the negative reviews.

_Response._ 
Positive_Review and Negative_Review have the positive and negative reviews.

In [11]:
reviews = pd.read_csv('./data/Hotel_Reviews.csv.zip')
reviews.keys()


Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')

__B8.__ _(1 pts)_ Sometimes, a reviewer won't leave a positive or negative review in one of the categories. However, what's left is not a conventional N/A or anything. Refer back to the data dictionary:

- https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

and determine what we should match for to filter out any missing/null reviews.

_Response._
We should match for 'No Positive' and 'No Negative'. More specifically, we should match with column values that don't equal 'No Positive'/'No Negative'.

__B9.__ _(7 pts)_ Use your observation from __B8.__ to create a single list with all of the non null reviews, as well as a parallel list of labels: $1$s (for positive review texts) and $0$s (for the negative review texts).

In [13]:
reviews = pd.read_csv('./data/Hotel_Reviews.csv.zip')

all_reviews = reviews[['Positive_Review','Negative_Review']]

parallel = []
new_pos = []
new_neg = []

for x in all_reviews['Positive_Review']:
    if x!= 'No Positive':
        new_pos.append(x)
        parallel.append(1)
        
for j in all_reviews['Negative_Review']:
    if j != 'No Negative':
        new_neg.append(j)
        parallel.append(0)

        
        
total_reviews = new_pos + new_neg

__B10.__ _(2 pts)_ How many positive and negatives were there? Does this data set have a class imbalance? Specifically, determine the percentage of reviews that were positive and comment on the presence of any imbalance in the markdown cell below.

_Response._  There are 479792 positive reviews and 387848 negative reviews. There is a small imbalance; there are about 5% more positive reviews than negative.

In [15]:
print(len(new_pos))
print(len(new_neg))
review_percentage = len(new_pos)/len(total_reviews)*100
review_percentage

479792
387848


55.29851090313955

__B11.__ _(5 pts)_ Use `CountVectorizer()` again&mdash;now to create a TDM for the new hotel data. Note: You must use the same initialized vectorizer from __B3.__, i.e., after is has run `.fit()`. So, here you must start from the `'.transform()'` step. If you re-initialize the vectorizer, you will wind up with a different vocabulary! Note: you also have to change the input format with `vectorizer.input`. It was equal to `'filename'` which would create a TDM by a list of files. Now we want it to work off of a list of strings. This will work if we set:
- `'vectorizer.input = content'`

In [14]:
vectorizer.input = 'content'
hotel_matrix = vectorizer.transform(total_reviews)
hotel_matrix.shape 



(867640, 9571)

__B12.__ _(5 pts)_ Apply your Classifier to this new, Booking.com TDM and compute accuracy, precision, recall, and $F_1$. What do you notice? Is there any more of a class imbalance now? Comment in the markdown cell below.

_Response._
Accuracy and recall have similar values, while the precision and $F_1$ is higher than the both of them. Judging from the accuracy score, there is a class imbalance from a high number of True Negatives.

In [15]:
from sklearn.metrics import accuracy_score

new_predictions = []
Hotel_Preds = logistic_SPAM.predict_proba(hotel_matrix)

for x in Hotel_Preds[:,1]:
    if x >= 0.5:
        new_predictions.append(1)
    else:
        new_predictions.append(0)
        
    
accuracy = accuracy_score(new_predictions,parallel)

print("Precision, recall, F1, and accuracy were:")
print(precision_score(new_predictions,parallel))
print(recall_score(new_predictions,parallel))
print(f1_score(new_predictions,parallel))
print(accuracy)

Precision, recall, F1, and accuracy were:
0.8496869476773269
0.7849057838892087
0.8160126823614894
0.7881183440136462


__B13.__ _(2 pts)_  Compare these results with the results from __B6__. Is the performance better or worse in some areas (e.g., precision vs. recall) than others? Do you think our sentiment polarity classifier transferred well from the one Opintion SPAM dataset to this one from Booking.com? Place your discussion in the markdown box below.

_Response._
Overall, the performance is "worse" in terms of accuracy as it is working with a much larger amount of items and thus a large number of true negatives. The performance is better in terms of getting the precision, recall, and $F_1$ since the model has now been trained on true/false negatives and positives. Although the sentiment
classifier as a whole is a great tool for getting a general sense of information, hotel reviews and SPAM mails are
almost completly different in terms of content. 

__B14.__ _(3 pts)_ Go back to the Opinion SPAM data and rebuild the _SPAM_ (no longer sentiment polarity) labels for that dataset's classification, in particular using the patter `deceptive` inside of the file names to produce positive-valued (`1`) labels, and `0`s, otherwise.

In [17]:
import re
x = glob.glob(file_paths)
## make an empty list for our class labels
y = []
## loop through all files
for filename in x:
    ## if the file path has the word "deceptive"
    ## then it's spam (positive/y = 1)
    if re.search("deceptive", filename):
        y.append(1)
    else:
        y.append(0)
y = np.array(y)

print(len(y))

1600


__B15.__ _(2 pts)_ Now, train your classifier on _all_ of the Opinion SPAM labels. Note: you _must_ initialize a new classifier in order to classify _SPAM_, instead of polarity. However, we can just reuse our `TDM` from __B3__.

In [18]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(matrix, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

__B16.__ _(3 pts)_ Run the classifier you just trained on the new hotel reviews data set. Make classification at a threshold of $0.5$ and report the percentage of the new data set that our classifier thinks is SPAM. 

In [19]:
hotel_probabilities = []
for x in clf.predict_proba(hotel_matrix)[:,1]:
    if x >= 0.5:
        hotel_probabilities.append(1)
    else:
        hotel_probabilities.append(0)
sum(hotel_probabilities)/len(hotel_probabilities)

0.22310289982020193

__B17.__ (2 pts) Interpret the output percentage from __B16__. Is this a big number? If correct, what would it mean for Booking.com? Do you our classification was a reliable assessment? Why or why not? Place your discussion in the markdown cell, below.

_Response._ 
B16 indicates that 22% of all emails to Booking.com are spam. It is almost a quarter of all reviews, but given the thousands of emails/reviews they recieve every month, this number seems arbitrary and easy enough to filter out. Our classification is based on the number of particular words in training data; it might be more accurate if we could classify based on word-analysis.

__B18.__ _(2 pts)_ Sort the Booking.com reviews by their prediction probabilities from high to low. Either use `sorted()` on a list of `(probability, review)` tuples, or create a pandas data frame with the two columns and use the `.sort_values()` method.

In [38]:
h= clf.predict_proba(hotel_matrix)
#print(h[:5])
p_column = []
for x in h:
        p_column.append(x[1])
joined_list = list(map(lambda x, y:(x,y), p_column, total_reviews)) 
print(joined_list[:5])
sorted(joined_list, reverse = True)

[(0.6461448628057155, ' Only the park outside of the hotel was beautiful '), (0.7272031783945109, ' No real complaints the hotel was great great location surroundings rooms amenities and service Two recommendations however firstly the staff upon check in are very confusing regarding deposit payments and the staff offer you upon checkout to refund your original payment and you can make a new one Bit confusing Secondly the on site restaurant is a bit lacking very well thought out and excellent quality food for anyone of a vegetarian or vegan background but even a wrap or toasted sandwich option would be great Aside from those minor minor things fantastic spot and will be back when i return to Amsterdam '), (0.01893239591520274, ' Location was good and staff were ok It is cute hotel the breakfast range is nice Will go back '), (0.0037967191536536667, ' Great location in nice surroundings the bar and restaurant are nice and have a lovely outdoor area The building also has quite some charac

[(0.9999999999922693,
  ' I had seen the propertys availability on Booking and since I have had positive experiences in spa hotels I was looking forward to it Unlike most 4 hotels there was no real welcome Eventually a young woman asked me what I wanted I said to make a reservation for two nights Without consulting any kind of file she announced nous sommes complete I looked back at the page I had opened where the hotels availability was still plain to see At that moment I should have left already the hotel had displayed poor customer service Eternal optimist I booked it on line and cheerily announced that I had done so Rather than apologizing or offering me a coffee while they put the situation to rights they turned away from me and fumbled with their computers After a while a shabby room was ready No charm rather bare soulless room I thought never mind I ll go and relax in the spa only to hear that the hammam was out of order again with scarcely an apology so all that was available w

__B19.__ _(2 pts)_ We really don't have SPAM labels for the Booking.com data. So, inspect the first few most and least spammy reviews. What observations can you draw? Do you see any qualitative differences between the most and least spammy reviews? Do you think the classifier is working?  Place your discussion in the markdown cell, below.

_Response._ 
The first/ most spammy review is littered with spelling errors and complaints that don't really seem to go anywhere.
The least spammy seems to be well thought out review that is short and to the point. Based on this it would seem that
the spam filter is working nicely. 

__B20.__ (2 pts) What aspects of our classifier could we modify to potentially improve our SPAM classifier's performance? Specifically, discuss the potential effects to this experiment in selecting or transforming our features, or optimizing any criteria for our predictions in the markdown cell, below.

_Response._ 
We could perhaps add a feature with a counter for words with positive connotations and a feature for words with negative connotations. We could train the classifier on labeling reviews with a specific number of words with either type of connotation.

__B21.__ _(2 pts)_ How could we get an evaluation out of this experiment and _really_ know if the classifer is working? What would we have to do with the Booking.com data in order to get a strong sense of our performance on SPAM? Is there _any_ reasonable labeling of this data that we could come up with, or would we have to get some new data that we have more control over? Place your discussion in the markdown cell, below.

_Response._
We could find all of the true positives or true negatives and limit the data set down even further. 
This data only seems to lable/find spam quite nicely. We can possibly extend this further and check emails as well for
spam.