# Assignment group 4: Machine learning and regression

#### Kiana Montazeri

## Module B _(62 pts)_ Exploring Classifier Transferability
### Data sets

__Data Set 1:__ There's a lot more than e-mail text out there, and malicious SPAM-like text-based deception is pervasive in other domains. One domain of particular interest to a few companies is called _opinion SPAM_, in which product and business reviews are spoofed, either to help or hurt a business.

An interesting data set for purposes of studying opinion SPAM was produced by a researcher named Myle Ott. In addition to collecting real reviews on hotels from the web and TripAdvisor, Ott et al. ran Amazon Mechanical Turk surveys to have real people write both positive and negative fake reviews of the hotels:

- http://myleott.com/op-spam.html

The goal with the data set was to train computers to detect which reviews were real vs. fake. These are provided in the following nested file structure:

- `./data/op_spam_v1.4/negative_polarity/deceptive_from_MTurk/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/negative_polarity/truthful_from_Web/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/positive_polarity/truthful_from_TripAdvisor/fold[1-5]/*.txt`

__Data Set 2:__ The big picture of what we're trying to do here is train an Opinion SPAM classifier on the _curated_ __Data Set 1__, and apply it to get an idea of how prolific SPAM is on this completely different, _real-world_ hotel [booking website's](booking.com) data. The data from this website live in the assignment's data directory, too:

- `./data/Hotel_Reviews.csv`
    
and were taken from [Kaggle](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe).

In [1]:
#Libraries in use:
from pprint import pprint
%matplotlib inline
from collections import Counter
import csv
from matplotlib import pyplot as plt
import pandas as  pd
import numpy as np
import re
from os.path import join
from glob import glob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score

__B1.__ _(2 pts_) To load the Op SPAM data we'll be using `sklearn`, but as a requirement we'll need a full list of all the different review files in the data set. To compile a list of file paths, review the datas directory structure and use the `glob` module's `.glob(regex)` method to output a list of all `all_files` matching the provided `regex` pattern.

When this is complete, print the first 5 files to show your code's function.

In [2]:
file_paths = "./data/op_spam_v1.4/*/*/*/*.txt"
all_files = glob(file_paths)

In [3]:
type(all_files)

list

In [4]:
all_files[:5]

['./data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold2/d_talbott_9.txt',
 './data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold2/d_talbott_8.txt',
 './data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold2/d_affinia_20.txt',
 './data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold2/d_hardrock_18.txt',
 './data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold2/d_hardrock_19.txt']

In [5]:
len(all_files) #4*5(fold)*80(each folder)

1600

__B2.__ _(3 pts)_ Since this is supervised learning, we'll neeed labels, too. To construct, use a regex match on `all_files`. In particular, since we're doing sentiment classification, utilize the word 'positive_polarity' in the file path to indicate a positve label (of value `1`) and otherwise use a negative label (value `0`). Store these values in a `np.array()` called `labels.

When this is done, compute and print the size of positive and negative portions of the data set and discuss the imbalance you observe in the response box below. 

<font color=blue>We have a balanced positive and negative reviews! (No imbalance!)</font>

In [6]:
polarity_labels = np.array([1 if "positive_polarity" in x else 0 for x in all_files]) 

In [7]:
Counter(polarity_labels)

Counter({1: 800, 0: 800})

__B3.__ _(3 pts)_ Now, `import` `sklearn`'s TDM-maker `CountVectorizer` from `sklearn.feature_extraction.text`. Initialize an instance of 
- `CountVectorizer(input = 'filename')` 

and called `vectorizer`, apply its `.fit()` and `.transform()` methods to `all_files` to produce a `TDM`.

When this is complete, exhibit its shape, and be sure to apply `TDM.toarra()` to convert the matrix to a dense representation.

In [8]:
## initialize the vectorizer
vectorizer = CountVectorizer(input = 'filename')
## tokenize and build a vocab that spans all files
## note, this establishes the TDM's tracked words and their indices
# create the TDM (it's sparse)
SPAM_TDM = vectorizer.fit_transform(all_files)
# let's check out a little to see if it worked!
print(SPAM_TDM.shape)
print(type(SPAM_TDM))
print(SPAM_TDM.toarray())

(1600, 9571)
<class 'scipy.sparse.csr.csr_matrix'>
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


__B4.__ _(2 pts)_ Now, use `train_test_split` to split the `TDM` and `labels` into $75\%$ training and $25\%$ test sets, importing the function from `sklearn.model_selection`. Also, be sure to use use `random_state = 0`.

In [9]:
x_train, x_test, y_train, y_test = train_test_split(all_files, polarity_labels, test_size=0.25, random_state=42)

__B5.__ _(5pts)_ Now, `import`, initialize, and `.fit()` a binary classifer of your choosing (from __Chapter 8.__) with `sklearn` on the training data split. After training, apply and print `.predict()` and `.score()` to review the model's accuracy.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logistic_polarity = Pipeline([('vect', vectorizer),
                          ('clf', LogisticRegression(solver='lbfgs'))])
logistic_polarity.fit(x_train, y_train)

predictions = logistic_polarity.predict(x_test)
score = accuracy_score(y_test, predictions)
print("Accuracy is:")
print(score)

Accuracy is:
0.935


__B6.__ _(5 pts)_ Now, determine precision, recall, and $F_1$ for the classifier's performance on the test set. Do these results provide any different information as compared to accuracy? If not, why do you think? Provide discussion in the markdown cell below.

<font color=blue>Precision and recall are both rates of how good the model is based on true or false positives or negatives. and F1 is the harmonic average of the other two. However, the score is the percentage of the correct predictions. So the most important measure depends on what outcome we care about the most. For example, if we are looking to find fake negative reviews(ones that wrongfully distroy a bussiness) it would be better to look at the Recall value rather than Percision. F1 is a good _macroscopic_ measure since it includes recall and percision.<br><br>
Precision is a good measure to determine, when the costs of False Positive is high. Here, a false positive means that a review that is non-destructive has been identified as destructive. The user might lose important boosts if the precision is not high for the false negative detection model. On the other hand, in fake review detection, if a fake review is detected as true review, the consequence can be very bad for the user and can destroy reputation.
</font>

In [11]:
print("Precision, recall, and F1 were:")
print(precision_score(predictions, y_test))
print(recall_score(predictions, y_test))
print(f1_score(predictions, y_test))  
print("")

Precision, recall, and F1 were:
0.9174757281553398
0.9545454545454546
0.9356435643564357



__B7.__ _(2 pts)_ Let's see how well our sentiment polarity classifier does on a different data set:

- `./data/Hotel_Reviews.csv`

which was hosted on a Kaggle competition, but came from Booking.com:

- https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

There's a decent description of the data there, where it seems a customer can comment with positive and negative reviews, in parallel. To get started, load these data in with pandas, print out the column names and identify (in the markdown cell, below) which have the positive and the negative reviews.

<font color=blue>Negative_Review column contains the negative ones as opposed to Positive_Review column.</font>

In [12]:
df_Hotel_review = pd.read_csv("./data/Hotel_Reviews.csv")

In [13]:
df_Hotel_review.columns

Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')

In [14]:
df_Hotel_review.shape

(515738, 17)

__B8.__ _(1 pts)_ Sometimes, a reviewer won't leave a positive or negative review in one of the categories. However, what's left is not a conventional N/A or anything. Refer back to the data dictionary:

- https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

and determine what we should match for to filter out any missing/null reviews.

<font color=blue>No negative or no positive happens when the data is missing. We also have "Please see above" in one row.</font>

__B9.__ _(7 pts)_ Use your observation from __B8.__ to create a single list with all of the non null reviews, as well as a parallel list of labels: $1$s (for positive review texts) and $0$s (for the negative review texts).

In [15]:
def get_labels_and_reviews(df):
    all_reviews = []
    all_reviews.extend(df['Negative_Review'].unique())
    labels = [0 for x in all_reviews]
    all_reviews.extend(df['Positive_Review'].unique())
    labels.extend([1 for x in range(len(df['Positive_Review'].unique()))])
    for i, rev in enumerate(all_reviews):
        if (rev=="No Negative") | (rev=="No Positive"): 
            all_reviews.pop(i)
            labels.pop(i)
    return all_reviews, labels

In [16]:
all_reviews, df_labels = get_labels_and_reviews(df_Hotel_review)

In [17]:
len(all_reviews), len(df_labels)

(742610, 742610)

In [18]:
all_reviews[30]

' The rooms were cold Although nice the room decor was basic and unwelcoming It could use some design help an area rug or two '

In [19]:
all_reviews[-30]

' Rooms are very clean and comfortabile'

In [20]:
df_labels[30], df_labels[-30]

(0, 1)

__B10.__ _(2 pts)_ How many positive and negatives were there? Does this data set have a class imbalance? Specifically, determine the percentage of reviews that were positive and comment on the presence of any imbalance in the markdown cell below.

In [21]:
counter_pos_neg = Counter(df_labels)
print(counter_pos_neg)

Counter({1: 412600, 0: 330010})


In [22]:
pos_percentage = 100*counter_pos_neg[1]/(counter_pos_neg[1]+counter_pos_neg[0])
pos_percentage

55.56079234052868

In [23]:
neg_percentage = 100*counter_pos_neg[0]/(counter_pos_neg[1]+counter_pos_neg[0])
neg_percentage

44.43920765947132

<font color=blue>There is an imbalance in the data. Most of the reviews are positive. </font>

__B11.__ _(5 pts)_ Use `CountVectorizer()` again&mdash;now to create a TDM for the new hotel data. Note: You must use the same initialized vectorizer from __B3.__, i.e., after is has run `.fit()`. So, here you must start from the `'.transform()'` step. If you re-initialize the vectorizer, you will wind up with a different vocabulary! Note: you also have to change the input format with `vectorizer.input`. It was equal to `'filename'` which would create a TDM by a list of files. Now we want it to work off of a list of strings. This will work if we set:
- `'vectorizer.input = content'`

In [24]:
vectorizer.input = 'content'
Hotel_TDM = vectorizer.transform(all_reviews)
# let's check out a little to see if it worked!
print(Hotel_TDM.shape)
print(type(Hotel_TDM))
print(Hotel_TDM.toarray())

(742610, 8358)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


__B12.__ _(5 pts)_ Apply your Classifier to this new, Booking.com TDM and compute accuracy, precision, recall, and $F_1$. What do you notice? Is there any more of a class imbalance now? Comment in the markdown cell below.

<font color=blue>Now the difference between different accuracy measures are more significant. Precision and recall provide an effective means to evaluate model performance in the presence of class imbalance and Accuracy can be sometimes misleading.</font>

In [25]:
predictionsHotel = logistic_polarity.predict(all_reviews)
scoreHotel = accuracy_score(df_labels, predictionsHotel)
print("Accuracy is:")
print(scoreHotel)

Accuracy is:
0.7900297599008901


In [26]:
print("Precision, recall, and F1 were:")
print(precision_score(predictionsHotel, df_labels))
print(recall_score(predictionsHotel, df_labels))
print(f1_score(predictionsHotel, df_labels))  
print("")

Precision, recall, and F1 were:
0.9470722249151721
0.7445212917976565
0.8336700624033281



__B13.__ _(2 pts)_  Compare these results with the results from __B6__. Is the performance better or worse in some areas (e.g., precision vs. recall) than others? Do you think our sentiment polarity classifier transferred well from the one Opintion SPAM dataset to this one from Booking.com? Place your discussion in the markdown box below.

<font color=blue>Percision result is better but rhe recall value is worse. Since this is an application of sentiment analysis.  If the algorithm is created for sentiment analysis and all you need is a high-level idea of emotions indicated in tweets then aiming for precision is the way to go. This means that our model transfered very well to the hotel data.</font>

__B14.__ _(3 pts)_ Go back to the Opinion SPAM data and rebuild the _SPAM_ (no longer sentiment polarity) labels for that dataset's classification, in particular using the patter `deceptive` inside of the file names to produce positive-valued (`1`) labels, and `0`s, otherwise.

In [27]:
spam_labels = np.array([1 if "deceptive" in x else 0 for x in all_files]) 

In [28]:
Counter(spam_labels)

Counter({1: 800, 0: 800})

__B15.__ _(2 pts)_ Now, train your classifier on _all_ of the Opinion SPAM labels. Note: you _must_ initialize a new classifier in order to classify _SPAM_, instead of polarity. However, we can just reuse our `TDM` from __B3__.

In [31]:
logistic_SPAM = Pipeline([('vect', vectorizer),
                          ('clf', LogisticRegression(solver='lbfgs', max_iter=200))])

In [32]:
vectorizer.input = 'filename'
logistic_SPAM.fit(all_files, spam_labels)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='filename',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

__B16.__ _(3 pts)_ Run the classifier you just trained on the new hotel reviews data set. Make classification at a threshold of $0.5$ and report the percentage of the new data set that our classifier thinks is SPAM. 

In [33]:
vectorizer.input = 'content'
predictionsHotelSPAM = logistic_SPAM.predict(all_reviews) #default threshold is 0.5

In [34]:
CounterSpam = Counter(predictionsHotelSPAM)

In [37]:
CounterSpam

Counter({0: 637998, 1: 104612})

In [35]:
SPAM_Percentage = CounterSpam[1]/(CounterSpam[1] + CounterSpam[0])

In [36]:
SPAM_Percentage

0.14087071275635932

__B17.__ (2 pts) Interpret the output percentage from __B16__. Is this a big number? If correct, what would it mean for Booking.com? Do you our classification was a reliable assessment? Why or why not? Place your discussion in the markdown cell, below.

<font color=blue>Since label 1 was for spam, the percentage of ones to total number is the portion of spam reviews. If this number is high, it means that the reviews are not sincere and probably Booking.com website should not be trusted.It means that 14 percent of the reviews on the website are fake(which I believe is pretty high! (:-?))</font>

__C18.__ _(2 pts)_ Sort the Booking.com reviews by their prediction probabilities from high to low. Either use `sorted()` on a list of `(probability, review)` tuples, or create a pandas data frame with the two columns and use the `.sort_values()` method.

In [71]:
prob_predictionsHotelSPAM = logistic_SPAM.predict_proba(all_reviews)

In [72]:
def make_review_prob_sorted_df(predictionProb):
    prob_df = pd.DataFrame(predictionProb)
    prob_df.drop(0, axis=1, inplace=True)
    prob_df['reviews'] = np.array(all_reviews)
    df_sorted = prob_df.sort_values(1, ascending=False)
    return df_sorted

In [73]:
df_sorted = make_review_prob_sorted_df(prob_predictionsHotelSPAM)

In [74]:
df_sorted.head()

Unnamed: 0,1,reviews
204742,0.999996,My husband and I stayed in the penthouse The ...
387872,0.999759,I would never stay in this hotel i stay here ...
157820,0.999757,Warning for international guests Keep your pa...
578087,0.99966,Hotel was just amazing myself and my boyfrien...
129260,0.999624,The bathroom was tired the bath panel had see...


__B19.__ _(2 pts)_ We really don't have SPAM labels for the Booking.com data. So, inspect the first few most and least spammy reviews. What observations can you draw? Do you see any qualitative differences between the most and least spammy reviews? Do you think the classifier is working?  Place your discussion in the markdown cell, below.

<font color=blue>In my opinion, the probably fake ones are more exaggerated than the probably true ones. They talk more about a crazy situation that happened to them suring their stay. On the other hand, the sincere ones are more focused on the quality of the rooms, service, etc.</font>

In [80]:
print("Most spammy reviews are: \n")
for x in df_sorted['reviews'][:3]:
    print(x)
    print('\n')

Most spammy reviews are: 

 My husband and I stayed in the penthouse The room had a strong sewage gas Oder Someone should have known right from the start and should have NOT have hat room booked The room was dusty I m allergic to dust the headboard had mold and the refrigerator was not getting cold I took pictures My husband called the receptionist right away and they said there was nothing that they could do until the morning My husband wanted to talk to the manager but the manager is not available on the weekends which is ridiculous because a manager at any hotel should be available 24 7 The receptionist said there was nothing they could do until the morning They said they would bring a plumber to check it out That night my husband and I went out and bought candles to get rid of the smell It didn t really help It made me so sick The strong Oder I m actually sick now nauseous fatigued and a high fewer Well my husband called the next morning at 8 30am to see when the plumber was coming

In [81]:
print("Least spammy reviews are: \n")
for x in df_sorted['reviews'][-3:]:
    print(x)
    print('\n')

Least spammy reviews are: 

 Bedroom needs painting there was a gap between bedroom carpet and bathroom door the bin in room was as all bent to different shapes stain on the table and seat In all the management need to invest heavily in the property Bedroom needs painting there was a gap between bedroom carpet and bathroom door the bin in room was as all bent to different shapes stain on the table and seat In all the management need to invest heavily in the property Bedroom needs painting there was a gap between bedroom carpet and bathroom door the bin in room was as all bent to different shapes stain on the table and seat In all the management need to invest heavily in the property Bedroom needs painting there was a gap between bedroom carpet and bathroom door the bin in room was as all bent to different shapes stain on the table and seat In all the management need to invest heavily in the property Bedroom needs painting there was a gap between bedroom carpet and bathroom door the bin

__B20.__ (2 pts) What aspects of our classifier could we modify to potentially improve our SPAM classifier's performance? Specifically, discuss the potential effects to this experiment in selecting or transforming our features, or optimizing any criteria for our predictions in the markdown cell, below.

<font color=blue>Right now, our model is based on the number of occurences of each word in the whole review dataset. Instead, we can perform a sentiment analysis on the text to extract more specificly oriented features of each review. Maybe, this way the predictions would be more accurate. Another way is to take a bundle of words and extract the features of the whole bunch together for a more meaningful phrase.</font>

__B21.__ _(2 pts)_ How could we get an evaluation out of this experiment and _really_ know if the classifer is working? What would we have to do with the Booking.com data in order to get a strong sense of our performance on SPAM? Is there _any_ reasonable labeling of this data that we could come up with, or would we have to get some new data that we have more control over? Place your discussion in the markdown cell, below.

<font color=blue>Maybe we can use the column "reviewer Score" to come up with a new system of labeling. This way a reviewer with a very poor score will be more likely to leave a fake review.</font>