# Feature Engineering Homework 
***
**Name**: $Ahmed Al Hasani$ 

**Kaggle Username**: $<mandazi11>$
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, it is highly recommended that you perform local evaluation using a validation set or cross-validation. 

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import accuracy_score
from csv import DictReader, DictWriter
from sklearn.model_selection import KFold
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression 
import nltk
import math
import string
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize 
from nltk import ngrams
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.sparse import csr_matrix
from nltk.stem.porter import *
from nltk.tag import pos_tag
from sklearn import preprocessing
%matplotlib inline 

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


In [59]:
#Tokenizing and Stemming
class TDIDF_Stemming():
    def __init__(self):
        self.stemmer = PorterStemmer()
    
    def stem_tokens(self, tokens, stemmer):
        stemmed = []
        
        for item in tokens:
            stemmed.append(stemmer.stem(item))
        
        return stemmed

    def __call__(self, examples):
        tokens = word_tokenize(examples)
        tokens = [i for i in tokens if i not in string.punctuation and i != "michael" ]
        stemmed = self.stem_tokens(tokens, self.stemmer)
        return stemmed

#Count the length of a sentence 
class sentence_length_transformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, examples):
        return self

    def transform(self, examples):
        X = np.zeros((len(examples), 1))
        
        for ii, x in enumerate(examples):
            X[ii,:] = np.array([len(x)])

        return csr_matrix(X)

#Part of Speech Tokenizer
class POS_tokenizer(object):
    def __call__(self, text):
        words = word_tokenize(text)
        words_and_pos_tags = pos_tag(words)
        return [word_and_pos[0] for word_and_pos in words_and_pos_tags if word_and_pos[1] != "NN" \
        and word_and_pos[1] != "IN" and word_and_pos[1] != "PRP$"]

def find_word_spoiler(data, word, information):
        text = data [information]
        tags = data ["spoiler"]
        count_true = 0 
        count_false = 0
        count = 0

        key_words = [word]
        print(key_words)
        for sentence, tag in zip(text, tags):
            if any(word in sentence.lower() for word in key_words):
                if str(tag)=='True':
                    count_true+=1
                elif str(tag)=='False':
                    count_false+=1
                count+=1

        print("Total Sentences: " + str(count))
        print("Total Spoilers: " + str(count_true))
        print("Total Non-Spoilers: " + str(count_false))

In [54]:
'''
TD IDF and CountVectorizer
binary = false, which means a binary text model is not set, hence, bag-of-words will be used
min_df = minimum document frequency. If terms did not occur in many documents, ignore that term. 
max_df = opposite to above
strip_accents = ascii / unicode / none
sublinear_tf = sublear tf scaling, replace tf with 1+log(tf)
ngrams = unigram, bigrams or ngrams. I used bigrams
norm = to normalize, default is l2
'''

'''
Function Transformer
validate: checks the array X beforehand.  True or False
accept_sparse: True or False
'''

class FeatEngr:
    def __init__(self):
        self.Y = ['True', 'False']
        self.vectorizer = FeatureUnion([(
            "sentence_tfidfVect", Pipeline([('sentece', FunctionTransformer(lambda x:x[0], validate = False)),
                ('tfid', TfidfVectorizer(ngram_range=(1,2), lowercase=True, stop_words='english'))])),
            ("trope_countVect", Pipeline([('trope', FunctionTransformer(lambda x:x[1], validate = False)), 
                ('countvectorizer', CountVectorizer())]))
            ])
        
        self.data = pd.read_csv("../data/spoilers/train.csv")

    def build_train_features(self, examples):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: currently just a list of forum posts  
        """
        return self.vectorizer.fit_transform(examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: currently just a list of forum posts  
        """
        return self.vectorizer.transform(examples)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
        
    def train_model(self, random_state=1234):
        """ 
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        #write data in dictionary, convert to list
        dfTrain = list(DictReader(open("../data/spoilers/train.csv")))

        #grab different information from dictionary above
        self.X_train = self.build_train_features([[x["sentence"] for x in dfTrain], [x["trope"] for x in dfTrain]])
        
        #grab spoilers, convert them to 0's and 1's
        self.y_train = np.array(list(['True', 'False'].index(x["spoiler"]) for x in dfTrain))
        
        k_folds_test = KFold(n_splits=10, shuffle=True)
        accuracy = []
        for train_index, test_index in k_folds_test.split(self.X_train):
            local_x_train, local_x_test = self.X_train[train_index], self.X_train[test_index]
            local_y_train, local_y_test = self.y_train[train_index], self.y_train[test_index]

            self.logreg = LogisticRegression(random_state=1230)
            self.logreg.fit(local_x_train, local_y_train)
            local_y_pred = self.logreg.predict(local_x_test)
            accurate = accuracy_score(local_y_test, local_y_pred)

            accuracy.append(accurate)
            print('Local Accuracy: ', accurate)
        
        print('Avg Accuracy is: ', sum(accuracy) / len(accuracy))

        #train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)

        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv =10)
        print(scores)
        
    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file 
        """
        # read in test data 
        dfTest = list(DictReader(open("../data/spoilers/test.csv")))
        
        # featurize test data 
        self.X_test = self.get_test_features([[x["sentence"] for x in dfTest], [x["trope"] for x in dfTest]])
        
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)

        #increment id as each line is written
        id_csv = 0
        
        # dump predictions to file for submission to Kaggle  
        with open("prediction.csv", "w") as output:
            wr = DictWriter(output, fieldnames=["Id", "spoiler"], lineterminator = '\n')
            wr.writeheader()

            for p in pred:
                d = {"Id": id_csv, "spoiler": self.Y[p]}
                wr.writerow(d)
                id_csv+=1
        
    def computeLength(self):
        count_true = 0.0
        count_false = 0.0
        total_true_length = 0.0
        total_false_length = 0.0
        for index, row  in self.data.iterrows():
            if row["spoiler"] == True:
                count_true += 1.0
                total_true_length += len(row["sentence"])
            else:
                count_false += 1.0
                total_false_length += len(row["sentence"])

        print("Avg Length For Spoilers: ", end="")
        print(total_true_length/count_true)
        print("Avg Length For Non-Spoilers: ", end="")
        print(total_false_length/count_false)
        print("Total No. of Spoilers: " + str(count_true))
        print("Total No. of Non-Spoilers: " + str(count_false))
    

In [76]:
# Instantiate the FeatEngr clas 
feat = FeatEngr()

# Train your Logistic Regression classifier 
feat.train_model(random_state=1230)

# Shows the top 10 features for each class 
#feat.show_top10()

# Make prediction on test data and produce Kaggle submission file 
feat.model_predict()

Local Accuracy:  0.769423558897
Local Accuracy:  0.74269005848
Local Accuracy:  0.738512949039
Local Accuracy:  0.750208855472
Local Accuracy:  0.770258980785
Local Accuracy:  0.74269005848
Local Accuracy:  0.751879699248
Local Accuracy:  0.758563074353
Local Accuracy:  0.763575605681
Local Accuracy:  0.747702589808
Avg Accuracy is:  0.753550543024
[ 0.67111853  0.6903172   0.66583124  0.67919799  0.60818713  0.6566416
  0.62406015  0.62907268  0.63963211  0.64966555]


### [25 points] Problem 2: Motivation and Analysis 
***

The job of the written portion of the homework is to convince the grader that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features

Make sure that you have examples and quantitative evidence that your features are working well. Be sure to explain how you used the data (e.g., did you have a validation set? did you do cross-validation?) and how you inspected the results. In addition, it is very important that you show some kind of an **error analysis** throughout your process.  That is, you should demonstrate that you've looked at misclassified examples and put thought into how you can craft new features to improve your model. 

A sure way of getting a low grade is simply listing what you tried and reporting the Kaggle score for each. You are expected to pay more attention to what is going on with the data and take a data-driven approach to feature engineering.

###  My Approach
Initially, the features names extracted using only countvectorizer for my transformer were: 
Pos: tear freya dies harvey sebastian regina morgana olivia moriarty destiny
Neg: cory johnny tim drew often hilarious meant cody disney fed

Even before submitting the prediction csv file from the above features, I was able to conclude that the features would not be helpful in determining whether a sentence has a spoiler, because they focused heavily on names rather than verbs or other words that indicate a plot twist, and hence, vital information that concludes a sentence as a spoiler. 

After submitting the file, it resulted in the baseline score of ~0.62231.

Here I wanted to test several approaches.
First: Key words 
Key words that are often found in spoilers vary, and I guessed they were similar to words "kill" , "spoiler", "dead", and "end".
As a result, the cell below shows a simple for loop that counts how many True and False for each word. 

#key_words = ["spoiler", "spoilers"] #7 vs 1
#key_words = ["death"] #208 vs 55
#key_words = ["died"] #92 vs 33
#key_words = ["killed"] #178 vs 39
#key_words = ["kills"] #77 vs 19
#key_words = ['stabs'] #7 vs 2
#key_words = ['ending'] #97 vs 37
#key_words = ['revenge'] #37 vs 9
#key_words = ['suicide'] #49 vs 10
#key_words = ['chances'] #9 vs 1
#key_words = ['tear'] #23 vs 3
#key_words = ['spoiler', 'spoilers', 'death', 'killed', 'kills', 'stabs', 'suicide', 'revenge'] #462 vs 115

This verified my assumption that verbs and other words that are related to a change in situation are highly related to spoilers. An example is shown below.


In [36]:
word = 'spoiler'
information = 'sentence'
find_word_spoiler(feat.data, word, information )

['spoiler']
Total Sentences: 8
Total True: 7
Total False: 1


As a result, I want a vectorizer that will give me top features with similar words. I tried setting self.vectorizer to a Tf Idf vectorizer as (One of the vectorizers Chris suggested in class) to test what are the top features. 

In [66]:
class FeatEngr_TFIDF:
    def __init__(self):
        self.vectorizer = TfidfVectorizer(use_idf = False)
    def build_train_features(self, examples):
        return self.vectorizer.fit_transform(examples)
    def get_test_features(self, examples):
        return self.vectorizer.transform(examples)
    def show_top10(self):
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
    def train_model(self, random_state=1234):
        dfTrain = pd.read_csv("../data/spoilers/train.csv")
        self.X_train = self.build_train_features(list(dfTrain["sentence"]))
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)

feat_tfidf = FeatEngr_TFIDF()
feat_tfidf.train_model(random_state=1230)
feat_tfidf.show_top10()

Pos: end die dies kills dead olivia revealed death killed finale
Neg: show often usually tim drew always started cast like than


Looking at the top features from the TF IDF vectorizer, all words seem to fit what is expected in a spoiler, in addition to the small test of finding how many spoilers vs non spoilers for each word I conducted. 
I submitted the predictions from the test above, and it resulted in a Kaggle Score of ~0.67.

Clearly, every word in the top 10 features seem to fit, except for olivia. It is undesirable to have a name in the highest weights to classify a sentence as a spoiler. A name might indicate a spoiler only for a certain show, which is not helpful, because the program must be able to detect spoilers for various shows. This must be fixed.
Additionally, using tokenizer and stemming might provide me with an improved transformer. Rather than have die or dies in the highest words as seperate terms to look for, using stem() will help in getting rid of 'dies' and hence, I will have another term there that fits better. 

Additionally, I read this paper (http://www.umiacs.umd.edu/~jbg/docs/2013_spoiler.pdf) on their approach in detecting spoilers through feature engineering. They noted that the longer a post, the more likely it is to contain as spoiler. This is due two several reasons. The first is, a longer post means more descriptions and more information to describe a situation, it is likely this post is a spoiler. Second, longer posts describe lengthy shows. This means friendlier shows, such as sitcoms, are unlikely talked about in that post, and instead, shows that take a long time to develop a plot and the story are more likely discussed in that particular post. Hence, it is also important to consider the genre of the show or the length of the post.

I wanted to find the length of posts that are spoilers and nonspoilers to determine whether it is a worth approach.

In [38]:
feat.computeLength()

Avg Length For Spoilers: 120.85846055979644
Avg Length For Non-Spoilers: 104.15047518479409
Total No. of Spoilers: 6288.0
Total No. of Non-Spoilers: 5682.0


At first glance, I believed a custom transformer to count the length of posts is not worth it, because length of spoilers are slightly longer than non-spoiler posts. However, when I counted the number of spoilers, a noticeably larger amount of spoilers did not lower the average length of spoilers to a point lower than the average length for non-spoilers. 

I created a vectorizer using Feature Union that combined TF IDF with stemming and counting the length of the posts from the class I created just as below. 

In [39]:
'''
self.vectorizer = FeatureUnion([("TfidfVectorizer", TfidfVectorizer(ngram_range=(1, 2), stop_words='english', tokenizer=TDIDF_Stemming())), 
        ("sentence_length_transformer", sentence_length_transformer())])
'''

'\nself.vectorizer = FeatureUnion([("TfidfVectorizer", TfidfVectorizer(ngram_range=(1, 2), stop_words=\'english\', tokenizer=TDIDF_Stemming())), \n        ("sentence_length_transformer", sentence_length_transformer())])\n'

At that time, I did not use cross validation to test my vectorizer, and submitted the predictions from using the transformer from the resultant self.vectorizer. I only earned a score of ~0.63, and this continued for several submissions for similar vectorizers as above, each time adjusting the parameters to TF IDF while also using the sentence_length_transformer. Scores varied between 0.60 and 0.64, and only when I had limited submissions and time did I start using cross validation and considered different approaches. 

In [40]:
class FeatEngr3:
    def __init__(self):
        
        from sklearn.feature_extraction.text import CountVectorizer
        
        self.vectorizer = FeatureUnion([("TDIDF:" , TfidfVectorizer(min_df = 110, use_idf = False, tokenizer = TDIDF_Stemming(), stop_words='english')), 
           ("Count:", CountVectorizer(ngram_range=(1, 2), stop_words='english', tokenizer=POS_tokenizer()))])

    def build_train_features(self, examples):
        return self.vectorizer.fit_transform(examples)

    def get_test_features(self, examples):
        return self.vectorizer.transform(examples)

    def show_top10(self):
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
                
    def train_model(self, random_state=1234):
        from sklearn.linear_model import LogisticRegression 
        dfTrain = pd.read_csv("../data/spoilers/train.csv")
        self.X_train = self.build_train_features(list(dfTrain["sentence"]))
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)
        

        k_folds_test = KFold(n_splits=10, shuffle=True)
        accuracy = []
        for train_index, test_index in k_folds_test.split(self.X_train):
            local_x_train, local_x_test = self.X_train[train_index], self.X_train[test_index]
            local_y_train, local_y_test = self.y_train[train_index], self.y_train[test_index]

            self.logreg = LogisticRegression(random_state=1230)
            self.logreg.fit(local_x_train, local_y_train)
            local_y_pred = self.logreg.predict(local_x_test)
            accurate = accuracy_score(local_y_test, local_y_pred)

            accuracy.append(accurate)
            print('Local Accuracy: ', accurate)
        
        print('Avg Accuracy is: ', sum(accuracy) / len(accuracy))

        #train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)

        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv =10)
        print(scores)
        
    def model_predict(self):
        
        # read in test data 
        dfTest  = pd.read_csv("../data/spoilers/test.csv")
        
        # featurize test data 
        self.X_test = self.get_test_features(list(dfTest["sentence"]))
        
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)
        
        # dump predictions to file for submission to Kaggle  
        pd.DataFrame({"spoiler": np.array(pred, dtype=bool)}).to_csv("prediction.csv", index=True, index_label="Id")

In [65]:
feat3 = FeatEngr3()

feat3.train_model(random_state=1230)

feat3.show_top10()
print("---------------------------------------------")

Local Accuracy:  0.644110275689
Local Accuracy:  0.644110275689
Local Accuracy:  0.637426900585
Local Accuracy:  0.624060150376
Local Accuracy:  0.644945697577
Local Accuracy:  0.637426900585
Local Accuracy:  0.634920634921
Local Accuracy:  0.654135338346
Local Accuracy:  0.623224728488
Local Accuracy:  0.621553884712
Avg Accuracy is:  0.636591478697
[ 0.62186978  0.61769616  0.62823726  0.61403509  0.58228906  0.61236424
  0.61236424  0.59314954  0.590301    0.63628763]
Pos: Count:__sebastian Count:__starting Count:__kills Count:__turns TDIDF:__kill TDIDF:__die TDIDF:__reveal TDIDF:__death TDIDF:__end TDIDF:__final
Neg: Count:__small Count:__actually , Count:__hilarious Count:__meant TDIDF:__live Count:__frequently TDIDF:__like Count:__problems Count:__drew Count:__# 2
---------------------------------------------


By now I exhausted tokens, stemming, CountVectorizer and Tf-Idf with their parameters without progress. Parameters included playing around with bigrams and ngrams, minimum document frequency to catch words that occur frequently and others. It is important to note thought that using bigrams, stop words, and lower case was the most helpful. Additionally, combining countvectorizer and tdidf did not result in meaningful features, in fact, countvectorizer was not a meaningful additional to the results I found from using tf idf alone. Also, Using sentence length did not help either. The accurracy scores and the cross validation scores were also low. 

I watched a YouTube Video that discussed general approaches to extract information, which included Stemming and bigrams (https://www.youtube.com/watch?v=oYe03Y1WQaI&t=328s)

Below is an example of using Tf Idf with stop words, bigrams, and lower case.

In [47]:
class FeatEngr_TFIDF2:
    def __init__(self):
        self.vectorizer = TfidfVectorizer(ngram_range=(1,2), lowercase=True, stop_words='english')
    def build_train_features(self, examples):
        return self.vectorizer.fit_transform(examples)
    def get_test_features(self, examples):
        return self.vectorizer.transform(examples)
    def show_top10(self):
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
    def train_model(self, random_state=1234):
        dfTrain = pd.read_csv("../data/spoilers/train.csv")
        self.X_train = self.build_train_features(list(dfTrain["sentence"]))
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)
        k_folds_test = KFold(n_splits=10, shuffle=True)
        accuracy = []
        for train_index, test_index in k_folds_test.split(self.X_train):
            local_x_train, local_x_test = self.X_train[train_index], self.X_train[test_index]
            local_y_train, local_y_test = self.y_train[train_index], self.y_train[test_index]

            self.logreg = LogisticRegression(random_state=1230)
            self.logreg.fit(local_x_train, local_y_train)
            local_y_pred = self.logreg.predict(local_x_test)
            accurate = accuracy_score(local_y_test, local_y_pred)

            accuracy.append(accurate)
            print('Local Accuracy: ', accurate)
        
        print('Avg Accuracy is: ', sum(accuracy) / len(accuracy))

        #train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)

        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv =10)
        print(scores)

In [68]:
feat_tfidf2 = FeatEngr_TFIDF2()
feat_tfidf2.train_model(random_state=1230)
feat_tfidf2.show_top10()
print("--------------------------------------------------------------------------")

Local Accuracy:  0.664160401003
Local Accuracy:  0.66081871345
Local Accuracy:  0.680033416876
Local Accuracy:  0.697577276525
Local Accuracy:  0.680033416876
Local Accuracy:  0.68253968254
Local Accuracy:  0.702589807853
Local Accuracy:  0.674185463659
Local Accuracy:  0.695906432749
Local Accuracy:  0.69089390142
Avg Accuracy is:  0.682873851295
[ 0.63105175  0.64607679  0.64578112  0.63324979  0.62155388  0.62489557
  0.60568087  0.54636591  0.57943144  0.66220736]
Pos: kill kills dies dead end revealed finale death turns killed
Neg: usually like tim drew cast tv cory seasons meant ryan
--------------------------------------------------------------------------


So now that I verified that using these parameters resulted in more accurate features, and it also got rid of the names, such as Olivia. However, this does not solve the meaningless addition of CountVectorizer. 

I considered then using countvectorizer on tropes alone to see if I can find a correlation to the theme of the sentence rather then the length alone. In the cell below, I will create a class FeatEngr using only countvectorizor and instead of self.X_train grabbing information from the sentence column, I will change it to the trope column. 

In [69]:
class FeatEngr4:
    def __init__(self):
        self.vectorizer = CountVectorizer()

    def build_train_features(self, examples):
        return self.vectorizer.fit_transform(examples)

    def get_test_features(self, examples):
        return self.vectorizer.transform(examples)

    def show_top10(self):
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
                
    def train_model(self, random_state=1234):
        from sklearn.linear_model import LogisticRegression 
        dfTrain = pd.read_csv("../data/spoilers/train.csv")
        self.X_train = self.build_train_features(list(dfTrain["trope"]))
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)
        

        k_folds_test = KFold(n_splits=10, shuffle=True)
        accuracy = []
        for train_index, test_index in k_folds_test.split(self.X_train):
            local_x_train, local_x_test = self.X_train[train_index], self.X_train[test_index]
            local_y_train, local_y_test = self.y_train[train_index], self.y_train[test_index]

            self.logreg = LogisticRegression(random_state=1230)
            self.logreg.fit(local_x_train, local_y_train)
            local_y_pred = self.logreg.predict(local_x_test)
            accurate = accuracy_score(local_y_test, local_y_pred)

            accuracy.append(accurate)
            print('Local Accuracy: ', accurate)
        
        print('Avg Accuracy is: ', sum(accuracy) / len(accuracy))

        #train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)

        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv =10)
        print(scores)


In [71]:
feat4 = FeatEngr4()

feat4.train_model(random_state=1230)

feat4.show_top10()
print("------------------------------------------------------------------------------------------------------------")

Local Accuracy:  0.727652464495
Local Accuracy:  0.736006683375
Local Accuracy:  0.739348370927
Local Accuracy:  0.727652464495
Local Accuracy:  0.749373433584
Local Accuracy:  0.723475355054
Local Accuracy:  0.736006683375
Local Accuracy:  0.744360902256
Local Accuracy:  0.739348370927
Local Accuracy:  0.732664995823
Avg Accuracy is:  0.735588972431
[ 0.63439065  0.67278798  0.61487051  0.6449457   0.60651629  0.61236424
  0.6031746   0.6374269   0.65384615  0.61789298]
Pos: neverfoundthebody foreshadowing killedoffforreal heroicbsod backfromthedead ohcrap bittersweetending xanatosgambit thereveal whamepisode
Neg: catchphrase deadpansnarker thecastshowoff abc flanderization spinoff gameshow stockfootage thebbc domcom
------------------------------------------------------------------------------------------------------------


Here is when I made my biggest breakthrough. I looked at the top 10 tropes for the positive class (spoilers) and they describe situations that are more likely to be spoilers. This was a suitable alternative to sentence length. Instead of extracting sentence length to guess what content it is, why not use the trope directly? Additionally, my accuracy is significantly higher than previous tests.

Additionally, my local accuracy values were the highest here, although the cross validation scores were lower compared to using TF IDF alone. Nonetheless, using TF IDF on sentences and CountVectorizer on tropes seemed more promising. 

Further, I wanted to see if the suggested trope above are related to spoilers and not just a fluke or mistake by CountVectorizer. So I ran a few tests. 

In [72]:
word = 'neverfoundthebody'
information = 'trope'
find_word_spoiler(feat.data, word, information)

['neverfoundthebody']
Total Sentences: 14
Total Spoilers: 14
Total Non-Spoilers: 0


In [73]:
word = 'foreshadowing'
information = 'trope'
find_word_spoiler(feat.data, word, information)

['foreshadowing']
Total Sentences: 66
Total Spoilers: 58
Total Non-Spoilers: 8


What about the negative class? I also wanted to test the features suggested by CountVectorizer. 

In [74]:
word = 'catchphrase'
information = 'trope'
find_word_spoiler(feat.data, word, information)

['catchphrase']
Total Sentences: 50
Total Spoilers: 3
Total Non-Spoilers: 47


In [75]:
word = 'deadpansnarker'
information = 'trope'
find_word_spoiler(feat.data, word, information)

['deadpansnarker']
Total Sentences: 24
Total Spoilers: 1
Total Non-Spoilers: 23


After further tests with different tropes this seemed like a viable option for my transformer. Additionally, using tropes is promising because they explain a general idea or theme that gather sentences under a specific "umbrella", so sentences under the trope neverfoundthebody will definitely hold key information that classify a sentence as a spoiler. 

I wanted to feed my Tf Idf vectorizer with sentences only, and feed my CountVectorizer with tropes only. As a result, these 2 websites provided a helpful guide to only selecting certain information for building a customized transformer. They suggested using FunctionTransformer. 
1) https://stackoverflow.com/questions/43274423/use-sklearns-functiontransformer-with-string-data
2) http://scikit-learn.org/stable/auto_examples/preprocessing/plot_function_transformer.html

Additionally, using Pipeline will gather only one FunctionTransformer for Tf Idf and another one for Countvectorizer, then all I need to do is gather the two Pipelines with Feature Union. Although the get_feature_names function will not work on the pipeline, I realized with the previous tests that it will result in tropes and key words that accurately classify a sentence as a spoiler, so all I am doing now is feeding the arguments to those vectors using Function Transformer via Pipeline. And even though I can't call get_feature_names, the high accuracy scores validated my assumption that I had better features from combining the two vectors. It was my final adjustment and it resulted in the highest accuracy and scores from the cross validation test.

I changed the dfTrain variable given to us initially so that I can create dictionary with information from the sentence and trope column only, then convert it to a list. I did this for the training part and the predicting part. The resultant accuracy and cross validation scores were:

Local Accuracy:  0.743525480368
Local Accuracy:  0.756056808688
Local Accuracy:  0.774436090226
Local Accuracy:  0.770258980785
Local Accuracy:  0.741019214703
Local Accuracy:  0.738512949039
Local Accuracy:  0.739348370927
Local Accuracy:  0.741854636591
Local Accuracy:  0.758563074353
Local Accuracy:  0.730158730159
Avg Accuracy is:  0.749373433584
[ 0.67111853  0.6903172   0.66583124  0.67919799  0.60818713  0.6566416
  0.62406015  0.62907268  0.63963211  0.64966555]
  
The local accuracy was significantly higher then all previous tests, including the cross validation scores. With three submissions left on Thursday night, I felt confident in submitting the resultant CSV file and it resulted in the score of ~0.71. The final program is the one I posted in Part 1 in this notebook.

### Hints 
***

- Don't use all the data until you're ready. 

- Examine the features that are being used.

- Do error analyses.

- If you have questions that aren’t answered in this list, feel free to ask them on Piazza.

### FAQs 
***

> Can I heavily modify the FeatEngr class? 

Totally.  This was just a starting point.  The only thing you cannot modify is the LogisticRegression classifier.  

> Can I look at TV Tropes?

In order to gain insight about the data yes, however, your feature extraction cannot use any additional data (beyond what I've given you) from the TV Tropes webpage.

> Can I use IMDB, Wikipedia, or a dictionary?

Yes, but you are not required to. So long as your features are fully automated, they can use any dataset other than TV Tropes. Be careful, however, that your dataset does not somehow include TV Tropes (e.g. using all webpages indexed by Google will likely include TV Tropes).

> Can I combine features?

Yes, and you probably should. This will likely be quite effective.

> Can I use Mechanical Turk?

That is not fully automatic, so no. You should be able to run your feature extraction without any human intervention. If you want to collect data from Mechanical Turk to train a classifier that you can then use to generate your features, that is fine. (But that’s way too much work for this assignment.)

> Can I use a Neural Network to automatically generate derived features? 

No. This assignment is about your ability to extract meaningful features from the data using your own experimentation and experience.

> What sort of improvement is “good” or “enough”?

If you have 10-15% improvement over the baseline (on the Public Leaderboard) with your features, that’s more than sufficient. If you fail to get that improvement but have tried reasonable features, that satisfies the requirements of assignment. However, the extra credit for “winning” the class competition depends on the performance of other students.

> Where do I start?  

It might be a good idea to look at the in-class notebook associated with the Feature Engineering lecture where we did similar experiments. 


> Can I use late days on this assignment? 

You can use late days for the write-up submission, but the Kaggle competition closes at **4:59pm on Friday February 23rd**

> Why does it say that the competition ends at 11:59pm when the assignment says 4:59pm? 

The end time/date are in UTC.  11:59pm UTC is equivalent to 4:59pm MST.  Kaggle In-Class does not allow us to change this. 