# CSCI 4622 - Spring 2018 - Practicum 
***


This practicum is due on Moodle by **11:59pm on Thursday May 3rd**. 

**Here are the rules:** 

4. Your work must be done entirely on your own. You may **NOT** collaborate with classmates or anyone else.  
3. You may **NOT** post to message boards or other online resources asking for help. 
5. You may **NOT** use late days on the practicum nor can you drop your practicum grade. 
1. You may use your course notes, posted lecture slides, in-class notebooks, and homework solutions as resources. 
2. You may consult alternate sources like blog posts or technical papers, but you may **NOT** copy code from these sources. 
3. Any additional non-course sources that you use should be clearly cited (with links) in the **References** section at the bottom of this notebook. 
7. Submit only this Jupyter notebook to Moodle.  Do not compress it using tar, rar, zip, etc. 

Violation of the above rules will result in an **F** in the course and a trip to **Honor Council** 

***

**By writing your name below you agree to abide by the given rules:**

**Name**: $<$Chen Hao Cheng$>$

**Kaggle Username**: $<$insert username here$>$

***


**NOTES**: 

- You do not need to implement everything from scratch.  At this point you should be leveraging Sklearn as much as you can. 
- If you have a clarifying question, please post it as a **PRIVATE** message to me on Piazza. 
- Part of the goal of this assignment is to see if you can stand on your own.  Please do not ask me to help you debug code or check if your answers are correct. Most of the implementation details necessary to complete this practicum can be found in the Hands-On notebooks or the Sklearn documentation.  
- You'll notice that the point totals below do not add up to 100.  This is because 10 out of the 100 points will be attributed to **style**.  To earn full credit for style your analysis should be concise and well-organized, your code should be readable and well-commented, and you should use plots and graphics to support your conclusions whenever appropriate.  

In [3]:
import pickle, gzip 
import numpy as np
import pandas as pd

### [35 points] Problem 1: Building Classifiers for Fashion MNIST 
***

The classic MNIST Handwritten Digit data set has been a staple in the machine learning literature since the beginning of time (i.e. the late 90's).  However, machine learning practitioners have grown tired of the rusty digits and have recently begun to create and explore new, more interesting data sets. Some popular alternatives to emerge recently are [EMNIST](https://www.kaggle.com/crawford/emnist), [Sign Language MNIST](https://www.kaggle.com/datamunge/sign-language-mnist), and [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist). In this problem you will explore the latter. 

Fashion MNIST is comprised of $28 \times 28$ pixel gray-scale images of clothing, with classes corresponding to things like tops, trousers, coats, dresses, and various types of shoes.  The data set that we'll work with corresponds to a small subset of Fashion MNIST with 1500 examples from each of five distinct classes (tops, trousers, coats, sneakers, and ankle boots). 

Execute the following cell to load the data. 

In [None]:
f = gzip.open('fashion_mnist_subset.pklz', 'rb')
X_all, y_all = pickle.load(f)
f.close()

In **Parts A-C** you will construct various tuned classifiers for making predictions on Fashion MNIST.  For each classifier you should: 
- Describe and motivate any transformations on the pixel data that you found helpful/necessary to make your model work well. 
- Describe and justify your process for determining optimal hyperparameters for each model. Support your decisions with validation studies and associated graphics.  Do **NOT** just report the hyperparameters that worked best.  
- Describe how you evaluated your models during your process (i.e. did you use a validation set, did you do cross-validation, etc). 
- Report the final optimal hyperparameters that you used as well as the accuracy of your final model. 

**Part A**: Construct a K-Nearest Neighbors classifier to make predictions on the data. 

In [None]:
print(X_all)
print(y_all)
print(len(X_all))
print(len(X_all[0]))
print(len(y_all))

class KNN:
    def __init__(self, X_train, y_train, K=5, distance_weighted=False):
        from sklearn.neighbors import BallTree
        self.balltree = BallTree(X_train)
        self.y_train = y_train
        self.K = K
        self.distance_weighted = distance_weighted
        
    def majority(self, neighbor_indices, neighbor_distances=None):
        if self.distance_weighted == False:
            import numpy as np
            import collections
            
            Y_trainCounter = collections.Counter(map(lambda x: self.y_train[x], neighbor_indices[0]))
            MostCommonCounter = Y_trainCounter.most_common()
            
            if len(MostCommonCounter) == 1 or MostCommonCounter[0][1] != MostCommonCounter[1][1]: 
                # return the first element of list [0][0] because it's the most common
                return MostCommonCounter[0][0]
            else:
                # get rid of the last element
                return self.majority([neighbor_indices[0][:-1]])
            
        elif self.distance_weighted == True:
            LabelCounter = {}
            
            for i in range(len(neighbor_indices[0])):
                
                # Make LabelCounter dict labels = 0, i.e = [-1: 0, 1: 0] depending on the indices of y_train
                LabelCounter[self.y_train[neighbor_indices[0][i]]] = 0
                
#             print(neighbor_indices[0])
#             print("LabelCounter: ", LabelCounter)
            
#             print("-------------------------------------------------------")
            
            for j in range(len(neighbor_indices[0])):
                LabelCounter[self.y_train[neighbor_indices[0][j]]] += (1/(neighbor_distances[0][j] + 0.001))
#                 print("LabelCounterInLoop: ", LabelCounter)
#             print("New LabelCounter: ", LabelCounter)
            WinningClassLabel = max(LabelCounter, key = lambda key: LabelCounter[key])
#             print("WinningClassLabel: ", WinningClassLabel)
            return WinningClassLabel

    def classify(self, x):
        distance, indices = self.balltree.query(x.reshape(1, -1), self.K)
        return self.majority(indices, neighbor_distances = distance)

    def predict(self, X):
        return list(map(self.classify, X))
    


In [None]:
# predictor = KNN(X_all, y_all, 3, False)
# yHatValid = predictor.predict(X_all)
# print(yHatValid)

In [None]:
# from sklearn.neighbors import KNeighborsClassifier
# knn = KNeighborsClassifier(15).fit(X_all, y_all)

# from sklearn.metrics import confusion_matrix
# yHat_valid = knn.predict(X_all)
# print(yHat_valid)

# C = confusion_matrix(y_all, yHat_valid)
# print("Confusion matrix: ")
# print(C)

In [None]:
import pickle, gzip
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

def view_digit(x, label=None):
#     print("x is: ", x)

    fig = plt.figure(figsize=(3,3))
    plt.imshow(x.reshape(28,28), cmap='gray');
    
    plt.xticks([]); plt.yticks([]);
    if label: plt.xlabel("true: {}".format(label), fontsize=16)

In [None]:
# from sklearn.neighbors import BallTree
# predictor = KNN(X_all, y_all, 3, False)
# distance, indices = predictor.balltree.query(X_all[6].reshape(1, -1) , 3)
# print(indices)

# view_digit(X_all[6])
# view_digit(X_all[1304])
# view_digit(X_all[2321])

In [None]:
# UnweightedAccuracy = []
# WeightedAccuracy = []
# X = []

# i = 1
# for i in range(1, 5):
#     X.append(i)
    
#     KNN_Unweighted = KNN(X_all, y_all, i , False) # Unweighted KNN
#     yHatValid = KNN_Unweighted.predict(X_all)
#     UnweightedConfusionMatrix = confusion_matrix(y_all, yHatValid)
    
#     Accuracy = (np.sum(np.diag(UnweightedConfusionMatrix)))/UnweightedConfusionMatrix.sum()
# #     error = 1 - Accuracy
    
#     UnweightedAccuracy.append(Accuracy)
#     print("Un: ", UnweightedAccuracy)
    
    
#     KNN_Weighted = KNN(X_all, y_all, i , True) # Weighted
#     yHatValid = KNN_Weighted.predict(X_all)
#     WeightedConfusionMatrix = confusion_matrix(y_all, yHatValid)
    
#     Accuracy = (np.sum(np.diag(WeightedConfusionMatrix)))/WeightedConfusionMatrix.sum()
# #     error = 1 - Accuracy    
    
#     WeightedAccuracy.append(Accuracy)
#     print("W: ", WeightedAccuracy)



In [None]:
###Part A KNN
###Scale the input data to zero mean and unit variance
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
scaler = preprocessing.StandardScaler().fit(X_all)
x_all_scaled = pd.DataFrame(scaler.transform(X_all))
pca = PCA(n_components=200)
X_all_scaled_pca = pca.fit(x_all_scaled).transform(x_all_scaled)

In [None]:
modelCV1 = KNeighborsClassifier(n_neighbors = 12,weights = 'distance') #best parameters found using gridsearch
results11 = cross_validate(modelCV1, X_all, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
results12 = cross_validate(modelCV1, x_all_scaled, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
results13 = cross_validate(modelCV1, X_all_scaled_pca, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
print("Average Accuracy without scaling= ", results11['test_score'].mean(), ' std= ' ,results11['test_score'].std(), "Max= ",results11['test_score'].max(), "Min= ", results11['test_score'].min())
print("Average Accuracy with scaling= ", results12['test_score'].mean(), ' std= ' ,results12['test_score'].std(), "Max= ",results12['test_score'].max(), "Min= ", results12['test_score'].min() )
print("Average Accuracy after appying PCA= ", results13['test_score'].mean(), ' std= ' ,results13['test_score'].std(), "Max= ",results13['test_score'].max(), "Min= ", results13['test_score'].min() )

In [None]:
### Code to find best hyperparameters, sklearn provides GridSearchCV function to find best hyperparameters
### GridsearchCV checks all combination of input hyperparameters and gives best possible combination of hyperparameters

In [None]:
# from sklearn.model_selection import GridSearchCV
# parameters = {'n_neighbors': np.arange(6, 12, 2),'weights':('uniform', 'distance')}
# scoring = {'Accuracy': 'accuracy', 'Log_loss': 'neg_log_loss'}
# gs = GridSearchCV(KNeighborsClassifier(), return_train_score=True,param_grid=parameters, scoring=scoring, cv=10, refit='Accuracy')
# gs.fit(X_all_scaled_pca,y_all)
# results = gs.cv_results_
# print("best params: " + str(gs.best_estimator_))

**Part B**: Construct a Linear Support Vector Machine classifier to make predictions on the data. 

In [None]:
from sklearn import svm
modelCV2 = svm.SVC(kernel='linear')
results21 = cross_validate(modelCV2, X_all, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
results22 = cross_validate(modelCV2, x_all_scaled, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
print("Average Accuracy without scaling= ", results21['test_score'].mean(), ' std= ' ,results21['test_score'].std(), "Max= ",results21['test_score'].max(), "Min= ", results21['test_score'].min())
print("Average Accuracy with scaling= ", results22['test_score'].mean(), ' std= ' ,results22['test_score'].std(), "Max= ",results22['test_score'].max(), "Min= ", results22['test_score'].min() )

In [None]:
# ### Code to find best hyperparameters, sklearn provides GridSearchCV function to find best hyperparameters
# ### GridsearchCV checks all combination of input hyperparameters and gives best possible combination of hyperparameters
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
parameters = {'C': np.arange(1, 7, 1)}
scoring = {'Accuracy': 'accuracy', 'Log_loss': 'neg_log_loss'}
gs1 = GridSearchCV(svm.SVC(kernel='linear',probability=True), return_train_score=True,param_grid=parameters, scoring=scoring, cv=5, refit='Accuracy')
gs1.fit(x_all_scaled,y_all)
results = gs1.cv_results_
print("best params: " + str(gs1.best_estimator_))

**Part C**: Construct a Feed-Forward Neural Network classifier to make predictions on the data. We recommend using Sklearn's [MLPClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) rather than the code you wrote in Homework 4. In our experiments we found training an MLPClassifier to take no more than a minute for reasonable choices of architectures. 

In [None]:
from sklearn.neural_network import MLPClassifier
modelCV3 = MLPClassifier(hidden_layer_sizes = (784))
results31 = cross_validate(modelCV3, X_all, y_all, cv=5, scoring= 'accuracy', return_train_score=False)
results32 = cross_validate(modelCV3, x_all_scaled, y_all, cv=5, scoring= 'accuracy', return_train_score=False)
# results33 = cross_validate(modelCV3, X_all_scaled_pca, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
print("Average Accuracy without scaling= ", results31['test_score'].mean(), ' std= ' ,results31['test_score'].std(), "Max= ",results31['test_score'].max(), "Min= ", results31['test_score'].min())
print("Average Accuracy with scaling= ", results32['test_score'].mean(), ' std= ' ,results32['test_score'].std(), "Max= ",results32['test_score'].max(), "Min= ", results32['test_score'].min() )
# print("Average Accuracy after appying PCA= ", results33['test_score'].mean(), ' std= ' ,results33['test_score'].std(), "Max= ",results33['test_score'].max(), "Min= ", results33['test_score'].min() )

In [None]:
### Running Grid search on Neural networks can be a very costly operation, it may take days.
### So in this case as we have 784 features, I have used 1 hiddle layer with 784 units. 


**Part D**: Which of the three models above performed the best on the data set?  Were you surprised or not surprised by your results?  Discuss. 

In [None]:
#Neural netwroks gave best accuracy as expected.
# Whenever we get a new problem in ML, we should be able to decide which algorithm we should apply in that case. Following are the tricks which should be used to decide it,
# 1) When to use Regression Algorithms: Number of features are small and number of training records are high.
# 2) when to use SVM:  Number of features are high and number of training records are less.
# 3) When to use Neural network: Number of features are high and number of training records are high as well.
# In this case we can see that we have 784 features and 7.5k records, thats why neural network is giving better accuracy!


**Part E**: For the best model you identified in **Part D**, perform a train-validation split and construct a confusion matrix based on predictions on the validation set.  Which classes tend to get confused with each other the most? Are there any classes for which your model performs exceptionally well?  Plot at least one misclassified example from each of the often-confused classes and suggests reasons why this behavior might occur.   

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.15, random_state=42)
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes = (784))
model.fit(x_train, y_train)
print(model.score(x_train, y_train))
predicted = model.predict(x_test)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predicted,labels = [0, 1, 4, 7, 9]))

In [None]:
confusion_matrix(y_test, predicted,labels = [0, 1, 4, 7, 9] )

In [None]:
### The confusion matrix shows that class 7 and 9 are most confused in each other
### class 4 is performing exceptionally well
###shape of class 7 and 9 are very similar, hence they are most confused
index7 = np.where(y_all==7)
index9 = np.where(y_all==9)
print(index7)
print(index9)
view_digit(X_all[5])
view_digit(X_all[14])

### [30 points] Problem 2: Predicting Authors of Presidential Election Tweets 
***

For the first time in history, the run-up to the 2016 presidential election saw candidates move a large portion of their campaigns from the traditional debating lectern to the Twitterverse. In this problem you will construct various classifiers to predict whether a tweet was sent by @HillaryClinton ($y=0$) or @realDonaldTrump ($y=1$). 

The data set contains $4000$ tweets that have been cleaned by converting all text to lowercase, removing punctuation, and removing hypertext links. In order to preserve hashtags we've replaced the typical # with the string `hashtag` (e.g. `#GiantMeteor` would be converted to `hashtaggiantmeteor`).  

Execute the following cell to load the data. 

In [4]:
f = gzip.open('clean_tweets.pklz','rb')
text_all, y_all = pickle.load(f)
f.close()

**Part A**: Vectorize the text features using the Bag-of-Words text model **while removing stop words**.  Then answer the following questions: 

- How many distinct text features are there in the data after stop words are removed? 
- How many distinct **HashTags** are there in the data? 
- Which candidate uses HashTags the most frequently? 

In [5]:
import operator
allstopwords = {}
allfeatures = {}
total_hashtags = 0
hillory_hashtags = 0
trump_hashtags = 0
# Load stopwords
f = open('stopwords.txt','r')

for line in f:
    word = line.strip()
    if(word != "" and len(word) > 1):
        allstopwords[word] = True
# bag of words
i = 0
for row in text_all:
    cc = str(row.lower()).count("hashtag")
    total_hashtags += cc
    if(str(y_all[i]) == '1'):
        hillory_hashtags += cc
    else:
        trump_hashtags += cc    
    review_words = row.split()
    for word in review_words:
        if(word not in allstopwords):
            if(word not in allfeatures):
                allfeatures[word] = 1
            else:
                allfeatures[word] += 1
    i = i+1
print("Number of distinct text features in the data after stop words are removed = ", len(allfeatures))
s = [(k, allfeatures[k]) for k in sorted(allfeatures, key=allfeatures.get, reverse=True)]
top_features = {}
inverse_top_features = {}
count = 0
n_top_features = 3500
count = 0
for e in s:
    top_features[e[0]] = count
    inverse_top_features[count] = e[0]
    count += 1
    if(count >= n_top_features):
        break

print("Total Hashtags in the data = ", total_hashtags)
print("Hashtags for Hillary Clinton = ", hillory_hashtags)
print("Hashtags for Donald Trump = ", trump_hashtags)
if(hillory_hashtags > trump_hashtags):
    print("Hillory clinton uses hashtags most frequently!")
elif(hillory_hashtags < trump_hashtags):
    print("Donald Trump uses hashtags most frequently!")
else:
    print("Both Donald Trump and Hillory clinton use hashtags equally!")


Number of distinct text features in the data after stop words are removed =  7486
Total Hashtags in the data =  1062
Hashtags for Hillary Clinton =  804
Hashtags for Donald Trump =  258
Hillory clinton uses hashtags most frequently!


In [6]:
# Prepare x_features
x_features = np.zeros((len(text_all),n_top_features))
for comment in range(0,len(text_all)):
    words = text_all[comment].split()
    wordcount={}
    for w in words:
        if(w in wordcount):
            wordcount[w] += 1
        else:
            wordcount[w] = 1
    for w in wordcount:
        if(w in top_features):
            x_features[comment][top_features[w]] = wordcount[w]

In [7]:
### There are 7451 distict words in the dataset. 
### But using all 7451 as features in not a good idea as there are some words which occur only once in the whole dataset and 
##  such words dont have potential to classify the comments. 
### trying following values [2000,2500,3000,3500,4000,4500] as number of features, I found that best accuracy is achieved at 3500

**Part B**: Construct a Logistic Regression classifier with L2 regularization to make predictions on the data. Exactly as in **Problem 1**, you should clearly detail your process for picking optimal hyperparameters and evaluating your model, and report the details of your best model along with final validation accuracy. 

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
modelCV4 = LogisticRegression(C=1.2000100000000002, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
results211 = cross_validate(modelCV4, x_features, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
print("Average Accuracy using Logistic Regression=", results211['test_score'].mean(), ',std=' ,results211['test_score'].std(), ",Max=",results211['test_score'].max(), ",Min=", results211['test_score'].min())

Average Accuracy using Logistic Regression= 0.9225000000000001 ,std= 0.016007810593582125 ,Max= 0.9475 ,Min= 0.89


In [10]:
# This section of the code is using gridsearch to best best hyperparameters
from sklearn.model_selection import GridSearchCV
param_grid = {'C': np.arange(1e-05, 3, 0.1)}
scoring = {'Accuracy': 'accuracy', 'AUC': 'roc_auc', 'Log_loss': 'neg_log_loss'}
gs = GridSearchCV(LogisticRegression(), return_train_score=True,
                  param_grid=param_grid, scoring=scoring, cv=10, refit='Accuracy')
gs.fit(x_features, y_all)
results = gs.cv_results_
print("best params: " + str(gs.best_estimator_))
# best params found using gridsearch: LogisticRegression(C=1.2000100000000002, class_weight=None, dual=False,
#           fit_intercept=True, intercept_scaling=1, max_iter=100,
#           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
#           solver='liblinear', tol=0.0001, verbose=0, warm_start=False) 

**Part C**: Determine and report the 10 words that are the best predictors for @HillaryClinton and the 10 words that are the best predictors for @realDonaldTrump in your Logistic Regression model. In addition, you should briefly discuss how you found these best features mathematically. 

**Part D**: Construct a Naive Bayes classifier to make predictions on the data. Again, you should clearly detail your process for picking optimal hyperparameters and evaluating your model, and report the details of your best model along with final validation accuracy. **Hint**: Since text features are discrete, you'll want to use Sklearn's [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) classifier. 

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
modelCV5 = MultinomialNB(alpha = 0.85, fit_prior = True)
results221 = cross_validate(modelCV5, x_features, y_all, cv=10, scoring= 'accuracy', return_train_score=False)
print("Average Accuracy without scaling= ", results221['test_score'].mean(), ' std= ' ,results221['test_score'].std(), "Max= ",results221['test_score'].max(), "Min= ", results221['test_score'].min())

In [None]:
# This section of the code is using gridsearch to best  hyperparameters
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': np.arange(0.1, 1.2, 0.2), 'fit_prior':('True','False')}
scoring = {'Accuracy': 'accuracy', 'AUC': 'roc_auc', 'Log_loss': 'neg_log_loss'}
gs = GridSearchCV(MultinomialNB(), return_train_score=True,
                  param_grid=param_grid, scoring=scoring, cv=10, refit='Accuracy')
gs.fit(x_features, y_all)
results = gs.cv_results_
print("best params: " + str(gs.best_estimator_))
# best params found: MultinomialNB(alpha=0.30000000000000004, class_prior=None, fit_prior='True')

**Part E**: Determine and report the 10 words that are the best predictors for @HillaryClinton and the 10 words that are the best predictors for @realDonaldTrump in your Naive Bayes model. In addition, you should briefly discuss how you found these best features mathematically. 

In [22]:
## The Recursive Feature Elimination (RFE) method is most common feature selection approach. 
### I have used RFE to select top 10 features which have highest potential to classify a comment
### 10 words that are the best predictors for @HillaryClinton
from sklearn.naive_bayes import MultinomialNB
x_features_top200 = x_features[:,0:200]
from sklearn.feature_selection import RFE
model = MultinomialNB()
# create the RFE model and select 10 attributes
rfe = RFE(model, 10)
rfe = rfe.fit(x_features_top200, y_all)
ranks = rfe.ranking_
most_predictive={}
for index in range(0,200):
    if(ranks[index] <= 10):
        most_predictive[index] = ranks[index]
s = [(k, most_predictive[k]) for k in sorted(most_predictive, key=most_predictive.get, reverse=False)]
print("Top 10 predictive features in case of Naive bayes are:")
for i in range(1,11):
    print(inverse_top_features[s[i][0]])
class TweetFeaturizer2:
    def __init__(self):
        
        from sklearn.feature_extraction.text import CountVectorizer
        
        self.vectorizer = CountVectorizer()
        
    def add_text_features(self, examples):
        """
        Method for looping over original text and adding new text 
        features. 
        :param examples: the list of raw tweets 
        """
        
        new_examples = [] 
        for ex in examples:
            # here is where you might try to add new features 
            # currently this does nothing.  
            new_examples.append(ex)
            
        return new_examples

    def build_train_features(self, examples):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: the list of raw tweets 
        """
        
        new_examples = self.add_text_features(examples)
        return self.vectorizer.fit_transform(new_examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: the list of raw tweets
        """
        new_examples = self.add_text_features(examples)
        return self.vectorizer.transform(new_examples)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.NB.coef_[0])[-10:]
        bottom10 = np.argsort(self.NB.coef_[0])[:10]
        print("10 words that are the best predictors for @DonaldTrump: %s" % " ".join(feature_names[top10]))
        print("10 words that are the best predictors for @HillaryClintonHC: %s" % " ".join(feature_names[bottom10]))
                
    def train_model(self, random_state=1234):
        """
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        
        from sklearn.linear_model import LogisticRegression 
        
        # load data 
        f = gzip.open('raw_tweets_train.pklz','rb')
        text_all, y_all = pickle.load(f)
        f.close()
        
        # get training features and labels 
        self.X_train = self.build_train_features(text_all)
        self.y_train = y_all
        
        # train logistic regression model.  !!MUST USE LogisticRegression!! 
        self.NB = MultinomialNB()
        self.NB.fit(self.X_train, self.y_train)
        

# Instantiate the class 
feat = TweetFeaturizer2()

# Train your NB 
feat.train_model(random_state=1234)

# Show the top 10 features for each class 
feat.show_top10()


Top 10 predictive features in case of Naive bayes are:
hillary_esp
weve
billclinton
flotus
joebiden
thebriefing2016
deserve
timkaine
hfa
kids
10 words that are the best predictors for @DonaldTrump: will of is you in and to https co the
10 words that are the best predictors for @HillaryClintonHC: ⁰⁰i rescue foot footage requires repudiate republicano republicana republic forcefully


**Part F**: Which of the two models above performed the best on the data set?  Were you surprised or not surprised by your results?  Discuss. 

In [None]:
# Naive Bayes Performed better than Logistic regression. I am not surprised with this, it was expected. Naive Bayes is most 
# commonly used for text classification tasks like Spam detection, Topic mining, Categorization etc. Given problem is  
# version of text classification.

### [25 points] Problem 3: Feature Engineering and Presidential Tweets 
***

In this problem you will again work with the Twitter election data from **Problem 2**, but this time in its unadulterated raw form. Unlike in **Problem 2**, you will only be allowed to use Logistic Regression as your classifier.  Instead of using a fancier model, you will attempt to improve performance by crafting better features.  One way you might do this is to explore text models that are more sophisticated that simple Bag-of-Words. Alternatively, you might explore the training data and identify characteristics of tweets by a particular author that you can then turn into a feature. 

The class `TweetFeaturizer` shown below is already fully functional.  Your goal in this problem is to make it better.  In it's current state, the class reads in the training and test data, fits a Logistic Regression model using Bag-of-Words, makes predictions on the test set, and then dumps the predictions to a csv file that can be uploaded to Kaggle. You are free to modify this class is any way that you see fit, but we've given you some helpful functionality that will prove sufficient for most of you.  The `add_text_features` method currently loops over each tweet in the data set, copies it to a new array, and then passes that array into the text vectorizer.  One way to create new features is to append distinct word-indicators onto the string representing the tweet.  These will then be turned into features by the vectorizer. 

As an example (that is intentionally silly and probably unhelpful): Suppose you think a potentially helpful feature is whether or not the tweet contains more than 10 instances of the letter `z`.  In `add_text_features` you could count the number of `z`'s in a tweet and if there are more than 10, you could append the word `MoreThanTenZs` to the tweet.  Then, when the tweet is passed into the vectorizer, this will turn into a numerical feature.  

In addition to competing against yourself to craft the best features that you can, you'll also compete against your classmates in a Kaggle competition.  The competition page can be found here: 

https://www.kaggle.com/c/4622-election-tweet-authorship

A private invite link will be available on Piazza which will get you into the competition. Note that the test data has been partitioned into a public leaderboard set and a private leaderboard set.  While the competition is open, submitting to Kaggle will tell you your score on the public leaderboard.  Your scores on the private leaderboard will become available at the end of the competition.   The top **THREE** students on the **Private** leaderboard at the end of the competition will earn 10 extra credit points on the Practicum. Note that to prevent the machine learning-equivalent of button mashing, we've limited you to **10** submissions per day.  You should be evaluating your features locally with cross-validation and then submitting to Kaggle when you think you have something that works.  

**Part A**: **Feature Engineering**:  What you need to do: 

- Explore and experiment with the data to try to find good features 
- Implement these features in the `TweetFeaturizer` class  
- Implement some evaluation methods to see how well your features improve your model (*cough* cross-validation *cough*) 
- Make submissions to the Kaggle competition and see how you compared to your classmates 

In [None]:
# In this part we need to add certain new words into tweet, which will help our logistic regression model to imporve performance!
# So the the question now is how should we add the words into a tweet? which words should be added?
# So, to improve accuracy of our model, our plan is to make a guess category of a tweet using a logic and add some unique words into it so that 
# logistic regression will be able to identify it.

# So,  How to guess category of a tweet?
# We are given a function show_top10. This function shows top words which are strong indicator of Hillory clinton class 
# and top words which are strong indicators of Donald trump class. So, According to presence of top words we will add new words into sentence!

# And what words should be added to a tweet?
# I will keep a list of top 100 words of each both classes using show_top10().
# Then in each commnet I will count number of times a word in top 100 occurs in the tweet.
# Ex1. lets say 10 indicators of Trump occur in a tweet and 4 indicators of Cliton occur in same tweet.
# So we will insert a word "dtindicator" (10 - 4) = 6 times at the back of the tweet.
# Ex2. lets say 7 indicators of Trump occur in a tweet and 11 indicators of Cliton occur in same tweet.
# So we will insert a word "hcindicator" (11 - 7) = 4 times at the back of the tweet.


**Part B**: **Motivation and Analysis**: What you need to do: 

Convince me that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features


In [None]:
# Apart from above precessing, I have cleaned the data by following methode
# 1) convert into lowercase
# 2) remove urls
# 3) Lemmatization
# 4) remove punctuation and digits
# 5) remove stopwords
# I have divided the input data into 95% for training and 5% for testing, I got 92.5% accuracy on test data!
# New features are giving a clue to classifier about the class.

In [87]:
from sklearn.model_selection import train_test_split
import re
import operator
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
allstopwords = {}
f = open('stopwords.txt','r')
for line in f:
    word = line.strip()
    if(word != "" and len(word) > 2):
        allstopwords[word] = True
class TweetFeaturizer:
    def __init__(self):
        
        from sklearn.feature_extraction.text import CountVectorizer
        
        self.vectorizer = CountVectorizer()
    def remove_stopwords(self,tweet):
        lemmatizer = WordNetLemmatizer()
        words = tweet.split()
        outwords = []
        for w in words:
            ww = lemmatizer.lemmatize(w)
            if(not(ww in allstopwords) and len(ww) > 1):
                outwords.append(ww)
        return " ".join(outwords)
    def clean_tweet(self,ex):
          ex = ex.lower()
          ex = re.sub(r'^https?:\/\/.*[\r\n]*', '', ex, flags=re.MULTILINE)
          tmp_ex = ex.replace("!"," ").replace("@"," ").replace("#"," ").replace("$"," ")
          tmp_ex = tmp_ex.replace("%"," ").replace("^"," ").replace("&"," ").replace("*"," ")
          tmp_ex = tmp_ex.replace("("," ").replace(")"," ").replace(","," ").replace("."," ")
          tmp_ex = tmp_ex.replace("/"," ").replace("\\"," ").replace("?"," ").replace("0"," ")
          tmp_ex = tmp_ex.replace("1"," ").replace("2"," ").replace("3"," ").replace("4"," ")
          tmp_ex = tmp_ex.replace("5"," ").replace("6"," ").replace("7"," ").replace("8"," ")
          tmp_ex = tmp_ex.replace("9"," ").replace("_"," ").replace("-"," ").replace("+"," ")
          tmp_ex = tmp_ex.replace(":"," ").replace(";"," ").replace("<"," ").replace(">"," ")
          tmp_ex = re.sub(' +', ' ', tmp_ex)
          return tmp_ex
    def add_text_features(self, examples):
        DT_indicators = {}
        HC_indicators = {}
        DT_top50 = "zteiwnqz beat tomorrow watching directly vote obama amazing deal weak failing touch system benghazi radical isi makeamericasafeagain bernie sander wonderful press donaldtrump dem newtgingrich hv join vega statement danscavino gopconvention looking primary anncoulter enjoy trade continue smart condolence bring mail melania interviewed totally trumptrain ltg erictrump morning interesting ted trumpocrats prayer tired supporter nice story bad drudge goofy win corrupt dtindicator imwithyou terrorism victim follower pm cruz rally foxnews safe teamtrump votetrump gop lyin washington record wisconsin ivankatrump via speech amp great delegate trumppence crookedhillary donaldjtrumpjr report york maga realdonaldtrump poll thank email cnn border clinton americafirst medium makeamericagreatagain crooked"
        HC_top50 = "http barack timkaine potus joebiden there let demsinphilly rt flotus ve hillary berniesanders idea billclinton donald voice dream good the this and willing re history friend business here mother chelseaclinton union proud ztggmmfhqg tax foreign help you thebriefing deserve prepared serve debatenight build demconvention lnrjcakpjw carrier pigeon ll company america college chief start ago what life running refusing week we pay lgbt gold who stronger inspired nationalvoterregistrationday disability child su mom real rncinclehttps can latino none greater gabbygiffords diplomacy election relationship make teamusa womensequalityday hfa parent muslim matter nc heart tweet white income paying enough enquirer corybooker they that elizabethforma"
        for w in DT_top50.split():
            DT_indicators[w] = True
        for w in HC_top50.split():
            HC_indicators[w] = True        
        new_examples = []
        for ex in examples:
            tmp_ex = self.clean_tweet(ex)
            count_DT_indicators = 0
            count_HC_indicators = 0
            for w in tmp_ex.split():
                if(w in DT_indicators):
                    count_DT_indicators += 1
                if(w in HC_indicators):
                    count_HC_indicators += 1
            if(count_DT_indicators > count_HC_indicators):
                tmp = "dtindicator"
                for i in range(0,(count_DT_indicators - count_HC_indicators) + 1):
                    tmp_ex = tmp_ex + " " + tmp
            elif(count_DT_indicators < count_HC_indicators):
                tmp = "hcindicator"
                for i in range(0,abs(count_DT_indicators - count_HC_indicators) + 1):
                    tmp_ex = tmp_ex + " " + tmp
            tmp_ex = self.remove_stopwords(tmp_ex)
            new_examples.append(tmp_ex)            
        return new_examples

    def build_train_features(self, examples):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: the list of raw tweets 
        """
        
        new_examples = self.add_text_features(examples)
        return self.vectorizer.fit_transform(new_examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: the list of raw tweets
        """
        new_examples = self.add_text_features(examples)
        return self.vectorizer.transform(new_examples)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("DT: %s" % " ".join(feature_names[top10]))
        print("HC: %s" % " ".join(feature_names[bottom10]))
                
    def train_model(self, random_state=1234):
        """
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        
        from sklearn.linear_model import LogisticRegression 
        
        # load data 
        f = gzip.open('raw_tweets_train.pklz','rb')
        text_train, y_train = pickle.load(f)
        f.close()
        
        # get training features and labels 
        self.x = self.build_train_features(text_train)
        self.y = y_train
        nx_train, nx_test, ny_train, ny_test = train_test_split(self.x, self.y, test_size=0.05, random_state=42)        
        # train logistic regression model.  !!MUST USE LogisticRegression!! 
        self.nx_train = nx_train;
        self.ny_train = ny_train;
        self.nx_test = nx_test;
        self.ny_test = ny_test;        
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.nx_train, self.ny_train)
    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file 
        """
        
        # read in test data 
        f = gzip.open('raw_tweets_test.pklz','rb')
        text_valid = pickle.load(f)
        f.close()
        
        # featurize test data 
        self.X_test = self.get_test_features(text_valid)
        
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)
        
        # dump predictions to file for submission to Kaggle  
        pd.DataFrame({"realDonaldTrump": np.array(pred, dtype=bool)}).to_csv("prediction.csv", index=True, index_label="Id")
        

# Instantiate the class 
feat = TweetFeaturizer()


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Akash\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [88]:
# Train your Logistic Regression classifier 
feat.train_model(random_state=1234)

In [89]:
# Show the top 10 features for each class 
feat.show_top10()
print(feat.logreg.score(feat.nx_test, feat.ny_test))

DT: border realdonaldtrump clinton email cnn americafirst makeamericagreatagain medium thank crooked
HC: barack timkaine potus let joebiden rt http demsinphilly there ve
0.925


In [90]:
# Make prediction on test data and produce Kaggle submission file 
feat.model_predict()