# Tweeter Tweets Classification Project.

- Add ability of llama3 to classify tweets
- Use word2vec next time for word embeddings instead of tfidf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import sklearn
import warnings
import re
from IPython.display import display
warnings.filterwarnings('ignore')
%matplotlib inline

# Classifying tweets

In this project, I will be analyzing Twitter data extracted using [this](https://dev.twitter.com/overview/api) api. The data contains tweets posted by the following six Twitter accounts: `realDonaldTrump, mike_pence, GOP, HillaryClinton, timkaine, TheDemocrats`

For every tweet, there are two pieces of information:
- `screen_name`: the Twitter handle of the user tweeting and
- `text`: the content of the tweet.

The tweets have been divided into two parts - train and test available to you in CSV files. For train, both the `screen_name` and `text` attributes were provided but for test, `screen_name` is hidden.

The overarching goal of the problem is to "predict" the political inclination (Republican/Democratic) of the Twitter user from one of his/her tweets. The ground truth (i.e., true class labels) is determined from the `screen_name` of the tweet as follows
- `realDonaldTrump, mike_pence, GOP` are Republicans
- `HillaryClinton, timkaine, TheDemocrats` are Democrats

Thus, this is a binary classification problem. 

The problem proceeds in 3 stages. The three stages are text processing, feature construction and tweet classification using a SVM classifier.
- **Text processing**: We will clean up the raw tweet text using the various functions offered by the [nltk](http://www.nltk.org/genindex.html) package.
- **Feature construction**: In this part, we will construct bag-of-words feature vectors and training labels from the processed text of tweets and the `screen_name` columns respectively.
- **Classification**: Using the features derived, we will use [sklearn](http://scikit-learn.org/stable/modules/classes.html) package to learn a model which classifies the tweets as desired.


I will be using the python packages in this problem: `nltk` and `sklearn`, both of which should be available with anaconda. However, NLTK comes with many corpora, toy grammars, trained models, etc, which have to be downloaded manually. This project requires NLTK's stopwords list, POS tagger, and WordNetLemmatizer. I install them using:

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('average_perceptron_tagger')
nltk.download('omw-1.4')

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')

## 1. Text Processing [20%]

Build a function which processes and tokenizes raw text. The generated list of tokens should meet the following specifications:
1. The tokens must all be in lower case.
2. The tokens should appear in the same order as in the raw text.
3. The tokens must be in their lemmatized form. If a word cannot be lemmatized (i.e, you get an exception), simply catch it and ignore it. These words will not appear in the token list.
4. The tokens must not contain any punctuations. Punctuations should be handled as follows: (a) Apostrophe of the form `'s` must be ignored. e.g., `She's` becomes `she`. (b) Other apostrophes should be omitted. e.g, `don't` becomes `dont`. (c) Words must be broken at the hyphen and other punctuations. 
5. The tokens must not contain any part of a url.

In order for `lemmatize()` to give me the root form for any word, I have to provide the context in which you want to lemmatize through the `pos` parameter: `lemmatizer.lemmatize(word, pos=SOMEVALUE)`. The context should be the part of speech (POS) for that word. I don't need to  manually write out the lexical categories for each word because [nltk.pos_tag()](https://www.nltk.org/book/ch05.html) will do this for me. I will then use the results from `pos_tag()` for the `pos` parameter.
However, a thing that i noticed is that  the POS tag returned from `pos_tag()` is in different format than the expected pos by `lemmatizer`.
> pos
(Syntactic category): n for noun files, v for verb files, a for adjective files, r for adverb files.

I will also need to map these pos appropriately. After searching `nltk.help.upenn_tagset()` provides description of each tag returned by `pos_tag()`. This will be helpful

## Part1 (Base function):

In [None]:
# Converting part of speech tag from nltk.pos_tag to word net compatible format
# Simple mapping based on first letter of return tag to make grading consistent
# Everything else will be considered noun 'n'
posMapping = {
# "First_Letter by nltk.pos_tag":"POS_for_lemmatizer"
    "N":'n',
    "V":'v',
    "J":'a',
    "R":'r'
}

def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs:
        list(str): tokenized text
    """

    # Step 1: Standarizing tokens to lower case
    text = text.lower()

    # Step 2: Capturing and removing URL using regular expressions.
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

    ######################################################
    # Step 3: Removing all punctuations with regards
    # to the rules given.

    # Remove 's
    text = re.sub(r"'s", "", text)

    # Removing apostrophes
    text = re.sub(r"[']", "", text)

    # Removing any left punctuation and breaking word by space
    text = re.sub(r"[!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]", " ", text)
    ######################################################
    
    # Tokenizing the text using nltk.word_tokenize()
    tokenized_text = nltk.word_tokenize(text)

    # Part-of-speech-tagging every token in tokenized_text
    pos_tagged_tokens = nltk.pos_tag(tokenized_text)

    # Converting the pos tagged tokens to wordnet format for
    # standardization and so they can be used in the lemmatize()
    # function.
    pos_tagged_tokens = [[tag[0], posMapping.get(tag[1][0], 'n')] for tag in pos_tagged_tokens]
    
    # Creating a list of lemmas to return through lemmatizer.
    list_of_lemmas = []
    for pos_tags in pos_tagged_tokens:
        try:
            list_of_lemmas.append(lemmatizer.lemmatize(pos_tags[0], pos_tags[1]))
        except:
            continue


    return list_of_lemmas

Testing the function to see if it works as expected.

In [None]:
print(process("I'm doing well! How about you?"))
# ['im', 'do', 'well', 'how', 'about', 'you']

print(process("Education is the ability to listen to almost anything without losing your temper or your self-confidence."))
# ['education', 'be', 'the', 'ability', 'to', 'listen', 'to', 'almost', 'anything', 'without', 'lose', 'your', 'temper', 'or', 'your', 'self', 'confidence']

print(process("been had done languages cities mice"))
# ['be', 'have', 'do', 'language', 'city', 'mice']

print(process("It's hilarious. Check it out http://t.co/dummyurl"))
# ['it', 'hilarious', 'check', 'it', 'out']

print(process("See it Sunday morning at 8:30a on RTV6 and our RTV6 app. http:…"))
# ['see', 'it', 'sunday', 'morning', 'at', '8', '30a', 'on', 'rtv6', 'and', 'our', 'rtv6', 'app', 'http', '…']
# Here '…' is a special unicode character not in string.punctuation and it is still present in processed text

## Part2(Processing):
I will now use the `process()` funciton to convert the pandas df "tweets_train.csv". This function ideally should be able to handle any df that contains the column `text`. The df my `process_all()` function should return, replacing every string in `text` with the result of `process()` and retain all other columns as their default values. 

In [None]:
# Loading df

tweets = pd.read_csv("tweets_train.csv", na_filter=False)
display(tweets.head())

In [None]:
def process_all(df, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ process all text in the dataframe using process() function.
    Inputs
        df: pd.DataFrame: dataframe containing a column 'text' loaded from the CSV file
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs
        pd.DataFrame: dataframe in which the values of text column have been changed from str to list(str),
                        the output from process() function. Other columns are unaffected.
    """

    # Converting the series to a list
    tweet = list(df["text"])
    
    # Processing all of the tweets and returning lemmas in a list
    # for the tweets
    lemmatized_text = [process(text) for text in tweet]

    # Putting all of the lemmas in their respective
    # places in the text series of the df. 
    df['text'] = lemmatized_text

    return df

Testing the function to see if it works as expected.

In [None]:
processed_tweets = process_all(tweets)
print(processed_tweets)

#       screen_name                                               text
# 0             GOP  [rt, gopconvention, oregon, vote, today, that,...
# 1    TheDemocrats  [rt, dwstweets, the, choice, for, 2016, be, cl...
# 2  HillaryClinton  [trump, call, for, trillion, dollar, tax, cut,...
# 3  HillaryClinton  [timkaine, guide, principle, the, belief, that...
# 4        timkaine  [glad, the, senate, could, pass, a, thud, milc...

## 2. Feature Construction

The next step is to derive feature vectors from the tokenized tweets. Now I will construct a bag-of-words TF-IDF feature vector. But before that the number of possible words is prohibitively large and not all of them may be useful for our classification task. In order to determine which words to retain and which to omit I will use a common heuristic which is to construct a frequency distribution of words in the corpus and prune out the head and tail of the distribution. The intuition of the above operation is as follows. Very common words (i.e. stopwords) add almost no information regarding similarity of two pieces of text. Similarly with very rare words. NLTK has a list of in-built stop words which is a good substitute for head of the distribution. We will consider a word rare if it occurs only in a single document (row) in whole of `tweets_train.csv`. 

## Part1 (Feature Matrix Construction):

Now I will construct a sparse matrix of features for each tweet with the help of `sklearn.feature_extraction.text.TfidfVectorizer` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)). After some research I figured out that I need to pass a parameter `min_df=2` to filter out the words occuring only in one document in the whole training set. I am going to leave other optional parameters (e.g., `vocab`, `norm`, etc) at their default values. I am going to use parameters like `lowercase` and `tokenizer` to handle `processed_tweets` that is a `list` of tokens (not raw text).

In [None]:


def create_features(processed_tweets, stop_words):
    """ creates the feature matrix using the processed tweet text
    Inputs:
        processed_tweets: pd.DataFrame: processed tweets read from train/test csv file, containing the column 'text'
        stop_words: list(str): stop_words by nltk stopwords (after processing)
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
            we need this to tranform test tweets in the same way as train tweets
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """

    # Since I am dealing with preprocessed tweets
    # I define a tokenizer funtion to just return the 
    # tokens as it so that it my override the default
    # tokenizer and not tokenize them again.
    def tokenize(text):
        return text
    
    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(min_df=2, stop_words=stop_words, tokenizer=tokenize, lowercase=False)
    
    # Fit and transform the processed tweets to create a feature matrix.
    vectorizer.fit(processed_tweets['text'])
    feature_matrix = vectorizer.transform(processed_tweets['text'])

    return vectorizer, feature_matrix

In [None]:
# It is recommended to process stopwords according to our data cleaning rules
processed_stopwords = list(np.concatenate([process(word) for word in stopwords]))
(tfidf, X) = create_features(processed_tweets, processed_stopwords)
# Ignore warning
tfidf, X

# Output (should be similar):
# (TfidfVectorizer(lowercase=False, min_df=2,
#                  stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
#                              'ourselves', 'you', 'youre', 'youve', 'youll',
#                              'youd', 'your', 'yours', 'yourself', 'yourselves',
#                              'he', 'him', 'his', 'himself', 'she', 'shes', 'her',
#                              'hers', 'herself', 'it', 'it', 'it', 'itself', ...],
#                  tokenizer=<function create_features.<locals>.<lambda> at 0x2af726660>),
#  <17298x8115 sparse matrix of type '<class 'numpy.float64'>'
#  	with 169163 stored elements in Compressed Sparse Row format>)

## Part2 (Creation of Labels)
For each tweet I assign a class label (0 or 1) using the `screen_name`. 0 for realDonaldTrump, mike_pence, GOP and 1 for the rest.

In [None]:
def create_labels(processed_tweets):
    """ creates the class labels from screen_name
    Inputs:
        processed_tweets: pd.DataFrame: tweets read from train file, containing the column 'screen_name'
    Outputs:
        numpy.ndarray(int): dense binary numpy array of class labels
    """
    # Creating a boolean array where True corresponds to the republicans
    republicans = processed_tweets['screen_name'].isin(['realDonaldTrump', 'mike_pence', 'GOP'])

    # Using np.where() to assign 0 or 1 based on condition.
    labels = np.where(republicans, 0, 1).astype(int)
    
    return labels

In [None]:
y = create_labels(processed_tweets)
y
# 0        0
# 1        1
# 2        1
# 3        1
# 4        1
#         ..
# 17293    0
# 17294    0
# 17295    0
# 17296    1
# 17297    0
# Name: screen_name, Length: 17298, dtype: int32

## 3. Classification [40%]

Now puting things together and learn a model for the classification of tweets. The classifier that I am using is [`sklearn.svm.SVC`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) (Support Vector Machine). 

At the heart of SVMs is the concept of kernel functions, which determines how the similarity/distance between two data points in computed. `sklearn`'s SVM provides four kernel functions: `linear`, `poly`, `rbf`, `sigmoid` ([details](http://scikit-learn.org/stable/modules/svm.html#svm-kernels)).

Through the various functions you implement in this part, you will be able to learn a classifier, score a classifier based on how well it performs, use it for prediction tasks and compare it to a baseline.

Through these function I that I implement in this part, I will be able to learn a classifier, score it and based on how well it performs, use it for prediction tasks and compare it to a baseline.

Specifically, I will carry out the following tasks in order:

1. Implement and evaluate a simple baseline classifier MajorityLabelClassifier.
2. Implement the `learn_classifier()` function assuming `kernel` is always one of {`linear`, `poly`, `rbf`, `sigmoid`}. 
3. Implement the `evaluate_classifier()` function which scores a classifier based on accuracy of a given dataset.
4. Implement `best_model_selection()` to perform cross-validation by calling `learn_classifier()` and `evaluate_classifier()` for different folds and determine which of the four kernels performs the best.
5. Go back to `learn_classifier()` and fill in the best kernel. 

## Part1 (Establishing Baseline):

To determine whether the classifier is performing well, I will compare it to a baseline classifier. My classifier should beat the baseline in terms of performace measure such as accuracy.

In order to establish a baseline I will implement a classifier called `MajorityLabelClassifier` that always predicts the class equal to **mode** of the labels (i.e., the most frequent label) in training data.

In [None]:
class MajorityLabelClassifier():
    """
    A classifier that predicts the mode of training labels
    """
    def __init__(self):
        """
        Initialize your parameter here
        """
        self.theta = None

    def fit(self, X, y):
        """
        Implement fit by taking training data X and their labels y and finding the mode of y
        i.e. store your learned parameter
        """
        self.theta = pd.Series(y).mode()[0]

    def predict(self, X):
        """
        Implement to give the mode of training labels as a prediction for each data instance in X
        return labels
        """
        return np.full(shape=X.shape[0], fill_value=self.theta)



baselineClf = MajorityLabelClassifier()
baselineClf.fit(X, y)
predictions = baselineClf.predict(X)
accuracy = np.mean(predictions == y)

print(accuracy)
# print(training accuracy) should give 0.5001734304543878

## Part2 (learn_classifier()):
This function assumes kernel is always on of {`linear`, `poly`, `rbf`, `sigmoid`}. Stick to default values for any other optional parameters.

In [None]:
def learn_classifier(X_train, y_train, kernel):
    """ learns a classifier from the input features and labels using the kernel function supplied
    Inputs:
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features, output of create_features()
        y_train: numpy.ndarray(int): dense binary vector of class labels, output of create_labels()
        kernel: str: kernel function to be used with classifier. [linear|poly|rbf|sigmoid]
    Outputs:
        sklearn.svm.SVC: classifier learnt from data
    """

    clf = sklearn.svm.SVC(kernel=kernel)

    clf.fit(X_train, y_train)

    return clf

In [None]:
classifier = learn_classifier(X, y, 'linear')

## Part3 (Evaluate Classifier)
The next step now is to evaluate classifier i.e to characterize how good its classification performance is. This is done to select the best model among models, or for the future tune the hyperparameters for the given model.
**Cross-validation**:  This is an approach that I will use. To my understanding it divides the data set in $k$ groups (so, called k-fold set). One of teh gropu is used as a test set for evaluation and other groups as training set. The model of hyperparameter with the bast average performance across all k folds is chosen. For this part I will perform 4-fold cross validation to determine the best kernel. I will keep all other hyperparameters default for now. This approach provides robustness toward biasness in validation set. However, it takes more time.

**Metric**: I will be using accuracy as my model evaluation metric. The accuracy of the classifier measures the fraction of all dat points taht are correctly classifier by it; it is the ration of number of correct classifications to the total number of (currect and incorrect) classifications. `sklearn.metrics` provides a number of performance metrics.

In [None]:
def evaluate_classifier(classifier, X_validation, y_validation):
    """ evaluates a classifier based on a supplied validation data
    Inputs:
        classifier: sklearn.svm.classes.SVC: classifer to evaluate
        X_validation: scipy.sparse.csr.csr_matrix: sparse matrix of features
        y_validation: numpy.ndarray(int): dense binary vector of class labels
    Outputs:
        double: accuracy of classifier on the validation data
    """
    
    y_pred = classifier.predict(X_validation)

    accuracy = sklearn.metrics.accuracy_score(y_validation, y_pred)

    return accuracy

In [None]:
accuracy = evaluate_classifier(classifier, X, y)
print(accuracy)
# should give around 0.9545612209503989

## Part4 (Cross Validation)
Now I will decide which kernel works best by using cross-validation. The code splits the training data into 4-folds (75% training and 25% validation) by shuffling randomly. For each kernel I will record the average accuracy for all folds and determine the best classifier. Since our dataset is balanced, [`sklearn.model_selection.KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) can be used for cross-validation.

In [None]:
kf = sklearn.model_selection.KFold(n_splits=4, random_state=1, shuffle=True)
kf

Then now i use the following code to determine which classifier is the best.

In [None]:
def best_model_selection(kf, X, y):
    """
    Select the kernel giving best results using k-fold cross-validation.
    Other parameters should be left default.
    Input:
    kf (sklearn.model_selection.KFold): kf object defined above
    X (scipy.sparse.csr.csr_matrix): training data
    y (array(int)): training labels
    Return:
    best_kernel (string)

    # Use the documentation of KFold cross-validation to split ..
    # training data and test data from create_features() and create_labels()
    # call learn_classifer() using training split of kth fold
    # evaluate on the test split of kth fold
    # record avg accuracies and determine best model (kernel)
    """
    kernel_accuracy = {}
    for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
        accuracies = []

        for train_index, test_index in kf.split(X):

            # Splitting data into training and testing sets for the 
            # current fold.
            X_train, y_train= X[train_index], y[train_index]
            X_test, y_test = X[test_index], y[test_index]
            classifier = learn_classifier(X_train, y_train, kernel)
            accuracy = evaluate_classifier(classifier, X_test, y_test)

            # Appending the accuracy/error to the error array for 
            # further evaluation of mean error.
            accuracies.append(accuracy)

        # Calculating teh average accuracy for the current kernel 
        # across all folds
        kernel_accuracy[kernel] = np.mean(accuracies)

    # Getting the best kernel with the highest average accuracy.
    best_kernel = max(kernel_accuracy, key=kernel_accuracy.get)

    # Returning the best kernel as a string.
    return best_kernel

#Test your code
best_kernel = best_model_selection(kf, X, y)
best_kernel

Now I am going to write a nice little wrapper function that will use my model to classify unlabeled tweets from tweets_test.csv file.

In [None]:
def classify_tweets(tfidf, classifier, unlabeled_tweets):
    """ predicts class labels for raw tweet text
    Inputs:
        tfidf: sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used on training data
        classifier: sklearn.svm.SVC: classifier learned
        unlabeled_tweets: pd.DataFrame: tweets read from tweets_test.csv
    Outputs:
        numpy.ndarray(int): dense binary vector of class labels for unlabeled tweets
    """

    # Processing all the test tweets.
    processed_tweets = process_all(unlabeled_tweets)

    # Transforming the processed tweets to create a feature matrix
    # for the test
    X_test = tfidf.transform(processed_tweets['text'])

    y_pred = classifier.predict(X_test)

    return y_pred

In [None]:
# Now I fill in best classifier in the function and re-trian the classifier using all training data
classifier = learn_classifier(X, y, 'poly')
unlabeled_tweets = pd.read_csv("tweets_test.csv", na_filter=False)
y_pred = classify_tweets(tfidf, classifier, unlabeled_tweets)
print(y_pred)


In [None]:
accuracy = evaluate_classifier(classifier, X, y)
print(accuracy)

The classifier performed better than the baseline training data. This is because we pass in the best type of kernel that we found from the kfold cross validation. When I run the best classifier that I found which is `poly` and calculate it against the ground truth, it givees me an accuracy of 99.7%.