<a href="https://colab.research.google.com/github/anilkeshwani/StatLearnProj/blob/master/50pc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Master Notebook

The aim of this notebook is to create a single strand of analysis with a coherent narrative. 

[Update the results on the Google Sheet](https://docs.google.com/spreadsheets/d/1tFsScPgzPsGqZCGhDM3cJgXysmp_lce1kUN76B2Rha8/edit?usp=sharing)

### Aim for Final Version

- Table of Vectorisation Methods * Classification Method * |Additional Methods| - Table of accuracies for different combinations of analysis methods as detailed in _Table of Analyses.xlsx_ (see `organisation/` directory)
- Explanation/Exposition of methods
- EDA - Visualise vector word representations out of different pre-processing; Basic descriptive statistics on final _input dataset_
- Clear and clean pre-processing pipeline
- Clear and clean grid search methods

### Modelling Combinations

#### Pre-processing

- Components (methods) of `CleanText`
    - In particular stemming

#### Word Representations

- Bag-of-Words - One-Hot (BOW)
    - BOW n-grams with $n > 1$
- Bag-of-Words - Frequencies (FBOW)
- Term Frequency–Inverse Document Frequency (TF-IDF)
- Word2Vec
    - Skip-grams (SG)
    - Continuous-Bag-of-Words CBOW
- FastText
- Bert

#### Classifiers

- Logistic Regression (Elastic Net)
    - Search across penalisation weights (C) and l1-l2 ratios (l1_ratio)
- Support Vector Machines (SVM)
- Naive Bayes (NB)
- Random Forests (RF)
- Gradient Boosting (GB)
- (Perceptron) (MLP)

#### Additional Modelling Considerations

- Scaled versus Unscaled data

### Questions for the Team

- Logistic Regression: Thoughts on mean accuracy as given by `LogisticRegression.score()`?
- Logistic regression was fitted on individual words previously in Felicie's notebok. Given that we have a limited number of accounts, people might have tendencies to use the same words so our low train and test errors might come from here. Stemming reduces the training and test accuracies. **We should check if the components explaining a high degree of variation are individual words used by certain accounts.**
- What should our cut-off for the minimum _document frequency_ of words be. The value 10 has been used, but 1 is the default with `CountVectorizer`.

### Messages to the Team

- The repo is public - This allows us to directly read data in by passing Pandas a URL
- You can run code via AWS as if you're working locally. Follow [this tutorial](https://chrisalbon.com/aws/basics/run_project_jupyter_on_amazon_ec2/).

### TODO

- How has Felicie's approach to BOW vectorisation been realised by `sklearn.feature_extraction.text.CountVectorizer`? In particular, has any stemming been performed and if so via which algorithm? See the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
- Which dimensions explain the most variation? Inspect model coefficients; run PCA

# Code

In [None]:
# Start Fresh

%reset -f

In [None]:
pip install emoji



In [None]:
pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/b2/aa/e61819d04ef2bbee778bf4b3a748db1f3ad23512377e43ecfdc3211437a0/catboost-0.23.2-cp36-none-manylinux1_x86_64.whl (64.8MB)
[K     |████████████████████████████████| 64.8MB 75kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.23.2


In [None]:
# Imports and Set Options

import csv  # for slang
import os
import re  # regex
import string  # punct
from pprint import pprint

import emoji  # for emoji
import gensim
import keras
import lightgbm as lgb
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from gensim.models import Word2Vec
from IPython.display import Image
from matplotlib import pyplot as plt
from nltk.corpus import stopwords  # stopwords
from nltk.stem import PorterStemmer  # stemming
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn import svm, tree
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import (AdaBoostClassifier, BaggingClassifier,
                              GradientBoostingClassifier,
                              RandomForestClassifier, RandomForestRegressor,
                              StackingClassifier)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import (accuracy_score, auc, average_precision_score,
                             brier_score_loss, classification_report,
                             confusion_matrix, f1_score, fbeta_score,
                             make_scorer, plot_precision_recall_curve,
                             precision_recall_curve, precision_score,
                             recall_score, roc_auc_score, roc_curve)
from sklearn.model_selection import (GridSearchCV, KFold, RandomizedSearchCV,
                                     cross_val_score, train_test_split)
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler
from sklearn.svm import SVC  # "Support vector classifier"
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

import catboost as cb
import xgboost as xgb

# pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

%matplotlib inline

Using TensorFlow backend.
  import pandas.util.testing as tm


## Homemade Classes and Functions

In [None]:
# Clean Text Class

class CleanText(BaseEstimator, TransformerMixin):
    
    def remove_mentions(self, input_text):
        '''
        Remove mentions, like @Mplamplampla
        '''
        return re.sub(r'@+', '', input_text)
    
    def remove_urls(self, input_text):
        '''
        Remove the urls mention in a tweet
        '''
        input_text  = ' '.join([w for w in input_text.split(' ') if '.com' not in w])
        return re.sub(r'http.?://[^\s]+[\s]?', '', input_text)
    
    def emoji_oneword(self, input_text):
        # By compressing the underscore, the emoji is kept as one word
        input_text = emoji.demojize(input_text)
        input_text = input_text.replace('_','')
        input_text = input_text.replace(':','')
        return input_text
    
    def possessive_pronouns(self, input_text):
        '''
        Remove the possesive pronouns, because otherwise after tokenization we will end up with a word and an s
        Example: government's --> ["government", "s"]
        '''
        return input_text.replace("'s", "")
    
    def characters(self, input_text):
        '''
        Remove special and redundant characters that may appear on a tweet and that don't really help in our analysis
        '''
        input_text = input_text.replace("\r", " ") # Carriage Return
        input_text = input_text.replace("\n", " ") # Newline
        input_text = " ".join(input_text.split()) # Double space
        input_text = input_text.replace('"', '') # Quotes
        return input_text
    
    def remove_punctuation(self, input_text):
        '''
        Remove punctuation and specifically these symbols '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
        '''
        punct = string.punctuation # string with all the punctuation symbols '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
        trantab = str.maketrans(punct, len(punct)*' ')  # Every punctuation symbol will be replaced by a space
        return input_text.translate(trantab)
    
    def remove_digits(self, input_text):
        '''
        Remove numbers
        '''
        return re.sub('\d+', '', input_text)
    
    def to_lower(self, input_text):
        '''
        Convert all the sentences(words) to lowercase
        '''
        return input_text.lower()
    
    def remove_stopwords(self, input_text):
        '''
        Remove stopwords (refers to the most common words in a language)
        '''
        stopwords_list = stopwords.words('english')
        # Some words which might indicate a certain sentiment are kept via a whitelist
        whitelist = ["n't", "not", "no"]
        words = input_text.split() 
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return " ".join(clean_words) 
    
    def stemming(self, input_text):
        '''
        Reduce the words to their stem
        '''
        porter = PorterStemmer()
        words = input_text.split() 
        stemmed_words = [porter.stem(word) for word in words]
        return " ".join(stemmed_words)
    
    def encode_decode(self, input_text):
        '''
        Remove weird characters that are result of encoding problems
        '''
        return  " ".join([k.encode("ascii", "ignore").decode() for k in input_text.split(" ")])
    
    
    def translator(self, input_text):
        '''
        Transform abbrevations to normal words
        Example: asap --> as soon as possible
        '''
        input_text = input_text.split(" ")
        j = 0
        for _str in input_text:
            # File path which consists of Abbreviations.
            fileName = r"slang.txt"
            # File Access mode [Read Mode]
            accessMode = "r"
            with open(fileName, accessMode) as myCSVfile:
                # Reading file as CSV with delimiter as "=", so that 
                # abbreviation are stored in row[0] and phrases in row[1]
                dataFromFile = csv.reader(myCSVfile, delimiter="=")
                # Removing Special Characters.
                _str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
                for row in dataFromFile:
                    # Check if selected word matches short forms[LHS] in text file.
                    if _str.upper() == row[0]:
                        # If match found replace it with its appropriate phrase in text file.
                        input_text[j] = row[1]
                myCSVfile.close()
            j = j + 1
        
        return(' '.join(input_text))
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        clean_X = (X.apply(self.translator)
                    .apply(self.remove_mentions)
                    .apply(self.remove_urls)
                    .apply(self.emoji_oneword)
                    .apply(self.possessive_pronouns)
                    .apply(self.remove_punctuation)
                    .apply(self.remove_digits)
                    .apply(self.encode_decode)
                    .apply(self.characters)
                    .apply(self.to_lower)
                    .apply(self.remove_stopwords)
                    .apply(self.stemming))
        return clean_X
    
    def transform_no_stem(self, X, **transform_params):
        clean_X = (X.apply(self.translator)
                    .apply(self.remove_mentions)
                    .apply(self.remove_urls)
                    .apply(self.emoji_oneword)
                    .apply(self.possessive_pronouns)
                    .apply(self.remove_punctuation)
                    .apply(self.remove_digits)
                    .apply(self.encode_decode)
                    .apply(self.characters)
                    .apply(self.to_lower)
                    .apply(self.remove_stopwords))
        return clean_X

## Read in Data and Create Train and Test Sets

In [None]:
# Read in data (Raw copy for reference; copy for processing)

tweets_raw = pd.read_csv('https://github.com/anilkeshwani/StatLearnProj/raw/master/Iason/climate_change_tweets_sample-2020-05-16-17-57.csv')
tweets = pd.read_csv('https://github.com/anilkeshwani/StatLearnProj/raw/master/Iason/climate_change_tweets_sample-2020-05-16-17-57.csv')
tweets.head()

Unnamed: 0,username,user_handle,date,retweets,favorites,text,label
0,WWF Climate & Energy,climateWWF,2020-04-28,11,22,Economic recovery and national climate pledges...,0
1,WWF Climate & Energy,climateWWF,2020-04-22,6,16,"In this difficult time, it’s hard to connect w...",0
2,WWF Climate & Energy,climateWWF,2020-04-01,43,69,"The decision to postpone # COP26, is unavoidab...",0
3,WWF Climate & Energy,climateWWF,2020-03-30,24,30,Japan - the world’s fifth largest emitter of g...,0
4,WWF Climate & Energy,climateWWF,2020-03-30,22,40,How can countries include # NatureBasedSolutio...,0


## Clean Dataset

Applies the class methods (leveraging `sklearn` API):

- translator
- remove_mentions
- remove_urls
- emoji_oneword
- possessive_pronouns
- remove_punctuation
- remove_digits
- encode_decode
- characters
- to_lower
- remove_stopwords
- stemming (via Porter Algorithm)

In [None]:
# Text Cleaning

# ct = CleanText()
# tweets["text"] = ct.fit_transform(tweets["text"])
# tweets.to_csv("clean_tweets.csv") # save once processed
tweets = pd.read_csv("clean_tweets.csv") # read in instead
tweets = tweets.loc[(~tweets.text.isnull()), :]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(tweets.text, tweets.label, 
                                                    test_size=0.2, random_state=17, 
                                                    shuffle=True) # explicit default

[print(dat.head(3), dat.shape, end="\n"*2) for dat in [X_train, X_test, Y_train, Y_test]];

3642     might progress climat area not enough global s...
12695    trump crackdown politic scienc nasa climat div...
8451     no one would believ human panick fals climat m...
Name: text, dtype: object (14406,)

8376     nazi root environment climat chang fraud bbcne...
6111     interest democrat candid compar mani aspect cl...
13983    ittrademark imposs see global warm signal minn...
Name: text, dtype: object (3602,)

3642     0
12695    1
8451     1
Name: label, dtype: int64 (14406,)

8376     1
6111     0
13983    1
Name: label, dtype: int64 (3602,)



In [None]:
print(f"Training label counts: \n{Y_train.value_counts()}", end="\n"*2)
print(f"Test label counts: \n{Y_test.value_counts()}")

Training label counts: 
1    8433
0    5973
Name: label, dtype: int64

Test label counts: 
1    2138
0    1464
Name: label, dtype: int64


In [None]:
# Save set of workspace objects' names to enable periodic clean-up

necessities = set(dir())

## Word Vectorisations

### Bag of Words (BOW) Binary ("One-Hot") Representation

In [None]:
# Bag of Words Representation (One Hot, i.e. binary)

BOW_vectorizer = CountVectorizer(stop_words = 'english', 
                                 binary=True, # Creates 0/1 "One Hot" vector; 
                                              # np.unique(BOW_train.toarray())
                                 min_df = 10)
BOW_vectorizer.fit(X_train)
BOW_train = BOW_vectorizer.transform(X_train)
BOW_test = BOW_vectorizer.transform(X_test)

# Construct Scaled Datasets

scaler_BOW = MaxAbsScaler()
BOW_train_scaled = scaler_BOW.fit_transform(BOW_train)
BOW_test_scaled = scaler_BOW.transform(BOW_test)

In [None]:
# Most frequently occurring words in the training corpus

[(index, word) for index, word in sorted(BOW_vectorizer.vocabulary_.items(), key=lambda item: item[1], reverse=True)][:20]

[('zero', 2206),
 ('yr', 2205),
 ('youtub', 2204),
 ('youthvgov', 2203),
 ('youthtopow', 2202),
 ('youthclimatesummit', 2201),
 ('youth', 2200),
 ('young', 2199),
 ('york', 2198),
 ('yesterday', 2197),
 ('year', 2196),
 ('yeah', 2195),
 ('ye', 2194),
 ('yall', 2193),
 ('yale', 2192),
 ('ya', 2191),
 ('wwf', 2190),
 ('wsj', 2189),
 ('wrote', 2188),
 ('wrong', 2187)]

### Bag of Words with Frequencies Representation (FBOW)

In [None]:
# Bag of Words Representation (Frequencies; binary=False)

FBOW_vectorizer = CountVectorizer(stop_words = 'english', 
                                  binary=False, # Creates Word Frequency Vector; 
                                                # # np.unique(FBOW_train.toarray())
                                  min_df = 10)
FBOW_vectorizer.fit(X_train)
FBOW_train = FBOW_vectorizer.transform(X_train)
FBOW_test = FBOW_vectorizer.transform(X_test)

# Construct Scaled Datasets

scaler_FBOW = MaxAbsScaler()
FBOW_train_scaled = scaler_FBOW.fit_transform(FBOW_train)
FBOW_test_scaled = scaler_FBOW.transform(FBOW_test)

In [None]:
# Word use (per tweet) frequencies

print(np.unique(FBOW_train.toarray(), return_counts=True))

# Feature_Index: Word Mapping

# {v: k for k, v in FBOW_vectorizer.vocabulary_.items()}

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 13]), array([31661370,   127070,     5161,      379,       45,        7,
              5,        1,        2,        1,        1]))


### Bag of Words Bigrams (bigram)

In [None]:
bigram_vectorizer = CountVectorizer(stop_words = 'english', 
                                    binary=True, 
                                    min_df = 10,
                                    ngram_range = (1,2)) # create bigrams
bigram_vectorizer.fit(X_train)

bigram_train = bigram_vectorizer.transform(X_train)
bigram_test = bigram_vectorizer.transform(X_test)

# Construct Scaled Datasets

scaler_bigram = MaxAbsScaler()
bigram_train_scaled = scaler_bigram.fit_transform(bigram_train)
bigram_test_scaled = scaler_bigram.transform(bigram_test)

### Term Frequency–Inverse Document Frequency Representation (tf-idf)

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', 
                                   min_df=10) # used for now for consistency
tfidf_vectorizer.fit(X_train)
tfidf_train = tfidf_vectorizer.transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

# Construct Scaled Datasets

scaler_tfidf = MaxAbsScaler()
tfidf_train_scaled = scaler_tfidf.fit_transform(tfidf_train)
tfidf_test_scaled = scaler_tfidf.transform(tfidf_test)

### Word2Vec - Continuous-Bag-of-Words

## Grid-Searches

In [None]:
def var_name(var):
    for name,value in globals().items() :
        if value is var :
            return name
    return '?????' 
 
def printv(var):
    print("##",var_name(var))

In [None]:
kfcv = KFold(n_splits=5,shuffle=True,random_state=101)

In [None]:
def GS(X):

  ### X TRAIN
  X_train = X.toarray()

  ### Gaussian Naives Bayes
  clf = GaussianNB()
  var_smoothing = [pow(10,k)/1000000000 for k in range(10)]
  param_grid = {
           'var_smoothing': var_smoothing
        }
  grid_search_NB = GridSearchCV(estimator = clf, param_grid = param_grid, cv=kfcv ,n_jobs = -1, verbose = 2, scoring='accuracy')
  grid_search_NB.fit(X_train,Y_train)

  ### Logistic regression
  clf = LogisticRegression()
  param_grid = {
            'penalty': ['elasticnet','l1','l2','none'],
            'C': [.001, .01, .1, 1, 10, 100, 1000],
            'solver': ['liblinear', "saga", "lbfgs", "newton-cg", "sag"],
            'multi_class': ['ovr'],
            'max_iter' : [1000]
        }
  grid_search_LR = GridSearchCV(estimator = clf, param_grid = param_grid, cv=kfcv ,n_jobs = -1, verbose = 2)
  grid_search_LR.fit(X_train,Y_train)

  ### Random Forest
  clf = RandomForestClassifier(oob_score=True)
  param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 100, None],
    'max_features': ['sqrt', 'log2'],
    'min_samples_leaf': [1, 3, 5],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [50, 100, 300]
  }
  grid_search_RF = GridSearchCV(estimator = clf, param_grid = param_grid, cv=kfcv ,n_jobs = -1, verbose = 2)
  grid_search_RF.fit(X_train,Y_train)

  ### SuperVectorMachine
  clf = SVC()
  param_grid = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1,2,10,100]},
                    {'kernel': ['linear'], 'C': [1,2,10,100]}]
  grid_search_SVM = GridSearchCV(estimator = clf, param_grid = param_grid, cv=kfcv ,n_jobs = -1, verbose = 2, scoring='accuracy')
  grid_search_SVM.fit(X_train,Y_train)

  ### Yandex CatBoost
  clf = cb.CatBoostClassifier(random_state=17,thread_count=4,verbose=0)
  param_grid = {'n_estimators':[100,250,500,1000],
              'depth':sp_randint(1,10),
              'learning_rate':[0.001,0.01,0.05,0.1,0.2,0.3], 
              'l2_leaf_reg':[1,5,10,100],
              'border_count':[5,10,20,50,100,200]}
  grid_search_CB = GridSearchCV(estimator = clf, param_grid = param_grid, cv=kfcv ,n_jobs = -1, verbose = 2, scoring='accuracy')
  grid_search_CB.fit(X_train,Y_train)
  
  ### Print the best parameters
  print("#####################################\n#####################################")
  printv(X)
  print("#####################################\n##")
  print("## Best Naive Bayes parameters :",grid_search_NB.best_params_)
  print("## Best Logistic Regression parameters :", grid_search_LR.best_params_)
  print("## Best Random Forest parameters :", grid_search_RF.best_params_)
  print("## Best Super Vector Machines parameters :", grid_search_SVM.best_params_)
  print("## Best Cat Boost parameters :", grid_search_CB.best_params_)
  print("##\n#####################################\n#####################################\n")

### Bag of Word NO SCALED

In [None]:
GS(BOW_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   19.1s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   24.8s finished


Fitting 5 folds for each of 140 candidates, totalling 700 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    9.7s


### Bag of Word SCALED

In [None]:
GS(BOW_train_scaled)

### Frequency Bag of Word NO SCALED

In [None]:
GS(FBOW_train)

### Frequency Bag of Word SCALED

In [None]:
GS(FBOW_train_scaled)

### Bag of Word BIGRAM NO SCALED

In [None]:
GS(bigram_train)

### Bag of Word BIGRAM SCALED

In [None]:
GS(bigram_train_scaled)

### TF IDF NO SCALED

In [None]:
GS(tfidf_train)

### TF IDF SCALED

In [None]:
GS(tfidf_train_scaled)