# CISC 873 Data Mining Competition #3
name: Asmaa Qindeel


Competition #3: https://www.kaggle.com/c/cisc-873-dm-f22-a3

# TOPICS:
* [Questions & Answers](#Questions)
* [Problem and Protocol](#intro)

**Code Part**
* [Preparing workspace: gathering info about the data](#pre)
    * [Data Balance](#balance)
* [Trail 1: no preprocessing/ tfidf/ XGBoost](#t1)
* [Trail 2: clean text/ tfidf/ xgboost](#t2)
* [Trail 3: / countvectorizor/ Logistic Regression](#t3)
* [Trail 4: best hypers from above / charachter level vectorizor](#t4)
* [Trail 5: Random Search with XGboost](#t5)
    * [5.2](#t5.2)
* [6](#6)


[**Conclusion**](#conc)

# Questions:  
<a class="anchor" id="Questions"></a>

1. What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?
    - Character n-gram works on the character level, while Word n-gram operates on the word level which is why it suffers more from the OOV issue.
2. What is the difference between stop word removal and stemming? Are these techniques language-dependent?
    - they are both steps in preprocessing text, but stop word removes words that repeat so much that they are useless, while stemming doesn't remove words, but rather modify them. Stemming is language-dependant, e.g. it depends on the rules of each language to stem words of it. stop words can just be found with some frequency analysis, they are different for each language of course, but they don't depend on the rules of the language itself.
3. Is tokenization techniques language dependent? Why?
    - Yes, you nedd to know the rules of the language to understand which word is more useful(the weights of each word in the symantics of the language) i.e. in Arabic one letter attached to a wrod/verb can change its value greatly, making it more useful.
4. What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?
    - Count focuses on the frequency of the word, it doesn't indicate the importance of the word through the whole dataset, while tfidf takes into account the frequency of the document it self, this makes tfidf reduce feature dimention by focusing on the higher frequency documents.
    - Using all possible n-grams is computationaly impossible, a very high cost. The best n-gram depends on your model, you can narrow the search down to some values of n depending on your understanding of the model, and use tuning to get the best n.

# Problem Formulation:<a class="anchor" id="intro"></a>

https://www.kaggle.com/c/cisc-873-dm-f22-a3

The problem is an NLP Binary classification, to get the real news headlines out of the fake news. The input space is a st of 60k news headline, with target 0/1, zero being fake news, and 1 being real news. The data is row as we'll see.
In this project i'll explore text preprocessing, word/character vectorizer, n-gram effect, all with the use of a pipeline. The object is to experience NLP, tune text preprocessing hyperparameters. 

The ideal solution i think will have no preprocessing(no data cleaning), because the data is headlines. News headlines are supposed to be clean, short text. The Count vectorizor vs the tfidf i think may be close for the same reason(news headlines). The best model i'm thinking will be linear (LogReg, SVC,....)


Challenges 

Metric used in this problem is roc_auc.

## The Impact:
for Me, learning. for the social media companies==> a better society. Blocking the fake provocative news will have an impact on reducing conflict in social gatherings. I have seen people fighting over the stupidest little piece of news, even tending to violence. Increase awareness, with the absence of fake news people will tend more to the real news, awareness of their reality is the first step in changing it.



# Experimental Protocol:

Using pipeline to contain the preprocessors(vectorizor) and the model.

Use Bag of words(BoW) as text processing technique

I will first check my hypothesis about the data, by trying cleaning and not cleaning text, with fixed model and vectorizor. Then i'll change the vectorizor to see which is better(word vs character victorizer) but for both it will be the measure of tf-idf for the words. Then i'll change the model between XGBoost and LogesticRegression.

For Preprocessing i'll use stemming once and embedding once.

I also use Random search with predifined Validation Set, or grid search when the grid is small enough. Metric is `'roc_auc'` will be the decision maker.

# References
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

873_nlp lab

In [1]:
# libs for text preprocessing
import re
import sklearn
import pandas as pd
import numpy as np
import nltk 

In [2]:
#libs for modeling

# NLTK tools
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

#vectorizor
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# building model and tunign tools
from sklearn.model_selection import train_test_split,  PredefinedSplit
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

#import models
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

np.random.seed(0)

In [4]:
#load the data

train_file = pd.read_csv('train.csv' , header = 0 )#, index_col=0)
test_file = pd.read_csv('test.csv' , header = 0 )#, index_col=0)

#train_file.head()
train_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      60000 non-null  int64 
 1   text    60000 non-null  object
 2   label   60000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.4+ MB


In [5]:
# always make a copy 
df = train_file.copy()
df_test = test_file.copy()
# check the count of null values
null_count = df.isnull().sum().sort_values(ascending = False)
null_count

id       0
text     0
label    0
dtype: int64

In [5]:
df.head(10)

Unnamed: 0,id,text,label
0,265723,A group of friends began to volunteer at a hom...,0
1,284269,British Prime Minister @Theresa_May on Nerve A...,0
2,207715,"In 1961, Goodyear released a kit that allows P...",0
3,551106,"Happy Birthday, Bob Barker! The Price Is Right...",0
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Y...",0
5,117912,"In the 1920鈥檚, Hitler was forbidden to address...",0
6,213064,Nerd Wins Scrabble with word you've never hear...,0
7,398923,Why 95.8% of Female Newscasters Have the Same ...,1
8,314798,Donald Trump Says He'll Do This If More 'Inapp...,0
9,20243,5 crazy facts about Lamborghini's outrageous e...,0


In [5]:
#see the target distribution
df.label.value_counts()

0    32172
1    27596
2      232
Name: label, dtype: int64

There shouldn't be a target 2 , 

but 232 of 60000  is a small ratio, i'll consider it wrong data and drop it.

In [6]:
#drop the rows where label=2 
#df[df.label == 2]
df.drop(df[df.label == 2].index, axis = 0, inplace=True)
df.label.value_counts()

0    32172
1    27596
Name: label, dtype: int64

# Trial 1:
<a class="anchor" id="t1"></a>
## No Preprocessing :

just input the data as it is into the model, to get a feel of the benefit of preprocessing. Use random search to search for hyperparameters of the vectorizer and the model
- model: XGBoost
- feature extraction: TfidfVectorizer, on word level



In [7]:
X = df['text']
y = df['label']
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size = 0.2 , stratify = y, random_state = 42)

In [51]:
#create the pipeline
full_pipe = Pipeline(steps=[
        # analyzer = 'word' means word-level vectorizer.
    ('vectorizor', TfidfVectorizer(stop_words='english', analyzer='word')) , 
    ('my_model', XGBClassifier(use_label_encoder=False, verbosity = 0)) 
])

#define parameter grid for the tuning
param_grid = {
    #parameters of XGBoost
    'my_model__max_depth':[11, 13, 17],
    'my_model__learning_rate':[0.1],
    
    #parameters for the Vectorizer
    'vectorizor__ngram_range': [(1, 2), (1, 3)],
    'vectorizor__max_df': np.arange(0.3, 0.8),
    "vectorizor__min_df": np.arange(5, 50)
}

#create predefined split
split_index = [-1 if x in X_train.index else 0 for x in X.index]
pre_defined_split = PredefinedSplit(test_fold = split_index )

#cast y as int for the XGBoost
y = y.astype(int)

In [52]:
%%time
# define random search
random_search = RandomizedSearchCV(
    full_pipe, param_grid, cv = pre_defined_split , scoring="roc_auc", n_iter=15)
random_search.fit(X, y)

print('best score of the cv:', random_search.best_score_)
print('best Hyper set:', random_search.best_params_)

best score of the cv: 0.8386354379650266
best Hyper set: {'vectorizor__ngram_range': (1, 2), 'vectorizor__min_df': 23, 'vectorizor__max_df': 0.3, 'my_model__max_depth': 17, 'my_model__learning_rate': 0.1}
Wall time: 2min 11s


**model performance:**

best score of the cv: 0.8365894860912074


`best Hyper set: {'vectorizor__ngram_range': (1, 3), 'vectorizor__min_df': 34, 'vectorizor__max_df': 0.3, 'my_model__max_depth': 17, 'my_model__learning_rate': 0.1}`

In [53]:
#predict for the test file data
# and save to file
y_out = random_search.predict_proba(df_test['text'])

dummy = pd.DataFrame({'id': df_test['id'],'label': y_out[:,1]})
dummy.to_csv("nopre_T1.csv", index = False)

# Trial 2: 
<a class="anchor" id="t2"></a>

0.83 with no preprocessing, that was pretty good.

now lets experience some simple preprocessing. I'll clean the text to create a WoB(word of bags); remove single letters remove any special characters, leave only the alphabet characters and the numbers(i thought it could be a good factor in recognizing fake and real news), remove any other shapes like tags: <>. Use 
Use best_parameters from T_1 for the model, and search parameters of the vectorizer with grid search with validation set.

**setting:**

- model: Use the best settings of the model from Trial_1
- vectorizer: word-level vectorizer
- tuning: a grid search this time with narrwing down the search space to become closer to the results of T1, for faster convergence


In [37]:
#download the package for tokenization and stopwords
nltk.download('punkt')
nltk.download('stopwords')

# get the stemmer for English
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))

# this piece of function was edited from the original in the 873_nlp lab
def clean_text(text):
##plan:
## define the things to remove as RegEx
## use re.sub function to replace them in text
##stem 

#   1. define the whitespaces
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
#   2. define the tags
    RE_TAGS = re.compile(r"<[^>]+>")
#   3. define all that is not english alphabet or numbers   
    RE_ASCII = re.compile(r"[^A-Za-z1-9 ]", re.IGNORECASE)
#   4. define single characters
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)

    # now replace them in the text
    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)

    #tokenize the text (separate it into words)
    word_tokens = word_tokenize(text)
    #lower cases
    words_tokens_lower = [word.lower() for word in word_tokens]

    #stem the words
    words_filtered = [
        stemmer.stem(word) for word in words_tokens_lower if word not in stop_words
    ]

    #put the words back together
    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asmaa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asmaa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [38]:
#first clean the train file
# copy the text column to a new column
df['clean_txt'] = df['text']
# remap the new column with the clean_text function
df['clean_txt'] = df['clean_txt'].map(
    lambda x: clean_text(x ) if isinstance(x, str) else x
)
#also clean the test file
df_test['clean_txt'] = df_test['text']
df_test['clean_txt'] = df_test['clean_txt'].map(
    lambda x: clean_text(x ) if isinstance(x, str) else x
)

In [19]:
# check out the result
df.head()

Unnamed: 0,id,text,label,clean_txt
0,265723,A group of friends began to volunteer at a hom...,0,group friend began volunt homeless shelter nei...
1,284269,British Prime Minister @Theresa_May on Nerve A...,0,british prime minist theresa may nerv attack f...
2,207715,"In 1961, Goodyear released a kit that allows P...",0,1961 goodyear releas kit allow ps2s brought he...
3,551106,"Happy Birthday, Bob Barker! The Price Is Right...",0,happi birthday bob barker price right host lik...
4,8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Y...",0,obama nation innoc cop unarm young black men d...


In [20]:
X = df['clean_txt']
y = df['label']
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size = 0.3 , stratify = y, random_state = 42)

In [21]:
#create the pipeline with tfidf vectorizer and model:XGBoost

full_pipe = Pipeline(steps=[
    # analyzer = 'word' means word-level vectorizer.
    ('vectorizor', TfidfVectorizer(stop_words='english', analyzer='word')) , 
    ('my_model', XGBClassifier(use_label_encoder=False, verbosity = 0)) 
])

#define parameter grid for the tuning
param_grid = {
    #parameters of XGBoost
    #use the best settings from Trial_1
    'my_model__max_depth':[17],
    'my_model__learning_rate':[0.1],
  
    'vectorizor__ngram_range': [(1, 3)],
    'vectorizor__max_df': np.arange(0.3, 0.8),
    "vectorizor__min_df": np.arange(10, 47) #best min in Trial_1 was 34, 
                        # so i narrowed it down a little bit from 100. 
}

# define the predifened validation set
split_index = [-1 if x in X_train.index else 0 for x in X.index]
pre_defined_split = PredefinedSplit(test_fold = split_index )

In [8]:
grid_search = GridSearchCV(
    full_pipe, param_grid, cv = pre_defined_split , scoring="roc_auc" )
grid_search.fit(X, y)

print('best score of the cv:', grid_search.best_score_)
print('best Hyper set:', grid_search.best_params_)

**Result:**

best score of the cv: 0.8245970148910057

`best Hyper set: {'my_model__learning_rate': 0.1, 'my_model__max_depth': 17, 'vectorizor__max_df': 0.3, 'vectorizor__min_df': 13, 'vectorizor__ngram_range': (1, 3)}
Wall time: 6min 42s`

In [48]:
#predict for the test file data
# and save to file
y_out = random_search.predict_proba(df_test['clean_txt'])

dummy = pd.DataFrame({'id': df_test['id'],'label': y_out[:,1]})
dummy.to_csv("clean_xgboost_T2.csv", index = False)

# Trial 3:
<a class="anchor" id="t3"></a>

from trial 1 and 2 it is actually better not to clean the data?! there goes the logic.

Now i'll focus on tuning the model and vectorizer and try another preprocessing afterwards. so i'll use data, and tune for the vectorizor type (word,char),

**new setting:**
- preprocessing: none
- model: XGBoost
- Tuning: grid search to tune the vectorizer type (word/char)
- vectorizer: char level tfidf

In [23]:
X = df['text']
y = df['label']
# make the test size 0.3 because we have plenty of data
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size = 0.3 , stratify = y, random_state = 42)

In [23]:
#create the pipeline
full_pipe = Pipeline(steps=[
    ('vectorizor', TfidfVectorizer(stop_words='english')) , 
    ('my_model', XGBClassifier(use_label_encoder=False, verbosity = 0)) 
])

#define parameter grid for the tuning
param_grid = {
    #parameters of XGBoost
    #use the best settings from Trial_1 
    'my_model__max_depth':[17],
    'my_model__learning_rate':[0.1],
  
    'vectorizor__analyzer': ['char'],  # ,'char_wb'],
    'vectorizor__ngram_range': [(3,5)], #[(3,5), (7,11)],
    'vectorizor__max_df': [0.3],  # np.arange(0.3, 0.5, 0.1),
    "vectorizor__min_df": [23] #np.arange(11, 23) #best min in last trial
}

# define the predifened validation set
split_index = [-1 if x in X_train.index else 0 for x in X.index]
pre_defined_split = PredefinedSplit(test_fold = split_index )

In [24]:
# define grid search and feed it the pipeline
grid_search = GridSearchCV(
    full_pipe, param_grid, cv = pre_defined_split , scoring="roc_auc" )
#fit the search
grid_search.fit(X, y)

print('best score of the cv:', grid_search.best_score_)
print('best Hyper set:', grid_search.best_params_)

best score of the cv: 0.9251783918759096
best Hyper set: {'my_model__learning_rate': 0.1, 'my_model__max_depth': 17, 'vectorizor__analyzer': 'char', 'vectorizor__max_df': 0.3, 'vectorizor__min_df': 23, 'vectorizor__ngram_range': (3, 5)}
Wall time: 9min 34s


**some of the ngram i tried, each at a time so the grid ends fast:**


- ngram (11,13):best score of the cv: 0.7335391881465831
- ngram (9,9): best score of the cv: 0.7914074548484631,  Wall time: 9min 34s
- ngram (7,7): best score of the cv: 0.84
- ngram (5,5): best score of the cv: 0.8918609987762567,  Wall time: 9min 34s
    - ngram (3,5): best score of the cv: 0.9251783918759096, Wall time: 9min 34s
    - ngram (3,4): best score of the cv: 0.9235048800817051, Wall time: 6min 19s
- ngram (3,3): best score of the cv: 0.9169495846445554,  Wall time: 9min 34s
- ngram (2,2): best score of the cv: 0.9033028582996028,  Wall time: 9min 34s


In [25]:
#predict for the test file data
# and save to file
y_out = grid_search.predict_proba(df_test['text'])

dummy = pd.DataFrame({'id': df_test['id'],'label': y_out[:,1]})
dummy.to_csv("char_vector_T3_b.csv", index = False)

# Trial 4:
<a class="anchor" id="t4"></a>

Trial 3 was overfitting, it resulted > 0.92, with `ngram_range` (3,5), but failed in the test set on kaggle. 

I should've tested it with a separate test set and see its f1-score.
That is becasuse the character level analyzer doesn't capture a crucial information in our problem, i.e. a set of letters 'sem' for example doesn't have the same effect as a whole word like 'assemple' the repetition of it is different. The model probably learned the pattern of the letters in the words, nothing more, which is pointless unless this is a translation problem.

**next plan:**

one more check of the character vectorizer. I want to try the option `char_wb` in the analyzer.
with the same set of parameters as Trial_3: 
**T_3 setting:**
- preprocessing: none
- model: XGBoost
- Tuning: grid search to tune the vectorizer type (word/char)
- vectorizer: char_wb level tfidf

In [8]:
#create the pipeline
full_pipe = Pipeline(steps=[
    #'char_wb' creates character n-grams only from text inside word boundaries
    ('vectorizor', TfidfVectorizer(stop_words='english')) ,
    ('my_model', XGBClassifier(use_label_encoder=False, verbosity = 0)) 
])

#define parameter grid for the tuning
param_grid = {
    #parameters of XGBoost
    #use the best settings from Trial_1 
    'my_model__max_depth':[17],
    'my_model__learning_rate':[0.1],
  
     #best combination from Trial 3
    'vectorizor__analyzer': ['char_wb'], #char_wb means characters only inside words
    'vectorizor__ngram_range': [(3,4)], #[(3,5), (7,11)],
    'vectorizor__max_df': [0.3],  
    "vectorizor__min_df": [23] 
}

# define the predifened validation set
split_index = [-1 if x in X_train.index else 0 for x in X.index]
pre_defined_split = PredefinedSplit(test_fold = split_index )

In [9]:
# define the grid search model and feed it the pipelene
grid_search = GridSearchCV(
    full_pipe, param_grid, cv = pre_defined_split , scoring="roc_auc" )
#train the search
grid_search.fit(X, y)

print('best score of the cv:', grid_search.best_score_)
print('best Hyper set:', grid_search.best_params_)

best score of the cv: 0.9252480904772987
best Hyper set: {'my_model__learning_rate': 0.1, 'my_model__max_depth': 17, 'vectorizor__analyzer': 'char_wb', 'vectorizor__max_df': 0.3, 'vectorizor__min_df': 23, 'vectorizor__ngram_range': (3, 4)}


**some of the ngram i tried with `char_wb`, each at a time so the grid ends fast:**


- ngram (7,7):  best score of the cv: 0.8850190331721215, Wall time: 54.6 s
- ngram (5,7):  best score of the cv: 0.9038577714014062, Wall time: 3min 25s
- ngram (3,4): best score of the cv: 0.92297878103903, Wall time: 4min 47s
 


In [38]:
#predict for the test file data
# and save to file
y_out = grid_search.predict_proba(df_test['text'])

dummy = pd.DataFrame({'id': df_test['id'],'label': y_out[:,1]})
dummy.to_csv("char_wb_vector_T4.csv", index = False)

# Trial 5:
<a class="anchor" id="t5"></a>
`char_wb ngram (3,4): best score of the cv: 0.92297878103903`
although this seems to be a good result, i doubt it will perform will on test data, and i was right, not better than previous models at least, on kaggle results.

now the **best settings** so far, is with:

- no text cleaning
- word level vectorizer
- ngram_range(1,3)
- max_df: 0.3 
- min_df: 23

so the **plan** is to use that setting with a new model, because the XGBoost tends to overfit, because it is basically a random tree with so much splits.
I'll also use testing set, a 0.2 of the data.
- model: LogisticRegression
- tuning: grid search, tune the logistic regression
- vectorizer: word level, ngram(1,3) with best parameters from before
- Testing: separated testing set, so i can check what happened in the model

In [11]:
X1 = df['text']
y1 = df['label']
# make the test size 0.3 because we have plenty of data
X, X_ts, y, y_ts = train_test_split( X1, y1 , test_size = 0.2 , stratify = y1, random_state = 42) 
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size = 0.2 , stratify = y, random_state = 42) 

In [12]:
#create the pipeline
full_pipe = Pipeline(steps=[
    ('vectorizor', TfidfVectorizer(stop_words='english')) , 
    ('my_model', LogisticRegression(verbose=0)) 
])

#define parameter grid for the tuning
param_grid = {
     #hypers of the Logistic Reg
        #use the best settings from Trial_1 
    'my_model__penalty': [None, 'l1', 'l2'],
    'my_model__C':[1.0, 0.8],
    'my_model__max_iter': [10000], # so the model can converge
   
     #best combination from Trial 2
    'vectorizor__analyzer': ['word'],
    'vectorizor__ngram_range': [(1,3)],  
    'vectorizor__max_df': [0.3],  
    "vectorizor__min_df": [23] 
}

# define the predifened validation set
split_index = [-1 if x in X_train.index else 0 for x in X.index]
pre_defined_split = PredefinedSplit(test_fold = split_index )

In [13]:
grid_search = GridSearchCV(
    full_pipe, param_grid, cv = pre_defined_split , scoring="roc_auc", verbose=0 )
grid_search.fit(X, y)

print('best score of the cv:', grid_search.best_score_)
print('best Hyper set:', grid_search.best_params_)

4 fits failed out of a total of 6.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 441, i

best score of the cv: 0.8761954196552159
best Hyper set: {'my_model__C': 1.0, 'my_model__max_iter': 10000, 'my_model__penalty': 'l2', 'vectorizor__analyzer': 'word', 'vectorizor__max_df': 0.3, 'vectorizor__min_df': 23, 'vectorizor__ngram_range': (1, 3)}
Wall time: 34.6 s


In [15]:
#testing and checking with more metrics
y_hat = grid_search.predict(X_ts)
report = classification_report(y_ts, y_hat)
print(report)

              precision    recall  f1-score   support

           0       0.82      0.80      0.81      6435
           1       0.77      0.79      0.78      5519

    accuracy                           0.80     11954
   macro avg       0.79      0.79      0.79     11954
weighted avg       0.80      0.80      0.80     11954



In [16]:
#predict for the test file data
# and save to file
y_out = grid_search.predict_proba(df_test['text'])

dummy = pd.DataFrame({'id': df_test['id'],'label': y_out[:,1]})
dummy.to_csv("logistic_T5_noclean.csv", index = False)

# Trial 5.2:
<a class="anchor" id="t5.2"></a>
Trial_4 macro f1-score is 0.79 with logistic regression and word-level vectorizer.
I think this is good enough, but lets try again and see if it gets better if we reversed to char-level vectorizer.

Trail_5 setting 
again with character level 
- no text cleaning
    - 'vectorizor__analyzer': ['char_wb'],
    - 'vectorizor__ngram_range': [(3,4)],
    - 'vectorizor__max_df': [0.3],  
    - "vectorizor__min_df": [23] 

In [17]:
X1 = df['text']
y1 = df['label']
# make the test size 0.2 because we have plenty of data
X, X_ts, y, y_ts = train_test_split( X1, y1 , test_size = 0.2 , stratify = y1, random_state = 42) 
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size = 0.2 , stratify = y, random_state = 42) 

In [24]:
#create the pipeline
full_pipe = Pipeline(steps=[
    #'char_wb' creates character n-grams only from text inside word boundaries
    ('vectorizor', TfidfVectorizer(stop_words='english')) , 
    ('my_model', LogisticRegression(verbose=0)) 
])

#define parameter grid for the tuning
param_grid = {
     #hypers of the Logistic Reg
        #use the best settings from Trial_1 
    'my_model__penalty': [None, 'l2'],
    'my_model__C':[1.0, 0.8],
    'my_model__max_iter': [10000], # so the model can converge
   
     #best combination from Trial 3
    'vectorizor__analyzer': ['char_wb'],
    'vectorizor__ngram_range': [(3,4)], #[(3,5), (7,11)],
    'vectorizor__max_df': [0.3],  
    "vectorizor__min_df": [23] 
}

# define the predifened validation set
split_index = [-1 if x in X_train.index else 0 for x in X.index]
pre_defined_split = PredefinedSplit(test_fold = split_index )

In [25]:
%%time 
#show how much time this will take

grid_search = GridSearchCV(
    full_pipe, param_grid, cv = pre_defined_split , scoring="roc_auc", verbose=0 )
grid_search.fit(X, y)

print('best score of the cv:', grid_search.best_score_)
print('best Hyper set:', grid_search.best_params_)

2 fits failed out of a total of 4.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 441, i

best score of the cv: 0.9206754363039752
best Hyper set: {'my_model__C': 1.0, 'my_model__max_iter': 10000, 'my_model__penalty': 'l2', 'vectorizor__analyzer': 'char_wb', 'vectorizor__max_df': 0.3, 'vectorizor__min_df': 23, 'vectorizor__ngram_range': (3, 4)}
Wall time: 1min 12s


best score of the cv: 0.9206754363039752

`best Hyper set: {'my_model__C': 1.0, 'my_model__max_iter': 10000, 'my_model__penalty': 'l2', 'vectorizor__analyzer': 'char_wb', 'vectorizor__max_df': 0.3, 'vectorizor__min_df': 23, 'vectorizor__ngram_range': (3, 4)}
Wall time: 1min 12s`

In [26]:
#testing with test set
y_hat = grid_search.predict(X_ts)
report = classification_report(y_ts, y_hat)
print(report)

              precision    recall  f1-score   support

           0       0.87      0.84      0.86      6435
           1       0.82      0.85      0.84      5519

    accuracy                           0.85     11954
   macro avg       0.85      0.85      0.85     11954
weighted avg       0.85      0.85      0.85     11954



![image](https://sayingimages.com/wp-content/uploads/jackie-chan-wait-what-meme.jpg)

## 5.2 summary
soo
- model: logistic with
- char_wb tfidf vectorizer, ngram(3,4), 
- AAAAAAnd no preprocessing, NONE
performed better than all above?! 

`roc_auc` = 0.92 

`macro f1-score` = 0.85 for both classes


Performed better than Trial_2 wich preprocessing and word-level vectorizer?!
WHAAAT?!

That doesn't make sense to me, word level vectorizer should be better in this problem, why is it not?!
>I thought maybe char_wb with ngram(3,4) could just be resulting the whole words again, so i tried word-level with ngram(1,1) but it got less roc_auc score it got 0.87

In [80]:
#predict for the test file data
# and save to file
y_out = grid_search.predict_proba(df_test['text'])

dummy = pd.DataFrame({'id': df_test['id'],'label': y_out[:,1]})
dummy.to_csv("logistic_T5_b_char.csv", index = False)

# Trial 6: 
<a class="anchor" id="t6"></a>
That last Trial_5.2 rendered me quizzical, but i still have one more technique i want to explore, cleaning/preprocessing without stemming, as a simple Word Embedding. Acomplex word embedding is something like word2vec.

**Setting:**
- Preprocessing: Word embedding
- model: logistic regression
- tuning: grid search
- vectorizer: char_wb level tfidf vectorizer with ngram(3,4)

In [34]:
def word_embedding(text ):
##plan:
## define the things to remove as RegEx
## use re.sub function to replace them in text
## 

#   1. define the whitespaces
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
#   2. define the tags
    RE_TAGS = re.compile(r"<[^>]+>")
#   3. define all that is not english alphabet of punctuation  
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž,.!? ]", re.IGNORECASE)
#   4. define single characters
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž,.!?]\b", re.IGNORECASE)

    # now replace them in the text
    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)

    #tokenize the text (separate it into words)    
    word_tokens = word_tokenize(text)
  #  print(word_tokens)
    #put the words back together
    text_clean = " ".join(word_tokens)
    return text_clean

In [35]:
%%time
#clean the train file
df['clean_txt2'] = df['text']
df['clean_txt2'] = df['clean_txt2'].map(
    lambda x: word_embedding(x) if isinstance(x, str) else x
)
# also clean the test file
df_test['clean_txt2'] = df_test['text']
df_test['clean_txt2'] = df_test['clean_txt2'].map(
    lambda x: word_embedding(x) if isinstance(x, str) else x
)

Wall time: 34.4 s


In [43]:
df[['text', 'clean_txt', 'clean_txt2']].head()

Unnamed: 0,text,clean_txt,clean_txt2
0,A group of friends began to volunteer at a hom...,group friend began volunt homeless shelter nei...,group of friends began to volunteer at homeles...
1,British Prime Minister @Theresa_May on Nerve A...,british prime minist theresa may nerv attack f...,British Prime Minister Theresa May on Nerve At...
2,"In 1961, Goodyear released a kit that allows P...",1961 goodyear releas kit allow ps2s brought he...,"In , Goodyear released kit that allows PS to b..."
3,"Happy Birthday, Bob Barker! The Price Is Right...",happi birthday bob barker price right host lik...,"Happy Birthday , Bob Barker ! The Price Is Rig..."
4,"Obama to Nation: 聙""Innocent Cops and Unarmed Y...",obama nation innoc cop unarm young black men d...,Obama to Nation Innocent Cops and Unarmed Youn...


In [45]:
#split val and test set
X1 = df['clean_txt2']
y1 = df['label']
X, X_ts, y, y_ts = train_test_split(X1, y1 , test_size = 0.2 , stratify = y1, random_state = 42)
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size = 0.2 , stratify = y, random_state = 42)

In [46]:
#create the pipeline
full_pipe = Pipeline(steps=[
    ('vectorizor', TfidfVectorizer(stop_words='english' )) , 
    ('my_model', LogisticRegression( verbose = 0)) 
])

#define parameter grid for the tuning
param_grid = {
     #hypers of the Logistic Reg
     'my_model__penalty': [None, 'l2'],
    'my_model__C':[1.0, 0.8],
    'my_model__max_iter': [10000], # so the model can converge
   
     #best combination from Trial 3
    'vectorizor__analyzer': ['char_wb'],
    'vectorizor__ngram_range': [(3,4)], 
    'vectorizor__max_df': [0.3],  
    "vectorizor__min_df": [23] }

# define the predifened validation set
split_index = [-1 if x in X_train.index else 0 for x in X.index]
pre_defined_split = PredefinedSplit(test_fold = split_index )

In [47]:
%%time
grid_search = GridSearchCV(
    full_pipe, param_grid, cv = pre_defined_split , scoring="roc_auc" , verbose=0)
grid_search.fit(X, y)

print('best score of the cv:', grid_search.best_score_)
print('best Hyper set:', grid_search.best_params_)

2 fits failed out of a total of 4.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\asmaa\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 441, i

best score of the cv: 0.8721561933852436
best Hyper set: {'my_model__C': 1.0, 'my_model__max_iter': 10000, 'my_model__penalty': 'l2', 'vectorizor__analyzer': 'char_wb', 'vectorizor__max_df': 0.3, 'vectorizor__min_df': 23, 'vectorizor__ngram_range': (3, 4)}
Wall time: 1min 7s


In [48]:
y_hat = grid_search.predict(X_ts)
report = classification_report(y_ts, y_hat)
print(report)

              precision    recall  f1-score   support

           0       0.80      0.80      0.80      6435
           1       0.77      0.77      0.77      5519

    accuracy                           0.79     11954
   macro avg       0.79      0.79      0.79     11954
weighted avg       0.79      0.79      0.79     11954



In [48]:
#predict for the test file data
# and save to file
y_out = random_search.predict_proba(df_test['clean_txt'])

dummy = pd.DataFrame({'id': df_test['id'],'label': y_out[:,1]})
dummy.to_csv("T5.3.csv", index = False)

# Conclusion:
<a class="anchor" id="conc"></a>

No thing makes sense, What you think you know is probably not true, studing is never enough.

That being said, The best model under these 6 Trials was 
- Logistic regression with C=1.0, penalty=l2, char_wb level vectorizer with ngram(3,4), All with no preprocessing, just raw data. It took 2 min under the grid search to finish, which was faster than others which took up to 4 and 6 minutes.

The embedding i did was week. In coming Trials i hope to experience something like word2vec, I've just learned about it today, it will be very interesting to create a vector for the similarities of words. The word vectors doesn't quite mean the machine can understand the meaning of words, but the similarities between them,