## Data mining questions

1- What is the difference between Character n-gram and Word n-gram? Which one tends to suffer more from the OOV issue?

- The main difference between character n-gram and word n-gram is that character n-gram is based on individual characters, whereas word n-gram is based on whole words
- In terms of the OOV (out-of-vocabulary) issue, character n-gram tends to suffer less than word n-gram. This is because character n-gram can capture sub-word information

---


2- What is the difference between stop word removal and stemming? Are these techniques language-dependent?

- the main difference between the two is that stop word removal involves completely eliminating predetermined words that are listed, while stemming involves reducing words to their roots by removing prefixes and suffixes, without removing the entire word.
- These techniques can be language-dependent as stop words and word inflections can vary across different languages

---


3- Is tokenization techniques language dependent? Why?
- Yes, tokenization techniques are language-dependent, as different languages have their own unique rules and structures for dividing text into individual units or tokens.

---

4- What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?

- Count vectorizer is a simple technique that converts a collection of text documents into a matrix of token counts
- TF-IDF (term frequency-inverse document frequency) vectorizer, on the other hand, is a more advanced technique that assigns weights to the words based on their importance in the text
- it may not be feasible to use all possible n-grams as the number of features can become very large, leading to the curse of dimensionality and computational complexity

## problem formulation
</br>

### problem definition

- our problem here is about building a model to classify and detect which real news and fake news from just its titles, our inputs here are news titles (60001) observations for a training dataset and the output is (60001) label.
---

### Data mining function
- text preprocessing -> tokenization and vectorization each text ->training the model -> classification and prediction.

---

### Challenges

- We have a very big data set so maybe will take some time to preprocess it and train it.

- The input data consists of unclean text observations that contain numerous punctuation marks, non-English letters, misspellings, and grammatical errors. These titles were typed by humans, thus requiring an appropriate text cleaning technique to be selected. To determine the optimal method, various techniques will be evaluated, and the one that produces the best results will be chosen.


---

### Model impact

- prevent the spreading of rumors on social media quickly. 

---

### The ideal solution

- at the end i got 0.83 using deep learning approach on kaggle while svm got me 0.85 and logistic regression 0.84 that was maximum i could get with the given time

## Experimental protocol

### Data preprocessing
---
- cleaning input data for both training set and test set using stemming or lemmatizing each text.
- removing label 2
- using standard scaling

---
### Building models

- Building our piplines (has the vectorizer and machine learning model)
- Build the search speace and search for the best hyperparameters combinations but trying many fits.

- At the end i tried deep nural network with lstm

In [17]:
import re
import pickle
import nltk 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import PredefinedSplit
from sklearn.naive_bayes import MultinomialNB
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from sklearn.svm import SVC
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 100

In [18]:
# pip install tensorflow_hub

In [19]:
# pip install tensorflow_text

# read training dataset and test dataset


In [20]:
data = pd.read_csv('xy_train.csv',index_col='id')
data = data.drop(data[data.label == 2].index)
data

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
265723,"A group of friends began to volunteer at a homeless shelter after their neighbors protested. ""Se...",0
284269,"British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: ""The government has c...",0
207715,"In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...",0
551106,"Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | ""As the ...",0
8584,"Obama to Nation: 聙""Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...",0
...,...,...
70046,"Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)",0
189377,"Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back",1
93486,Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no,0
140950,Julius Caesar upon realizing that everyone in the room has a knife except him (44 bc),0


In [21]:
test = pd.read_csv('x_test.csv',index_col='id')

# text cleaning

- First method will clean text and will lemmtize
- Second method will clean text and will stem

In [22]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('omw-1.4')
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))
print(stop_words)

def cleaning_text(text, for_embedding):
    """
        - remove any html tags (< /br> often found)
        - Keep only ASCII + European Chars and whitespace, no digits
        - remove single letter chars
        - convert all whitespaces (tabs etc.) to single wspace

    """
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE) #match one or more white sepace
    RE_TAGS = re.compile(r"<.*?>") #match <any num of words>
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž0-9]+", re.IGNORECASE) #match any English word
    RE_SINGLECHAR = re.compile(r"\b^[^A-Za-zÀ-ž0-9]+\b", re.IGNORECASE) #match any word with word boundary.
    if for_embedding:
        # Keep punctuation
        RE_ASCII = re.compile(r"[^A-Za-zÀ-ž,.!? ]", re.IGNORECASE) #match any English word and any punctuation
        RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž,.!?]\b", re.IGNORECASE) #match any word and any punctuation with word boundary.

    text = re.sub(RE_TAGS, " ", text) #remove one or more white sepace
    text = re.sub(RE_ASCII, " ", text) #remove <any num of words>
    text = re.sub(RE_SINGLECHAR, " ", text) #remove any English word
    text = re.sub(RE_WSPACE, " ", text) #remove any word with word boundary.
    word_tokens = word_tokenize(text)
    
    return word_tokens

def lemmatize_clean_text(text ,for_embedding=False):

    """ steps:
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and lemmatize
    """

    lemmatizer = WordNetLemmatizer()

    word_tokens = cleaning_text(text, for_embedding)

    if for_embedding:
        # no stemming or lemmatization, lowering and punctuation / stop words removal
        words_filtered = word_tokens
    else:
        words_tokens_lower = [word.lower() for word in word_tokens]
        words_filtered = [lemmatizer.lemmatize(word) for word in words_tokens_lower if word not in stop_words]

    clean_text = " ".join(words_filtered)
    return clean_text



def stemming_clean_text(text ,for_embedding=False):
    """ steps:
        if not for embedding (but e.g. tdf-idf):
        - all lowercase
        - remove stopwords, punctuation and stemming
    """

    word_tokens = cleaning_text(text,for_embedding)

    if for_embedding:
        # no stemming or lemmatization, lowering and punctuation / stop words removal
        words_filtered = word_tokens
    else:
        words_tokens_lower = [word.lower() for word in word_tokens]
        words_filtered = [stemmer.stem(word) for word in words_tokens_lower if word not in stop_words]

    clean_text = " ".join(words_filtered)
    return clean_text

{'on', 'can', 'during', 't', 'his', 'into', 'ain', 'do', 'up', 'own', 'where', 'that', 'doing', 'after', 'weren', 'an', "it's", 'because', 'don', 'for', 'haven', 'ours', 'before', 'she', 'aren', "mightn't", 'against', "you're", 'who', 'then', 'been', 'have', 'some', 'was', 'himself', 'shouldn', 'more', 'further', 'did', 'all', 'm', 'no', 'there', 'needn', 'how', "haven't", 'should', 'her', 'of', 'him', 'our', 'so', 'herself', "you've", 'only', 'whom', 'were', 'doesn', 'its', 'by', 'are', "hasn't", 'o', 'over', 'your', 's', "should've", "wasn't", 'to', 'as', 'we', "weren't", 'yourselves', 'and', 'when', "mustn't", "won't", 're', 'these', 'a', 'between', 'll', "you'll", 'shan', 'what', 'being', "hadn't", "wouldn't", 'wasn', 'if', 'in', 'be', 'ourselves', 'this', 'again', 'such', 'until', 'hasn', 'they', 'while', "that'll", 'very', 'about', "she's", 'through', 'had', 'does', 'or', 'mightn', "shouldn't", 'too', 'theirs', 'once', 'is', 'here', "don't", 'below', 'those', 've', 'with', 'you',

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [23]:
data["text"]

id
265723    A group of friends began to volunteer at a homeless shelter after their neighbors protested. "Se...
284269    British Prime Minister @Theresa_May on Nerve Attack on Former Russian Spy: "The government has c...
207715    In 1961, Goodyear released a kit that allows PS2s to be brought to heel. https://m.youtube.com/w...
551106    Happy Birthday, Bob Barker! The Price Is Right Host on How He'd Like to Be Remembered | "As the ...
8584      Obama to Nation: 聙"Innocent Cops and Unarmed Young Black Men Should Not be Dying Before Magic Jo...
                                                         ...                                                 
70046                   Finish Sniper Simo H盲yh盲 during the invasion of Finland by the USSR (1939, colorized)
189377                  Nigerian Prince Scam took $110K from Kansas man; 10 years later, he's getting it back
93486                   Is It Safe To Smoke Marijuana During Pregnancy? You鈥檇 Be Surprised Of The Answer | no
140950 

In [24]:
data_lemmatized = data["text"].map(lambda x: lemmatize_clean_text(x ,for_embedding=False) if isinstance(x, str) else x).copy() ## clean and lemmatiz training set
data_stemmed = data["text"].map(lambda x: stemming_clean_text(x, for_embedding=False) if isinstance(x, str) else x).copy() ## word cleaning and stemming training set
test_lemmatized = test["text"].map(lambda x: lemmatize_clean_text(x ,for_embedding=False) if isinstance(x, str) else x).copy() ## word cleaning and lemmatizing test set
test_stemmed = test["text"].map(lambda x: stemming_clean_text(x ,for_embedding=False) if isinstance(x, str) else x).copy() ## word cleaning and lemmatizing test set

In [63]:
data_lemmatized

id
265723    group friend began volunteer homeless shelter neighbor protested seeing another person also need...
284269    british prime minister theresa may nerve attack former russian spy government concluded highly l...
207715    1961 goodyear released kit allows ps2s brought heel http youtube com watch v alxulk0t8cg 0 72 0 ...
551106    happy birthday bob barker price right host like remembered man said ave pet spayed neutered 0 92...
8584      obama nation innocent cop unarmed young black men dying magic johnson 1 0 0 2 1 jimbobshawobodob...
                                                         ...                                                 
70046                                            finish sniper simo h yh invasion finland ussr 1939 colorized
189377                                    nigerian prince scam took 110k kansa man 10 year later getting back
93486                                                         safe smoke marijuana pregnancy surprised answer
140950 

In [26]:
data_lemmatized.head(1)

id
265723    group friend began volunteer homeless shelter neighbor protested seeing another person also need...
Name: text, dtype: object

In [27]:
data_stemmed.head(1)

id
265723    group friend began volunt homeless shelter neighbor protest see anoth person also need natur lik...
Name: text, dtype: object

In [28]:
# Word Frequency of most common words in data_lemmatized
word_freq_lemmatized = pd.Series(" ".join(data_lemmatized).split()).value_counts()
word_freq_lemmatized[0:10]

0            4307
year         4121
one          3285
new          2998
like         2949
man          2705
trump        2577
u            2513
colorized    2430
people       2315
dtype: int64

In [29]:
# Word Frequency of most common words in data_stemmed
word_freq_stemmed = pd.Series(" ".join(data_stemmed).split()).value_counts()
word_freq_stemmed[0:10]

0        4293
year     4125
one      3285
like     3128
new      2998
look     2847
color    2737
man      2728
get      2602
trump    2578
dtype: int64

In [30]:
data["label"].value_counts(normalize=True)

0    0.538281
1    0.461719
Name: label, dtype: float64

here i tried to convert the label 2 to 0 but it gave lower accuracy than if i removed it

In [31]:
# data["label"] = [0 if x == 2 else x for x in data["label"]]

In [32]:
data["label"].value_counts(normalize=True)

0    0.538281
1    0.461719
Name: label, dtype: float64

# Stemmed

## trial 0 first thing i wanted to try was Logistic regression (word analyzer)

In [66]:
# feature creation and modelling in a single function
pipe_lg = Pipeline([("tfidf", TfidfVectorizer()), ("lg", LogisticRegression(max_iter=10000,random_state=42,n_jobs=-1))])


# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3), (1,4), (1,5)],
    "tfidf__max_df": np.arange(0.2, 1.0),
    "tfidf__min_df": np.arange(5, 100),
    "tfidf__analyzer": ['word'],
    "tfidf__strip_accents":[None,'ascii','unicode'],
    'tfidf__smooth_idf':[False,True],
    "tfidf__sublinear_tf":[True]
}

# here we still use data_lemmatized; but the random search model will use our predefined split internally to determine which sample belongs to the validation set

pipe_lg_clf = RandomizedSearchCV(pipe_lg, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=100, verbose=1)
pipe_lg_clf.fit(data_stemmed, data['label'])


Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('lg',
                                              LogisticRegression(max_iter=10000,
                                                                 n_jobs=-1,
                                                                 random_state=42))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'tfidf__analyzer': ['word'],
                                        'tfidf__max_df': array([0.2]),
                                        'tfidf__min_df': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33...
       39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
       90

In [67]:
print('best score {}'.format(pipe_lg_clf.best_score_))
print('best params {}'.format(pipe_lg_clf.best_params_))
# best score 0.8776370315149837
# best params {'tfidf__sublinear_tf': True, 'tfidf__strip_accents': None, 'tfidf__smooth_idf': True, 'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 6, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word'}

best score 0.8776370315149837
best params {'tfidf__sublinear_tf': True, 'tfidf__strip_accents': None, 'tfidf__smooth_idf': True, 'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 6, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word'}


In [62]:
params = {
    'tfidf__sublinear_tf':[True],
    'tfidf__strip_accents':[None],
    'tfidf__smooth_idf':[True],
    'tfidf__ngram_range': [(1, 4)],
    'tfidf__analyzer':['word'],
    'tfidf__min_df': [6], 
    'tfidf__max_df': [0.2],
    'lg__class_weight':['balanced',None],
    "lg__solver" : ['sag','newton-cg', 'lbfgs','liblinear','saga'],
    'lg__C': [1.0,0.1,0.005,1.5,2.0,3.5,4,5],
    'lg__fit_intercept':[False, True]

}

pipe_lg_clf = RandomizedSearchCV(pipe_lg, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=200, verbose=1)
pipe_lg_clf.fit(data_stemmed, data['label'])

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('lg',
                                              LogisticRegression(max_iter=10000,
                                                                 n_jobs=-1,
                                                                 random_state=42))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'lg__C': [1.0, 0.1, 0.005, 1.5, 2.0,
                                                  3.5, 4, 5],
                                        'lg__class_weight': ['balanced', None],
                                        'lg__fit_intercept': [False, True],
                                        'lg__solver': ['sag', 'newton-cg',
                                                       'lbfgs', 'liblinear',
                                                       'saga'],
                                        'tfidf__anal

In [63]:
print('best score {}'.format(pipe_lg_clf.best_score_))
print('best params {}'.format(pipe_lg_clf.best_params_))
#best score 0.8784508358625386
#{'tfidf__sublinear_tf': False, 'tfidf__strip_accents': None, 'tfidf__smooth_idf': True, 'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 6, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word', 'lg__solver': 'lbfgs', 'lg__fit_intercept': False, 'lg__class_weight': 'balanced', 'lg__C': 2.0}

best score 0.8784508358625386
best score {'tfidf__sublinear_tf': False, 'tfidf__strip_accents': None, 'tfidf__smooth_idf': True, 'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 6, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word', 'lg__solver': 'lbfgs', 'lg__fit_intercept': False, 'lg__class_weight': 'balanced', 'lg__C': 2.0}


In [77]:
#create submission file
submission = pd.DataFrame()

submission['id'] = test_stemmed.index
print(len(test['text']))
submission['label'] = pipe_lg_clf.predict_proba(test_stemmed)[:,1]
submission.to_csv('lg_stem2.csv', index=False)

59151


it was good enough but i think it still can be better but it will take very long time to tune using grid search so for next trial i wanted to try xgboost (char analyzer)

## trial 1 XGBoost (char) i expect it to be above 80 also

In [66]:
# feature creation and modelling in a single function
pipe_xgb = Pipeline([("tfidf", TfidfVectorizer()), ("xgb", XGBClassifier(random_state=42,n_jobs=-1))])


# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3), (1,4), (1,5)],
    "tfidf__max_df": np.arange(0.2, 1.0),
    "tfidf__min_df": np.arange(5, 100),
    "tfidf__strip_accents":[None,'ascii','unicode'],
    'tfidf__analyzer':['word','char','char_wb'],
    'tfidf__smooth_idf':[False,True],
    "tfidf__sublinear_tf":[True,False]
}

# here we still use data_lemmatized; but the random search model will use our predefined split internally to determine which sample belongs to the validation set

pipe_xgb_clf = RandomizedSearchCV(pipe_xgb, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=50, verbose=1)
pipe_xgb_clf.fit(data_stemmed, data['label'])

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('xgb',
                                              XGBClassifier(base_score=None,
                                                            booster=None,
                                                            callbacks=None,
                                                            colsample_bylevel=None,
                                                            colsample_bynode=None,
                                                            colsample_bytree=None,
                                                            early_stopping_rounds=None,
                                                            enable_categorical=False,
                                                            eval_metric=None,
                                                            feature_types=None,
                                      

In [67]:
print('best score {}'.format(pipe_xgb_clf.best_score_))
print('best score {}'.format(pipe_xgb_clf.best_params_))
#0.8545179954285592
#{'tfidf__sublinear_tf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 61, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'char'}

best score 0.8545179954285592
best score {'tfidf__sublinear_tf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 61, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'char'}


In [68]:
params = {
    'tfidf__sublinear_tf':[False], 
    'tfidf__strip_accents':['unicode'],
    'tfidf__smooth_idf':[False],
    'tfidf__ngram_range': [(1, 5)], #(1,2)
    'tfidf__min_df': [6], 
    'tfidf__max_df': [0.2],
    'tfidf__analyzer':['char'],
    'xgb__booster':['gbtree','gblinear', 'dart'],
    'xgb__n_estimators': [400,450, 500,600],
    'xgb__max_depth': [5,8,10,15,30,60],
    'xgb__learning_rate': [0.01, 0.1, 0.2],
    'xgb__gamma': [0, 0.1, 0.2],
    'xgb__subsample': [0.8, 0.9, 1]

}
   
pipe_xgb_clf = RandomizedSearchCV(pipe_xgb, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=10, verbose=1)
pipe_xgb_clf.fit(data_stemmed, data['label'])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('xgb',
                                              XGBClassifier(base_score=None,
                                                            booster=None,
                                                            callbacks=None,
                                                            colsample_bylevel=None,
                                                            colsample_bynode=None,
                                                            colsample_bytree=None,
                                                            early_stopping_rounds=None,
                                                            enable_categorical=False,
                                                            eval_metric=None,
                                                            feature_types=None,
                                      

In [69]:
print('best score {}'.format(pipe_xgb_clf.best_score_))
print('best score {}'.format(pipe_xgb_clf.best_params_))

best score 0.8741512318381581
best score {'xgb__subsample': 0.8, 'xgb__n_estimators': 500, 'xgb__max_depth': 60, 'xgb__learning_rate': 0.1, 'xgb__gamma': 0.1, 'xgb__booster': 'dart', 'tfidf__sublinear_tf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 6, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'char'}


In [81]:
#create submission file
submission = pd.DataFrame()

submission['id'] = test_stemmed.index
print(len(test['text']))
submission['label'] = pipe_xgb_clf.predict_proba(test_stemmed)[:,1]
submission.to_csv('xgb_stem2.csv', index=False)

59151


it toke very long but it was lower than logestic regression :(

## trial 2 SVM 
i guess it should out preform logistic regression if we searched for best hyper parameter using grid but svm toke nearly longer time than any other model but i think it has the potential

In [28]:
# feature creation and modelling in a single function
pipe_svc = Pipeline([("tfidf", TfidfVectorizer()), ("svc", SVC(random_state=42,probability=True))])


# define parameter space to test
params = {
    'tfidf__sublinear_tf':[True], 
    'tfidf__strip_accents':['unicode'],
    'tfidf__smooth_idf':[False],
    'tfidf__ngram_range': [(1, 2)], #(1,2)
    'tfidf__min_df': [5], 
    'tfidf__max_df': [0.2],
    'tfidf__analyzer':['word']
}

# here we still use data_lemmatized; but the random search model will use our predefined split internally to determine which sample belongs to the validation set

pipe_svc_clf = GridSearchCV(pipe_svc, params, n_jobs=-1,cv=5, scoring="roc_auc", verbose=1)
pipe_svc_clf.fit(data_stemmed, data['label'])

Fitting 5 folds for each of 1 candidates, totalling 5 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('svc',
                                        SVC(probability=True,
                                            random_state=42))]),
             n_jobs=-1,
             param_grid={'tfidf__analyzer': ['word'], 'tfidf__max_df': [0.2],
                         'tfidf__min_df': [5], 'tfidf__ngram_range': [(1, 2)],
                         'tfidf__smooth_idf': [False],
                         'tfidf__strip_accents': ['unicode'],
                         'tfidf__sublinear_tf': [True]},
             scoring='roc_auc', verbose=1)

In [29]:
print('best score {}'.format(pipe_svc_clf.best_score_))
print('best score {}'.format(pipe_svc_clf.best_params_))
#best score 0.8810183459981413
#{'tfidf__sublinear_tf': True, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 5, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word'}

best score 0.8810183459981413
best score {'tfidf__analyzer': 'word', 'tfidf__max_df': 0.2, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 2), 'tfidf__smooth_idf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__sublinear_tf': True}


In [30]:
#create submission file
submission = pd.DataFrame()

submission['id'] = test_stemmed.index
print(len(test['text']))
submission['label'] = pipe_svc_clf.predict_proba(test_stemmed)[:,1]
submission.to_csv('svc_stem_final.csv', index=False)

59151


In [19]:
# feature creation and modelling in a single function
pipe_svc = Pipeline([("tfidf", TfidfVectorizer()), ("svc", SVC(random_state=42))])

params = {
    'tfidf__sublinear_tf':[True], 
    'tfidf__strip_accents':['unicode'],
    'tfidf__smooth_idf':[False],
    'tfidf__ngram_range': [(1, 2)], #(1,2)
    'tfidf__min_df': [5], 
    'tfidf__max_df': [0.2],
    'tfidf__analyzer':['word'],
    'svc__C':[10**-2, 10**-1, 10**0, 10**1, 10**2],
    'svc__kernel': ['linear', 'poly','rbf', 'sigmoid'],
    'svc__gamma': [0.5,0.1,1]

}
   
pipe_svc_clf = RandomizedSearchCV(pipe_svc, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=10, verbose=1)
pipe_svc_clf.fit(data_stemmed, data['label'])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('svc', SVC(random_state=42))]),
                   n_jobs=-1,
                   param_distributions={'svc__C': [0.01, 0.1, 1, 10, 100],
                                        'svc__gamma': [0.5, 0.1, 1],
                                        'svc__kernel': ['linear', 'poly', 'rbf',
                                                        'sigmoid'],
                                        'tfidf__analyzer': ['word'],
                                        'tfidf__max_df': [0.2],
                                        'tfidf__min_df': [5],
                                        'tfidf__ngram_range': [(1, 2)],
                                        'tfidf__smooth_idf': [False],
                                        'tfidf__strip_accents': ['unicode'],
                                        'tfidf__sublinear_tf': [True]},
             

In [20]:
print('best score {}'.format(pipe_svc_clf.best_score_))
print('best score {}'.format(pipe_svc_clf.best_params_))
#best score 0.8789690913235946
#{'tfidf__sublinear_tf': True, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 5, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word', 'svc__kernel': 'rbf', 'svc__gamma': 0.5, 'svc__C': 1}

best score 0.8789690913235946
best score {'tfidf__sublinear_tf': True, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 5, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word', 'svc__kernel': 'rbf', 'svc__gamma': 0.5, 'svc__C': 1}


i think svm would give better result than logestic but if i had enough resources and enough time its much easier to try tunning logistic regression model

#  lemmatized

## trial 3 Logistic regression with lemmatized data

In [68]:
# feature creation and modelling in a single function
pipe_lg = Pipeline([("tfidf", TfidfVectorizer()), ("lg", LogisticRegression(max_iter=10000,random_state=42,n_jobs=-1))])


# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3), (1,4), (1,5)],
    "tfidf__max_df": np.arange(0.2, 1.0),
    "tfidf__min_df": np.arange(5, 100),
    "tfidf__analyzer": ['word','char','char_wb'],
    "tfidf__strip_accents":[None,'ascii','unicode'],
    'tfidf__smooth_idf':[False,True],
    "tfidf__sublinear_tf":[True,False]
}

# here we still use data_lemmatized; but the random search model will use our predefined split internally to determine which sample belongs to the validation set

pipe_lg_clf2 = RandomizedSearchCV(pipe_lg, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=200, verbose=1)
pipe_lg_clf2.fit(data_lemmatized, data['label'])


Fitting 5 folds for each of 200 candidates, totalling 1000 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('lg',
                                              LogisticRegression(max_iter=10000,
                                                                 n_jobs=-1,
                                                                 random_state=42))]),
                   n_iter=200, n_jobs=-1,
                   param_distributions={'tfidf__analyzer': ['word', 'char',
                                                            'char_wb'],
                                        'tfidf__max_df': array([0.2]),
                                        'tfidf__min_df': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27,...
       39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 7

In [69]:
print('best score {}'.format(pipe_lg_clf2.best_score_))
print('best params {}'.format(pipe_lg_clf2.best_params_))
# best score 0.8796225479855799
# best params {'tfidf__sublinear_tf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 5, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word'}

best score 0.8796225479855799
best params {'tfidf__sublinear_tf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 4), 'tfidf__min_df': 5, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word'}


In [109]:
# feature creation and modelling in a single function
pipe_lg = Pipeline([("tfidf", TfidfVectorizer()), ("lg", LogisticRegression(max_iter=10000,random_state=42,n_jobs=-1))])

params = {
    'tfidf__sublinear_tf':[False],
    'tfidf__strip_accents':["unicode"],
    'tfidf__smooth_idf':[False],
    'tfidf__ngram_range': [(1, 4)],
    'tfidf__analyzer':['word'],
    'tfidf__min_df': [5], 
    'tfidf__max_df': [0.2],
    'lg__class_weight':['balanced',None],
    "lg__solver" : ['sag','newton-cg', 'lbfgs','liblinear','saga', 'sag'],
    'lg__C': [1.0,0.1,0.005,1.5,2.0,3.5,4,5],
    'lg__fit_intercept':[False, True],

}

pipe_lg_clf2 = GridSearchCV(pipe_lg, params, n_jobs=-1,cv=5, scoring="roc_auc", verbose=1)
pipe_lg_clf2.fit(data_lemmatized, data['label'])

Fitting 5 folds for each of 192 candidates, totalling 960 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('lg',
                                        LogisticRegression(max_iter=10000,
                                                           n_jobs=-1,
                                                           random_state=42))]),
             n_jobs=-1,
             param_grid={'lg__C': [1.0, 0.1, 0.005, 1.5, 2.0, 3.5, 4, 5],
                         'lg__class_weight': ['balanced', None],
                         'lg__fit_intercept': [False, True],
                         'lg__solver': ['sag', 'newton-cg', 'lbfgs',
                                        'liblinear', 'saga', 'sag'],
                         'tfidf__analyzer': ['word'], 'tfidf__max_df': [0.2],
                         'tfidf__min_df': [5], 'tfidf__ngram_range': [(1, 4)],
                         'tfidf__smooth_idf': [False],
                         'tfidf__strip_accents': ['unicode'],
     

In [110]:
print('best score {}'.format(pipe_lg_clf2.best_score_))
print('best params {}'.format(pipe_lg_clf2.best_params_))
#best score 0.8845699688823505
# best params {'lg__C': 2.0, 'lg__class_weight': 'balanced', 'lg__fit_intercept': False, 'lg__solver': 'newton-cg', 'tfidf__analyzer': 'word', 'tfidf__max_df': 0.2, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 4), 'tfidf__smooth_idf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__sublinear_tf': False}

best score 0.8845699688823505
best params {'lg__C': 2.0, 'lg__class_weight': 'balanced', 'lg__fit_intercept': False, 'lg__solver': 'newton-cg', 'tfidf__analyzer': 'word', 'tfidf__max_df': 0.2, 'tfidf__min_df': 5, 'tfidf__ngram_range': (1, 4), 'tfidf__smooth_idf': False, 'tfidf__strip_accents': 'unicode', 'tfidf__sublinear_tf': False}


In [111]:
#create submission file
submission = pd.DataFrame()

submission['id'] = test_lemmatized.index
print(len(test['text']))
submission['label'] = pipe_lg_clf2.predict_proba(test_lemmatized)[:,1]
submission.to_csv('lg_lemmatized_finalGR_nodrpzero.csv', index=False)

59151


## trial 4 SVM with lemmatized

it didn't look promising and it will take very long time so i decided not to continue with it

In [85]:
# feature creation and modelling in a single function
pipe_svc = Pipeline([("tfidf", TfidfVectorizer()), ("svc", SVC(random_state=42))])


# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3), (1,4), (1,5)],
    "tfidf__max_df": np.arange(0.2, 1.0),
    "tfidf__min_df": np.arange(5, 100),
    "tfidf__strip_accents":[None,'ascii','unicode'],
    'tfidf__analyzer':['word','char','char_wb'],
    'tfidf__smooth_idf':[False,True],
    "tfidf__sublinear_tf":[True,False]
}

# here we still use data_lemmatized; but the random search model will use our predefined split internally to determine which sample belongs to the validation set

pipe_svc_clf2 = RandomizedSearchCV(pipe_svc, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=10, verbose=1)
pipe_svc_clf2.fit(data_lemmatized, data['label'])

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('svc', SVC(random_state=42))]),
                   n_jobs=-1,
                   param_distributions={'tfidf__analyzer': ['word', 'char',
                                                            'char_wb'],
                                        'tfidf__max_df': array([0.2]),
                                        'tfidf__min_df': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
       39, 40, 41, 42, 43, 44...48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
       90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
                                        'tfidf__ngram_range': [(1, 2), (1, 3),
                        

In [86]:
print('best score {}'.format(pipe_svc_clf2.best_score_))
print('best score {}'.format(pipe_svc_clf2.best_params_))

best score 0.8794486916772799
best score {'tfidf__sublinear_tf': False, 'tfidf__strip_accents': None, 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 11, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word'}


In [None]:
# params = {
#     'tfidf__sublinear_tf':[False], 
#     'tfidf__strip_accents':['unicode'],
#     'tfidf__smooth_idf':[False],
#     'tfidf__ngram_range': [(1, 5)], #(1,2)
#     'tfidf__min_df': [6], 
#     'tfidf__max_df': [0.2],
#     'tfidf__analyzer':['char'],
#     'svc__C':[10**-2, 10**-1, 10**0, 10**1, 10**2],
#     'svc__kernel': ['linear', 'poly','rbf', 'sigmoid'],
#     'svc__gamma': [0.5,0.1,1]

# }
   
# pipe_svc_clf2 = RandomizedSearchCV(pipe_svc, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=20, verbose=1)
# pipe_svc_clf2.fit(data_lemmatized, data['label'])

In [None]:
# print('best score {}'.format(pipe_lg_clf2.best_score_))
# print('best params {}'.format(pipe_lg_clf2.best_params_))

## trial 5 PassiveAggressiveClassifier

i tried to look for another classifier that can outpreform logistic regression but that was not the case :(

In [70]:
# feature creation and modelling in a single function
pipe_pa = Pipeline([("tfidf", TfidfVectorizer()), ("pa", PassiveAggressiveClassifier(random_state=42))])


# define parameter space to test
params = {
    "tfidf__ngram_range": [(1, 2), (1, 3), (1,4), (1,5)],
    "tfidf__max_df": np.arange(0.2, 1.0),
    "tfidf__min_df": np.arange(5, 100),
    "tfidf__analyzer": ['word','char'],
    "tfidf__strip_accents":[None,'ascii','unicode'],
    'tfidf__smooth_idf':[False,True],
    "tfidf__sublinear_tf":[False,True]
}

# here we still use data_lemmatized; but the random search model will use our predefined split internally to determine which sample belongs to the validation set

pipe_pa_clf = RandomizedSearchCV(pipe_pa, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=100, verbose=1)
pipe_pa_clf.fit(data_stemmed, data['label'])


Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('pa',
                                              PassiveAggressiveClassifier(random_state=42))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'tfidf__analyzer': ['word', 'char'],
                                        'tfidf__max_df': array([0.2]),
                                        'tfidf__min_df': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36...
       39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
       90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
                                        'tfidf__ngram_range': [(1, 2), (1, 3),
         

In [71]:
print('best score {}'.format(pipe_pa_clf.best_score_))
print('best score {}'.format(pipe_pa_clf.best_params_))

best score 0.8486310776986358
best score {'tfidf__sublinear_tf': True, 'tfidf__strip_accents': None, 'tfidf__smooth_idf': True, 'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 15, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word'}


In [62]:
params = {
    'tfidf__sublinear_tf':[True],
    'tfidf__strip_accents':['unicode'],
    'tfidf__smooth_idf':[False],
    'tfidf__ngram_range': [(1, 2)],
    'tfidf__analyzer':['word'],
    'tfidf__min_df': [14], 
    'tfidf__max_df': [0.2],
    'pa__class_weight':['balanced',None],
    "pa__loss" : ['hinge','squared_hinge'],
    'pa__C': [1.0,0.1,0.005,1.5,2.0,3.5,4,5, 10.0],
    'pa__fit_intercept':[False, True],
    'pa__average':[False, True],
}

pipe_pa_clf = RandomizedSearchCV(pipe_pa, params, n_jobs=-1,cv=5, scoring="roc_auc", n_iter=100, verbose=1)
pipe_pa_clf.fit(data_stemmed, data['label'])

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('pa',
                                              PassiveAggressiveClassifier(random_state=42))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'pa__C': [1.0, 0.1, 0.005, 1.5, 2.0,
                                                  3.5, 4, 5, 10.0],
                                        'pa__average': [False, True],
                                        'pa__class_weight': ['balanced', None],
                                        'pa__fit_intercept': [False, True],
                                        'pa__loss': ['hinge', 'squared_hinge'],
                                        'tfidf__analyzer': ['word'],
                                        'tfidf__max_df': [0.2],
                                        'tfidf__min_df': [14],
                                        'tfidf__ngram_range': [(

In [63]:
print('best score {}'.format(pipe_pa_clf.best_score_))
print('best score {}'.format(pipe_pa_clf.best_params_))

best score 0.875545069320939
best score {'tfidf__sublinear_tf': True, 'tfidf__strip_accents': 'unicode', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2), 'tfidf__min_df': 14, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word', 'pa__loss': 'squared_hinge', 'pa__fit_intercept': True, 'pa__class_weight': 'balanced', 'pa__average': True, 'pa__C': 0.005}


## trial 6 Naive bayes

that was my kast option for machine learning classifiers i guess the best one for me was the logistic regression after all

In [22]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

# Define the parameters to be tuned
parameters = {
    "tfidf__ngram_range": [(1, 2), (1, 3), (1,4), (1,5),(2,2)],
    "tfidf__max_df": np.arange(0.2, 1.0),
    "tfidf__min_df": np.arange(5, 100),
    "tfidf__analyzer": ['word'],
    "tfidf__strip_accents":[None,'ascii','unicode'],
    'tfidf__smooth_idf':[False,True],
    "tfidf__sublinear_tf":[False,True],
    "clf__alpha": (0.1, 0.5, 1.0)
}
grid_search = RandomizedSearchCV(pipeline, parameters, cv=5, n_jobs=-1, n_iter=200, verbose=1)
grid_search.fit(data_lemmatized, data['label'])

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                             ('clf', MultinomialNB())]),
                   n_iter=200, n_jobs=-1,
                   param_distributions={'clf__alpha': (0.1, 0.5, 1.0),
                                        'tfidf__analyzer': ['word'],
                                        'tfidf__max_df': array([0.2]),
                                        'tfidf__min_df': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
       3...44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
       90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
                                        'tfidf__ngram_range': [(1, 2), (1, 3),
                              

In [23]:
print('best score {}'.format(grid_search.best_score_))
print('best score {}'.format(grid_search.best_params_))

best score 0.7812333333333333
best score {'tfidf__sublinear_tf': True, 'tfidf__strip_accents': 'ascii', 'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 5), 'tfidf__min_df': 5, 'tfidf__max_df': 0.2, 'tfidf__analyzer': 'word', 'clf__alpha': 1.0}


# trial 7 

one last trial with no cross validation and using grid search

In [33]:
X_train_2, X_val_2 ,Y_train_2, Y_val_2 = train_test_split(data_stemmed,data['label'],stratify=data['label'], random_state=42, test_size=0.25, shuffle=True)


split_index_stemmed = [-1 if x in X_train_2.index else 0 for x in data_stemmed.index]
ps = PredefinedSplit(split_index_stemmed)

In [35]:
pipe_lg = Pipeline([("tfidf", TfidfVectorizer()), ("lg", LogisticRegression(max_iter=10000,random_state=42,n_jobs=-1))])

params = {
    'tfidf__sublinear_tf':[True,False], 
    'tfidf__strip_accents':[None],
    'tfidf__smooth_idf':[False],
    'tfidf__ngram_range': [(1, 5),(1,2)], 
    'tfidf__analyzer':['char','word'],
    'tfidf__min_df': [11], 
    'tfidf__max_df': [0.2],
    'lg__class_weight':['balanced',None],
    "lg__solver" : ['sag','newton-cg', 'lbfgs','liblinear','saga'],
    'lg__C': [1.0,0.1,0.001,1.5,2.0,3.5,4,5],
    'lg__fit_intercept':[False, True],

}

pipe_lg_clf = GridSearchCV(pipe_lg, params, n_jobs=-1,cv=ps, scoring="roc_auc", verbose=1)
pipe_lg_clf.fit(data_stemmed, data['label'])

Fitting 1 folds for each of 1280 candidates, totalling 1280 fits


GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1, -1])),
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('lg',
                                        LogisticRegression(max_iter=10000,
                                                           n_jobs=-1,
                                                           random_state=42))]),
             n_jobs=-1,
             param_grid={'lg__C': [1.0, 0.1, 0.001, 1.5, 2.0, 3.5, 4, 5],
                         'lg__class_weight': ['balanced', None],
                         'lg__fit_intercept': [False, True],
                         'lg__solver': ['sag', 'newton-cg', 'lbfgs',
                                        'liblinear', 'saga'],
                         'tfidf__analyzer': ['char', 'word'],
                         'tfidf__max_df': [0.2], 'tfidf__min_df': [11],
                         'tfidf__ngram_range': [(1, 5), (1, 2)],
                         'tfidf__

In [37]:
print('best score {}'.format(pipe_lg_clf.best_score_))
print('best score {}'.format(pipe_lg_clf.best_params_))

best score 0.890334974227255
best score {'lg__C': 3.5, 'lg__class_weight': 'balanced', 'lg__fit_intercept': True, 'lg__solver': 'sag', 'tfidf__analyzer': 'char', 'tfidf__max_df': 0.2, 'tfidf__min_df': 11, 'tfidf__ngram_range': (1, 5), 'tfidf__smooth_idf': False, 'tfidf__strip_accents': None, 'tfidf__sublinear_tf': True}


In [39]:
pipe_lg.set_params(**pipe_lg_clf.best_params_).fit(data_stemmed, data['label'])

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(analyzer='char', max_df=0.2, min_df=11,
                                 ngram_range=(1, 5), smooth_idf=False,
                                 sublinear_tf=True)),
                ('lg',
                 LogisticRegression(C=3.5, class_weight='balanced',
                                    max_iter=10000, n_jobs=-1, random_state=42,
                                    solver='sag'))])

In [43]:
#create submission file
submission = pd.DataFrame()

submission['id'] = test_stemmed.index
print(len(test['text']))
submission['label'] = pipe_lg.predict_proba(test_stemmed)[:,1]
submission.to_csv('bestLGfinal.csv', index=False)

59151


# Deep learning approach

##  Bert

In [62]:
# Import the necessary modules
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from keras.callbacks import EarlyStopping
from keras.callbacks import ReduceLROnPlateau
from keras.optimizers import Nadam
# Load the BERT model from TensorFlow Hub
bert_model = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4')

# Load the BERT preprocessing layer from TensorFlow Hub
bert_preprocess = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
X_train, X_test, y_train, y_test = train_test_split(data_lemmatized,data["label"],stratify=data["label"],test_size = 0.05,random_state=1)
# Define the input and output layers for the classifier
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
encoder_inputs = bert_preprocess(text_input)
encoder_outputs = bert_model(encoder_inputs)
pooled_output = encoder_outputs['pooled_output']
dropout = tf.keras.layers.Dropout(0.5)(pooled_output)
output = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(dropout)

# Create a Keras model for sentiment analysis
model = tf.keras.Model(inputs=[text_input], outputs=[output])
opt = Nadam(learning_rate=0.002)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=0.001)
# Compile the model with binary crossentropy loss and Adam optimizer
model.compile(optimizer=opt,loss='binary_crossentropy',metrics=['AUC'])

# # Assume data is a data frame with two columns: text and label
# X_train = data['text']
# y_train = data['label']
early_stopping = EarlyStopping(monitor='val_auc', patience=3, verbose=1, mode='auto', restore_best_weights=True)
# Fit the model on the training data
model.fit(X_train, y_train, validation_split=0.2, epochs=50, batch_size=32, callbacks=[early_stopping,reduce_lr])

In [47]:
test_loss, test_auc = model.evaluate(X_test, y_test)
print('Test loss:', test_loss)
print('Test AUC:', test_auc)

Test loss: 0.41137224435806274
Test AUC: 0.8980545997619629


# LSTM

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Load the dataset
df = pd.read_csv('xy_train.csv',index_col='id')
df = df.drop(df[df.label == 2].index)
# Tokenize the text and convert it to sequences
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])

# Pad the sequences to a fixed length
X = pad_sequences(sequences, maxlen=50)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, df['label'],stratify=df["label"],test_size=0.2, random_state=42)

# Build the model with LSTM layers
model = Sequential()
model.add(Embedding(10000, 128, input_length=50))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['AUC'])

# Train the model with class weights

# Train the model with class weights and early stopping
history = model.fit(X_train, y_train, epochs=3, batch_size=32, validation_data=(X_test, y_test))

# Predict the probabilities of the test set
y_pred = model.predict(X_test)

# Calculate the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC AUC Score:", roc_auc)





Epoch 1/3
 161/1495 [==>...........................] - ETA: 4:15 - loss: 0.8782 - auc: 0.6988

In [47]:
# Load the testing dataset
test_df = pd.read_csv('x_test.csv')

# Tokenize the text and convert it to sequences
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(test_df['text'])
sequences = tokenizer.texts_to_sequences(test_df['text'])
X_test = pad_sequences(sequences, maxlen=50)
y_pred = model.predict(X_test)

# Create a new pandas DataFrame with the predicted probabilities and the ID column
submission_df = pd.DataFrame()
submission_df['id'] = test_df['id']
submission_df['label'] = y_pred

# Save the submission DataFrame as a CSV file
submission_df.to_csv("submissionfinaaaaaaaaal.csv", index=False)



# lstm + cnn

In [37]:
import pandas as pd
import numpy as np
from tensorflow.keras.layers import Input, Embedding, LSTM, Conv1D, GlobalMaxPooling1D, Dense, concatenate, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping

# Load the data
train_df =  pd.read_csv('xy_train.csv',index_col='id')
train_df = train_df.drop(train_df[train_df.label == 2].index)
test_df = pd.read_csv('x_test.csv')
# Convert the text column of the DataFrames to a list of strings
train_texts = train_df['text'].to_numpy().tolist()
test_texts = test_df['text'].to_numpy().tolist()

# Convert the label column of the DataFrames to a NumPy array
train_labels = train_df['label'].to_numpy()

# Initialize a tokenizer with a vocabulary size of 10,000
tokenizer = Tokenizer(num_words=10000)

# Fit the tokenizer on the training data
tokenizer.fit_on_texts(train_texts)

# Convert the texts to sequences of word indices
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

# Pad the sequences to a maximum length of 256
max_length = 256
train_data = pad_sequences(train_sequences, maxlen=max_length, padding='post', truncating='post')
test_data = pad_sequences(test_sequences, maxlen=max_length, padding='post', truncating='post')

# Define the input layer
input_layer = Input(shape=(max_length,), dtype='int32')

# Add an embedding layer to convert the input indices to dense vectors
embedding_layer = Embedding(input_dim=10000, output_dim=128, input_length=max_length)(input_layer)

# Add a LSTM layer with 64 units and L2 regularization
lstm_layer = LSTM(units=64, kernel_regularizer=regularizers.l2(0.01))(embedding_layer)

# Add a 1D convolutional layer with 128 filters and a kernel size of 3, and L2 regularization
conv_layer = Conv1D(filters=128, kernel_size=3, activation='relu', kernel_regularizer=regularizers.l2(0.01))(embedding_layer)

# Add a global max pooling layer to extract the most important features
pooling_layer = GlobalMaxPooling1D()(conv_layer)

# Add a dropout layer to prevent overfitting
dropout_layer = Dropout(rate=0.5)(lstm_layer)

# Concatenate the outputs of the dropout and pooling layers
concat_layer = concatenate([dropout_layer, pooling_layer])

# Add a dense layer with L2 regularization for classification
dense_layer = Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))(concat_layer)

# Create the Keras model
model = Model(inputs=input_layer, outputs=dense_layer)

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['AUC'])

# Train the model with class weights
early_stopping = EarlyStopping(monitor='val_auc', patience=4, restore_best_weights=True)

# Train the model with class weights and early stopping
history = model.fit(x=train_data, y=train_labels, epochs=50, batch_size=128,validation_split=0.2, callbacks=[early_stopping])


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50


In [38]:
from sklearn.metrics import roc_auc_score
val_preds = model.predict(train_data[int(len(train_data)*0.8):])[:, 0]
val_labels = train_labels[int(len(train_labels)*0.8):]
roc_auc = roc_auc_score(val_labels, val_preds)
print('ROC AUC score on the validation set:', roc_auc)

ROC AUC score on the validation set: 0.8912908127223991


In [41]:
# Predict the probabilities of the test set using the trained model
y_pred = model.predict(test_data)[:, 0]

# Create a new pandas DataFrame with the predicted probabilities and the ID column
submission_df = pd.DataFrame()
submission_df['id'] = test_df['id']
submission_df['label'] = y_pred

# Save the submission DataFrame as a CSV file
submission_df.to_csv('submissionlstmcnn2fff.csv', index=False)



In [42]:
submission_df

Unnamed: 0,id,label
0,0,0.548615
1,1,0.267323
2,2,0.389484
3,3,0.719527
4,4,0.374875
...,...,...
59146,59146,0.845922
59147,59147,0.658412
59148,59148,0.001467
59149,59149,0.299196


In [74]:
import pandas as pd
import numpy as np
from tensorflow.keras.layers import Input, Embedding, LSTM, Conv1D, GlobalMaxPooling1D, Dense, concatenate, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping

# Load the data
train_df =  pd.read_csv('xy_train.csv',index_col='id')
train_df = train_df.drop(train_df[train_df.label == 2].index)
test_df = pd.read_csv('x_test.csv')
# Convert the text column of the DataFrames to a list of strings
train_texts = data_lemmatized
test_texts = test_lemmatized

# Convert the label column of the DataFrames to a NumPy array
train_labels = train_df['label'].to_numpy()

# Initialize a tokenizer with a vocabulary size of 10,000
tokenizer = Tokenizer(num_words=10000)

# Fit the tokenizer on the training data
tokenizer.fit_on_texts(train_texts)

# Convert the texts to sequences of word indices
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

# Pad the sequences to a maximum length of 256
max_length = 256
train_data = pad_sequences(train_sequences, maxlen=max_length, padding='post', truncating='post')
test_data = pad_sequences(test_sequences, maxlen=max_length, padding='post', truncating='post')

# Define the input layer
input_layer = Input(shape=(max_length,), dtype='int32')

# Add an embedding layer to convert the input indices to dense vectors
embedding_layer = Embedding(input_dim=10000, output_dim=128, input_length=max_length)(input_layer)

# Add a LSTM layer with 64 units and L2 regularization
lstm_layer = LSTM(units=64, kernel_regularizer=regularizers.l2(0.01))(embedding_layer)

# Add a 1D convolutional layer with 128 filters and a kernel size of 3, and L2 regularization
conv_layer = Conv1D(filters=128, kernel_size=3, activation='relu', kernel_regularizer=regularizers.l2(0.01))(embedding_layer)

# Add a global max pooling layer to extract the most important features
pooling_layer = GlobalMaxPooling1D()(conv_layer)

# Add a dropout layer to prevent overfitting
dropout_layer = Dropout(rate=0.5)(lstm_layer)

# Concatenate the outputs of the dropout and pooling layers
concat_layer = concatenate([dropout_layer, pooling_layer])

# Add a dense layer with L2 regularization for classification
dense_layer = Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))(concat_layer)

# Create the Keras model
model = Model(inputs=input_layer, outputs=dense_layer)

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['AUC'])

# Train the model with class weights
early_stopping = EarlyStopping(monitor='val_auc', patience=5)
reduce_lr = ReduceLROnPlateau(monitor='val_auc', factor=0.2, patience=3, min_lr=1e-6)

# Train the model with class weights and early stopping
history = model.fit(x=train_data, y=train_labels, epochs=50, batch_size=32,validation_split=0.1, callbacks=[early_stopping, reduce_lr])


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


In [75]:
# Predict the probabilities of the test set using the trained model
y_pred = model.predict(test_data)[:, 0]

# Create a new pandas DataFrame with the predicted probabilities and the ID column
submission_df = pd.DataFrame()
submission_df['id'] = test_df['id']
submission_df['label'] = y_pred

# Save the submission DataFrame as a CSV file
submission_df.to_csv('submissionlstmcnn2finallast.csv', index=False)



at the end it only got 0.83 on kaggle while svm got me 0.85 and logistic regression 0.84 that was maximum i could get from every approach :)