1_Define the problem: 

    The problem is to develop a machine learning model that can accurately classify Reddit posts as fake news or not based on 
    their title. This task can help mitigate the spread of false information on social media platforms.

2_What is the input? 

The input is a text string that represents the title of a Reddit post. The text may contain various 
forms of words, including slang, misspellings, abbreviations, and other variations that can make text classification challenging.

3_What is the output? 

The output is a binary classification label that indicates whether the Reddit post is fake news or not. 
This label can be represented as a 0 or 1, where 0 represents a genuine post and 1 represents a fake news post.

4_What data mining function is required? 

A text classification algorithm is required to perform the classification task. This involves preprocessing the text data, extracting relevant features, 
and training a machine learning model to classify the posts. Some of the data mining techniques that can be used in this task include text preprocessing, 
feature engineering, model selection, and hyperparameter tuning.

5_What could be the challenges? 

Some of the challenges include dealing with the unstructured nature of the text data, identifying relevant features that can
help discriminate between fake and real news, and addressing issues of class imbalance or bias in the dataset. Additionally, the text data may contain noise, 
ambiguity, and subjectivity that can make it difficult to develop a robust classification model.

6_What is the impact? 

The impact of this project is to help identify and mitigate the spread of fake news on social media platforms, which can have significant social and political consequences.
By accurately identifying fake news posts, this project can help prevent the spread of misinformation, promote media literacy, and enhance the credibility of online information
sources.

7_What is an ideal solution? 

Model: Random Forest

Validation : Random Search

Vectorizer: tfidf(word)



8-What is the difference between Character n-gram and Word n-gram?  Which one tends to suffer more from the OOV issue?

Character n-grams are sequences of characters of length n, whereas word n-grams are sequences of words of length n. Character n-grams are useful for capturing information about the morphology and spelling of words, while word n-grams capture information about the semantics and syntax of language.

Character n-grams tend to suffer more from the OOV (out-of-vocabulary) issue, as they can generate a large number of n-grams that may not be present in the training data. This is because words can be spelled in different ways and may contain misspellings, abbreviations, and other variations that can increase the number of unique n-grams. Word n-grams, on the other hand, tend to be less affected by the OOV issue, as they are based on the presence or absence of whole words, which are more likely to be present in the training data.

9-What is the difference between stop word removal and stemming? Are these techniques language-dependent?

Stop word removal is the process of removing frequently occurring words, such as "the", "and", and "a", from a text corpus. These words are not informative for text classification tasks and can be safely removed without losing valuable information.

Stemming, on the other hand, is the process of reducing words to their base or root form by removing suffixes and prefixes. For example, the words "running", "runs", and "run" would be reduced to the base form "run". Stemming can help reduce the dimensionality of the data and improve the accuracy of text classification models.

Both techniques are language-dependent, as the list of stop words and the rules for stemming can vary depending on the language and the context of the text corpus. For example, stop words in English may differ from stop words in Spanish, and the rules for stemming may need to be adapted to account for irregular verbs and noun forms in different languages.

10-Is tokenization techniques language dependent? Why?

Tokenization is the process of breaking down a text corpus into individual units, or tokens, such as words, punctuation marks, or other symbols. Tokenization techniques can be language-dependent, as different languages may have different rules for tokenizing text.

For example, in English, words are typically separated by spaces, while in languages like Chinese and Japanese, there may be no spaces between words. Similarly, in some languages, such as Arabic and Hebrew, words are written from right to left instead of left to right, which can affect the tokenization process.

Tokenization techniques may also need to be adapted to account for the specific context of the text corpus, such as the presence of abbreviations, acronyms, or other non-standard forms of text.

11-What is the difference between count vectorizer and tf-idf vectorizer? Would it be feasible to use all possible n-grams? If not, how should you select them?

Count vectorizer is a technique for converting a text corpus into a matrix of word counts, where each row represents a document and each column represents a unique word in the corpus. Tf-idf vectorizer, on the other hand, is a technique for weighting the word counts based on their frequency and importance in the corpus. Tf-idf vectorizer assigns higher weights to words that are more frequent in a particular document but less frequent in the overall corpus.

It would not be feasible to use all possible n-grams, as this would generate a very large number of features and increase the computational complexity of the classification task. Instead, it is common to limit the number of n-grams based on their frequency or information gain, or to use a technique such as feature selection or dimensionality reduction to identify the most informative n-grams.

To select the most informative n-grams, one approach is to use techniques such as mutual information or chi-squared tests to identify the n-grams that are most strongly associated with the target variable. Another approach is to use domain knowledge or heuristic rules to select n-grams that are likely to be informative based on the specific context of the text corpus.

In [24]:
import re
import pickle
import sklearn
import pandas as pd
import numpy as np
import holoviews as hv
import nltk 
from bokeh.io import output_notebook
import scipy.stats as stats
output_notebook()

from pathlib import Path

# some seeting for pandas and hvplot

pd.options.display.max_columns = 100
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 100
np.set_printoptions(threshold=2000)

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
df_train = pd.read_csv('xy_train.csv', sep=",", na_values=[""])
df_test = pd.read_csv('x_test.csv', sep=",", na_values=[""])
df_test

Unnamed: 0,id,text
0,0,stargazer
1,1,yeah
2,2,PD: Phoenix car thief gets instructions from YouTube video
3,3,"As Trump Accuses Iran, He Has One Problem: His Own Credibility"
4,4,"""Believers"" - Hezbollah 2011"
...,...,...
59146,59146,Bicycle taxi drivers of New Delhi
59147,59147,Trump blows up GOP's formula for winning House races
59148,59148,"Napoleon returns from his exile on the island of Elba. (March 1815), Colourised"
59149,59149,Deep down he always wanted to be a ballet dancer


# Preprocessing performed:
1- Lemmatization to resize words to their base form.

2- URL removal and html tags

3- Lowercase conversion

4- remove punctuation and stop words

5- remove extra white spaces

6- remove single letter chars

7-remove single letter chars


In [4]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

#helper function for preocessing
def clean_text(text, for_embedding=False):
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    RE_TAGS = re.compile(r"<[^>]+>")
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž ]", re.IGNORECASE)
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)
    if for_embedding:
        # Keep punctuation
        RE_ASCII = re.compile(r"[^A-Za-zÀ-ž,.!? ]", re.IGNORECASE)
        RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž,.!?]\b", re.IGNORECASE)

    text = re.sub(RE_TAGS, " ", text)
    text = re.sub(RE_ASCII, " ", text)
    text = re.sub(RE_SINGLECHAR, " ", text)
    text = re.sub(RE_WSPACE, " ", text)

    word_tokens = word_tokenize(text)
    words_tokens_lower = [word.lower() for word in word_tokens]

    if for_embedding:
        # no stemming, lowering and punctuation / stop words removal
        words_filtered = word_tokens
    else:
        words_filtered = [
            lemmatizer.lemmatize(word) for word in words_tokens_lower if word not in stop_words
        ]

    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to C:\Users\Aly
[nltk_data]     Abdelkader\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Aly
[nltk_data]     Abdelkader\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Aly
[nltk_data]     Abdelkader\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Create a new column containing the updated text

In [5]:
# Clean train text
df_train["text_clean"] = df_train["text"]
df_train["text_clean"] = df_train["text_clean"].map(
    lambda x: clean_text(x, for_embedding=False) if isinstance(x, str) else x
)
# Clean test text
df_test["text_clean"] = df_test["text"]
df_test["text_clean"] = df_test["text_clean"].map(
    lambda x: clean_text(x, for_embedding=False) if isinstance(x, str) else x
)


In [6]:
df_train["label"].value_counts()

0    32172
1    27596
2      232
Name: label, dtype: int64

there appears to be an extra class with little frequency so we will remove it to turn it to a binary classification task

In [7]:
# turn all class 2 to null
df_train.loc[df_train["label"]==2] = np.NaN


# Drop when any of x missing
df_train = df_train[(df_train["text_clean"] != "") & (df_train["text_clean"] != "null")]


df_train = df_train.dropna(
    axis="index", subset=["label", "text", "text_clean"]
).reset_index(drop=True)




In [8]:
from bokeh.models import NumeralTickFormatter
data_clean=df_train.copy()
# Word Frequency of most common words
word_freq = pd.Series(" ".join(data_clean["text_clean"]).split()).value_counts()
word_freq[1:40]

one          3285
new          2998
like         2949
man          2706
trump        2577
colorized    2430
people       2315
first        2247
old          2222
look         2214
say          2147
get          2072
time         2011
poster       1999
found        1959
day          1935
woman        1892
war          1858
life         1769
make         1727
world        1570
u            1506
american     1498
psbattle     1468
state        1387
post         1384
two          1364
school       1339
back         1325
photo        1324
made         1314
right        1301
circa        1249
child        1216
know         1201
president    1199
see          1181
house        1175
way          1164
dtype: int64

In [9]:
df_test.shape

(59151, 3)

In [10]:
# list most uncommon words
word_freq[-10:].reset_index(name="freq")

Unnamed: 0,index,freq
0,veja,1
1,pibulsonggram,1
2,baring,1
3,jockstrap,1
4,hybernate,1
5,tardigrade,1
6,upriver,1
7,rohl,1
8,squinted,1
9,wahre,1


In [11]:
# Distribution of ratings
data_clean["label"].value_counts(normalize=True)

0.0    0.538221
1.0    0.461779
Name: label, dtype: float64

In [12]:
#split the data
X_train = df_train["text_clean"]
Y_train = df_train["label"]
X_test = df_test["text_clean"]
print(X_train.shape)
print(X_test.shape)

(59758,)
(59151,)


# Pipeline
This tunable pipeline contains models and vectorizers of choice used for all our Trials.

* Models: Logistic Reggression, Passive Aggressive Classifier , XGBoost Classifier, Random Forest .

* Vectorizers: TfidfVectorizer , Count Vectorizer.
* HyperParameter Optimization: Grid search, Random search

In [28]:
#tfid Vectorizer
preprocessortf = Pipeline(
    steps=[
        ("tfidf", TfidfVectorizer())]
)
#Count vectorizer
preprocessorcnt = Pipeline(
    steps=[
        ("count", CountVectorizer())]
)

#logistic Reggression
full_piplineLOG = Pipeline(  
    steps=[
        ('preprocessor', preprocessorcnt), 
        ('my_classifier', 
           LogisticRegression(), # Logistic Reggression.
        )
    ]
)

#xgboost
full_piplineXGB = Pipeline(  
    steps=[
        ('preprocessor', preprocessortf), 
        ('my_classifier', 
           XGBClassifier(), # XGBClassifier.
        )
    ]
)
#passive aggrisive Classifier
full_piplinePAC = Pipeline(
    steps=[
        ('preprocessor', preprocessortf),
        ('my_classifier', 
           PassiveAggressiveClassifier(),
        )
    ]
)
full_piplineRAN = Pipeline(
    steps=[
        ('preprocessor', preprocessortf),
        ('my_classifier', 
           RandomForestClassifier(),
        )
    ]
)

# Trial 1
we will use Logistic reggression with tfid Vectorizer and tune hyper parameters using grid random search.
* LogisticReggression: it's a perfect classifier to start with for it's simplicity and suitability for binary classification


* tfid: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora. 


* Random search: since it doesn't take too long we will use it to determine if the model is promising and worth using grid search for.


In [15]:
#Define Parameters
params = {
    "preprocessor__tfidf__ngram_range": [(1, 5), (1, 3)],
    "preprocessor__tfidf__max_df": np.arange(0.3, 0.8),
    "preprocessor__tfidf__min_df": np.arange(5, 100),
    'my_classifier__penalty': ['l2'],#l2 regularization
          'my_classifier__C' : [1.4,1.6,1.8,2.0],#The parameter C is the the inverse of regularization strength in Logistic Regression
          'my_classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

log_rnd = RandomizedSearchCV(
    full_piplineLOG, params, cv=5, verbose=1, n_jobs=-1, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')

log_rnd.fit(X_train, Y_train)

print('best score {}'.format(log_rnd.best_score_))
print('best score {}'.format(log_rnd.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
best score 0.8716812338211024
best score {'preprocessor__tfidf__ngram_range': (1, 5), 'preprocessor__tfidf__min_df': 6, 'preprocessor__tfidf__max_df': 0.3, 'my_classifier__solver': 'lbfgs', 'my_classifier__penalty': 'l2', 'my_classifier__C': 1.8}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Observation: 

the the model had decent start with accuracy of 87% using tfidvectorizer and random search 
in the next Trial we will Try using count Vectorizer and grid search.

# Trial 2
logistic reggression had promising perfrormance so we will use it again this time with grid search to try all posssibe hyperparameter combination and we will also try Count vectorizer.


* Count Vectorizer: it counts the number of times a token shows up in the document and uses this value as its weight.

In [19]:
params_cnt_log = {
    
    "preprocessor__count__ngram_range": [(1, 5), (1, 3)],
    "preprocessor__count__max_df": np.arange(0.3, 1),
    "preprocessor__count__min_df": np.arange(5,20),
    'my_classifier__penalty': ['l2','l1'],#l2 regularization
          'my_classifier__C' : [1.6,1.8,2.0],#The parameter C is the the inverse of regularization strength in Logistic Regression
          'my_classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag']
}
log_grid = GridSearchCV(
    full_piplineLOG, params_cnt_log, cv=5, verbose=1, n_jobs=-1, 
    scoring='roc_auc')

log_grid.fit(X_train, Y_train)

print('best score {}'.format(log_grid.best_score_))
print('best score {}'.format(log_grid.best_params_))

Fitting 5 folds for each of 720 candidates, totalling 3600 fits


1350 fits failed out of a total of 3600.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
450 fits failed with the following error:
Traceback (most recent call last):
  File "D:\Anaconda\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "D:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueEr

best score 0.8593779248470911
best score {'my_classifier__C': 1.6, 'my_classifier__penalty': 'l2', 'my_classifier__solver': 'lbfgs', 'preprocessor__count__max_df': 0.3, 'preprocessor__count__min_df': 5, 'preprocessor__count__ngram_range': (1, 5)}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Oservation:
looks like some fits failed the search and the reason is i included "L1" penality and some solvers don't support it.
in case of count vectorizer it under performes tfidf so we will continue with tfidf in the next Trials and we will try character level vectorization.

# Trial 3
for the next 2 trial we will try xgboost with Random search first then with Grid search

In [21]:
xgb_param = {
    "preprocessor__tfidf__ngram_range": [(1, 5)],
    "preprocessor__tfidf__max_df": np.arange(0.3,0.4),
    "preprocessor__tfidf__min_df": np.arange(5,7),
  'my_classifier__n_estimators': [425,450],
  'my_classifier__max_depth': [12,15],
  'my_classifier__learning_rate': [0.1],
  'my_classifier__gamma': [0,0.5],
  'my_classifier__subsample': [0.5,0.8],
}
xgb_rnd = RandomizedSearchCV(
    full_piplineXGB, xgb_param, cv=5, verbose=1, n_jobs=-1, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')

xgb_rnd.fit(X_train, Y_train)

print('best score {}'.format(xgb_rnd.best_score_))
print('best score {}'.format(xgb_rnd.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
best score 0.8490724708783874
best score {'preprocessor__tfidf__ngram_range': (1, 5), 'preprocessor__tfidf__min_df': 5, 'preprocessor__tfidf__max_df': 0.3, 'my_classifier__subsample': 0.8, 'my_classifier__n_estimators': 450, 'my_classifier__max_depth': 15, 'my_classifier__learning_rate': 0.1, 'my_classifier__gamma': 0}


Observation:
Xgboost didn't out perform logistic reggression but we will try it again with grid search for the next trial

# Trial 4
we will try grid search this time to see if there is any change in performance

In [22]:
xgb_grid = GridSearchCV(
    full_piplineXGB, xgb_param, cv=5, verbose=1, n_jobs=-1, 
    scoring='roc_auc')

xgb_grid.fit(X_train, Y_train)

print('best score {}'.format(xgb_grid.best_score_))
print('best score {}'.format(xgb_grid.best_params_))

Fitting 5 folds for each of 32 candidates, totalling 160 fits
best score 0.8501072515409712
best score {'my_classifier__gamma': 0.5, 'my_classifier__learning_rate': 0.1, 'my_classifier__max_depth': 15, 'my_classifier__n_estimators': 450, 'my_classifier__subsample': 0.8, 'preprocessor__tfidf__max_df': 0.3, 'preprocessor__tfidf__min_df': 5, 'preprocessor__tfidf__ngram_range': (1, 5)}


Observation:
grid search has a slightly better score but looks like xgboost isn't the optimal solution since the score dropped in submission from 85% to 81%

# Trial 5
for this trial we will try a new model we haven't used before but is typically used for Text Classification called Passive Aggressive Classifier.

for the first trial we will use grid search

In [27]:
pac_param = {
    "preprocessor__tfidf__ngram_range": [(1, 5),(1,3)],
    "preprocessor__tfidf__max_df": np.arange(0.3,0.7),
    "preprocessor__tfidf__min_df": np.arange(5,8),
    'my_classifier__C': [1.0,1.5,2], # The regularization term C
    'my_classifier__loss': ['hinge', 'squared_hinge'] # PA-I or PA-II
}
pac_grid = GridSearchCV(
    full_piplinePAC, pac_param, cv=5, verbose=1, n_jobs=-1, 
    scoring='roc_auc')

pac_grid.fit(X_train, Y_train)

print('best score {}'.format(pac_grid.best_score_))
print('best score {}'.format(pac_grid.best_params_))

Fitting 5 folds for each of 36 candidates, totalling 180 fits
best score 0.8257529104901618
best score {'my_classifier__C': 1.0, 'my_classifier__loss': 'hinge', 'preprocessor__tfidf__max_df': 0.3, 'preprocessor__tfidf__min_df': 7, 'preprocessor__tfidf__ngram_range': (1, 5)}


Observation: the model gave a score of 82% , the lowest score out of all trials

# Trial 6
our final Classifier to try is Random forest, we will use grid search and use it find out the ideal parameter space

In [38]:
param_rndf = {
     "preprocessor__tfidf__ngram_range": [(1, 5),(1,3)],
    "preprocessor__tfidf__max_df": [0.3],
    "preprocessor__tfidf__min_df": [5],
    "preprocessor__tfidf__analyzer": ['word', 'char', 'char_wb'],
    # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
    'my_classifier__n_estimators': [700,750,800],  
     # my_classifier__n_estimators points to my_classifier->n_estimators 
    'my_classifier__max_depth':[20,25,30],
    'my_classifier__max_features': ['log2'],
    'my_classifier__criterion': ['entropy'], 
    'my_classifier__min_samples_split':[12,13,14]   
}
#rf_grid = GridSearchCV(
    #full_piplineRAN, param_rndf, cv=5, verbose=1, n_jobs=-1, 
    #scoring='roc_auc')

#rf_grid.fit(X_train, Y_train)

#print('best score {}'.format(rf_grid.best_score_))
#print('best score {}'.format(rf_grid.best_params_))

Observation: best number of estimator is 700 which was the highest value in the parameter and we can update the parameter space based on the best values in this trial and try in the next trial space which means we will modify the paramter space for the next trial

# Trial 7 
this time we will use random search using a validation set with the updated parameter space and we added parameter analyzer to vectorizer to identify which suitable to use 

In [40]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit

# Further split the original training set to a train and a validation set
X_train2, X_val, y_train2, y_val = train_test_split(
    X_train, Y_train, train_size = 0.8, stratify = Y_train, random_state = 2022)

# Create a list where train data indices are -1 and validation data indices are 0
# X_train2 (new training set), X_train
split_index = [-1 if x in X_train2.index else 0 for x in X_train.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

rd_rand= RandomizedSearchCV(
    full_piplineRAN, param_rndf, cv=pds, verbose=1, n_jobs=-1, 
    # number of random trials
    n_iter=50,
    scoring='roc_auc')
# here we still use X_train; but the grid search model
# will use our predefined split internally to determine 
# which sample belongs to the validation set
rd_rand.fit(X_train, Y_train)

print('best score {}'.format(rd_rand.best_score_))
print('best score {}'.format(rd_rand.best_params_))

Fitting 1 folds for each of 50 candidates, totalling 50 fits
best score 0.8540925294969737
best score {'preprocessor__tfidf__ngram_range': (1, 3), 'preprocessor__tfidf__min_df': 5, 'preprocessor__tfidf__max_df': 0.3, 'preprocessor__tfidf__analyzer': 'word', 'my_classifier__n_estimators': 750, 'my_classifier__min_samples_split': 13, 'my_classifier__max_features': 'log2', 'my_classifier__max_depth': 30, 'my_classifier__criterion': 'entropy'}


Observation: Random forest has the highest score in submission and the best suited is the word analyzer

In [41]:
submission = pd.DataFrame()

submission['id'] = df_test['id'].astype(int)

submission['label'] = rd_rand.predict_proba(X_test)[:,1]

submission.to_csv('rd_rand.csv', index=False)

In [None]:
X_train.to_csv('cleaned.csv', index=False)