Problem Definition
---
I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( Problem Formalization)

> We will be predicting whether a question asked on Quora is sincere or not

Source : https://www.kaggle.com/mjbahmani/a-data-science-framework-for-quora

Data Source : https://www.kaggle.com/c/quora-insincere-questions-classification/data

**About Quora**

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

**Business View**

An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

**What is a insincere question?**

Is defined as a question intended to make a statement rather than look for helpful answers.

![Quora_moderation_warning](images/Quora_moderation_warning.png)

**Feature Set**

We use train.csv and test.csv as Input and we should upload a submission.csv as Output.

The training set contains the following 3 features (for Supervised Learning)
1. qid - unique question identifier
2. question_text - Quora question text
3. target - a question labeled "insincere" has a value of 1, otherwise 0

**Coding a solutiuon for solving the above problem**

In [0]:
import pandas as pd
import numpy as np
import itertools
from time import time

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score, accuracy_score, recall_score, make_scorer

from nltk.stem.snowball import SnowballStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

In [0]:
from google.colab import files
uploaded = files.upload()



Saving train.csv to train.csv


In [0]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test.csv


In [0]:
data_train = pd.read_csv("train.csv")
data_test = pd.read_csv("test.csv")

In [0]:
 # Check out data
print(f"Observations in training / test data: {len(data_train)} / {len(data_test)}")
print(f"Number of sincere / insincere questions in training data:",
      f"{len(data_train.loc[data_train.target == 0])} / {len(data_train.loc[data_train.target == 1])}")
data_train.head()

Observations in training / test data: 1306122 / 375806
Number of sincere / insincere questions in training data: 1225312 / 80810


Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [0]:
# Split training data into training and validation set

X_train, X_valid, y_train, y_valid = train_test_split(data_train["question_text"], data_train["target"],
                                                    test_size = 0.2, random_state = 19)

print(f"Observations in training / validation data: {len(X_train)} / {len(X_valid)}")

Observations in training / validation data: 1044897 / 261225


In [0]:
# Transform text into the bag-of-words representation with CountVectorizer's default options
vect = CountVectorizer().fit(X_train) 
X_train_vect = vect.transform(X_train) 
X_valid_vect = vect.transform(X_valid) 
print("Number of unique words/features:", X_train_vect.shape[1])

# Train and evaluate the model with MultinomialNB's default options
model = MultinomialNB()
model.fit(X_train_vect, y_train)
y_valid_pred = model.predict(X_valid_vect)
print("F1-Score: {:.4f}".format(f1_score(y_valid, y_valid_pred)))

Number of unique words/features: 173196
F1-Score: 0.5612


In [0]:
# Define parameter combinations to iterate over
stemmer = SnowballStemmer("english") 
analyzer = CountVectorizer().build_analyzer()
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))
analyzer_i = ["word", stemmed_words]
min_df_i = [1, 3, 5, 10]
ngram_range_i = [(1,1), (1,2)]
combinations = list(itertools.product(min_df_i, ngram_range_i, analyzer_i))

print("Number of combinations to iterate over:", len(combinations))

Number of combinations to iterate over: 16


In [0]:
# Transform text and train the basic Naive Bayes model
start = time()
for i in combinations:
    
    vect = CountVectorizer(min_df = i[0], ngram_range = i[1], analyzer = i[2], strip_accents = "unicode").fit(X_train)
    X_train_vect = vect.transform(X_train)
    X_valid_vect = vect.transform(X_valid)
    
    model = MultinomialNB()
    model.fit(X_train_vect, y_train)
    y_valid_pred = model.predict(X_valid_vect)
    
    print(f"min_df: {i[0]}; ngram_range: {i[1]}; analyzer: {i[2]}; number of features: {X_train_vect.shape[1]}")
    print("F1-Score: {:.4f}".format(f1_score(y_valid, y_valid_pred)))
    
end = time()
print("Training time: {:.1f} seconds".format(end-start))

min_df: 1; ngram_range: (1, 1); analyzer: word; number of features: 172561
F1-Score: 0.5611
min_df: 1; ngram_range: (1, 1); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 135436
F1-Score: 0.5555
min_df: 1; ngram_range: (1, 2); analyzer: word; number of features: 2735085
F1-Score: 0.5187




min_df: 1; ngram_range: (1, 2); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 135436
F1-Score: 0.5555
min_df: 3; ngram_range: (1, 1); analyzer: word; number of features: 60985
F1-Score: 0.5431
min_df: 3; ngram_range: (1, 1); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 44127
F1-Score: 0.5387
min_df: 3; ngram_range: (1, 2); analyzer: word; number of features: 507105
F1-Score: 0.5307




min_df: 3; ngram_range: (1, 2); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 44127
F1-Score: 0.5387
min_df: 5; ngram_range: (1, 1); analyzer: word; number of features: 44804
F1-Score: 0.5413
min_df: 5; ngram_range: (1, 1); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 31792
F1-Score: 0.5375
min_df: 5; ngram_range: (1, 2); analyzer: word; number of features: 296067
F1-Score: 0.5107




min_df: 5; ngram_range: (1, 2); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 31792
F1-Score: 0.5375
min_df: 10; ngram_range: (1, 1); analyzer: word; number of features: 30611
F1-Score: 0.5391
min_df: 10; ngram_range: (1, 1); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 21487
F1-Score: 0.5365
min_df: 10; ngram_range: (1, 2); analyzer: word; number of features: 152080
F1-Score: 0.4979




min_df: 10; ngram_range: (1, 2); analyzer: <function stemmed_words at 0x7feb21972ae8>; number of features: 21487
F1-Score: 0.5365
Training time: 3141.0 seconds


In [0]:
# Vectorize data with tf-idf
vect = TfidfVectorizer(min_df = 10, strip_accents = "unicode").fit(X_train)
X_train_vect = vect.transform(X_train)
X_valid_vect = vect.transform(X_valid)
print("Number of features:", X_train_vect.shape[1])

# Train and evaluate the model with MultinomialNB's default options
model = MultinomialNB()
model.fit(X_train_vect, y_train)
y_valid_pred = model.predict(X_valid_vect)
print("F1-Score: {:.4f}".format(f1_score(y_valid, y_valid_pred)))

Number of features: 30611
F1-Score: 0.3909


In [0]:
# Split training data into training and validation set
X_train, X_valid, y_train, y_valid = train_test_split(data_train["question_text"], data_train["target"],
                                                    test_size = 0.2, random_state = 19)
print(f"Observations in training / validation data: {len(X_train)} / {len(X_valid)}")

vect = CountVectorizer(min_df = 10, strip_accents = "unicode").fit(X_train)
X_train_vect = vect.transform(X_train)
X_valid_vect = vect.transform(X_valid)
print(f"Number of features: {X_train_vect.shape[1]}")

Observations in training / validation data: 1044897 / 261225
Number of features: 30611


In [0]:
def train_validate(model, grid):
    
    start = time()
    
    scorer = make_scorer(f1_score)

    grid_obj = GridSearchCV(model, parameters, scoring = scorer, cv = 5, n_jobs = -1)  # use 5 fold CV
    grid_fit = grid_obj.fit(X_train_vect, y_train)
    best_model = grid_fit.best_estimator_
    best_model.fit(X_train_vect, y_train)
    y_valid_pred = best_model.predict(X_valid_vect)
    
    end = time()
 
    print("Best model:", best_model)
    print("Training time: {:.1f} seconds".format(end-start))
    print("F1-Score: {:.4f};".format(f1_score(y_valid, y_valid_pred)),
          "Accuracy: {:.4};".format(accuracy_score(y_valid, y_valid_pred)),
          "Recall: {:.4}".format(recall_score(y_valid, y_valid_pred)))

In [0]:
parameters = {"alpha": list(np.arange(0, 11))}

train_validate(model = MultinomialNB(), grid = parameters)

Best model: MultinomialNB(alpha=6, class_prior=None, fit_prior=True)
Training time: 19.1 seconds
F1-Score: 0.5511; Accuracy: 0.9287; Recall: 0.7091


In [0]:
parameters = {"alpha": list(np.arange(5, 7, 0.1))}

train_validate(model = MultinomialNB(), grid = parameters)

Best model: MultinomialNB(alpha=5.9999999999999964, class_prior=None, fit_prior=True)
Training time: 32.0 seconds
F1-Score: 0.5511; Accuracy: 0.9287; Recall: 0.7091


In [0]:
parameters = {"C": [0.1, 1, 10]}

train_validate(model = LogisticRegression(solver = "liblinear"), grid = parameters)



Best model: LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
Training time: 1165.9 seconds
F1-Score: 0.5500; Accuracy: 0.9529; Recall: 0.4669


In [0]:
parameters = {"C": [3, 4, 5, 6, 7]}

train_validate(model = LogisticRegression(solver = "liblinear"), grid = parameters)

array([0, 1], dtype=int64)

In [0]:
parameters = {"C": [2, 2.5, 3, 3.5]}

train_validate(model = LogisticRegression(solver = "liblinear"), grid = parameters)

qid                 86
question_text       86
target              86
num_words           86
num_stopwords       86
num_punctuations    86
dtype: int64

In [0]:
X_train = data_train["question_text"]
y_train = data_train["target"]
vect = CountVectorizer(min_df = 3, strip_accents = "unicode").fit(X_train) 
X_train_vect = vect.transform(X_train) 
X_valid_vect = vect.transform(X_valid) 
model = MultinomialNB(alpha = 6)
model.fit(X_train_vect, y_train)

MultinomialNB(alpha=6, class_prior=None, fit_prior=True)

In [0]:
def predict(question):
    question_vect = vect.transform([question])
    predicted_label = model.predict(question_vect)   
    a = ("sincere" if predicted_label[0] == 0 else "insincere")
    print(f"Question: {question} -- Predicted label: {predicted_label[0]} / {a}")

In [0]:
predict(question = "why do you think are women stupid")

Question: why do you think are women stupid -- Predicted label: 1 / insincere


In [0]:
predict(question = "How does photosynthesis work?")

Question: How does photosynthesis work? -- Predicted label: 0 / sincere


In [0]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# load data 
data_train = pd.read_csv("train.csv") # update to kaggle directory in cloud kernel
data_test = pd.read_csv("test.csv") # update to kaggle directory in cloud kernel

# transform data
X_train = data_train["question_text"]
y_train = data_train["target"]
X_test = data_test["question_text"]
vect = CountVectorizer(min_df = 10, strip_accents = "unicode").fit(X_train) 
X_train_vect = vect.transform(X_train) 
X_test_vect = vect.transform(X_test) 

# train model
model = MultinomialNB(alpha = 6)
model.fit(X_train_vect, y_train)

# predict labels
y_test_pred = model.predict(X_test_vect)

# create output file
submission = pd.DataFrame({"qid": data_test["qid"], "prediction": y_test_pred}, columns = ["qid", "prediction"])
submission.to_csv("submission.csv", index = False)