# Predict-Closed-Question-StackOverFlow

Millions of programmers use Stack Overflow to get high quality answers to their programming questions every day. We take quality very seriously, and have evolved an effective culture of moderation to safe-guard it.

With more than six thousand new questions asked on Stack Overflow every weekday we're looking to add more sophisticated software solutions to our moderation toolbox.

Closing Questions

Currently about 6% of all new questions end up "closed". Questions can be closed as off topic, not constructive, not a real question, or too localized. More in depth descriptions of each reason can be found in the Stack Overflow FAQ. The exact duplicate close reason has been excluded from this contest, since it depends on previous questions.

Your goal is to build a classifier that predicts whether or not a question will be closed given the question as submitted, along with the reason that the question was closed. Additional data about the user at question creation time is also available.

# Dataset Link

https://www.kaggle.com/competitions/predict-closed-questions-on-stack-overflow/data

# Dataset Description

The training data contains data through July 31st UTC, and the public leaderboard data goes from August 1 UTC to August 14 UTC.

The train.csv file contains post text and associated metadata at the time of post creation which will serve as inputs to your solution. The state of the post as of July 31st is also included. It contains the following fields (not in this order):

Input
PostCreationDate
OwnerUserId
OwnerCreationDate
ReputationAtPostCreation
OwnerUndeletedAnswerCountAtPostTime
Title
BodyMarkdown
Tag1
Tag2
Tag3
Tag4
Tag5
Output
OpenStatus
Additional Data
PostId
PostClosedDate
The public leaderboard data contains all of the above fields, except for the target field OpenStatus and PostClosedDate.

The file train-sample.csv is a stratified sample of the training data: it contains every closed question and an equally-sized random sample of the open questions in the training data.

All questions will have a value in Tag1, but Tags 2 through 5 are optional.

To convert the user submitted Markdown found in BodyMarkdown to HTML if desired, our open source implementations in C# and Javascript may be useful.

Additional data can be found in "2012-07 Stack Overflow.7z", which contains an entire public data dump of Stack Overflow. Descriptions of the values can be found in the archive itself as well as on Meta Stack Overflow. This data will not be available as inputs, but may be useful in building your solution. As it is rather large (6GB) you may find it easier to download as a torrent, more details can be found in this forum post.

Using 8000 smaples only from data to can fit in the memory

# CODE IMPLEMENTATION

Steps involved:
    Step-1: Basic imports
    Step-2:Exploratory Data Analysis
    Step-3:Feature Engineering
    Step-4:Generate Test & Train
    Step-5:Training Model
    Step-6:Model Evalution

# Step-1:Basic imports

In [2]:
import numpy as np # linear algebra.
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv).
import re#Used to work with Regular Expressions.
import spacy#Allows you to you divide a text into linguistically meaningful units.
import matplotlib.pyplot as plt


# Step-2:  Exploratory Data Analysis

### Importing & Reading data

In [3]:
data = pd.read_csv("C:\\Users\\Snigdha\\New folder\\Studies\\PROJECTS\\Predict closed questions on Stack Overflow\\train-sample.csv")

### View data

In [4]:
#The head() method returns a specified number of rows, string from the top.
data.head()

Unnamed: 0,PostId,PostCreationDate,OwnerUserId,OwnerCreationDate,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,Title,BodyMarkdown,Tag1,Tag2,Tag3,Tag4,Tag5,PostClosedDate,OpenStatus
0,6046168,05/18/2011 14:14:05,543315,09/17/2010 10:15:06,1,2,For Mongodb is it better to reference an objec...,I am building a corpus of indexed sentences in...,mongodb,,,,,,open
1,4873911,02/02/2011 11:30:10,465076,10/03/2010 09:30:58,192,24,How to insert schemalocation in a xml document...,i create a xml document with JAXP and search a...,dom,xsd,jaxp,,,,open
2,3311559,07/22/2010 17:21:54,406143,07/22/2010 16:58:20,1,0,Too many lookup tables,What are the adverse effects of having too man...,sql-server,database-design,enums,,,,open
3,9990413,04/03/2012 09:18:39,851755,07/19/2011 10:22:40,4,1,What is this PHP code in VB.net,I am looking for the vb.net equivalent of this...,php,vb.net,,,,04/15/2012 21:12:48,too localized
4,10421966,05/02/2012 21:25:01,603588,02/04/2011 18:05:34,334,14,Spring-Data mongodb querying multiple classes ...,"With Spring-Data, you can use the @Document an...",mongodb,spring-data,,,,,open


### Info of the data 

In [5]:
#The info() method prints information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140272 entries, 0 to 140271
Data columns (total 15 columns):
 #   Column                               Non-Null Count   Dtype 
---  ------                               --------------   ----- 
 0   PostId                               140272 non-null  int64 
 1   PostCreationDate                     140272 non-null  object
 2   OwnerUserId                          140272 non-null  int64 
 3   OwnerCreationDate                    140272 non-null  object
 4   ReputationAtPostCreation             140272 non-null  int64 
 5   OwnerUndeletedAnswerCountAtPostTime  140272 non-null  int64 
 6   Title                                140272 non-null  object
 7   BodyMarkdown                         140272 non-null  object
 8   Tag1                                 140262 non-null  object
 9   Tag2                                 113021 non-null  object
 10  Tag3                                 75914 non-null   object
 11  Tag4                      

### Shape of the data

In [6]:
#The shape property returns a tuple containing the shape of the DataFrame.
data.shape

(140272, 15)

### nunique() Method

In [7]:
#Return the number of unique values for each column.
data.nunique()

PostId                                 140272
PostCreationDate                       140118
OwnerUserId                             94215
OwnerCreationDate                       94149
ReputationAtPostCreation                 6423
OwnerUndeletedAnswerCountAtPostTime       965
Title                                  140192
BodyMarkdown                           140270
Tag1                                     5209
Tag2                                     9292
Tag3                                    11080
Tag4                                    10027
Tag5                                     7605
PostClosedDate                          70070
OpenStatus                                  5
dtype: int64

### DataFrame.sample

In [8]:
#Return a random sample of items from an axis of object.
data = data.sample(80000, random_state = 234)
data.head()

Unnamed: 0,PostId,PostCreationDate,OwnerUserId,OwnerCreationDate,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,Title,BodyMarkdown,Tag1,Tag2,Tag3,Tag4,Tag5,PostClosedDate,OpenStatus
122715,1296097,08/18/2009 19:51:02,135646,07/09/2009 13:13:22,146,13,How to serialize Color property as ARGB values?,I'm working with Windows Forms designer. It se...,winforms,windows-form-designer,colors,rgb,serialization,,open
29350,9349765,02/19/2012 13:52:14,1216512,02/17/2012 14:57:14,1,0,Sending large files in java,How to send large files (2-3 GB) over the netw...,networking,,,,,02/20/2012 14:17:34,not a real question
6938,10974978,06/11/2012 05:47:32,748164,05/11/2011 07:12:53,44,1,How to Create personal online radio station?,"I want to create my own radio station, that is...",asp.net,,,,,06/11/2012 07:35:28,off topic
131897,11364721,07/06/2012 15:00:12,623694,02/18/2011 19:35:31,677,32,how to name my android software legally?,Im new to android programming and i found a pr...,android,android-market,google-play,,,07/06/2012 15:54:05,off topic
10420,11076945,06/18/2012 04:31:24,1314162,04/05/2012 01:24:25,1,0,Broke Emacs 24 on Lion 10.7.4,I just managed to break my beloved Emacs on Li...,emacs,osx-lion,macports,ncurses,,06/25/2012 15:57:29,off topic


### Checking for Missing Values 

In [9]:
data.isnull().sum()

PostId                                     0
PostCreationDate                           0
OwnerUserId                                0
OwnerCreationDate                          0
ReputationAtPostCreation                   0
OwnerUndeletedAnswerCountAtPostTime        0
Title                                      0
BodyMarkdown                               0
Tag1                                       6
Tag2                                   15642
Tag3                                   36753
Tag4                                   57231
Tag5                                   71055
PostClosedDate                         40024
OpenStatus                                 0
dtype: int64

In [10]:
data = data[['Title', 'BodyMarkdown', 'OpenStatus']]

In [11]:
data.head()

Unnamed: 0,Title,BodyMarkdown,OpenStatus
122715,How to serialize Color property as ARGB values?,I'm working with Windows Forms designer. It se...,open
29350,Sending large files in java,How to send large files (2-3 GB) over the netw...,not a real question
6938,How to Create personal online radio station?,"I want to create my own radio station, that is...",off topic
131897,how to name my android software legally?,Im new to android programming and i found a pr...,off topic
10420,Broke Emacs 24 on Lion 10.7.4,I just managed to break my beloved Emacs on Li...,off topic


In [12]:
data = data.dropna()

In [13]:
data.isnull().sum()

Title           0
BodyMarkdown    0
OpenStatus      0
dtype: int64

# Step-3:Feature Engineering

In [14]:
from sklearn.preprocessing import LabelEncoder
# Let's covvert words to numbers using TF-IDF 
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.layers import Dense, Flatten, Input
from keras.models import Sequential
from keras.callbacks import EarlyStopping, ModelCheckpoint

### Removing all punctuation

In [15]:
def remove_num_punc(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    return text.lower()

### Lemmatization

In [16]:
def lemmitization(text):
    words = ''
    for word in text:
        words += ' ' + word.lemma_
    return words

In [17]:
nlp = spacy.load('en_core_web_sm')

### Extract important features as title, BodyMarkdown and OpenStatue

In [18]:
data['Title'] = data['Title'].apply(lambda x: remove_num_punc(x))
data['BodyMarkdown'] = data['BodyMarkdown'].apply(lambda x: remove_num_punc(x))

In [19]:
data.head()

Unnamed: 0,Title,BodyMarkdown,OpenStatus
122715,how to serialize color property as argb values,i m working with windows forms designer it se...,open
29350,sending large files in java,how to send large files gb over the netw...,not a real question
6938,how to create personal online radio station,i want to create my own radio station that is...,off topic
131897,how to name my android software legally,im new to android programming and i found a pr...,off topic
10420,broke emacs on lion,i just managed to break my beloved emacs on li...,off topic


### Encode Labels using LabelEncoder

In [20]:
encoder = LabelEncoder()
# data['Tag1'] = encoder.fit_transform(data['Tag1'])
data['OpenStatus'] = encoder.fit_transform(data['OpenStatus'])

In [21]:
data.head()

Unnamed: 0,Title,BodyMarkdown,OpenStatus
122715,how to serialize color property as argb values,i m working with windows forms designer it se...,3
29350,sending large files in java,how to send large files gb over the netw...,0
6938,how to create personal online radio station,i want to create my own radio station that is...,2
131897,how to name my android software legally,im new to android programming and i found a pr...,2
10420,broke emacs on lion,i just managed to break my beloved emacs on li...,2


In [22]:
target = data['OpenStatus']
data = data.drop(['OpenStatus'], axis=1)

In [23]:
data.head()

Unnamed: 0,Title,BodyMarkdown
122715,how to serialize color property as argb values,i m working with windows forms designer it se...
29350,sending large files in java,how to send large files gb over the netw...
6938,how to create personal online radio station,i want to create my own radio station that is...
131897,how to name my android software legally,im new to android programming and i found a pr...
10420,broke emacs on lion,i just managed to break my beloved emacs on li...


# Step-4:Generate train and test

In [24]:
data = data['Title'] + ' '+ data['BodyMarkdown']

In [25]:
from sklearn.model_selection import train_test_split

### x_train, y_train

In [26]:

xtrain, xtest, ytrain, ytest = train_test_split(data, target, test_size = 0.3, random_state = 42)
data = []

In [27]:
xtrain.shape

(56000,)

### Let's convert words to numbers using TF-IDF 

In [28]:
vectorizer = TfidfVectorizer(max_features = 1000)  # it contains only 10k features (fixed!)

xtrain = vectorizer.fit_transform(xtrain).toarray()  # converting words to numbers for train data 
xtest = vectorizer.transform(xtest).toarray()        # converting words to numbers for test data 

In [29]:
xtrain.shape

(56000, 1000)

In [30]:
set(ytrain)

{0, 1, 2, 3, 4}

In [31]:
import pickle
pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))

# Step-5:Training Model

## A)Naive Bayes 

In [32]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

In [33]:
clf.fit(xtrain, ytrain)


### Predict

In [34]:
predicted_naive = clf.predict(xtest)

### Metrics

In [35]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix 

print('Accuracy Score \n',accuracy_score(predicted_naive, ytest))
print('Confusion Matrix \n', confusion_matrix(predicted_naive, ytest))
print('Classification Report \n', classification_report(predicted_naive, ytest))

Accuracy Score 
 0.369375
Confusion Matrix 
 [[ 967  133  120 1309  110]
 [1289 1855  821 1616  109]
 [1347  478 1797 2590  226]
 [ 658  107  157 3844  209]
 [ 967  104  143 2642  402]]
Classification Report 
               precision    recall  f1-score   support

           0       0.18      0.37      0.25      2639
           1       0.69      0.33      0.44      5690
           2       0.59      0.28      0.38      6438
           3       0.32      0.77      0.45      4975
           4       0.38      0.09      0.15      4258

    accuracy                           0.37     24000
   macro avg       0.43      0.37      0.33     24000
weighted avg       0.48      0.37      0.35     24000



# Accuracy Score:36.9%

### Saving the model

In [36]:
import pickle
filename = 'g_naive.sav'
pickle.dump(clf, open(filename, 'wb'))

### Load the model from disk

In [37]:

loaded_model = pickle.load(open(filename, 'rb'))
naive = loaded_model.predict(xtest)

## B)MLPClassifier

In [38]:
from sklearn.neural_network import MLPClassifier

mlp_cv=MLPClassifier(early_stopping=True, verbose=2)
mlp_cv.fit(xtrain, ytrain)

Iteration 1, loss = 1.12605801
Validation score: 0.625179
Iteration 2, loss = 0.95931869
Validation score: 0.629464
Iteration 3, loss = 0.93611904
Validation score: 0.632500
Iteration 4, loss = 0.92413648
Validation score: 0.629464
Iteration 5, loss = 0.91371776
Validation score: 0.629821
Iteration 6, loss = 0.90329094
Validation score: 0.631607
Iteration 7, loss = 0.89224038
Validation score: 0.634286
Iteration 8, loss = 0.88095649
Validation score: 0.629821
Iteration 9, loss = 0.86942608
Validation score: 0.633036
Iteration 10, loss = 0.85770150
Validation score: 0.631429
Iteration 11, loss = 0.84530667
Validation score: 0.632500
Iteration 12, loss = 0.83303977
Validation score: 0.633929
Iteration 13, loss = 0.82093376
Validation score: 0.636250
Iteration 14, loss = 0.80820280
Validation score: 0.625714
Iteration 15, loss = 0.79590109
Validation score: 0.631607
Iteration 16, loss = 0.78283875
Validation score: 0.626786
Iteration 17, loss = 0.77017199
Validation score: 0.623214
Iterat

### Predict

In [39]:
predicted_mlp = mlp_cv.predict(xtest)

### Metrics

In [40]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix 

def metrics(predicted): 
    predicted_naive = predicted 
    print('Accuracy Score \n',accuracy_score(predicted_naive, ytest))
    print('Confusion Matrix \n', confusion_matrix(predicted_naive, ytest))
    print('Classification Report \n', classification_report(predicted_naive, ytest))

In [41]:
metrics(predicted_mlp)

Accuracy Score 
 0.6255
Confusion Matrix 
 [[ 2098   302   326  1001   180]
 [  368  1474   394   380    29]
 [  358   341  1365   552    66]
 [ 2362   556   948 10039   745]
 [   42     4     5    29    36]]
Classification Report 
               precision    recall  f1-score   support

           0       0.40      0.54      0.46      3907
           1       0.55      0.56      0.55      2645
           2       0.45      0.51      0.48      2682
           3       0.84      0.69      0.75     14650
           4       0.03      0.31      0.06       116

    accuracy                           0.63     24000
   macro avg       0.45      0.52      0.46     24000
weighted avg       0.69      0.63      0.65     24000



# Accuracy Score : 62.5%

### Saving the model

In [42]:
import pickle
filename = 'mlp.sav'
pickle.dump(mlp_cv, open(filename, 'wb'))

### Load the model from disk

In [43]:
loaded_model = pickle.load(open(filename, 'rb'))
predicted_mlp2 = loaded_model.predict(xtest)


## C)RandomForest

In [44]:
from sklearn.ensemble import RandomForestClassifier

### Train Random Forest model

In [45]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(xtrain, ytrain)


### Predict

In [46]:
rf_pred = rf.predict(xtest)


### Accuracy

In [47]:
rf_acc = accuracy_score(ytest, rf_pred)

In [48]:
print('Random Forest accuracy:', rf_acc)

Random Forest accuracy: 0.578


# Accuracy Score: 57.8%

In [49]:
import pickle
filename = 'rf.sav'
pickle.dump(rf, open(filename, 'wb'))

In [50]:
loaded_model = pickle.load(open(filename, 'rb'))
RandomForest = loaded_model.predict(xtest)

## D)ANN

In [51]:
ann = Sequential()


In [52]:
ann.add(Input(shape=(xtrain.shape[1],)))
# Add a hidden layer
ann.add(Dense(1024, activation='relu'))
ann.add(Dense(1024, activation='relu'))
ann.add(Dense(512, activation='relu'))
ann.add(Dense(512, activation='relu'))
ann.add(Dense(256, activation='relu'))
ann.add(Dense(128, activation='relu'))
ann.add(Dense(64, activation='relu'))
ann.add(Dense(32, activation='relu'))
ann.add(Dense(16, activation='relu'))
# Softmax normalizes it into a probability distribution consisting of
# K probabilities proportional to the exponentials of the input
# Add an output layer
ann.add(Dense(5, activation='softmax'))


In [53]:
ann.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 1024)              1025024   
                                                                 
 dense_1 (Dense)             (None, 1024)              1049600   
                                                                 
 dense_2 (Dense)             (None, 512)               524800    
                                                                 
 dense_3 (Dense)             (None, 512)               262656    
                                                                 
 dense_4 (Dense)             (None, 256)               131328    
                                                                 
 dense_5 (Dense)             (None, 128)               32896     
                                                                 
 dense_6 (Dense)             (None, 64)                8

In [54]:
ann.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])



In [55]:
early_stopping = EarlyStopping(patience=10, verbose=1)
model_checkpoint = ModelCheckpoint("ann_model.h5", save_best_only=True, verbose=1)



In [56]:
ann.fit(xtrain, ytrain, epochs=5, batch_size=32, validation_data=(xtest, ytest), callbacks=[model_checkpoint])

Epoch 1/5
Epoch 1: val_loss improved from inf to 1.02774, saving model to ann_model.h5
Epoch 2/5
Epoch 2: val_loss improved from 1.02774 to 0.99496, saving model to ann_model.h5
Epoch 3/5
Epoch 3: val_loss did not improve from 0.99496
Epoch 4/5
Epoch 4: val_loss did not improve from 0.99496
Epoch 5/5
Epoch 5: val_loss did not improve from 0.99496


<keras.callbacks.History at 0x1c92e8ba800>

In [57]:
ann.save_weights('stackoverflow_weights.h5')

In [58]:
score = ann.evaluate(xtest, ytest, verbose=0)
print("accuracy ANN", score[1] * 100)

accuracy ANN 59.10000205039978


# Accuracy Score: 59.1%

### E)CNN

### Import necessary libraries:

In [59]:

import numpy as np
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

### Load the data:

In [60]:
data = pd.read_csv("C:\\Users\\Snigdha\\New folder\\Studies\\PROJECTS\\Predict closed questions on Stack Overflow\\train-sample.csv")

In [61]:
text_data = data['BodyMarkdown'].values

In [62]:
labels = data['OpenStatus'].values

### Convert non-binary labels to binary labels

In [63]:
encoder = LabelEncoder()
binary_labels = encoder.fit_transform(labels)

### Split data into training and testing sets

In [64]:
x_train, x_test, y_train, y_test = train_test_split(text_data, binary_labels, test_size=0.3, random_state=42)

### Preprocess text data

In [65]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(x_train)
sequences_train = tokenizer.texts_to_sequences(x_train)
sequences_test = tokenizer.texts_to_sequences(x_test)
word_index = tokenizer.word_index
max_len = max([len(s) for s in sequences_train + sequences_test])
x_train = pad_sequences(sequences_train, maxlen=max_len)
x_test = pad_sequences(sequences_test, maxlen=max_len)

### Build the CNN model

In [66]:
embedding_dim = 100

ann = Sequential()
ann.add(Embedding(len(word_index) + 1, embedding_dim, input_length=max_len))
ann.add(Conv1D(128, 5, activation='relu'))
ann.add(GlobalMaxPooling1D())
ann.add(Dense(1, activation='sigmoid'))
ann.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [67]:
ann.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 11305, 100)        30685200  
                                                                 
 conv1d (Conv1D)             (None, 11301, 128)        64128     
                                                                 
 global_max_pooling1d (Globa  (None, 128)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense_10 (Dense)            (None, 1)                 129       
                                                                 
Total params: 30,749,457
Trainable params: 30,749,457
Non-trainable params: 0
_________________________________________________________________


### Train the model

In [68]:
history = ann.fit(x_train, y_train, epochs=3, batch_size=32, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


### Evaluate the model:

In [82]:
loss, accuracy = ann .evaluate(x_test, y_test)
print('Test accuracy:', accuracy)

Test accuracy: 0.11071241647005081


# Accuracy Score: 11.07%

# Models Evalution:

1)MLP Classifier
Accuracy = 62.5 %
2)NAIVE
Accuracy = 36.9 %
3)ANN
Accuracy = 59.1 %
4)CNN
Accuracy = 11.07%
5)RANDOM FOREST
Accuracy = 57.8%
6)LSTM
Accuracy = 11.07%

# Conclusion

In this Notebook, I demonstrated various machine learning &  deep learning algorithms, I have used MLP, NAIVE, ANN, RANDOM FOREST,CNN, LSTM mechanism for the above the dataset. You can further enhance the performance of your model 

->Using other deep learning algorithms like GRU and BERT which are more suited for text classification problems.
->Using advanced word-embedding methods like GloVe and BERT