<h2><u>STEP: 01</u></h2>
<p>As the first step, download the "20 News Group" dataset from the http://qwone.com/~jason/20Newsgroups/ website.</p>

<h2><u>STEP: 02</u></h2>
<p>As the second step, create four main classes by combining the document in the subclasses in the both training and testing folders.</p>

<p>Then import all the required modules from <b>os</b>, <b>pandas</b>, <b>sklearn</b>, <b>nltk</b></p>

In [1]:
import os
import pandas as pd
from string import punctuation


# libraries importing from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
            #Convert a collection of text documents to a matrix of token counts
from sklearn.model_selection import train_test_split
            #Split arrays or matrices into random train and test subsets

    
# natural language toolkit packages
import nltk
from nltk.corpus import stopwords
            #Stop words can be filtered from the text to be processed. The nltk module contains a list of stop words.
            #Text may contain stop words like ‘the’, ‘is’, ‘are’.
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer
            #This tokenizer treats most punctuation characters as separate tokens and splits off standard contractions, e.g. ``don't go`` -> [`do`, `n't`, `go`]
            #The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer's regexes.
from nltk.stem.snowball import EnglishStemmer
            #Stemmers use an algorithmic approach of removing prefixes and suffixes, e.g. `getting` -> `get`
            #The result might not be an actual dictionary word, e.g. `xyzing` -> `xyze` - it even works on non words!
from nltk.stem.wordnet import WordNetLemmatizer
            #Lemmatizers need extra info about the part of speech they are processing, e.g. `xyzing` - KeyError! - Doesn't work on non-words!
from nltk.corpus import wordnet as wn
            #Use WordNet to find the meanings of words, synonyms, antonyms, and more.
from nltk.tokenize import word_tokenize
            #A tokenizer that divides a string into substrings by splitting on the specified string


<p>The method walk() generates the file names in a directory tree by walking the tree either top-down or bottom-up.</p>
<ul>
    <li><b>roots</b> :	Generate directories only from what you specified.</li>
    <li><b>dirs</b> :	Generate sub-directories from root.</li>
    <li><b>files</b>:  Generate all files from root and directories</li>
</ul>

In [2]:
# method to get all the text files from the directory to a file list
def read_class_files(dir):
    fl = []
    for roots, dirs, files in os.walk(dir):
        for name in files:
            fl.append(os.path.join(roots, name))
    return fl

In [3]:
text_files = read_class_files('./')

<p><b>Encoders and decoders for converting text between different representations.</b></p>

In [4]:
# method to categorize and read the files
def categorize_class_files(file_paths):
    comp_data = []
    rec_data = []
    sci_data = []
    talk_data = []
    for file in file_paths:
        if "comp" in file:
            w = open(file, encoding='utf-8', errors='ignore')
            comp_data.append(file)
        elif "rec" in file:
            w = open(file, encoding='utf-8', errors='ignore')
            rec_data += [w.read()]
        elif "sci" in file:
            w = open(file, encoding='utf-8', errors='ignore')
            sci_data += [w.read()]
        elif "talk" in file:
            w = open(file, encoding='utf-8', errors='ignore')
            talk_data += [w.read()]
    return comp_data, rec_data, sci_data, talk_data

In [5]:
comp_fl = []
rec_fl = []
sci_fl = []
talk_fl = []

comp_fl, rec_fl, sci_fl, talk_fl = categorize_class_files(text_files)

In [6]:
print("comp  "+str(len(comp_fl)))
print("sci "+str(len(sci_fl)))
print("talk  "+str(len(talk_fl)))
print("rec "+str(len(rec_fl)))

comp  4891
sci 3952
talk  3253
rec 3979


<h2><u>STEP: 03</u></h2>
<p>Split documents randomly in each class into two separate database.</p>

In [8]:
comp_train,comp_test = train_test_split(comp_fl, train_size=0.7,test_size=0.3, random_state=42)
rec_train, rec_test = train_test_split(rec_fl, train_size=0.7,test_size=0.3, random_state=42)
sci_train, sci_test = train_test_split(sci_fl, train_size=0.7,test_size=0.3, random_state=42)
talk_train,talk_test = train_test_split(talk_fl, train_size=0.7,test_size=0.3, random_state=42)

In [9]:
All_train_data_file =[ comp_train, rec_train, sci_train, sci_train]
All_test_data_file =[ comp_test, rec_test, sci_test, sci_test]

<h2><u>STEP: 04</u></h2>
<p>Pre-process each document and transform it to a feature vector.</p>

In [10]:
# get the set of stop words
_stop_words = set(stopwords.words('english') + list(punctuation))

# initialize english stemmer
stemmer = EnglishStemmer()

# initialize word net lemmatizer
lemmatizer = WordNetLemmatizer()

# word list used in english language
_word_list = set([word for word in wn.words(lang='eng')])

In [11]:
#get unique words and clean data
def process_class_files(data_list):
    word_set = []
    for docs in data_list:
        for doc in docs:
            words = []
            for word in word_tokenize(doc):
                word = stemmer.stem(word)
                word = lemmatizer.lemmatize(word)
                words += [word]
            word_set += [word for word in words if word not in _stop_words if word in _word_list if word.isalpha()]
    return list(set(word_set))

In [13]:
voc_train = process_class_files(All_train_data_file)

In [16]:
clean_train_comp = process_class_files(comp_train)

In [17]:
clean_train_rec = process_class_files(rec_train)

In [18]:
clean_train_sci = process_class_files(sci_train)

In [19]:
clean_train_talk = process_class_files(talk_train)

In [20]:
clean_test_comp = process_class_files(comp_test)

In [21]:
clean_test_rec = process_class_files(rec_test)

In [22]:
clean_test_sci = process_class_files(sci_test)

In [23]:
clean_test_talk = process_class_files(talk_test)

<p><b>Convert a collection of text documents to a matrix of token counts.</b></p>

In [28]:
def vectorize_csv_files(cleaned_data, csv_file_name, vocabulary_set):
    vectorizer = CountVectorizer(max_features=500,stop_words=_stop_words, vocabulary=vocabulary_set)
    x = vectorizer.fit_transform(cleaned_data)
    y = vectorizer.get_feature_names()
    
    df = pd.DataFrame(data=x.toarray(), columns=y)
    
    df.to_csv(csv_file_name+'.csv')
    
    print(csv_file_name+'.csv is successfully created!')
    return df

<p><b>It generates eight csv files of train.csv and test.csv files.</b></p>

In [31]:
vectorize_csv_files(clean_train_comp, 'train_comp', voc_train)
vectorize_csv_files(clean_train_rec, 'train_rec', voc_train)
vectorize_csv_files(clean_train_sci, 'train_sci', voc_train)
vectorize_csv_files(clean_train_talk, 'train_talk', voc_train)

train_comp.csv is successfully created!
train_rec.csv is successfully created!
train_sci.csv is successfully created!
train_talk.csv is successfully created!


Unnamed: 0,aphrodisiac,walkout,lee,gao,hygienist,imu,nit,nourish,sac,eec,...,goldwyn,philadelphia,bran,tailspin,monochromat,overreact,boost,disco,spruce,forgotten
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
vectorize_csv_files(clean_test_comp, 'test_comp', voc_train)
vectorize_csv_files(clean_test_rec, 'test_rec', voc_train)
vectorize_csv_files(clean_test_sci, 'test_sci', voc_train)
vectorize_csv_files(clean_test_talk, 'test_talk', voc_train)

test_comp.csv is successfully created!
test_rec.csv is successfully created!
test_sci.csv is successfully created!
test_talk.csv is successfully created!


Unnamed: 0,aphrodisiac,walkout,lee,gao,hygienist,imu,nit,nourish,sac,eec,...,goldwyn,philadelphia,bran,tailspin,monochromat,overreact,boost,disco,spruce,forgotten
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
