Document classification using SVM

The dataset used here is the <b>20 Newsgroups dataset</b>. This dataset contains around 18000 newsgroup posts on 20 topics.

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import re
import nltk
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [2]:
ng_data = fetch_20newsgroups(subset = "all", random_state=42)

In [3]:
nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/hrishikesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/hrishikesh/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
print(ng_data["data"][1])

From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local Bus

  - Orchid Farenheit 1280

  - ATI Graphics Ultra Pro

  - Any other high-performance VLB card


Please post or email.  Thank you!

  - Matt

-- 
    |  Matthew B. Lawson <------------> (mblawson@essex.ecn.uoknor.edu)  |   
  --+-- "Now I, Nebuchadnezzar, praise and exalt and glorify the King  --+-- 
    |   of heaven, because everything he does is right and all his ways  |   
    |   are just." - Nebuchadnezzar, king of Babylon, 562 B.C.           |   



Preprocessing

In [6]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    stop_words = set(stopwords.words('english'))
    text = nltk.word_tokenize(text)
    text = ' '.join([word for word in text if word not in stop_words])
    return text
    

The above function is applied for preprocessing the text. The function:
1. Converts all text into lower case.
2. Removes all punctuations and numbers.
3. Tokenizes the text.
4. Removes the stopwords from the text 

In [7]:
ng_data_df = pd.DataFrame({'text' : ng_data.data, 'target': ng_data.target})

In [8]:
ng_data_df

Unnamed: 0,text,target
0,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,10
1,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,3
2,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,17
3,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,3
4,From: Alexander Samuel McDiarmid <am2o+@andrew...,4
...,...,...
18841,From: jim.zisfein@factory.com (Jim Zisfein) \n...,13
18842,From: rdell@cbnewsf.cb.att.com (richard.b.dell...,12
18843,From: westes@netcom.com (Will Estes)\nSubject:...,3
18844,From: steve@hcrlgw (Steven Collins)\nSubject: ...,1


In [9]:
ng_data_df["text"] = ng_data_df["text"].apply(preprocess_text)

In [10]:
ng_data_df

Unnamed: 0,text,target
0,mamatha devineni ratnam mrandrewcmuedu subject...,10
1,mblawsonmidwayecnuoknoredu matthew b lawson su...,3
2,hilmierdsvsuse hilmi eren subject armenia says...,17
3,guydaustinibmcom guy dawson subject ide vs scs...,3
4,alexander samuel mcdiarmid amoandrewcmuedu sub...,4
...,...,...
18841,jimzisfeinfactorycom jim zisfein subject migra...,13
18842,rdellcbnewsfcbattcom richardbdell subject ques...,12
18843,westesnetcomcom estes subject mounting cpu coo...,3
18844,stevehcrlgw steven collins subject sphere poin...,1
