# Activity 2 : Extracting specific features from texts

Create a Bag of Word model and a TF-IDF model using sklearn’s ‘The 20 newsgroups text dataset’ using top 20 frequently occuring words. Compare and contrast these two models. Which are the most informative terms of the 2nd document according to these models. Are the sets of most informative terms different from each other? Is so, what do you think are the reasons for this difference? <br>
(Note: Assume first instance to be the 0th document )


In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import re
import string
import pandas as pd

newsgroups_data_sample = fetch_20newsgroups(subset='train')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sohom.ghosh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
newsgroups_text_df = pd.DataFrame({'text' : newsgroups_data_sample['data']})
newsgroups_text_df.head()

Unnamed: 0,text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...


In [5]:
stop_words = stopwords.words('english')

#adding individual printable characters to list of stop words so that they get renoved along with the stopwords
stop_words = stop_words + list(string.printable)

newsgroups_text_df['cleaned_text'] = newsgroups_text_df['text'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x))) if word.lower() not in stop_words]))

In [6]:
bag_of_words_model = CountVectorizer(max_features= 20)
bag_of_word_df = pd.DataFrame(bag_of_words_model.fit_transform(newsgroups_text_df['cleaned_text']).todense())
bag_of_word_df.columns = sorted(bag_of_words_model.vocabulary_)
bag_of_word_df.head()

Unnamed: 0,article,ax,com,edu,get,host,know,like,line,max,nntp,one,organization,people,posting,subject,time,university,would,writes
0,0,0,0,2,0,1,1,0,1,0,1,0,1,0,1,1,0,1,0,0
1,1,0,0,3,0,1,0,0,1,0,1,0,1,0,1,1,0,1,0,0
2,0,0,0,2,1,0,1,1,2,0,0,1,1,1,0,1,1,1,0,0
3,1,0,2,2,1,1,1,1,1,0,1,0,1,0,1,1,0,0,0,1
4,2,0,2,3,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1


In [7]:
tfidf_model = TfidfVectorizer(max_features=20)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(newsgroups_text_df['cleaned_text']).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,article,ax,com,edu,get,host,know,like,line,max,nntp,one,organization,people,posting,subject,time,university,would,writes
0,0.0,0.0,0.0,0.522408,0.0,0.338589,0.399935,0.0,0.183761,0.0,0.341216,0.0,0.190375,0.0,0.327584,0.183242,0.0,0.353795,0.0,0.0
1,0.282739,0.0,0.0,0.691591,0.0,0.298827,0.0,0.0,0.162181,0.0,0.301146,0.0,0.168019,0.0,0.289115,0.161723,0.0,0.312248,0.0,0.0
2,0.0,0.0,0.0,0.411404,0.325661,0.0,0.314954,0.308844,0.289428,0.0,0.0,0.277716,0.149923,0.35893,0.0,0.144306,0.345624,0.278619,0.0,0.0
3,0.234838,0.0,0.499523,0.382949,0.303137,0.248201,0.29317,0.287482,0.134705,0.0,0.250127,0.0,0.139553,0.0,0.240134,0.134325,0.0,0.0,0.0,0.225159
4,0.493319,0.0,0.52467,0.60334,0.0,0.0,0.0,0.0,0.141486,0.0,0.0,0.0,0.146579,0.0,0.0,0.141087,0.0,0.0,0.0,0.236494


In [8]:
rw = 2
list(bag_of_word_df.columns[bag_of_word_df.iloc[rw,:] == bag_of_word_df.iloc[rw,:].max()])

['edu', 'line']

In [9]:
rw = 2
list(tfidf_df.columns[tfidf_df.iloc[rw,:] == tfidf_df.iloc[rw,:].max()])

['edu']

In [10]:
bag_of_word_df[bag_of_word_df['line']!=0].shape[0]

11282

In [11]:
bag_of_word_df[bag_of_word_df['edu']!=0].shape[0]

7393