## Learning Objective


At the end of the experiment, you will be able to:

*  Pre-process the data
*  Representation of  text document using Bag of Words & Word2Vec

In [None]:
#@title Experiment Walkthrough Video
#@markdown BoW vs W2V
from IPython.display import HTML

HTML("""<video width="854" height="480" controls>
  <source src="https://cdn.exec.talentsprint.com/non-processed/Bag_of_Words_Vs_Word2Vec.mp4">
</video>
""")

## Dataset

   This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from [HuffPost](https://www.huffpost.com/). The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

Each news headline has a corresponding category. Categories and corresponding article counts as follows:


    POLITICS: 32739
    WELLNESS: 17827
    ENTERTAINMENT: 16058
    TRAVEL: 9887
    STYLE & BEAUTY: 9649
    PARENTING: 8677
    HEALTHY LIVING: 6694
    QUEER VOICES: 6314
    FOOD & DRINK: 6226
    BUSINESS: 5937
    COMEDY: 5175
    SPORTS: 4884
    BLACK VOICES: 4528
    HOME & LIVING: 4195
    PARENTS: 3955
    THE WORLDPOST: 3664
    WEDDINGS: 3651
    WOMEN: 3490
    IMPACT: 3459
    DIVORCE: 3426
    CRIME: 3405
    MEDIA: 2815
    WEIRD NEWS: 2670
    GREEN: 2622
    WORLDPOST: 2579
    RELIGION: 2556
    STYLE: 2254
    SCIENCE: 2178
    WORLD NEWS: 2177
    TASTE: 2096
    TECH: 2082
    MONEY: 1707
    ARTS: 1509
    FIFTY: 1401
    GOOD NEWS: 1398
    ARTS & CULTURE: 1339
    ENVIRONMENT: 1323
    COLLEGE: 1144
    LATINO VOICES: 1129
    CULTURE & ARTS: 1030
    EDUCATION: 1004


#### Description
This dataset has the following columns:
1. **Category:** Category article belongs to
2. **Headline:** Determines the Headline of the article
3. **Authors:** Person authored the article
4. **Link:** Link to the post
5. **Short_description:** Short description of the article
6. **Date:** Date the article was published

Out of 41 category's from the News_Category_Dataset, we consider four category's (Travel, Tech, Science, College) for this experiment

In [None]:
! wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/News_Category_Dataset_v2.csv
! wget https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar
! unrar e /content/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar
    

## Import packages


In [None]:
import re
import nltk
import pandas as pd
import numpy as np
import gensim
from nltk.corpus import stopwords 
nltk.download('stopwords')
import warnings
warnings.filterwarnings('ignore')

## Load the data 


In [None]:
# Load the data
df = pd.read_csv('News_Category_Dataset_v2.csv')
df.head()

In [None]:
# Count the classes in category
df['category'].value_counts()

## Data Pre-processing

we are considering four category's (Travel, Tech, Science, College) for this experiment

In [None]:
# Create a list of manually selected category 
category = ['TRAVEL','TECH','SCIENCE','COLLEGE']

# Load the dataset based on the category
df = df[df['category'].isin(category)]      # .isin whether each element in the DataFrame is contained in values.
df.shape

In [None]:
# Add the two columns into text column
df['text'] = df['headline'] +','+ df['short_description'] 
df['label'] = df['category']

Drop the unwanted columns

In [None]:
df = df.drop(['headline','short_description','date','authors','link','category','Unnamed: 0'], axis=1)
df.shape

Consider text column as feature and label as target variable. Convert label into numerical.

Hint: Label Encoder for obtaining a numeric representation, refer to the [link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['label']=le.fit_transform(df['label'])
df['label'].head()

In [None]:
df['text'].shape, df['label'].shape

## BoW

### TF IDF
 tf-idf aims to represent the number of times a given word appears in a document (a movie review in our case) relative to the number of documents in the corpus that the word appears in — where, words that appear in many documents have a value closer to zero and words that appear in less documents have values closer to 1.




In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
alltext = df['text'].astype(str)
tfidf_feature = tfidf_vectorizer.fit_transform(alltext)  

### Split the data into train and test sets 

Hint: Refer to[Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(tfidf_feature,df['label'],test_size = 0.2,random_state=42)

### Apply the Classification 


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

# Create an object for all the algorithms
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier(n_neighbors=8)
model3 = SGDClassifier()
model4 = SVC(kernel='linear')

models = [model1, model2, model3, model4]

for model in models:
    model.fit(X_train, y_train)         # fit the model
    y_pred= model.predict(X_test)       # then predict on the test set
    accuracy= accuracy_score(y_test, y_pred)
    print("Accuracy (in %):", model, "is", accuracy)


## Word2Vec

###Load pre-trained Word2Vec

Lets now proceed to load the complete pretrained vectors.

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format('AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

### Word2Vec representation

Convert each document into average of the word2vec vectors of all valid words in document

Note: Below code cell take some time to compile

In [None]:
# Creating empty final dataframe
docs_vectors = pd.DataFrame() 

# Removing stop words
stopwords = nltk.corpus.stopwords.words('english') 
text = df['text'].astype(str)
# Looping through each document and cleaning it
for doc in text.str.lower().str.replace('[^a-z ]', ''): 
    temp = pd.DataFrame()  
    for word in doc.split(' '): 
      # If word is not present in stopwords then (try)
        if word not in stopwords: 
            try:
                # If word is present in embeddings then get the vector representation and append it to temporary dataframe
                word_vec = model[word] 
                temp = temp.append(pd.Series(word_vec), ignore_index = True) 
            except:
                pass
    # Take the average of vectors for each word
    doc_vector = temp.mean() 
    # Append each document value to the final dataframe
    docs_vectors = docs_vectors.append(doc_vector, ignore_index = True) 
docs_vectors.shape



### Split the data into train and test sets 

Hint: Refer to[Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(docs_vectors,df['label'],test_size = 0.2,random_state=42)

### Apply the Classification 


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

# Create an object for all the algorithms
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier(n_neighbors=8)
model3 = SGDClassifier()
model4 = SVC(kernel='linear')

models = [model1, model2, model3, model4]

for model in models:
    model.fit(X_train, y_train)         # fit the model
    y_pred= model.predict(X_test)       # then predict on the test set
    accuracy= accuracy_score(y_test, y_pred)
    print("Accuracy(in %):", model, "is", accuracy)