<a href="https://colab.research.google.com/github/Shubham04689/colab_notebooks/blob/main/News_Category_Dataset_BoW_vs_W2V.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objective


At the end of the experiment, you will be able to:

*  Pre-process the data
*  Representation of  text document using Bag of Words & Word2Vec

## Dataset

   This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from [HuffPost](https://www.huffpost.com/). The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

Each news headline has a corresponding category. Categories and corresponding article counts as follows:


    POLITICS: 32739
    WELLNESS: 17827
    ENTERTAINMENT: 16058
    TRAVEL: 9887
    STYLE & BEAUTY: 9649
    PARENTING: 8677
    HEALTHY LIVING: 6694
    QUEER VOICES: 6314
    FOOD & DRINK: 6226
    BUSINESS: 5937
    COMEDY: 5175
    SPORTS: 4884
    BLACK VOICES: 4528
    HOME & LIVING: 4195
    PARENTS: 3955
    THE WORLDPOST: 3664
    WEDDINGS: 3651
    WOMEN: 3490
    IMPACT: 3459
    DIVORCE: 3426
    CRIME: 3405
    MEDIA: 2815
    WEIRD NEWS: 2670
    GREEN: 2622
    WORLDPOST: 2579
    RELIGION: 2556
    STYLE: 2254
    SCIENCE: 2178
    WORLD NEWS: 2177
    TASTE: 2096
    TECH: 2082
    MONEY: 1707
    ARTS: 1509
    FIFTY: 1401
    GOOD NEWS: 1398
    ARTS & CULTURE: 1339
    ENVIRONMENT: 1323
    COLLEGE: 1144
    LATINO VOICES: 1129
    CULTURE & ARTS: 1030
    EDUCATION: 1004


#### Description
This dataset has the following columns:
1. **Category:** Category article belongs to
2. **Headline:** Determines the Headline of the article
3. **Authors:** Person authored the article
4. **Link:** Link to the post
5. **Short_description:** Short description of the article
6. **Date:** Date the article was published

Out of 41 category's from the News_Category_Dataset, we consider four category's (Travel, Tech, Science, College) for this experiment

In [23]:
import requests
import os
import zipfile

def download_and_extract(url, extract_path):
  """Downloads a file from a given URL and extracts it to a specified path.

  Args:
    url: The URL of the file to download.
    extract_path: The path to extract the downloaded file to.
  """

  if not os.path.exists(extract_path):
    os.makedirs(extract_path)

  filename = url.split('/')[-1]
  filepath = os.path.join(extract_path, filename)

  try:
    response = requests.get(url, stream=True)
    response.raise_for_status()  # Raise an exception for bad status codes

    with open(filepath, 'wb') as f:
      for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

    print(f"Downloaded file: {filename}")

    if filename.endswith(".zip") or filename.endswith(".rar"):
      if filename.endswith(".zip"):
        with zipfile.ZipFile(filepath, 'r') as zip_ref:
          zip_ref.extractall(extract_path)
      elif filename.endswith(".rar"):
          # Use unrar command to extract .rar files
          !unrar x "{filepath}" "{extract_path}"
      print(f"Extracted files to: {extract_path}")

  except requests.exceptions.RequestException as e:
    print(f"Error downloading file: {e}")


# Download and extract News_Category_Dataset_v2.csv
download_and_extract("https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/News_Category_Dataset_v2.csv", "./data")

# Download and extract AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar
download_and_extract("https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar", "./data")

Downloaded file: News_Category_Dataset_v2.csv
Downloaded file: AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar

UNRAR 6.11 beta 1 freeware      Copyright (c) 1993-2022 Alexander Roshal


Extracting from ./data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar

Extracting  ./data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin         0%  1%  2%  3%  4%  5%  6%  7%  8%  9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80% 81% 82%

## Import packages


In [24]:
import re
import nltk
import pandas as pd
import numpy as np
import gensim
from nltk.corpus import stopwords
nltk.download('stopwords')
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load the data


In [25]:
# Load the data
df = pd.read_csv('/content/data/News_Category_Dataset_v2.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,category,headline,authors,link,short_description,date
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [26]:
# Count the classes in category
df['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
POLITICS,32739
WELLNESS,17827
ENTERTAINMENT,16058
TRAVEL,9887
STYLE & BEAUTY,9649
PARENTING,8677
HEALTHY LIVING,6694
QUEER VOICES,6314
FOOD & DRINK,6226
BUSINESS,5937


## Data Pre-processing

we are considering four category's (Travel, Tech, Science, College) for this experiment

In [27]:
# Create a list of manually selected category
category = ['TRAVEL','TECH','SCIENCE','COLLEGE']

# Load the dataset based on the category
df = df[df['category'].isin(category)]      # .isin whether each element in the DataFrame is contained in values.
df.shape

(15291, 7)

In [28]:
# Add the two columns into text column
df['text'] = df['headline'] +','+ df['short_description']
df['label'] = df['category']

Drop the unwanted columns

In [29]:
df = df.drop(['headline','short_description','date','authors','link','category','Unnamed: 0'], axis=1)
df.shape

(15291, 2)

Consider text column as feature and label as target variable. Convert label into numerical.

Hint: Label Encoder for obtaining a numeric representation, refer to the [link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [30]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['label']=le.fit_transform(df['label'])
df['label'].head()

Unnamed: 0,label
126,3
137,2
138,2
155,1
205,3


In [31]:
df['text'].shape, df['label'].shape

((15291,), (15291,))

## BoW

### TF IDF
 tf-idf aims to represent the number of times a given word appears in a document (a movie review in our case) relative to the number of documents in the corpus that the word appears in — where, words that appear in many documents have a value closer to zero and words that appear in less documents have values closer to 1.




In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
alltext = df['text'].astype(str)
tfidf_feature = tfidf_vectorizer.fit_transform(alltext)

### Split the data into train and test sets

Hint: Refer to[Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(tfidf_feature,df['label'],test_size = 0.2,random_state=42)

### Apply the Classification


In [34]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

# Create an object for all the algorithms
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier(n_neighbors=8)
model3 = SGDClassifier()
model4 = SVC(kernel='linear')

models = [model1, model2, model3, model4]

for model in models:
    model.fit(X_train, y_train)         # fit the model
    y_pred= model.predict(X_test)       # then predict on the test set
    accuracy= accuracy_score(y_test, y_pred)
    print("Accuracy (in %):", model, "is", accuracy)


Accuracy (in %): DecisionTreeClassifier() is 0.7724746649231775
Accuracy (in %): KNeighborsClassifier(n_neighbors=8) is 0.8336057535142203
Accuracy (in %): SGDClassifier() is 0.8692383131742399
Accuracy (in %): SVC(kernel='linear') is 0.8823144818568159


## Word2Vec

###Load pre-trained Word2Vec

Lets now proceed to load the complete pretrained vectors.

In [38]:
model = gensim.models.KeyedVectors.load_word2vec_format('/content/data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

### Word2Vec representation

Convert each document into average of the word2vec vectors of all valid words in document

Note: Below code cell take some time to compile

In [42]:
# Creating empty final dataframe
docs_vectors = pd.DataFrame()

# Removing stop words
stopwords = nltk.corpus.stopwords.words('english')
text = df['text'].astype(str)
# Looping through each document and cleaning it
for doc in text.str.lower().str.replace('[^a-z ]', ''):
    temp = pd.DataFrame()
    for word in doc.split(' '):
      # If word is not present in stopwords then (try)
        if word not in stopwords:
            try:
                # If word is present in embeddings then get the vector representation and append it to temporary dataframe
                word_vec = model[word]#: Call the genism model apply word
                temp = temp.append(pd.Series(word_vec), ignore_index = True)
            except:
                pass
    # Take the average of vectors for each word
    doc_vector = temp.mean() # Find the mean of temp
    # Append each document value to the final dataframe
    docs_vectors = pd.concat([docs_vectors, pd.DataFrame([doc_vector])], ignore_index=True)

docs_vectors.shape



(15291, 0)

### Split the data into train and test sets

Hint: Refer to[Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(docs_vectors,df['label'],test_size = 0.2,random_state=42)

### Apply the Classification


In [45]:
print(docs_vectors.isnull().sum())
print(docs_vectors.shape)
print(y_train.isnull().sum())


Series([], dtype: float64)
(15291, 0)
0


In [52]:
print(docs_vectors.shape)
print(docs_vectors.head())


(15291, 0)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]


In [46]:
docs_vectors = docs_vectors.fillna(0)


In [48]:
print(f"docs_vectors shape: {docs_vectors.shape}")
print(f"Labels shape: {df['label'].shape}")


docs_vectors shape: (15291, 0)
Labels shape: (15291,)


In [49]:
print(docs_vectors.dtypes)


Series([], dtype: object)


In [51]:
print("X_train shape:", X_train.shape)
print("X_train types:", X_train.dtypes)
print("X_train head:", X_train.head())

print("y_train shape:", y_train.shape)
print("y_train head:", y_train.head())


X_train shape: (12232, 0)
X_train types: Series([], dtype: object)
X_train head: Empty DataFrame
Columns: []
Index: [5562, 12752, 11949, 245, 3017]
y_train shape: (12232,)
y_train head: 120731    3
180657    1
174373    3
13244     3
81273     1
Name: label, dtype: int64
