# **Feature Extraction**

    --> The mapping from textual data to real values vector is called feature extraction

<br>

- _**bag of words(BOG)**_ : list of unique words in the text corpus
- **_Term Frequency-Inverse Document Frequency (TF-IDF)_** : To count the number of times each word appears in a document (TF\*IDF)

  Team Frequency(TF) : (Number of times term t appears in a document) / (number of terms in the document)
  Inverse Document Frequency(IDF) : log(N/n), Where N is the Number of document and n, id the number od the document a term t has appeared in


In [None]:
# import Dependencies

import numpy as np
import pandas as pd
import re  # regular expression
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
nltk.download('stopwords')

In [None]:
# printing the stopwords in english

print(stopwords.words('english'))

# Data Pre-processing


In [None]:
# loading the dataset
newsData = pd.read_csv('../DataSet_Collection/train.csv')

In [None]:
newsData.shape

In [None]:
newsData.head()

In [None]:
# counting the number of missing values in the dataset if they present

newsData.isnull().sum()

In [None]:
# replacing the null values with something other String
newsData = newsData.fillna('Null')

In [None]:

newsData.isnull().sum()

In [None]:
# merging the author & title Column, cause by that we can get a column for operation
newsData['content'] = newsData['author'] + ' ' + newsData['title']

In [None]:
print(newsData['content'])

In [None]:
# Separating the data and lable
X = newsData.drop(columns='label', axis=1)
Y = newsData['label']

# 👍Stemming :

    stemming is the process of the reducing a word to its root word
    Example : actor, actress, acting --> act


In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    # Cut off Un-necessary words
    stemmed_content = [port_stem.stem(word) for word in stemmed_content
                       if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
newsData['content'] = newsData['content'].apply(stemming)

In [None]:
# separating the data and label
X = newsData['content'].values
Y = newsData['label'].values

# **Tf-Idf :**


In [None]:
# convert the textual data to feature vector
vectorizer = TfidfVectorizer()

In [None]:
vectorizer.fit(X)

X = vectorizer.transform(X)