<a href="https://colab.research.google.com/github/AhsenRiaz/ML-Data/blob/main/04_feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction
By using TF-IDF, you can quickly find important words in each document and create a numerical representation of the text data that's suitable for machine learning tasks like text

In this example we will use TF-IDF technique

**Term Frequency (TF)**: For each word in a document, TF counts how many times it appears. Words that appear more often in a document get higher TF scores for that document.

**Inverse Document Frequency (IDF)**: IDF measures how unique or rare a word is across all documents. If a word appears in many documents, it gets a lower IDF score because it's common. If a word appears in only a few documents, it gets a higher IDF score because it's rare.

**TF-IDF Score**: To get the final TF-IDF score for a word in a document, you multiply its TF score (how often it appears in the document) by its IDF score (how unique it is across all documents).

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
news_dataset = pd.read_csv('/content/train.csv')

In [15]:
news_dataset.head()

Unnamed: 0,label,news
0,False,Says the Annies List political group supports ...
1,True,When did the decline of coal start? It started...
2,True,"Hillary Clinton agrees with John McCain ""by vo..."
3,False,Health care reform legislation is likely to ma...
4,True,The economic turnaround started at the end of ...


In [16]:
news_dataset.isnull().sum()

# if null fix the dataset by this code
# news_dataset = news_dataset.fillna('')


label    0
news     0
dtype: int64

In [31]:
X = news_dataset['news'].values
Y = news_dataset['label'].values

In [32]:
print(X)

['Says the Annies List political group supports third-trimester abortions on demand.'
 'When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.'
 'Hillary Clinton agrees with John McCain "by voting to give George Bush the benefit of the doubt on Iran."'
 ...
 'Says an alternative to Social Security that operates in Galveston County, Texas, has meant that participants will retire with a whole lot more money than under Social Security.'
 'On lifting the U.S. Cuban embargo and allowing travel to Cuba.'
 "The Department of Veterans Affairs has a manual out there telling our veterans stuff like, 'Are you really of value to your community?' You know, encouraging them to commit suicide."]


In [33]:
print(Y)

[False  True  True ...  True False False]


In [34]:
vectorizer = TfidfVectorizer()

In [35]:
vectorized_text = vectorizer.fit_transform(X)

In [36]:
print(vectorized_text)

  (0, 3278)	0.3399228124530313
  (0, 7728)	0.13423338099593773
  (0, 615)	0.2886717774483849
  (0, 11296)	0.40886628948153914
  (0, 11036)	0.2747906206356454
  (0, 10709)	0.2672566797277703
  (0, 5115)	0.2918323060577216
  (0, 8376)	0.2847775347384892
  (0, 6639)	0.3217759953815641
  (0, 1044)	0.4270131065530063
  (0, 10988)	0.06789988196273675
  (0, 9676)	0.11063502017569249
  (1, 751)	0.18344925558463465
  (1, 1964)	0.2637044974582498
  (1, 4910)	0.20345578434235373
  (1, 8554)	0.13824225599506434
  (1, 5687)	0.06817998938144511
  (1, 1532)	0.2738716739054018
  (1, 11110)	0.07294419071852533
  (1, 10980)	0.10044548692913545
  (1, 7674)	0.19430717148304494
  (1, 11138)	0.184608104001816
  (1, 4860)	0.20468163128243344
  (1, 7418)	0.2540938948798941
  (1, 10426)	0.45919377613538964
  :	:
  (10239, 6853)	0.2728576807709281
  (10239, 10594)	0.2642652775131411
  (10239, 3989)	0.2642652775131411
  (10239, 10918)	0.24004465308468284
  (10239, 8996)	0.20614540733281858
  (10239, 10660)	0.229