# **ARTIFICIAL INTELLIGENCE**
## **_PROJECT-1_**
### **_NEWS CLASSIFICATION_**

`NAME:` **_ISHIKA SHARMA_**

`EMAIL ID:` ishikasharma.aug2001@gmail.com


In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
import pandas as pd

#reading the files into dataframes
fake = pd.read_csv("Fake.csv")
genuine = pd.read_csv("True.csv")

#adding a column 'target' to tell the result to aour model
fake['target'] = 0
genuine['target'] = 1

#concatenating data to make a single dataFrame
data = pd.concat([fake, genuine], axis=0)

#resetting the index of entire dataset
data = data.reset_index(drop=True)

#dropping extra columns
data = data.drop(['subject','date','title'], axis=1)

#check data
print(data.columns)

Index(['text', 'target'], dtype='object')


### TOKENIZATION
Divides a large piece of continuous text into distinct units or tokens basically

In [5]:
from nltk.tokenize import word_tokenize #, sent_tokenize

data['text'] = data['text'].apply(word_tokenize)

print(data.head(10))

                                                text  target
0  [Donald, Trump, just, couldn, t, wish, all, Am...       0
1  [House, Intelligence, Committee, Chairman, Dev...       0
2  [On, Friday, ,, it, was, revealed, that, forme...       0
3  [On, Christmas, day, ,, Donald, Trump, announc...       0
4  [Pope, Francis, used, his, annual, Christmas, ...       0
5  [The, number, of, cases, of, cops, brutalizing...       0
6  [Donald, Trump, spent, a, good, portion, of, h...       0
7  [In, the, wake, of, yet, another, court, decis...       0
8  [Many, people, have, raised, the, alarm, regar...       0
9  [Just, when, you, might, have, thought, we, d,...       0


### STEMMING
The idea of removing the suffix of a word and reducing different forms of a word to a core root

In [6]:
from nltk.stem.snowball import SnowballStemmer

porter = SnowballStemmer("english")

def stem_it(text):
  return [porter.stem(word) for word in text]

data['text'] = data['text'].apply(stem_it)

print(data.head(10))

                                                text  target
0  [donald, trump, just, couldn, t, wish, all, am...       0
1  [hous, intellig, committe, chairman, devin, nu...       0
2  [on, friday, ,, it, was, reveal, that, former,...       0
3  [on, christma, day, ,, donald, trump, announc,...       0
4  [pope, franci, use, his, annual, christma, day...       0
5  [the, number, of, case, of, cop, brutal, and, ...       0
6  [donald, trump, spent, a, good, portion, of, h...       0
7  [in, the, wake, of, yet, anoth, court, decis, ...       0
8  [mani, peopl, have, rais, the, alarm, regard, ...       0
9  [just, when, you, might, have, thought, we, d,...       0


### STOPWORD REMOVAL
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore

In [None]:
#first way
#from nltk.corpus import stopwords
#print(stopwords.words('eglish))

In [7]:
#Another way of stemming
def stop_it(t):
  dt = [word for word in t if len(word) > 2]
  return dt

data['text'] = data['text'].apply(stop_it)

print(data.head(10))

                                                text  target
0  [donald, trump, just, couldn, wish, all, ameri...       0
1  [hous, intellig, committe, chairman, devin, nu...       0
2  [friday, was, reveal, that, former, milwauke, ...       0
3  [christma, day, donald, trump, announc, that, ...       0
4  [pope, franci, use, his, annual, christma, day...       0
5  [the, number, case, cop, brutal, and, kill, pe...       0
6  [donald, trump, spent, good, portion, his, day...       0
7  [the, wake, yet, anoth, court, decis, that, de...       0
8  [mani, peopl, have, rais, the, alarm, regard, ...       0
9  [just, when, you, might, have, thought, get, b...       0


In [8]:
data['text'] = data['text'].apply(' '.join)

print(data.head(10))

                                                text  target
0  donald trump just couldn wish all american hap...       0
1  hous intellig committe chairman devin nune hav...       0
2  friday was reveal that former milwauke sheriff...       0
3  christma day donald trump announc that would b...       0
4  pope franci use his annual christma day messag...       0
5  the number case cop brutal and kill peopl colo...       0
6  donald trump spent good portion his day his go...       0
7  the wake yet anoth court decis that derail don...       0
8  mani peopl have rais the alarm regard the fact...       0
9  just when you might have thought get break fro...       0


### SPLITTING dataset for training our model

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data['text'], data['target'], test_size=0.25)

# display(X_train.head(10))
# print('\n')
# display(y_train.head(10))

### VECTORIZATION

The vectorization is a technique used to convert textual data to numerical format. Using vectorization, a matrix is created where each column represents a feature and each row represents an individual review.

#### _TF (Term Frequency)_

Term Frequency is defined as how frequently the word appear  in the document.

#### _Term Frequency-Inverse Document Frequency(TF-IDF)_

TD-IDF  basically tells importance of the word in the corpus or dataset
- It is the combination of Term frequency and Inverse Document Frequency. 
- Inverse Document frequency is another concept which is used for finding out importance of the word. It is based on the fact that less frequent words are more informative and important.

__IDF(t) = log_e(Total number of documents / Number of documents with term t in it)__





In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
my_tfidf = TfidfVectorizer(max_df=0.7)

tfidf_train = my_tfidf.fit_transform(X_train)
tfidf_test = my_tfidf.transform(X_test)

print(tfidf_train)

  (0, 87044)	0.022145025191346538
  (0, 17945)	0.0369924568867777
  (0, 73398)	0.0384114481757907
  (0, 41051)	0.023598486798303443
  (0, 31388)	0.024787670870778336
  (0, 81455)	0.020931363444334314
  (0, 12263)	0.06766956891071564
  (0, 36790)	0.037904780260715885
  (0, 11295)	0.045599295727874536
  (0, 11732)	0.054265897911665834
  (0, 63171)	0.08354541137314743
  (0, 22108)	0.04739412271770294
  (0, 66856)	0.025985293181999148
  (0, 31242)	0.03579136278425623
  (0, 9886)	0.020992245344637648
  (0, 76586)	0.0411444879923586
  (0, 61503)	0.02063179534053521
  (0, 26780)	0.024975066271438487
  (0, 60083)	0.054463048008756544
  (0, 57498)	0.038972552367131534
  (0, 87286)	0.03615113404541332
  (0, 65357)	0.02679535657201108
  (0, 81430)	0.04971184069552565
  (0, 55143)	0.049854107569574285
  (0, 48989)	0.052962090943056164
  :	:
  (33672, 85467)	0.06411896595956618
  (33672, 30037)	0.0687000305930678
  (33672, 51264)	0.04362551842361566
  (33672, 17565)	0.038062811400065164
  (33672, 5

### LOGISTIC REGRESSION

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model_1 = LogisticRegression(max_iter=900)
model_1.fit(tfidf_train, y_train)

pred_1 = model_1.predict(tfidf_test)
cr1 = accuracy_score(y_test, pred_1)
print(cr1*100)

98.80623608017818


### PassiveAggressiveClassifier

In [19]:
from sklearn.linear_model import PassiveAggressiveClassifier

model = PassiveAggressiveClassifier(max_iter=50)
model.fit(tfidf_train, y_train)

y_pred = model.predict(tfidf_test)
accscore = accuracy_score(y_test, y_pred)
print('The accuracy of prediction is: ', accscore*100)

The accuracy of prediction is:  99.60801781737194
