# Scope


The scope of the project is to develop a system that recognizes fake news.
As part of this project, it is necessary to train a news classification model as true or false.
After training, we will evaluate the effectiveness of our model in relation to a set of metrics such as classification accuracy, training and classification time. 
Next, we will implement an application that accepts a news text as input and returns the category it belongs to (fake or true). Finally, we are asked to apply relationship extraction methods to the news collection.

#### Import libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('C:/Users/Alex/Desktop/Assessment3'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

C:/Users/Alex/Desktop/Assessment3\covid_data.csv
C:/Users/Alex/Desktop/Assessment3\Fake_News_Classifier (1).ipynb
C:/Users/Alex/Desktop/Assessment3\Fake_News_Classifier.ipynb
C:/Users/Alex/Desktop/Assessment3\submit.csv
C:/Users/Alex/Desktop/Assessment3\test.csv
C:/Users/Alex/Desktop/Assessment3\train.csv


In [2]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import re

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alex\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Import Data - Data manipulation

In [5]:
news_dataset_train = pd.read_csv('C:/Users/Alex/Desktop/Assessment3/train.csv')
news_dataset_test = pd.read_csv('C:/Users/Alex/Desktop/Assessment3/test.csv')

In [6]:
news_dataset_train.shape

(20800, 5)

In [7]:
news_dataset_train.head(5)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [8]:
news_dataset_train.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [9]:
news_dataset_train=news_dataset_train.fillna('')

In [10]:
news_dataset_train['content'] = news_dataset_train['title'] + ' ' + news_dataset_train['author']
print(news_dataset_train['content'])

0        House Dem Aide: We Didn’t Even See Comey’s Let...
1        FLYNN: Hillary Clinton, Big Woman on Campus - ...
2        Why the Truth Might Get You Fired Consortiumne...
3        15 Civilians Killed In Single US Airstrike Hav...
4        Iranian woman jailed for fictional unpublished...
                               ...                        
20795    Rapper T.I.: Trump a ’Poster Child For White S...
20796    N.F.L. Playoffs: Schedule, Matchups and Odds -...
20797    Macy’s Is Said to Receive Takeover Approach by...
20798    NATO, Russia To Hold Parallel Exercises In Bal...
20799              What Keeps the F-35 Alive David Swanson
Name: content, Length: 20800, dtype: object


In [11]:
X=news_dataset_train.drop(columns='label',axis=1)
y=news_dataset_train['label']

In [12]:
print(X)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

In [13]:
print(y)

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64


In [14]:
port_stem=PorterStemmer()

The following function , removes any numeric characters from sentences, converts the characters to lower case, splits the sentences into words, for every word finds the stemming word and concatenates again the words into sentences.

In [15]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [16]:
news_dataset_train['content'] = news_dataset_train['content'].apply(stemming)

In [17]:
print(news_dataset_train['content'])

0        hous dem aid even see comey letter jason chaff...
1        flynn hillari clinton big woman campu breitbar...
2                   truth might get fire consortiumnew com
3        civilian kill singl us airstrik identifi jessi...
4        iranian woman jail fiction unpublish stori wom...
                               ...                        
20795    rapper trump poster child white supremaci jero...
20796    n f l playoff schedul matchup odd new york tim...
20797    maci said receiv takeov approach hudson bay ne...
20798    nato russia hold parallel exercis balkan alex ...
20799                            keep f aliv david swanson
Name: content, Length: 20800, dtype: object


## Modeling

In [18]:
X = news_dataset_train['content'].values
Y = news_dataset_train['label'].values

In [19]:
print(X)

['hous dem aid even see comey letter jason chaffetz tweet darrel lucu'
 'flynn hillari clinton big woman campu breitbart daniel j flynn'
 'truth might get fire consortiumnew com' ...
 'maci said receiv takeov approach hudson bay new york time michael j de la merc rachel abram'
 'nato russia hold parallel exercis balkan alex ansari'
 'keep f aliv david swanson']


In [20]:
print(Y)

[1 0 1 ... 0 1 1]


TfidfVectorizer is a tool in the Python programming language that is used to convert a collection of text documents into numerical vectors that can be used as input to machine learning algorithms. It is commonly used in natural language processing (NLP) tasks such as text classification, clustering, and information retrieval.

TfidfVectorizer works by first tokenizing the input text, which means it divides the text into individual words or tokens. It then calculates the term frequency-inverse document frequency (TF-IDF) score for each token in the text. The term frequency is the number of times a token appears in a document, and the inverse document frequency is a measure of how rare the token is in the entire collection of documents. The TF-IDF score is calculated by multiplying the term frequency by the inverse document frequency.

TfidfVectorizer then uses the calculated TF-IDF scores to create a numerical vector representation of the text. The resulting vector can be used as input to machine learning algorithms that operate on numerical data.

In summary, TfidfVectorizer is a useful tool for converting text documents into numerical vectors that can be used in machine learning tasks. It works by tokenizing the input text and calculating the TF-IDF scores for each token, which are then used to create a numerical vector representation of the text.

In [21]:
vectorizer=TfidfVectorizer()
vectorizer.fit(X)
X=vectorizer.transform(X)

In [22]:
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

In [23]:
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2,stratify=Y,random_state=2)

### Logistic regression

In [24]:
model_lg=LogisticRegression()
model_lg.fit(X_train,y_train)

In [25]:
X_train_prediction=model_lg.predict(X_train)
train_accuracy=accuracy_score(X_train_prediction,y_train)
print(f'Train set accuracy',{train_accuracy})
X_val_prediction=model_lg.predict(X_val)
val_accuracy=accuracy_score(X_val_prediction,y_val)
print(f'Validation set accuracy',{val_accuracy})

Train set accuracy {0.9865985576923076}
Validation set accuracy {0.9790865384615385}


### SVM

In [26]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
model_svc = SVC()
model_svc.fit(X_train, y_train)

In [27]:
X_train_prediction = model_svc.predict(X_train)
X_val_prediction = model_svc.predict(X_val)
train_accuracy = accuracy_score(X_train_prediction, y_train)
print(f'Train set accuracy: {train_accuracy}')
val_accuracy = accuracy_score(X_val_prediction, y_val)
print(f'Validation set accuracy: {val_accuracy}')

Train set accuracy: 0.9990985576923077
Validation set accuracy: 0.9889423076923077


# Predict on new data

This solution predicts on new data whether a news example is fake or not provided the fitted SVC we have trained.

In [28]:
news_dataset_test.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [29]:
news_dataset_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      5200 non-null   int64 
 1   title   5078 non-null   object
 2   author  4697 non-null   object
 3   text    5193 non-null   object
dtypes: int64(1), object(3)
memory usage: 162.6+ KB


In [30]:
news_dataset_test.shape

(5200, 4)

In [31]:
news_dataset_test.isnull().sum()

id          0
title     122
author    503
text        7
dtype: int64

In [32]:
news_dataset_test = news_dataset_test.fillna(" ")

In [33]:
news_dataset_test['content'] = news_dataset_test['title'] + ' ' + news_dataset_test['author']

In [34]:
X_test = news_dataset_test['content'].values

In [35]:
X_test = vectorizer.transform(X_test)


In [36]:
X_test_predictions = model_svc.predict(X_test)
print(X_test_predictions)

[0 1 1 ... 0 1 0]


In [37]:
predictions_df = pd.DataFrame(X_test_predictions)
predictions_df.columns =['label']
predictions_df.head()

Unnamed: 0,label
0,0
1,1
2,1
3,0
4,1


In [38]:
df = pd.concat([news_dataset_test["id"], predictions_df], axis=1)
df.head()

Unnamed: 0,id,label
0,20800,0
1,20801,1
2,20802,1
3,20803,0
4,20804,1
