In [1]:
import pandas as pd

In [2]:
# Data collection

In [3]:
data_set_fake = pd.read_csv("Fake.csv")
data_set_true = pd.read_csv("True.csv")

In [4]:
data_set_fake.head(5)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [5]:
data_set_true.head(5)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


### Merge the two datasets

#### Before we merge, we should add a new column which will flag the news as Fake or True news, this will be our Target feature

In [7]:
data_set_fake['True'] = 0
data_set_true['True'] = 1


In [8]:
data_set_true.shape, data_set_fake.shape



((21417, 5), (23481, 5))

In [9]:
data_set_fake.head(5)

Unnamed: 0,title,text,subject,date,True
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [10]:
data_set = pd.concat([data_set_fake, data_set_true], axis=0)

In [11]:
data_set.head(5)

Unnamed: 0,title,text,subject,date,True
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


#### Data preprocessing

In [12]:
data_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44898 entries, 0 to 21416
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   True     44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 2.1+ MB


______________________                        
**Observations** - We donot have any null values.           
We would now do Feature Selection 

In [13]:
data_set_true['subject'].value_counts()

politicsNews    11272
worldnews       10145
Name: subject, dtype: int64

In [14]:
data_set_fake['subject'].value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

From the above, we can see that all the true news have subject - politicsNews, worldnews        
and fake news has rest other subjects.         
This would say that the subjects are correlated to the type of news (Feature- True), we would exclude this columns just so that we are able to read the content of the news and then find out what determines the news as True or Fake. 

### Data cleaning/ Feature selection

In [16]:
data_set = data_set.drop(['title', 'subject', 'date'], axis = 1)

In [17]:
data_set.head(5)

Unnamed: 0,text,True
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0


**Reshuffle the rows**
To prevent Bias, improving generalization and to enable randomness in batch selection and avoid overfitting. 

In [18]:
data_set = data_set.sample(frac=1).reset_index(drop=True)


In [19]:
data_set.head(10)

Unnamed: 0,text,True
0,"Whenever we have a terror attack in America, t...",0
1,WASHINGTON (Reuters) - The Clinton Foundation ...,1
2,Because what s funnier than tying up the lines...,0
3,NEW YORK (Reuters) - New Jersey Governor Chris...,1
4,BEIJING (Reuters) - China s naval chief has to...,1
5,One of the biggest mysteries in Constitutional...,0
6,"If one were to best describe Donald Trump, esp...",0
7,GAZA (Reuters) - Rival Palestinian factions Fa...,1
8,If Donald Trump was watching The Late Show wit...,0
9,"WASHINGTON (Reuters) - In late October, Presi...",1


In [20]:
# For ML models the data is better read in numerical values, so we will need to convert the text into numerical vectors. 
# TFIDF technique. (some words are important, to get the context of the words)
# Feature extraction 

In [21]:
# After inspecting some texts, we would like to remove the punctuations, handle the casing of the texts, remove links and other symbols. 


In [22]:
import re

def cleantext(text):
    
    #covert to lowercase
    text = text.lower()
    
    #remove URLs
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    
    #remove html tags
    text = re.sub(r'[<.*?]>+', '', text, flags=re.MULTILINE)
    
    #remove punctuations
    text = re.sub(r'[^\w\s]', '', text, flags=re.MULTILINE)
    
    #remove digits
    text = re.sub(r'\d', '', text, flags=re.MULTILINE)
    
    #remove newline character
    text = re.sub(r'\n', '', text, flags=re.MULTILINE)  
    
    return text


    

In [23]:
data_set['text'] = data_set['text'].apply(cleantext)
data_set['text'].get(0)

'whenever we have a terror attack in america the media goes to great lengths to make sure no one assumes of the killer was a muslim meanwhile the media has gone out of their way to convince americans that all hispanics hate trump isn t it interesting how quickly they print a story about the murderer of two muslim men who witnesses describe as a  tall hispanic man  and then blame trumpan imam and his assistant were shot and killed in broad daylight as they walked home from a mosque in queens that s not what america is about  khairul islam  a local resident told the daily news  we blame donald trump for this trump and his drama has created islamophobia another imam whose name is unknown at the moment also blamed the real estate mogul and former nyc mayor for the shooting  for those in leadership like trump and mr giuliani and other members of other institutions that project islam and muslims as the enemy this is the end result of their wickedness  the imam said at a gathering of muslims 

### Separate dependent and independent features


In [79]:
X = data_set['text']
Y = data_set['True']

### Split into training test data


In [25]:
from sklearn.model_selection import train_test_split

x_test, x_train, y_test, y_train = train_test_split(X,Y, test_size=0.7)

print(y_train)

42889    0
25120    0
7521     0
20241    0
41523    0
        ..
4910     0
41879    0
18038    1
29295    1
13049    0
Name: True, Length: 31429, dtype: int64


### Feature extraction -> we now convert the text to numerical data


In [26]:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorization = TfidfVectorizer()

xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test) # we only transform test data and donot fit() ie calculate the parameters again. 
# because we donot want new parameters to be calculated, if done new new words will be observed and the feature/column numbers will mismatch while predicting


In [63]:
xv_test
xv_train

<31429x177683 sparse matrix of type '<class 'numpy.float64'>'
	with 6436453 stored elements in Compressed Sparse Row format>

### Train models and find the accuracy of the prediction 

In [29]:
def train_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"score {model.score(xv_test, y_test)}")
    print(f"classification report \n {classification_report(y_test, y_pred)}")


In [30]:
# LogisticRegression 

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr = LogisticRegression()
train_model(lr, xv_train, xv_test, y_train, y_test)

score 0.9872299354072315
classification report 
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      7102
           1       0.98      0.99      0.99      6367

    accuracy                           0.99     13469
   macro avg       0.99      0.99      0.99     13469
weighted avg       0.99      0.99      0.99     13469



In [31]:
# Decision Tree Classifier 

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
train_model(dtc, xv_train, xv_test, y_train, y_test)

score 0.9960650382359492
classification report 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7102
           1       1.00      1.00      1.00      6367

    accuracy                           1.00     13469
   macro avg       1.00      1.00      1.00     13469
weighted avg       1.00      1.00      1.00     13469



In [32]:
# Random Forest Classifier 

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier() 
train_model(rfc, xv_train, xv_test, y_train, y_test)

score 0.989680005939565
classification report 
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      7102
           1       0.99      0.99      0.99      6367

    accuracy                           0.99     13469
   macro avg       0.99      0.99      0.99     13469
weighted avg       0.99      0.99      0.99     13469



In [33]:
# Gradient Boost

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
train_model(rfc, xv_train, xv_test, y_train, y_test)

score 0.9881951147078476
classification report 
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      7102
           1       0.99      0.99      0.99      6367

    accuracy                           0.99     13469
   macro avg       0.99      0.99      0.99     13469
weighted avg       0.99      0.99      0.99     13469



### Create Predictive Model

In [73]:
def output_status(true):
    if true == 0:
        return "It's a Fake news"
    elif true == 1:
        return"It's a True news"
        
def manual_testing(news):
    testing_news = {"text": [news]} #dictonary 
    df_test = pd.DataFrame(testing_news)
    df_test['text'] = df_test['text'].apply(cleantext)
    new_x_test = df_test['text']
    new_xv_test = vectorization.transform(new_x_test) #object vectorization is defined above_ tfidf
    y_pred_lr = lr.predict(new_xv_test)
    y_pred_dtc = lr.predict(new_xv_test)
    y_pred_rfc = lr.predict(new_xv_test)
    y_pred_gbc = lr.predict(new_xv_test)
    return print("prediction  \n\nLogistic Regression {}, \nDecision tree {}, \nRandom forest {}, \nGradient boost {}".format(output_status(y_pred_lr[0]), 
                                                                                                       output_status(y_pred_dtc[0]), 
                                                                                                       output_status(y_pred_rfc[0]), 
                                                                                                       output_status(y_pred_gbc[0])))

In [74]:
news_article = "Three sons of Hamas political leader Ismail Haniyeh were killed in an Israeli airstrike in Gaza Wednesday, an assassination that threatens to complicate ongoing negotiations aiming to secure a ceasefire and hostage deal.The Israeli military confirmed it carried out the attack, describing the men as “three Hamas military operatives that conducted terrorist activity in the central Gaza Strip.”According to the Israel Defense Forces (IDF) and Israel Security Agency (ISA), those killed were Amir Haniyeh, a cell commander in Hamas’ military wing, and Hamas military operatives Mohammad Haniyeh and Hazem Haniyeh.CNN is not able to independently confirm the IDF’s claims.The three were killed when the vehicle they were driving in was bombed in the Al Shati refugee camp, northwest of Gaza City, Hamas political leader Haniyeh told Al Jazeera.At least three of Haniyeh’s grandchildren were also killed, as was the driver, according to a journalist working for CNN in Gaza.The Israeli military statement did not mention anyone else being killed in the strike.The Hamas-run government media office (GMO) said Wednesday that the Haniyeh family had been “carrying out social and family visits on the occasion of Eid al-Fitr,” before the vehicle was struck.Eid al-Fitr marks the end of Ramadan and is one of the most important holidays on the Islamic calendar.Haniyeh in a statement said killing the sons of leaders would only make Hamas “more steadfast in our principles and adherence to our land.”Whoever thinks that by targeting my kids during the negotiation talks and before a deal is agreed upon that it will force Hamas to back down on its demands, is delusional,” Haniyeh added."

In [75]:
manual_testing(news_article)

prediction  

Logistic Regression It's a True news, 
Decision tree It's a True news, 
Random forest It's a True news, 
Gradient boost It's a True news


In [77]:
news_article_fake = "Title: Global Leaders Take Action on Climate Change at COP26 Summit In a historic moment for climate action, world leaders gathered at the COP26 summit in Glasgow to address the pressing issue of climate change. With the recent release of alarming reports on the state of the planet, the urgency to take bold steps to curb greenhouse gas emissions and limit global warming has never been greater. During the summit, countries pledged to ramp up their efforts to reduce carbon emissions and transition to renewable energy sources. The United States announced ambitious targets to cut emissions by 50-52% below 2005 levels by 2030, while China committed to peak its carbon emissions before 2030 and achieve carbon neutrality by 2060. One of the key topics of discussion at the summit was the need for wealthy nations to provide financial support to developing countries to help them adapt to the effects of climate change. A new agreement was reached to increase funding for climate adaptation and resilience in vulnerable regions, signaling a step towards global solidarity in the fight against climate change. Environmental activists and youth advocates were also present at the summit, pushing for stronger commitments from world leaders to protect the planet for future generations. Greta Thunberg, the renowned climate activist, delivered a powerful speech urging leaders to act with urgency and prioritize the well-being of the planet over short-term economic interests. As the COP26 summit drew to a close, there was a sense of hope and determination among participants that real progress had been made towards addressing the climate crisis. While the road ahead is challenging, the collective efforts of global leaders and activists have set a new course towards a more sustainable and resilient future for all."

In [78]:
manual_testing(news_article_fake)

prediction  

Logistic Regression It's a Fake news, 
Decision tree It's a Fake news, 
Random forest It's a Fake news, 
Gradient boost It's a Fake news
