In [74]:
import pandas as pd
import seaborn as sns

In [75]:
# Data collection

In [76]:
data_set_fake = pd.read_csv("Fake.csv")
data_set_true = pd.read_csv("True.csv")

In [77]:
data_set_fake.head(5)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [78]:
data_set_true.head(5)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


### Merge the two datasets

#### Before we merge, we should add a new column which will flag the news as Fake or True news, this will be our Target feature

In [80]:
data_set_fake['label'] = 0
data_set_true['label'] = 1


In [81]:
data_set_true.shape, data_set_fake.shape



((21417, 5), (23481, 5))

In [82]:
data_set_fake.head(5)

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [83]:
data_set = pd.concat([data_set_fake, data_set_true], axis=0)

In [84]:
data_set.head(5)

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


#### Data preprocessing

In [85]:
# # sns.countplot(data_set.subject)
# plt.title('the number of news fake/real');

In [86]:
data_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44898 entries, 0 to 21416
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 2.1+ MB


______________________                        
**Observations** - We donot have any null values.           
We would now do Feature Selection 

In [87]:
data_set_true['subject'].value_counts()

politicsNews    11272
worldnews       10145
Name: subject, dtype: int64

In [88]:
data_set_fake['subject'].value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

From the above, we can see that all the true news have subject - politicsNews, worldnews        
and fake news has rest other subjects.         
This would say that the subjects are correlated to the type of news (Feature- True), we would exclude this columns just so that we are able to read the content of the news and then find out what determines the news as True or Fake. 

### Data cleaning/ Feature selection

In [89]:
data_set = data_set.drop(['title', 'subject', 'date'], axis = 1)

In [90]:
data_set.head(5)

Unnamed: 0,text,label
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0


**Reshuffle the rows**
To prevent Bias, improving generalization and to enable randomness in batch selection and avoid overfitting. 

In [91]:
data_set = data_set.sample(frac=1).reset_index(drop=True)


In [92]:
data_set.head(10)

Unnamed: 0,text,label
0,BRAS LIA (Reuters) - Brazilian police raided t...,1
1,HARARE (Reuters) - Zimbabwe s ruling ZANU-PF p...,1
2,Hours after helping begin to repeal America s ...,0
3,The idea that a tax-exempt radical organizatio...,0
4,The music icon spoke at the North Minneapolis ...,0
5,WASHINGTON (Reuters) - President Barack Obama ...,1
6,"President Trump visits Florida hospital, prai...",0
7,WASHINGTON (Reuters) - A drop in the U.S. unem...,1
8,WASHINGTON (Reuters) - U.S. Senator Mark Warne...,1
9,Former Vermont Governor Howard Dean has alread...,0


In [93]:
# For ML models the data is better read in numerical values, so we will need to convert the text into numerical vectors. 
# TFIDF technique. (some words are important, to get the context of the words)
# Feature extraction 

In [94]:
# After inspecting some texts, we would like to remove the punctuations, handle the casing of the texts, remove links and other symbols. 


In [95]:
import re

def cleantext(text):
    
    #covert to lowercase
    text = text.lower()
    
    #remove URLs
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    
    #remove html tags
    text = re.sub(r'[<.*?]>+', '', text, flags=re.MULTILINE)
    
    #remove punctuations
    text = re.sub(r'[^\w\s]', '', text, flags=re.MULTILINE)
    
    #remove digits
    text = re.sub(r'\d', '', text, flags=re.MULTILINE)
    
    #remove newline character
    text = re.sub(r'\n', '', text, flags=re.MULTILINE)  
    
    return text


    

In [96]:
data_set['text'] = data_set['text'].apply(cleantext)
data_set['text'].get(0)

'bras lia reuters  brazilian police raided the offices and homes of two members of congress on wednesday in the country s latest corruption probe as the government makes a lastditch effort to vote on an overhaul of the national pension system dubbed  operation  pia  the probe centers on alleged bribery of civil servants and politicians in return for rigged bids on road work totaling  million reais  million in the state of tocantins in central brazil federal police said in a statement they were serving  search warrants and delivering subpoenas to eight people in connection with the probe dulce miranda and carlos gaguim lawmakers from tocantins are implicated in the investigation police said gaguim denied any wrongdoing noting the accusations against him are baseless miranda s representatives said she would cooperate with the investigation president michel temer has said the lower house of congress would vote by tuesday on his proposed pension reform which many consider crucial to reinin

### Separate dependent and independent features


In [97]:
X = data_set['text']
Y = data_set['label']

### Split into training test data


In [98]:
from sklearn.model_selection import train_test_split

x_test, x_train, y_test, y_train = train_test_split(X,Y, test_size=0.7)

print(y_train)

13170    1
22058    1
26968    1
20308    0
31011    1
        ..
16       1
13195    1
15292    0
35459    1
39872    0
Name: label, Length: 31429, dtype: int64


### Feature extraction -> we now convert the text to numerical data


In [99]:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorization = TfidfVectorizer()

xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test) # we only transform test data and donot fit() ie calculate the parameters again. 
# because we donot want new parameters to be calculated, if done new new words will be observed and the feature/column numbers will mismatch while predicting


In [100]:
xv_test
xv_train

<31429x177911 sparse matrix of type '<class 'numpy.float64'>'
	with 6453878 stored elements in Compressed Sparse Row format>

### Train models and find the accuracy of the prediction 

In [101]:
def train_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"score {model.score(xv_test, y_test)}")
    print(f"classification report \n {classification_report(y_test, y_pred)}")


In [102]:
# LogisticRegression 

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr = LogisticRegression()
train_model(lr, xv_train, xv_test, y_train, y_test)

score 0.98797238102309
classification report 
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      7044
           1       0.99      0.99      0.99      6425

    accuracy                           0.99     13469
   macro avg       0.99      0.99      0.99     13469
weighted avg       0.99      0.99      0.99     13469



In [103]:
# Decision Tree Classifier 

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
train_model(dtc, xv_train, xv_test, y_train, y_test)

score 0.9955453263048482
classification report 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7044
           1       1.00      0.99      1.00      6425

    accuracy                           1.00     13469
   macro avg       1.00      1.00      1.00     13469
weighted avg       1.00      1.00      1.00     13469



In [104]:
# Random Forest Classifier 

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier() 
train_model(rfc, xv_train, xv_test, y_train, y_test)

score 0.9888633157621204
classification report 
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      7044
           1       0.99      0.99      0.99      6425

    accuracy                           0.99     13469
   macro avg       0.99      0.99      0.99     13469
weighted avg       0.99      0.99      0.99     13469



In [106]:
# Gradient Boost

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
train_model(gbc, xv_train, xv_test, y_train, y_test)

score 0.9957680599896057
classification report 
               precision    recall  f1-score   support

           0       1.00      0.99      1.00      7044
           1       0.99      1.00      1.00      6425

    accuracy                           1.00     13469
   macro avg       1.00      1.00      1.00     13469
weighted avg       1.00      1.00      1.00     13469



### Create Predictive Model

In [107]:
def output_status(true):
    if true == 0:
        return "It's a Fake news"
    elif true == 1:
        return"It's a True news"
        
def manual_testing(news):
    testing_news = {"text": [news]} #dictonary 
    df_test = pd.DataFrame(testing_news)
    df_test['text'] = df_test['text'].apply(cleantext)
    new_x_test = df_test['text']
    new_xv_test = vectorization.transform(new_x_test) #object vectorization is defined above_ tfidf
    y_pred_lr = lr.predict(new_xv_test)
    y_pred_dtc = dtc.predict(new_xv_test)
    y_pred_rfc = rfc.predict(new_xv_test)
    y_pred_gbc = gbc.predict(new_xv_test)
    return print("prediction  \n\nLogistic Regression {}, \nDecision tree {}, \nRandom forest {}, \nGradient boost {}".format(output_status(y_pred_lr[0]), 
                                                                                                       output_status(y_pred_dtc[0]), 
                                                                                                       output_status(y_pred_rfc[0]), 
                                                                                                       output_status(y_pred_gbc[0])))

In [108]:
news_article = "Three sons of Hamas political leader Ismail Haniyeh were killed in an Israeli airstrike in Gaza Wednesday, an assassination that threatens to complicate ongoing negotiations aiming to secure a ceasefire and hostage deal.The Israeli military confirmed it carried out the attack, describing the men as “three Hamas military operatives that conducted terrorist activity in the central Gaza Strip.”According to the Israel Defense Forces (IDF) and Israel Security Agency (ISA), those killed were Amir Haniyeh, a cell commander in Hamas’ military wing, and Hamas military operatives Mohammad Haniyeh and Hazem Haniyeh.CNN is not able to independently confirm the IDF’s claims.The three were killed when the vehicle they were driving in was bombed in the Al Shati refugee camp, northwest of Gaza City, Hamas political leader Haniyeh told Al Jazeera.At least three of Haniyeh’s grandchildren were also killed, as was the driver, according to a journalist working for CNN in Gaza.The Israeli military statement did not mention anyone else being killed in the strike.The Hamas-run government media office (GMO) said Wednesday that the Haniyeh family had been “carrying out social and family visits on the occasion of Eid al-Fitr,” before the vehicle was struck.Eid al-Fitr marks the end of Ramadan and is one of the most important holidays on the Islamic calendar.Haniyeh in a statement said killing the sons of leaders would only make Hamas “more steadfast in our principles and adherence to our land.”Whoever thinks that by targeting my kids during the negotiation talks and before a deal is agreed upon that it will force Hamas to back down on its demands, is delusional,” Haniyeh added."

In [109]:
manual_testing(news_article)

prediction  

Logistic Regression It's a True news, 
Decision tree It's a Fake news, 
Random forest It's a True news, 
Gradient boost It's a Fake news


In [110]:
news_article_fake = "Title: Global Leaders Take Action on Climate Change at COP26 Summit In a historic moment for climate action, world leaders gathered at the COP26 summit in Glasgow to address the pressing issue of climate change. With the recent release of alarming reports on the state of the planet, the urgency to take bold steps to curb greenhouse gas emissions and limit global warming has never been greater. During the summit, countries pledged to ramp up their efforts to reduce carbon emissions and transition to renewable energy sources. The United States announced ambitious targets to cut emissions by 50-52% below 2005 levels by 2030, while China committed to peak its carbon emissions before 2030 and achieve carbon neutrality by 2060. One of the key topics of discussion at the summit was the need for wealthy nations to provide financial support to developing countries to help them adapt to the effects of climate change. A new agreement was reached to increase funding for climate adaptation and resilience in vulnerable regions, signaling a step towards global solidarity in the fight against climate change. Environmental activists and youth advocates were also present at the summit, pushing for stronger commitments from world leaders to protect the planet for future generations. Greta Thunberg, the renowned climate activist, delivered a powerful speech urging leaders to act with urgency and prioritize the well-being of the planet over short-term economic interests. As the COP26 summit drew to a close, there was a sense of hope and determination among participants that real progress had been made towards addressing the climate crisis. While the road ahead is challenging, the collective efforts of global leaders and activists have set a new course towards a more sustainable and resilient future for all."

In [111]:
manual_testing(news_article_fake)

prediction  

Logistic Regression It's a Fake news, 
Decision tree It's a Fake news, 
Random forest It's a True news, 
Gradient boost It's a Fake news
