## Import the libraries
### Data manipulation libraries
- **Pandas** is a python package or library to extract the data from the CSV into a DataFrame (table).
- **Numpy** is a general-purpose array-processing package mostly used to process n-dimensional  matrices in python.

### Machine Learning libraries
#### `sklearn` is a python library that provides a range of machine learning algorithms. The following modules belong to it.
  - `train_test_split` library is used to split the data into training and testing.
  - `TfidfVectorizer` converts a collection of raw documents to a matrix containing TF-IDF features.
  - `PassiveAggressiveClassifier` It is similar to the Perceptron in that it does not require a learning rate. However, contrary to the Perceptron, it includes a regularization parameter `C`.
  - `accuracy_score` - In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in `y_true`.
  - `confusion_matrix` - Computes confusion matrix to evaluate the accuracy of a classification.

In [0]:
# Importing all the libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

## Load Fake News data
We use pandas to import the data into a dataframe


In [0]:
# Read the data
df = pd.read_csv('fake_news_data.csv')

#### Get a snapshot of the data using `head()` function.

In [0]:
#Get shape and head
print(df.shape)
df.head()

In [0]:
# So the following line of code converts it into a unicode string format.
# This operation is done to make sure the classifier doesn't run into mundane issues of encoding.
df['text'] = df['text'].apply(lambda x: np.str_(x))

# Get the labels
labels=df.label
labels.head()

# Split data into training and test sets
Split data into `training = 80%` and `testing = 20%` using `train_test_split` function:

- `random_state` is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices.

In [0]:
# Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

## Machine learning code

In [0]:
# Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
# Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

# Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
# Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

In [0]:
# Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

In [0]:
print("In this model, we have "+
      str(confusion_matrix(y_test,y_pred, labels=['FAKE','REAL']).tolist()[0][0])+ " true positives (REAL news articles), "+
      str(confusion_matrix(y_test,y_pred, labels=['FAKE','REAL']).tolist()[1][1])+ " true negatives (FAKE news articles),\n "+
      str(confusion_matrix(y_test,y_pred, labels=['FAKE','REAL']).tolist()[1][0])+ " false positives (REAL news articles predicted as FAKE), "+
      str(confusion_matrix(y_test,y_pred, labels=['FAKE','REAL']).tolist()[0][1])+ " false negatives (FAKE news articles predicted as REAL), ")

# Prediction Block
Prediction on wine quality can be performed by uploading the provided csv file `(option-1)` or typing real time features `(option-2)` like **title** and  **text** in real time to predict if the news article is FAKE or REAL.


In [0]:
option = int(input("Enter 1 or 2\n 1. If you want to upload the prediction data file \n 2. Write your own Prediction data\n:"))
if(option==1):
  # Loading prediction.csv file
  df_1=pd.read_csv('fake_news_data_prediction.csv')
  # Seperating features from csv file
  x_pre= df_1.drop('label', axis=1)
  # So the following line of code converts it into a unicode string format.
  # This operation is done to make sure the classifier doesn't run into mundane issues of encoding.
  df_1['text'] = df_1['text'].apply(lambda x: np.str_(x))
  # Fit and transform Prediction set
  fake_1=tfidf_vectorizer.transform(df_1['text'])
  # Predicting news with given features
  fake_pred_1=pac.predict(fake_1)
  # Reshaping data into our needs
  fake_pred_1=fake_pred_1.reshape(401,1)
  # Adding the label column to the dataframe
  df_1['label']=fake_pred_1
  # Print dataframe
  print('Here is the dataframe or table for prediction dataset with values:\n',df_1)
  df_1.to_csv('Prediction_Output.csv')

else:
  print("Here are the links to a few the websites:\n\nREAL News:\n- http://www.bbc.com\n- http://money.cnn.com\n- http://edition.cnn.com\n- http://abcnews.go.com\n- http://www.bbc.co.uk\
  \n\nFAKE News:\n- http://beforeitsnews.com\n- https://www.activistpost.com\n- http://dailybuzzlive.com\n- http://www.disclose.tv")
  title=input("\nCopy the headline of the news article here: ")
  text=input("\nCopy the body of the news article here: \n")
  real_time_data = [text]  
  # Create the pandas DataFrame 
  df_2 = pd.DataFrame(real_time_data, columns = ['text']) 
  df_2['text'] = df_2['text'].apply(lambda x: np.str_(x))
  # Fit and transform Prediction set
  fake_2=tfidf_vectorizer.transform(df_2['text'])
  # Predicting news with given features
  fake_pred_2=pac.predict(fake_2)
  label=fake_pred_2[0]
  # Here we are converting data into pandas dataframe which is nothing but a table format
  real_time_data_2 = [[title,text,label]]  
  # Create the pandas DataFrame 
  df_3 = pd.DataFrame(real_time_data_2, columns = ['title','text','label']) 
  # Print dataframe with prediction results. 
  print('\n\nHere is the dataframe or table for above entered values with prediction:\n',df_3)
  df_3.to_csv('Prediction_Output.csv')