<a href="https://colab.research.google.com/github/0AlphaZero0/IBM-Coursera-AdvancedAI/blob/master/Advanced_Data_Science_Capstone_(IBM_Coursera).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Fake news and Real news classification</h1>

*(Logistic Regression with pySpark and dataset from Kaggle)*

The kaggle dataset is "[*Fake and real news dataset / Classifying the news*](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)" by [**Clément Bisaillon**](https://www.kaggle.com/clmentbisaillon)

The goal here is to create a Spark machine learning algorithm to classify fake news and real news.

## Setup

First we need to setup the environment with **Kaggle credentials**. <br><br>Here in Google Colab you can do that with your credentials in google drive. First, download your kaggle credentials on your Kaggle Account and on the API section you can create a new API Token, it will download the file `kaggle.json`. You can either load your file each time or put it in a folder in your Google Drive and then requests it to be loaded.
<br><br>
This is the method used here.

### Kaggle

First, let's install kaggle!

In [1]:
!pip install kaggle



Then thanks to the module `drive` of `google.colab`, let's connect to your Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Now, we can get your `kaggle.json` and bring it in Google Colab.

In [3]:
!cp '/content/gdrive/My Drive/Credentials/kaggle.json' 'kaggle.json'
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json  # set permission

kaggle.json


### Spark

Let's install pyspark

In [4]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[K     |████████████████████████████████| 212.4 MB 69 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 53.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=ba721cfa101773bf5b68a7f78368acaa60d946e99def84a7a98b51b43be100a9
  Stored in directory: /root/.cache/pip/wheels/a5/0a/c1/9561f6fecb759579a7d863dcd846daaa95f598744e71b02c77
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


## Dataset

To download the dataset we can do as followed :

In [5]:
!kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset # download dataset
!mkdir fakenews # create a directory in Google Colab environment
!unzip fake-and-real-news-dataset.zip -d fakenews/ # unzip the dataset in the directory previously created
!rm fake-and-real-news-dataset.zip # remove the zip file

Downloading fake-and-real-news-dataset.zip to /content
 85% 35.0M/41.0M [00:00<00:00, 80.1MB/s]
100% 41.0M/41.0M [00:00<00:00, 92.1MB/s]
Archive:  fake-and-real-news-dataset.zip
  inflating: fakenews/Fake.csv       
  inflating: fakenews/True.csv       


Import modules

In [6]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.functions import col,lit,to_date

Create Spark session

In [7]:
spark = (SparkSession.builder
                  .appName('Fake news detector')
                  .enableHiveSupport()
                  .config("spark.executor.memory", "4G")
                  .config("spark.driver.memory","18G")
                  .config("spark.executor.cores","7")
                  .config("spark.python.worker.memory","4G")
                  .config("spark.driver.maxResultSize","0")
                  .config("spark.sql.crossJoin.enabled", "true")
                  .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
                  .config("spark.default.parallelism","2")
                  .getOrCreate())

spark.sparkContext.setLogLevel('INFO')

Let's load *Fake* and *True* datasets

In [8]:
# True Dataset
df_true = spark.read.csv('./fakenews/True.csv',header=True)
df_true = df_true.withColumn("Label",lit("True")) # add a column named "Label" with all values as True
print("There are",df_true.count(),"rows in the True dataset.")
# False Dataset
df_false = spark.read.csv('./fakenews/Fake.csv',header=True)
df_false = df_false.withColumn("Label",lit("False")) # add a column named "Label" with all values as False
print("There are",df_false.count(),"rows in the False dataset.")
# Build a unique dataset
df = df_true.union(df_false)

# Transform date in the datetime format and remove the string column
df = df.withColumn("Datetime",to_date(df.date,"MMMM dd, YYYY")) # create a column named Datetime in datetime format
df = df.drop("date") # remove column date which contains date as strings
count = df.count()
print("In total, there are",count,"rows in the whole dataset.")

# Remove rows without title and text
columns = ['text','title']
df = df.na.drop(subset=columns) # remove rows where columns values are Nan
print(count-df.count(),"rows has been deleted because of Nan values in columns : "+" and ".join(columns),"=>",df.count(),"rows in the dataset")

There are 21417 rows in the True dataset.
There are 23489 rows in the False dataset.
In total, there are 44906 rows in the whole dataset.
8 rows has been deleted because of Nan values in columns : text and title => 44898 rows in the dataset


In [9]:
# A look at the dataframe schema
df.printSchema()

root
 |-- title: string (nullable = true)
 |-- text: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- Label: string (nullable = false)
 |-- Datetime: date (nullable = true)



Let's take a look at *subjects*

In [12]:
df_tmp = df.groupBy('subject').count()
df_tmp = df_tmp.sort(df_tmp['count'].desc())

df_tmp.show()
print("There are",df_tmp.count(),'subjects in the dataset')

+--------------------+-----+
|             subject|count|
+--------------------+-----+
|        politicsNews|11209|
|           worldnews|10115|
|                News| 8501|
|            politics| 6525|
|           left-news| 4216|
|     Government News| 1543|
|             US_News|  767|
|         Middle-east|  762|
|     fjs);}(document|  124|
|               2017"|   26|
|               2016"|   18|
| 2016Featured ima...|    8|
| i'm sure it wasn...|    8|
|              2017 "|    7|
| 2017Ranking memb...|    7|
|                   s|    7|
| and?"" he says. ...|    6|
| make it into art...|    5|
| fry em like baco...|    5|
|      2017So all day|    4|
+--------------------+-----+
only showing top 20 rows

There are 823 subjects int the dataset


As shown above some subjects seems corrupted and we will therefore keep only revelevant subjects (represented with more than 500 rows) 

In [12]:
df_tmp = df.groupBy('subject').count().where("count>500")
df_tmp = df_tmp.sort(df_tmp['count'].desc())

df_tmp.show()

+---------------+-----+
|        subject|count|
+---------------+-----+
|   politicsNews|11209|
|      worldnews|10115|
|           News| 8501|
|       politics| 6525|
|      left-news| 4216|
|Government News| 1543|
|        US_News|  767|
|    Middle-east|  762|
+---------------+-----+



Let's remove corrupted subjects

In [13]:
# Remove corrupted subjects
count = df.count()
subjects = [str(x.subject) for x in df_tmp.select('subject').collect()]
res_df = df.where(df.subject.isin(subjects))
    
print(count-res_df.count(),"rows has been deleted because of corrupted subjects =>",res_df.count(),"rows in the dataset")

1260 rows has been deleted because of corrupted subjects => 43638 rows in the dataset


In [14]:
df = res_df

## Model

Now our dataset is clean and ready for model building

In [15]:
# selection of feature/text and class/label
df = df.select('text','Label')
df.show(5)

+--------------------+-----+
|                text|Label|
+--------------------+-----+
|WASHINGTON (Reute...| True|
|WASHINGTON (Reute...| True|
|WASHINGTON (Reute...| True|
|WASHINGTON (Reute...| True|
|SEATTLE/WASHINGTO...| True|
+--------------------+-----+
only showing top 5 rows



Import libraries

In [16]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover,CountVectorizer,IDF
from pyspark.ml.feature import StringIndexer

Model skeleton

In [17]:
tokenizer = Tokenizer(inputCol='text',outputCol='mytokens')
stopwords_remover = StopWordsRemover(inputCol='mytokens',outputCol='filtered_tokens')
vectorizer = CountVectorizer(inputCol='filtered_tokens',outputCol='rawFeatures')
idf = IDF(inputCol='rawFeatures',outputCol='vectorizedFeatures')

Transform label column from string to float

In [18]:
labelEncoder = StringIndexer(inputCol='Label',outputCol='label').fit(df)
labelEncoder.transform(df).show(5)
label_dict = {"True":1.0,"False":0.0}

+--------------------+-----+
|                text|label|
+--------------------+-----+
|WASHINGTON (Reute...|  1.0|
|WASHINGTON (Reute...|  1.0|
|WASHINGTON (Reute...|  1.0|
|WASHINGTON (Reute...|  1.0|
|SEATTLE/WASHINGTO...|  1.0|
+--------------------+-----+
only showing top 5 rows



Transform the dataset label 

In [19]:
df = labelEncoder.transform(df)

Split the dataset into train and test set

In [20]:
trainDF,testDF = df.randomSplit((0.7,0.3),seed=42)

Import Logistic Regression

In [21]:
from pyspark.ml.classification import LogisticRegression

Model

In [22]:
lr = LogisticRegression(featuresCol='vectorizedFeatures',labelCol='label')

Import pipeline

In [23]:
from pyspark.ml import Pipeline

Finally build the model pipeline

In [24]:
pipeline = Pipeline(stages=[tokenizer,stopwords_remover,vectorizer,idf,lr])

Let's train the model on training set

In [25]:
lr_model = pipeline.fit(trainDF)

Then predict on test set.

In [26]:
predictions = lr_model.transform(testDF)
predictions.show()

+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|                text|label|            mytokens|     filtered_tokens|         rawFeatures|  vectorizedFeatures|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
| ((This December ...|  1.0|[, ((this, decemb...|[, ((this, decemb...|(262144,[0,1,3,4,...|(262144,[0,1,3,4,...|[-35.096204393460...|[5.72680105339041...|       1.0|
| (Corrects Comey ...|  1.0|[, (corrects, com...|[, (corrects, com...|(262144,[0,1,2,3,...|(262144,[0,1,2,3,...|[-34.956903176082...|[6.58278765780648...|       1.0|
| (Corrects Feb. 2...|  1.0|[, (corrects, feb...|[, (corrects, feb...|(262144,[0,1,2,3,...|(262144,[0,1,2,3,...|[-38.724634743910...|[1.52091455826497...|       1.0|
| (C

Import model evaluator

In [27]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

Declare evaluator

In [28]:
evaluator = MulticlassClassificationEvaluator(labelCol='label',predictionCol='prediction',metricName='accuracy')

Compute the accuracy of the model

In [29]:
accuracy = evaluator.evaluate(predictions)
accuracy

0.9923142613151152

### Hyperparameters and Cross Validation

Here we will use cross validation to do hyperparameter tuning and try to improve model performances

In [30]:
# Import modules
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

Building the hyperparameter Grid

In [31]:
paramGrid = ParamGridBuilder()\
    .addGrid(vectorizer.minDF,[1.0,1.5])\ # minDF hyperparameter of the vectorizer
    .addGrid(lr.regParam,[0.1,0.01])\ # reParam hyperparameter of the logistic regression
    .build()

Creation of the cross validation with 2 folds (in real world it would be better to have 3 or more folds)

In [32]:
crossVal = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=2)

Launch the cross validation on training data

In [33]:
cvModel = crossVal.fit(trainDF)

Get the accuracy of the best model in the cross validation/ hyperparameter tuning.

In [41]:
cvPrediction = cvModel.transform(testDF)
cvModel.getEvaluator().evaluate(cvPrediction)

0.9955748777268846

<br><br><br><br><br><br><br><br><br><br><br>
<h2> Done by Arthur Tanquerel Thouvenin <img src="https://media.giphy.com/media/dWlLf9EAC8u5Nd0ku4/giphy.gif" width="50"></h2>

<img align='right' src="https://drive.google.com/uc?export=view&id=1EjS2yq_Onqz4gbyz-Xu9MvdP6arFvqX3" width="130">

<p>
  <em>
    Data Scientist 
    <img src="https://media.giphy.com/media/QtOt8WyYCGQBiJJ4ZJ/giphy.gif" width="15">
  </em>
</p>

[![Linkedin: thaianebraga](https://img.shields.io/badge/-Arthur_Tanquerel_Thouvenin-blue?style=flat-square&logo=Linkedin&logoColor=white&link=https://www.linkedin.com/in/arthur-thouvenin-133822135/)](https://www.linkedin.com/in/arthur-thouvenin-133822135/)
[![GitHub Thaiane](https://img.shields.io/github/followers/0AlphaZero0?label=follow&style=social)](https://github.com/0AlphaZero0)
                        

<h3> <img src="https://media.giphy.com/media/W7EAM6hdYrw7E3NBgE/giphy.gif" width="30"> If you want to know more about me, you are on the spot!</h3>


- 🔭 I’m currently working on personal projects involving photos of second hand items and AI processing
- 🌱 I’m currently learning AI, Data Science and much more
- 🔒 Ardent supporter of Open Source, Open Sciences, Open Data
- 💬 Ask me about Persistent Identifiers (ROR IDs), Biomedical Litterature, NLP
- 📫 How to reach me: athouvenin [at] outlook.com
- 😄 Pronouns: He/Him

<br>

<p align=center>
    <img height=160 align="center" src="https://github-readme-stats.vercel.app/api?username=0AlphaZero0&show_icons=true&theme=midnight-purple">
    <img height=160 align="center" src="https://github-readme-stats.vercel.app/api/top-langs/?username=0AlphaZero0&layout=compact&theme=midnight-purple">
</p>


<!--START_SECTION:badges-->
<!--END_SECTION:badges-->


<h3> May the Force be with you !</h3>
<p align="center">
  <img src="https://media.giphy.com/media/GExBk9r9lP9LN5j2H5/giphy.gif">
</p>

<br><p align="right">![](https://visitor-badge.laobi.icu/badge?page_id=0ALphaZero0.0AlphaZero0)<br>
