# Natural Language Processing Project

In this notebook, you will apply what you have learned about natural language processing in a new dataset. As we could see, most of the time, when dealing with natural language, we can extract most of the features directly from text.
Thus, it is up to us identifying what type of features are more interesting to use from the text.
In this Project, you will work with a different dataset about fake and real news.
Your goal is to have a classifier model that can identify when are fake or not.
This is quite hard, so we are not expecting you to obtain great results. The main idea here is to practice some NLP techniques to process text.

### Fake or Real News

You are going to work on a [Kaggle dataset](https://www.kaggle.com/datasets/jillanisofttech/fake-or-real-news) to train a model for classifying whether news are fake or real.
In the link, you will find the description of each column in the dataset. Notice, that the dataset is balanced, *i.e.*, it has the same number of samples for each class (Fake and Real).
The dataset provides only the *title* and *text* of each news, from that, you can think on strategies on how to represent such text.
Some interesting questions to answer are:
1. Is the title more important than the rest of the text?
2. If we combine both title and text as a single feature, does it improve training results?
3. Is there any feature we can extract from the text (such as the size of the text, for example) that improves training?

In [1]:
# Import here your libraries.
import nltk
import pandas as pd
import sklearn

In [2]:
data_path = 'datasets/fake_or_real_news.csv'
news = pd.read_csv(data_path)
news.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


From now, you may know it, but it is important to say the obvious, you shall train your model without the *label* column as it is what we want the model to learn. So, here, what matters is the *title* and *text* columns. Have a good coding day :D

> *Filipe: MY ATTEMPT STARTS FROM HERE*

# As a first attempt, I will be using the Linear Support Vector Classification model(*choice was based on sklearn "cheat sheet" for ML models* ). The only feature I decided to use was 'text' for now.

## 1) Selecting features and splitting the data into training and testing data.

In [51]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X = news['text']
y = news['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## 2) Building simple pipeline for easier future modifications or changes on data.

In [58]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

pipeline = Pipeline([
    ('bow', CountVectorizer()),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', LinearSVC()),  # train on bow vectors w/ Linear SVC
])

## 3) Training model

In [59]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('bow', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('classifier', LinearSVC())])

## 4) Predictions

In [60]:
from sklearn.metrics import classification_report

predictions = pipeline.predict(X_test)
print(classification_report(predictions, y_test))

              precision    recall  f1-score   support

        FAKE       0.95      0.92      0.93       987
        REAL       0.91      0.95      0.93       914

    accuracy                           0.93      1901
   macro avg       0.93      0.93      0.93      1901
weighted avg       0.93      0.93      0.93      1901

