TfidfVectorizer from sklearn.feature_extraction.text is used to 
convert the text articles into a format that can be used by a machine learning model. 
It does this by turning the text into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. 
TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus.

In [62]:
# Importing necessary libraries

import numpy as np # Numpy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

import pandas as pd # Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

# Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

from sklearn.model_selection import train_test_split # This function is used to split the data into two sets: training data and testing data. By default, 25% of the data is used for testing and the rest for training. However, you can specify the proportion of test data.

from sklearn.feature_extraction.text import TfidfVectorizer # TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features. It's equivalent to CountVectorizer followed by TfidfTransformer.
from sklearn.svm import LinearSVC

In [99]:
data = pd.read_csv('fake.csv')


In [100]:
data

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [101]:
data['fake'] = data['label'].apply(lambda x: 0 if x == "REAL" else 1) 

In [102]:
data = data.drop("label", axis=1)

In [103]:
X, y = data['text'], data['fake']

In [104]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [105]:
len(X_train)

5068

In [106]:
vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [107]:
clf = LinearSVC()
clf.fit(X_train_vectorized, y_train)

In [108]:
clf.score(X_test_vectorized, y_test) #

0.9344909234411997

In [109]:
len(y_test)*0.9400 #len(y_test) means 1267 or 94 percent of the articles was classified correclty. 
#when you multiply by 0.9400 you get 

1190.98

In [110]:
len(y_test) #1267-1190 = 77 were classified incorrectly

1267

What if i want to predict a sperate article
Take the article text and put it in a text file

In [111]:
with open("fakeElon.txt", "w", encoding="utf-8") as f:
    f.write(X_test.iloc[10])

In [112]:
with open("fakeElon.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [113]:
vectorized_text = vectorizer.transform([text])

In [114]:
clf.predict(vectorized_text)

array([1])

In [116]:
y_test.iloc[10]

1

To run in stream lit
1.) cd /Users/shanecupid/Desktop/MachineLearningProjects/FakeNews/
2.) python app.py
3.) streamlit run app.py