The 'label' is the 'space' in the csv file after the endings og the texts

#Tasks 

For this exercise, you will build a classifier that can distinguish real news from fake news.

The fields you are interested in are ‘text’ and ‘label’ with the obvious interpretations.

Here are the action items for this exercise:

* Use the vector model and text preprocessing techniques from class to construct
a training data set.

* Determine the dimensions of your vector model and print out the first 10 dimensions

* Use that training data set to construct a Naive Bayes classifier.

* Compute the accuracy and 95% CI for the classifier.

* Try your analysis with and without data preprocessing, is there a difference in accuracies of the models.

* The data set contains a large number of articles (takes a long time to train), you can downsample this to something like 1,000 articles or so in order to speed up training and evaluation (hint: use shuffle).

#Task code

In [20]:
import pandas as pd
from sklearn.utils import shuffle

# Load your dataset
url = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"
news = pd.read_csv(url).drop(columns=['id'])

# Shuffle the dataset
shuffled_news = shuffle(news, random_state=42)

# Downsample the dataset to 1,000 articles
downsampled_news = shuffled_news.head(1000)

# Check the shape of the downsampled dataset
print(downsampled_news.shape)


(1000, 3)


In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"
news = pd.read_csv(url).drop(columns=['id'])
news.head(n=10)

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [2]:
news.shape

(6335, 3)

In [3]:
news['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
REAL,3171
FAKE,3164


## Task 1 & 2

In [8]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(news['text']).toarray()
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))



docarray shape: (6335, 67659)
first 10 coords: ['00' '000' '0000' '000000031' '00000031' '000035' '00006' '0001' '0001pt'
 '0002']


In [9]:
# process documents
vectorizer = CountVectorizer(analyzer = "word",
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True,
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(news['text']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

docarray shape: (6335, 38548)
first 10 coords: ['a' 'aa' 'aaa' 'aab' 'aadmi' 'aaib' 'aam' 'aamaq' 'aap' 'aaron']


In [15]:
#STOP WORDS

# process documents
vectorizer = CountVectorizer(analyzer = "word",
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True,
                             stop_words = 'english',
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(news['text']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

docarray shape: (6335, 38240)
first 10 coords: ['aa' 'aaa' 'aab' 'aadmi' 'aaib' 'aam' 'aamaq' 'aap' 'aaron' 'aarp']


In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

# add doc names so that later analysis becomes more readable
doc_names = ['doc{}'.format(i) for i in range(news.shape[0])]
news = pd.DataFrame(news.values, index=doc_names,columns=news.columns)
print(news.head(n=10))

# build the stemmer object
stemmer = PorterStemmer()

# get the default text analyzer from CountVectorizer
analyzer = CountVectorizer(analyzer = "word",
                           stop_words = 'english',
                           token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]

vectorizer = CountVectorizer(analyzer=stemmed_words,
                                 binary=True,
                                 min_df=2)
docarray = vectorizer.fit_transform(news['text']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("Dimensions of the vector model:", docarray.shape[1])
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

                                                  title  \
doc0                       You Can Smell Hillary’s Fear   
doc1  Watch The Exact Moment Paul Ryan Committed Pol...   
doc2        Kerry to go to Paris in gesture of sympathy   
doc3  Bernie supporters on Twitter erupt in anger ag...   
doc4   The Battle of New York: Why This Primary Matters   
doc5                                        Tehran, USA   
doc6  Girl Horrified At What She Watches Boyfriend D...   
doc7                  ‘Britain’s Schindler’ Dies at 106   
doc8  Fact check: Trump and Clinton at the 'commande...   
doc9  Iran reportedly makes new push for uranium con...   

                                                   text label  
doc0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
doc1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
doc2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
doc3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
doc4  It's primary day in New 

In [12]:
distances = euclidean_distances(docarray)
doc_names = ['doc{}'.format(i) for i in range(docarray.shape[0])]
distances_df = pandas.DataFrame(data=distances,index=doc_names,columns=doc_names)
distances_df

Unnamed: 0,doc0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,...,doc6325,doc6326,doc6327,doc6328,doc6329,doc6330,doc6331,doc6332,doc6333,doc6334
doc0,0.000000,21.610183,22.000000,21.307276,20.712315,28.583212,21.771541,20.420578,26.115130,22.847319,...,21.517435,20.639767,21.470911,19.416488,21.095023,21.563859,30.133038,27.092434,24.637370,23.043437
doc1,21.610183,0.000000,17.291616,16.822604,15.748016,26.870058,17.635192,14.966630,24.062419,18.841444,...,18.601075,16.703293,16.911535,13.341664,16.613248,17.146428,28.653098,25.039968,21.540659,19.131126
doc2,22.000000,17.291616,0.000000,17.204651,15.524175,26.476405,17.204651,14.317821,23.280893,17.606817,...,18.788294,17.029386,16.217275,12.767145,16.155494,16.462078,28.000000,24.413111,20.174241,19.621417
doc3,21.307276,16.822604,17.204651,0.000000,13.892444,26.925824,17.262677,14.177447,23.537205,18.708287,...,17.972201,16.309506,16.031220,12.449900,15.842980,16.643317,28.425341,24.454039,21.424285,18.681542
doc4,20.712315,15.748016,15.524175,13.892444,0.000000,26.115130,15.779734,12.409674,22.781571,17.117243,...,17.320508,15.394804,14.832397,10.488088,14.142136,15.748016,27.694765,23.937418,20.000000,17.435596
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
doc6330,21.563859,17.146428,16.462078,16.643317,15.748016,26.944387,17.000000,14.000000,23.000000,18.083141,...,17.944358,16.155494,16.309506,12.489996,15.491933,0.000000,27.658633,24.677925,20.976177,19.339080
doc6331,30.133038,28.653098,28.000000,28.425341,27.694765,33.090784,28.284271,27.440845,30.331502,28.740216,...,28.407745,28.354894,27.802878,26.851443,27.730849,27.658633,0.000000,30.919250,28.896367,28.687977
doc6332,27.092434,25.039968,24.413111,24.454039,23.937418,30.870698,24.779023,23.895606,28.106939,25.416530,...,25.436195,24.413111,24.637370,23.130067,23.769729,24.677925,30.919250,0.000000,26.400758,25.436195
doc6333,24.637370,21.540659,20.174241,21.424285,20.000000,28.390139,21.748563,19.748418,25.039968,21.702534,...,22.449944,21.142375,20.928450,18.867962,20.542639,20.976177,28.896367,26.400758,0.000000,22.715633


## Task 3 & 4: Naive Bayes Classifier and Compute the accuracy and 95% CI for the classifier




In [17]:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

# compute 95% confidence intervals for classification and regression
# problems

def classification_confint(acc, n):
    '''
    Compute the 95% confidence interval for a classification problem.
      acc -- classification accuracy
      n   -- number of observations used to compute the accuracy
    Returns a tuple (lb,ub)
    '''
    import math
    interval = 1.96*math.sqrt(acc*(1-acc)/n)
    lb = max(0, acc - interval)
    ub = min(1.0, acc + interval)
    return (lb,ub)

In [19]:

## Naive Bayes

print("******** model **********")


# Naive Bayes
model = MultinomialNB()
# NOTE: NB does not have any hyper-parameters - no overfitting - no searching over parameter space!
model.fit(docarray, news['label'])


print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = model
predict_y = best_model.predict(docarray)
acc = accuracy_score(news['label'], predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy of Naive Bayes with confidence interval: {:3.2f} ({:3.2f}, {:3.2f})".format(acc,lb,ub))

print("******** confusion matrix **********")

# build the confusion matrix
cats = ['FAKE','REAL']
cm = confusion_matrix(news['label'], predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** model **********
******** Accuracy **********
Accuracy of Naive Bayes with confidence interval: 0.92 (0.91, 0.93)
******** confusion matrix **********
Confusion Matrix:
      FAKE  REAL
FAKE  2856   308
REAL   197  2974
