## Task 1 & 2: Preprocessing techniques and determining the dimensions of the vector model

In [27]:
import pandas as pd
from sklearn.utils import shuffle

# Load your dataset
url = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"
news = pd.read_csv(url).drop(columns=['id'])

# Shuffle the dataset
shuffled_news = shuffle(news, random_state=42)

# Downsample the dataset to 1,000 articles
downsampled_news = shuffled_news.head(1000)

# Check the shape of the downsampled dataset
print(downsampled_news.shape)


(1000, 3)


In [28]:
downsampled_news.head(n=10)

Unnamed: 0,title,text,label
1357,"American Dream, Revisited",Will Trump pull a Brexit times ten? What would...,FAKE
2080,Clintons Are Under Multiple FBI Investigations...,Clintons Are Under Multiple FBI Investigations...,FAKE
2718,The FBI Can’t Actually Investigate a Candidate...,Dispatches from Eric Zuesse This piece is cros...,FAKE
812,Confirmed: Public overwhelmingly (10-to-1) say...,Print \n[Ed. – Every now and then the facade c...,FAKE
4886,Nanny In Jail After Force Feeding Baby To Death,Nanny In Jail After Force Feeding Baby To Deat...,FAKE
4890,Media Roll Out Welcome Mat for ‘Humanitarian’ ...,By Belén Fernández | FAIR PHOTO ABOVE: Hillary...,FAKE
4714,Hillary Clinton accepts nomination with 'bound...,"The words, when they came, had lost no power o...",REAL
1782,Police Turn In Badges Rather Than Incite Viole...,By Amanda Froelich It should be evident if you...,FAKE
2445,South Carolina police officer charged with mur...,"A white police officer in North Charleston, S....",REAL
3574,Tony Blair helpfully describes Remain voters a...,Tony Blair helpfully describes Remain voters a...,FAKE


In [29]:
downsampled_news['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
REAL,513
FAKE,487


**Docterm Matrix**

In [30]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(downsampled_news['text']).toarray()
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))


docarray shape: (1000, 31297)
first 10 coords: ['00' '000' '00006' '0001pt' '000x' '001' '005' '005s' '006s' '009']


There is 31,297 different features

In [62]:
# process documents
vectorizer = CountVectorizer(analyzer = "word",
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True,
                             min_df=4) # each word has to appear at least four times
docarray = vectorizer.fit_transform(downsampled_news['text']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

docarray shape: (1000, 10499)
first 10 coords: ['a' 'aaron' 'abandon' 'abandoned' 'abandoning' 'abandonment' 'abbott'
 'abc' 'abdeslam' 'abedin']


Cutting down the number of features in the space to 10499

In [63]:
#STOP WORDS

# process documents
vectorizer = CountVectorizer(analyzer = "word",
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True,
                             stop_words = 'english',
                             min_df=4) # each word has to appear at least four times
docarray = vectorizer.fit_transform(downsampled_news['text']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

docarray shape: (1000, 10209)
first 10 coords: ['aaron' 'abandon' 'abandoned' 'abandoning' 'abandonment' 'abbott' 'abc'
 'abdeslam' 'abedin' 'abetting']


Now the number is reduced to 10209

In [64]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

# add doc names so that later analysis becomes more readable
doc_names = ['doc{}'.format(i) for i in range(downsampled_news.shape[0])]
downsampled_news = pd.DataFrame(downsampled_news.values, index=doc_names,columns=downsampled_news.columns)
print(downsampled_news.head(n=10))

# build the stemmer object
stemmer = PorterStemmer()

# get the default text analyzer from CountVectorizer
analyzer = CountVectorizer(analyzer = "word",
                           stop_words = 'english',
                           token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]

vectorizer = CountVectorizer(analyzer=stemmed_words,
                                 binary=True,
                                 min_df=4)
docarray = vectorizer.fit_transform(downsampled_news['text']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("Dimensions of the vector model:", docarray.shape[1])
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

                                                  title  \
doc0                          American Dream, Revisited   
doc1  Clintons Are Under Multiple FBI Investigations...   
doc2  The FBI Can’t Actually Investigate a Candidate...   
doc3  Confirmed: Public overwhelmingly (10-to-1) say...   
doc4    Nanny In Jail After Force Feeding Baby To Death   
doc5  Media Roll Out Welcome Mat for ‘Humanitarian’ ...   
doc6  Hillary Clinton accepts nomination with 'bound...   
doc7  Police Turn In Badges Rather Than Incite Viole...   
doc8  South Carolina police officer charged with mur...   
doc9  Tony Blair helpfully describes Remain voters a...   

                                                   text label  
doc0  Will Trump pull a Brexit times ten? What would...  FAKE  
doc1  Clintons Are Under Multiple FBI Investigations...  FAKE  
doc2  Dispatches from Eric Zuesse This piece is cros...  FAKE  
doc3  Print \n[Ed. – Every now and then the facade c...  FAKE  
doc4  Nanny In Jail After Forc

Using the stemming algorithm

In [54]:
distances = euclidean_distances(docarray)
doc_names = ['doc{}'.format(i) for i in range(docarray.shape[0])]
distances_df = pandas.DataFrame(data=distances,index=doc_names,columns=doc_names)
distances_df

Unnamed: 0,doc0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,...,doc990,doc991,doc992,doc993,doc994,doc995,doc996,doc997,doc998,doc999
doc0,0.000000,20.371549,26.664583,18.654758,19.697716,24.392622,23.832751,18.055470,23.065125,18.520259,...,22.427661,24.596748,19.646883,20.976177,18.574176,20.396078,21.071308,22.737634,21.023796,24.637370
doc1,20.371549,0.000000,25.099801,15.000000,15.132746,23.065125,22.516660,13.379088,19.974984,15.033296,...,18.165902,22.583180,15.779734,18.357560,14.560220,17.233688,18.303005,19.544820,17.804494,23.366643
doc2,26.664583,25.099801,0.000000,24.799194,25.436195,28.213472,27.513633,24.596748,26.739484,25.019992,...,25.961510,27.640550,25.278449,25.980762,24.859606,25.475478,25.670995,26.381812,25.865034,27.820855
doc3,18.654758,15.000000,24.799194,0.000000,12.649111,21.189620,20.688161,9.380832,19.026298,11.618950,...,17.691806,21.095023,12.083046,16.186414,10.535654,14.696938,16.124515,18.248288,16.248077,22.293497
doc4,19.697716,15.132746,25.436195,12.649111,0.000000,21.794495,21.307276,10.295630,19.026298,12.369317,...,18.947295,21.840330,13.341664,16.911535,11.532563,15.874508,16.733201,19.000000,16.613248,22.561028
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
doc995,20.396078,17.233688,25.475478,14.696938,15.874508,22.203603,21.587033,13.416408,20.199010,14.594520,...,19.974984,21.931712,15.556349,17.378147,14.106736,0.000000,17.888544,19.672316,18.055470,23.043437
doc996,21.071308,18.303005,25.670995,16.124515,16.733201,22.956481,22.494444,14.832397,21.118712,15.779734,...,20.273135,22.605309,16.431677,19.026298,15.329710,17.888544,0.000000,20.518285,18.275667,23.515952
doc997,22.737634,19.544820,26.381812,18.248288,19.000000,23.790755,23.685439,17.521415,21.702534,18.439089,...,21.633308,23.916521,18.841444,20.615528,18.439089,19.672316,20.518285,0.000000,20.760539,24.698178
doc998,21.023796,17.804494,25.865034,16.248077,16.613248,22.248595,22.315914,14.282857,21.071308,15.779734,...,20.322401,22.781571,16.431677,18.761663,15.524175,18.055470,18.275667,20.760539,0.000000,23.259407


### In summary, the dimensions of my vector model are 6975 and the first 10 dimensions are ['aaron' 'abandon' 'abbott' 'abc' 'abdeslam' 'abdic' 'abduct' 'abedin' 'aberr' 'abet']

## Task 3 & 4: Naive Bayes Classifier and Compute the accuracy and 95% CI for the classifier



In [35]:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

# compute 95% confidence intervals for classification and regression
# problems

def classification_confint(acc, n):
    '''
    Compute the 95% confidence interval for a classification problem.
      acc -- classification accuracy
      n   -- number of observations used to compute the accuracy
    Returns a tuple (lb,ub)
    '''
    import math
    interval = 1.96*math.sqrt(acc*(1-acc)/n)
    lb = max(0, acc - interval)
    ub = min(1.0, acc + interval)
    return (lb,ub)

In [65]:

## Naive Bayes

print("******** model **********")


# Naive Bayes
model = MultinomialNB()
# NOTE: NB does not have any hyper-parameters - no overfitting - no searching over parameter space!
model.fit(docarray, downsampled_news['label'])


print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = model
predict_y = best_model.predict(docarray)
acc = accuracy_score(downsampled_news['label'], predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy of Naive Bayes with confidence interval: {:3.2f} ({:3.2f}, {:3.2f})".format(acc,lb,ub))

******** model **********
******** Accuracy **********
Accuracy of Naive Bayes with confidence interval: 0.94 (0.92, 0.95)


In [66]:
print("******** confusion matrix **********")

# build the confusion matrix
cats = ['FAKE','REAL']
cm = confusion_matrix(downsampled_news['label'], predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** confusion matrix **********
Confusion Matrix:
      FAKE  REAL
FAKE   444    43
REAL    21   492


##Model without data preprocessing

In [38]:
import pandas as pd
from sklearn.utils import shuffle

# Load your dataset
url = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"
news = pd.read_csv(url).drop(columns=['id'])

# Shuffle the dataset
shuffled_news = shuffle(news, random_state=42)

# Downsample the dataset to 1,000 articles
downsampled_news = shuffled_news.head(1000)

# Check the shape of the downsampled dataset
print(downsampled_news.shape)


(1000, 3)


In [39]:
import pandas as pd

url = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"
news = pd.read_csv(url).drop(columns=['id'])
news.head(n=10)

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [40]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(news['text']).toarray()
print("docarray shape: {}".format(docarray.shape))
print("Dimensions of the vector model:", docarray.shape[1])
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))



docarray shape: (6335, 67659)
Dimensions of the vector model: 67659
first 10 coords: ['00' '000' '0000' '000000031' '00000031' '000035' '00006' '0001' '0001pt'
 '0002']


In [41]:
distances = euclidean_distances(docarray)
doc_names = ['doc{}'.format(i) for i in range(docarray.shape[0])]
distances_df = pandas.DataFrame(data=distances,index=doc_names,columns=doc_names)
distances_df

Unnamed: 0,doc0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,...,doc6325,doc6326,doc6327,doc6328,doc6329,doc6330,doc6331,doc6332,doc6333,doc6334
doc0,0.000000,24.819347,25.000000,24.657656,24.000000,32.588341,24.919872,23.537205,30.149627,26.134269,...,25.039968,23.979158,24.758837,22.891046,24.186773,24.657656,34.510868,30.708305,28.354894,25.980762
doc1,24.819347,0.000000,19.723083,20.049938,18.439089,30.724583,20.420578,17.549929,27.982137,21.517435,...,21.977261,19.773720,19.570386,16.370706,19.467922,20.000000,32.664966,28.407745,25.139610,22.293497
doc2,25.000000,19.723083,0.000000,19.974984,18.027756,30.215890,19.899749,16.340135,27.459060,20.591260,...,21.587033,19.697716,18.493242,15.000000,18.275667,18.841444,31.874755,28.071338,23.515952,22.494444
doc3,24.657656,20.049938,19.974984,0.000000,17.435596,30.951575,20.322401,16.970563,27.946377,21.702534,...,21.656408,19.723083,19.416488,15.556349,19.052559,19.748418,32.603681,28.372522,25.258662,22.022716
doc4,24.000000,18.439089,18.027756,17.435596,0.000000,29.866369,18.627936,14.966630,27.037012,20.074860,...,20.469489,18.627936,17.691806,13.114877,17.000000,18.708287,31.701735,27.694765,23.832751,20.615528
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
doc6330,24.657656,20.000000,18.841444,19.748418,18.708287,30.757113,19.924859,16.852300,27.221315,21.000000,...,21.142375,19.313208,19.313208,15.937377,18.627936,0.000000,31.764760,28.372522,24.083189,22.203603
doc6331,34.510868,32.664966,31.874755,32.603681,31.701735,37.881394,32.341923,31.384710,35.242020,32.924155,...,32.680269,32.155870,31.843367,31.000000,31.685959,31.764760,0.000000,35.972211,33.361655,32.954514
doc6332,30.708305,28.407745,28.071338,28.372522,27.694765,35.000000,28.565714,27.477263,32.588341,28.948230,...,29.086079,28.106939,28.142495,27.000000,27.459060,28.372522,35.972211,0.000000,30.413813,29.291637
doc6333,28.354894,25.139610,23.515952,25.258662,23.832751,32.802439,25.159491,23.237900,29.681644,25.278449,...,26.362853,24.879711,24.433583,22.538855,23.811762,24.083189,33.361655,30.413813,0.000000,26.438608


In [42]:

## Naive Bayes

print("******** model **********")


# Naive Bayes
model = MultinomialNB()
# NOTE: NB does not have any hyper-parameters - no overfitting - no searching over parameter space!
model.fit(docarray, news['label'])


print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = model
predict_y = best_model.predict(docarray)
acc = accuracy_score(news['label'], predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy of Naive Bayes with confidence interval: {:3.2f} ({:3.2f}, {:3.2f})".format(acc,lb,ub))

print("******** confusion matrix **********")

# build the confusion matrix
cats = ['FAKE','REAL']
cm = confusion_matrix(news['label'], predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** model **********
******** Accuracy **********
Accuracy of Naive Bayes with confidence interval: 0.95 (0.94, 0.95)
******** confusion matrix **********
Confusion Matrix:
      FAKE  REAL
FAKE  2898   266
REAL    78  3093


##Question: Is there a difference in accuracies of the models


###Answer: The model without data preprocessing achieves a slightly higher accuracy (0.95) compared to the model with preprocessing (0.94).  However, the 95% confidence intervals overlap, suggesting the difference might not be statistically significant.                             
The model with data preprocessing shows a significant improvement in correctly classifying fake and real news articles compared to the model without preprocessing.  The confusion matrix for the preprocessed model demonstrates a much higher accuracy in identifying both fake (444 true positives, 43 false negatives) and real (492 true negatives, 21 false positives) news articles.  In contrast, the model without preprocessing has a substantially higher misclassification rates, with a much larger number of false positives and false negatives (2898 false positives, 266 false negatives and 78 true positives and 3093 true negatives). This indicates that the data preprocessing steps, such as removing stop words, stemming, and adjusting the minimum document frequency, significantly improve the model's ability to distinguish between the two classes, resulting in a far more reliable classification.

## Extra Credit

In [43]:
import pandas as pd
from sklearn.utils import shuffle

# Load your dataset
url1 = "https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv"
news1 = pd.read_csv(url1).drop(columns=['id'])

# Shuffle the dataset
shuffled_news1 = shuffle(news1, random_state=42)

# Downsample the dataset to 1,000 articles
downsampled_news1 = shuffled_news1.head(1000)

# Check the shape of the downsampled dataset
print(downsampled_news1.shape)


(1000, 3)


In [44]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(downsampled_news1['title']).toarray()
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))



docarray shape: (1000, 3514)
first 10 coords: ['000' '10' '100' '100k' '11' '117' '12' '126' '13' '130']


In [45]:
# process documents
vectorizer = CountVectorizer(analyzer = "word",
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True,
                             min_df=4) # each word has to appear at least four times
docarray = vectorizer.fit_transform(downsampled_news1['title']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))



docarray shape: (1000, 486)
first 10 coords: ['a' 'about' 'accused' 'action' 'actually' 'admits' 'after' 'against'
 'ahead' 'air']


In [46]:
#STOP WORDS

# process documents
vectorizer = CountVectorizer(analyzer = "word",
                             token_pattern = "[a-zA-Z]+", # only words
                             binary = True,
                             stop_words = 'english',
                             min_df=4) # each word has to appear at least four times
docarray = vectorizer.fit_transform(downsampled_news1['title']).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))

docarray shape: (1000, 362)
first 10 coords: ['accused' 'action' 'actually' 'admits' 'ahead' 'air' 'airport' 'al'
 'america' 'american']


In [47]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer


# add doc names so that later analysis becomes more readable
doc_names = ['doc{}'.format(i) for i in range(downsampled_news1.shape[0])]
downsampled_news1 = pd.DataFrame(downsampled_news1.values, index=doc_names,columns=downsampled_news1.columns)
print(downsampled_news1.head(n=10))


# build the stemmer object
stemmer = PorterStemmer()


# get the default text analyzer from CountVectorizer
analyzer = CountVectorizer(analyzer = "word",
                           stop_words = 'english',
                           token_pattern = "[a-zA-Z]+").build_analyzer()


# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]


vectorizer = CountVectorizer(analyzer=stemmed_words,
                                 binary=True,
                                 min_df=4)
docarray = vectorizer.fit_transform(downsampled_news1['title']).toarray()


print("docarray shape: {}".format(docarray.shape))
print("Dimensions of the vector model:", docarray.shape[1])
print("first 10 coords: {}".format(vectorizer.get_feature_names_out()[:10]))




                                                  title  \
doc0                          American Dream, Revisited   
doc1  Clintons Are Under Multiple FBI Investigations...   
doc2  The FBI Can’t Actually Investigate a Candidate...   
doc3  Confirmed: Public overwhelmingly (10-to-1) say...   
doc4    Nanny In Jail After Force Feeding Baby To Death   
doc5  Media Roll Out Welcome Mat for ‘Humanitarian’ ...   
doc6  Hillary Clinton accepts nomination with 'bound...   
doc7  Police Turn In Badges Rather Than Incite Viole...   
doc8  South Carolina police officer charged with mur...   
doc9  Tony Blair helpfully describes Remain voters a...   

                                                   text label  
doc0  Will Trump pull a Brexit times ten? What would...  FAKE  
doc1  Clintons Are Under Multiple FBI Investigations...  FAKE  
doc2  Dispatches from Eric Zuesse This piece is cros...  FAKE  
doc3  Print \n[Ed. – Every now and then the facade c...  FAKE  
doc4  Nanny In Jail After Forc

In [48]:
distances = euclidean_distances(docarray)
doc_names = ['doc{}'.format(i) for i in range(docarray.shape[0])]
distances_df = pandas.DataFrame(data=distances,index=doc_names,columns=doc_names)
distances_df

Unnamed: 0,doc0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,...,doc990,doc991,doc992,doc993,doc994,doc995,doc996,doc997,doc998,doc999
doc0,0.000000,2.000000,2.828427,2.645751,2.000000,2.236068,2.645751,2.828427,2.449490,1.732051,...,2.236068,2.645751,2.000000,1.732051,2.449490,2.000000,2.449490,2.236068,2.236068,2.000000
doc1,2.000000,0.000000,2.000000,3.000000,2.449490,2.645751,2.645751,3.162278,2.828427,2.236068,...,2.236068,2.645751,2.449490,2.236068,2.828427,2.449490,2.828427,2.645751,2.645751,2.449490
doc2,2.828427,2.000000,0.000000,3.316625,3.162278,3.316625,3.000000,3.741657,3.464102,3.000000,...,3.000000,3.316625,3.162278,3.000000,3.464102,2.828427,3.464102,3.316625,3.316625,3.162278
doc3,2.645751,3.000000,3.316625,0.000000,3.000000,2.828427,3.162278,3.605551,3.316625,2.828427,...,3.162278,3.464102,2.645751,2.828427,3.316625,3.000000,3.316625,3.162278,3.162278,3.000000
doc4,2.000000,2.449490,3.162278,3.000000,0.000000,2.645751,3.000000,3.162278,2.828427,2.236068,...,2.645751,3.000000,2.449490,2.236068,2.828427,2.449490,2.828427,2.645751,2.645751,2.449490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
doc995,2.000000,2.449490,2.828427,3.000000,2.449490,2.645751,2.645751,3.162278,2.828427,2.236068,...,2.645751,3.000000,2.449490,1.732051,2.828427,0.000000,2.449490,2.645751,2.645751,2.000000
doc996,2.449490,2.828427,3.464102,3.316625,2.828427,3.000000,3.000000,3.464102,3.162278,2.645751,...,3.000000,3.316625,2.828427,2.236068,3.162278,2.449490,0.000000,3.000000,3.000000,2.000000
doc997,2.236068,2.645751,3.316625,3.162278,2.645751,2.828427,3.162278,3.316625,3.000000,2.449490,...,2.828427,3.162278,2.645751,2.449490,3.000000,2.645751,3.000000,0.000000,2.828427,2.645751
doc998,2.236068,2.645751,3.316625,3.162278,2.645751,2.828427,3.162278,3.316625,3.000000,2.449490,...,2.828427,3.162278,2.645751,2.449490,3.000000,2.645751,3.000000,2.828427,0.000000,2.645751


In [49]:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

# compute 95% confidence intervals for classification and regression
# problems

def classification_confint(acc, n):
    '''
    Compute the 95% confidence interval for a classification problem.
      acc -- classification accuracy
      n   -- number of observations used to compute the accuracy
    Returns a tuple (lb,ub)
    '''
    import math
    interval = 1.96*math.sqrt(acc*(1-acc)/n)
    lb = max(0, acc - interval)
    ub = min(1.0, acc + interval)
    return (lb,ub)

In [50]:

## Naive Bayes

print("******** model **********")


# Naive Bayes
model = MultinomialNB()
# NOTE: NB does not have any hyper-parameters - no overfitting - no searching over parameter space!
model.fit(docarray, downsampled_news1['label'])


print("******** Accuracy **********")

# accuracy of best model with confidence interval
best_model = model
predict_y = best_model.predict(docarray)
acc = accuracy_score(downsampled_news1['label'], predict_y)
lb,ub = classification_confint(acc,docarray.shape[0])
print("Accuracy of Naive Bayes with confidence interval: {:3.2f} ({:3.2f}, {:3.2f})".format(acc,lb,ub))


******** model **********
******** Accuracy **********
Accuracy of Naive Bayes with confidence interval: 0.83 (0.80, 0.85)


In [51]:
# build the confusion matrix
print("******** confusion matrix **********")
cats = ['FAKE','REAL']
cm = confusion_matrix(downsampled_news1['label'], predict_y, labels=cats)
cm_df = pd.DataFrame(cm, index=cats, columns=cats)
print("Confusion Matrix:\n{}".format(cm_df))

******** confusion matrix **********
Confusion Matrix:
      FAKE  REAL
FAKE   392    95
REAL    77   436


Question: Try the same thing but instead of ‘text’ use ‘title’ for your training text. How does a classifier built on this data set compare to the original classifier?

Answer: The model with data preprocessing achieves higher accuracy (0.94) and better precision, as indicated by its confusion matrix (444 true positives for fake news, 492 true negatives for real news, with relatively low false positives and false negatives), compared to the extra credit model which uses only the title. The extra credit model has lower accuracy (0.83) and its confusion matrix (392 true positives, 436 true negatives) reveals a higher number of misclassifications. While the confidence intervals overlap, the difference in accuracy and the confusion matrices strongly suggest that using the full text content provides significantly more information for accurate classification than using just the title. Also the 'text' model has more dimensions than the 'title' model.