<a href="https://colab.research.google.com/github/Anna-Desorcy/FakeNewsDetection/blob/main/Anna_Desorcy_3_9_and_3_14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees Part 2

https://tinyurl.com/4mfb63zw

In [None]:
#Read and print the dataset
import pandas

df = pandas.read_csv("data.csv")

print(df)

    Age  Experience  Rank Nationality   Go
0    36          10     9          UK   NO
1    42          12     4         USA   NO
2    23           4     6           N   NO
3    52           4     4         USA   NO
4    43          21     8         USA  YES
5    44          14     5          UK   NO
6    66           3     7           N  YES
7    35          14     9          UK  YES
8    52          13     7           N  YES
9    35           5     9           N  YES
10   24           3     5         USA   NO
11   18           3     7          UK  YES
12   45           9     9          UK  YES


To make a decision tree, all data has to be numerical.

We have to convert the non numerical columns 'Nationality' and 'Go' into numerical values.

Pandas has a map() method that takes a dictionary with information on how to convert the values.

{'UK': 0, 'USA': 1, 'N': 2}

Means convert the values 'UK' to 0, 'USA' to 1, and 'N' to 2.

In [None]:
#Change string values into numerical values:
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)

print(df)

    Age  Experience  Rank  Nationality  Go
0    36          10     9            0   0
1    42          12     4            1   0
2    23           4     6            2   0
3    52           4     4            1   0
4    43          21     8            1   1
5    44          14     5            0   0
6    66           3     7            2   1
7    35          14     9            0   1
8    52          13     7            2   1
9    35           5     9            2   1
10   24           3     5            1   0
11   18           3     7            0   1
12   45           9     9            0   1


Then we have to separate the feature columns from the target column.

The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict.

In [None]:
#X is the feature columns, y is the target column:
features = ['Age', 'Experience', 'Rank', 'Nationality']

X = df[features]
y = df['Go']

print(X)
print(y)

    Age  Experience  Rank  Nationality
0    36          10     9            0
1    42          12     4            1
2    23           4     6            2
3    52           4     4            1
4    43          21     8            1
5    44          14     5            0
6    66           3     7            2
7    35          14     9            0
8    52          13     7            2
9    35           5     9            2
10   24           3     5            1
11   18           3     7            0
12   45           9     9            0
0     0
1     0
2     0
3     0
4     1
5     0
6     1
7     1
8     1
9     1
10    0
11    1
12    1
Name: Go, dtype: int64


In [None]:
#Create and display a Decision Tree:
import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

df = pandas.read_csv("data.csv")

d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)

features = ['Age', 'Experience', 'Rank', 'Nationality']

X = df[features]
y = df['Go']

dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)

#tree.plot_tree(dtree, feature_names=features)

## Result Explained
The decision tree uses your earlier decisions to calculate the odds for you to wanting to go see a comedian or not.

Let us read the different aspects of the decision tree:

Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and the rest will follow the False arrow (to the right).

gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle.

samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this is the first step.

value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".

The next step contains two boxes, one box for the comedians with a 'Rank' of 6.5 or lower, and one box with the rest.

### True - 5 Comedians End Here:
gini = 0.0 means all of the samples got the same result.

samples = 5 means that there are 5 comedians left in this branch (5 comedian with a Rank of 6.5 or lower).

value = [5, 0] means that 5 will get a "NO" and 0 will get a "GO".

###False - 8 Comedians Continue:

Nationality <= 0.5 means that the comedians with a nationality value of less than 0.5 will follow the arrow to the left (which means everyone from the UK, ), and the rest will follow the arrow to the right.

gini = 0.219 means that about 22% of the samples would go in one direction.

samples = 8 means that there are 8 comedians left in this branch (8 comedian with a Rank higher than 6.5).

value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a "GO".


... AND SO ON

## Predict Values
We can use the Decision Tree to predict new values.

Example: Should I go see a show starring a 40 years old American comedian, with 10 years of experience, and a comedy ranking of 7?

In [None]:
#Use predict() method to predict new values:
print(dtree.predict([[40, 10, 7, 1]]))


[1]




# Cleaning News Data
We need to clean the news data so it is usable by the decision tree algorithm/model.

In [None]:
#unzip fake news data
!unzip "News _dataset.zip"

Archive:  News _dataset.zip
  inflating: Fake.csv                
  inflating: True.csv                


In [None]:
#Load the data
import pandas as pd

df = pd.read_csv("Fake.csv")

print(df.head())

                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk...    News   
3  On Christmas day, Donald Trump announced that ...    News   
4  Pope Francis used his annual Christmas Day mes...    News   

                date  
0  December 31, 2017  
1  December 31, 2017  
2  December 30, 2017  
3  December 29, 2017  
4  December 25, 2017  


## Bag of Words
### Step 1: Collect Data
Below is a snippet of the first few lines of text from the book “A Tale of Two Cities” by Charles Dickens, taken from Project Gutenberg.

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire corpus of documents.

###Step 2: Design the Vocabulary
Now we can make a list of all of the words in our model vocabulary.

The unique words here (ignoring case and punctuation) are:


*   "it"
*   "was"
* "the"
* "best"
* "of"
* "times"
* "worst"
* "age"
* "wisdom"
* "foolishness"

That is a vocabulary of 10 words from a corpus containing 24 words.

###Step 3: Create Document Vectors
The next step is to score the words in each document.

The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.

Because we know the vocabulary has 10 words, we can use a fixed-length document representation of 10, with one position in the vector to score each word.

The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present.

Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first document (“It was the best of times“) and convert it into a binary vector.

The scoring of the document would look as follows:

*   "it" = 1
*   "was" = 1
* "the" = 1
* "best" = 1
* "of" = 1
* "times" = 1
* "worst" = 0
* "age" = 0
* "wisdom" = 0
* "foolishness" = 0

As a binary vector, this would look as follows:
```
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
```

The other three documents would look as follows:

```
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
```

In [None]:
def clean_article(text):
  #remove punctuation
  text = text.lower()
  text = text.replace('.','')
  text = text.replace(',','')
  text = text.replace('!','')
  text = text.replace('"','')
  text = text.replace("'",'')
  text = text.replace('?','')
  text = text.replace(':','')
  text = text.replace('/','')
  text = text.replace('@','')
  text = text.replace('(','')
  text = text.replace(')','')
  text = text.replace('[','')
  text = text.replace(']','')
  text = text.replace('_','')
  text = text.replace('*','')
  text = text.replace('0','')
  text = text.replace('1','')
  text = text.replace('2','')
  text = text.replace('3','')
  text = text.replace('4','')
  text = text.replace('5','')
  text = text.replace('6','')
  text = text.replace('7','')
  text = text.replace('8','')
  text = text.replace('9','')
  text = text.replace('-','')
  text = text.replace('#','')
  text = text.replace(';','')

  #split into words
  text = text.strip().split()

  #remove links
  text = [ x for x in text if "www" not in x ]
  text = [ x for x in text if "http" not in x ]

  return text

In [None]:
#Turn Fake and Real news into bag of words (use the text from the article)

#You will need to use both sets of data to create a common vocabulary

#I recommend using a dictionary, rather than a list, to keep track of your BOW

#Save your data in a new CSV file (news_data.csv) that has the vocabulary as
#the header and the counts as the features for each article

#We're going to use this data csv on Thursday to train our decision tree model

import pandas as pd

df_fake = pd.read_csv("Fake.csv")
df_real = pd.read_csv("True.csv")

df_fake = df_fake['text']
df_real = df_real['text']

word_dict = {}

#Get Vocab Words for Fake
cnt = 0
for text in df_fake:
  text = clean_article(text)
  for word in text:
    try:
      word_dict[word] += 1
    except:
      word_dict[word] = 0
  cnt += 1
  if cnt > 1000:
    break

#Get Vocab Words for Real
cnt = 0
for text in df_real:
  text = clean_article(text)
  for word in text:
    try:
      word_dict[word] += 1
    except:
      word_dict[word] = 0
  cnt += 1
  if cnt > 1000:
    break

#Remove words that occur less than min_thresh times and more than max_thresh times
vocab = list(word_dict)
print("Vocabulary Length Before Min/Max Removal:", len(vocab))

min_thresh = 100
max_thresh = 1000
for word in vocab:
  if word_dict[word] <= min_thresh or word_dict[word] > max_thresh:
    word_dict.pop(word)

vocab = list(word_dict)
print("Vocabulary Length After Min/Max Removal:", len(vocab))


Vocabulary Length Before Min/Max Removal: 36871
Vocabulary Length After Min/Max Removal: 885


In [None]:
#Now write out BOW for each article

#create empty article dictionary
article_dict = word_dict.copy()
article_dict = dict.fromkeys(article_dict, 0) #This is faster than looping through and setting each count to 0

#Open output file and write the vocab out as the header line
fout = open('news_data.csv', 'w')
vocab_str = ','.join(vocab)
fout.write(vocab_str + ',target_label\n') #add target_label header for label column

cnt = 0
for text in df_fake:
  text = clean_article(text)
  for word in text:
    try:                        # try/except is faster than if/else
      article_dict[word] += 1
    except:
      continue #word not in dictionary, go to next word (just an error catch)

  #Turn count list into a string of comma separated values
  article_list = list(article_dict.values())
  str_list = ','.join(str(e) for e in article_list)
  fout.write(str_list + ',1\n') #add 1 at the end for label for fake

  #reset article dictionary to 0 counts
  article_dict = dict.fromkeys(article_dict, 0)

  #only keep the first 1000 articles
  cnt += 1
  if cnt >= 1000:
    break

#Repeat process for real articles (label of 0 for true)
cnt = 0
for text in df_real:
  text = clean_article(text)
  for word in text:
    try:
      article_dict[word] += 1
    except:
      continue

  article_list = list(article_dict.values())
  str_list = ','.join(str(e) for e in article_list)
  fout.write(str_list + ',0\n')

  article_dict = dict.fromkeys(article_dict, 0)
  cnt += 1
  if cnt >= 1000:
    break
fout.close()

In [None]:
#Create and display a Decision Tree:
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

df_train = pd.read_csv('news_data.csv')
#get feature names
#get X and y (use dictionary_name.values so you don't get all those warnings)
list_of_features = df_train.keys()[:-1]
X = df_train[list_of_features].values
y = df_train['target_label']

#train DT

dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)

#print DT
tree.plot_tree(dtree, feature_names=list_of_features)


In [None]:
#Test our decision tree on more data from Fake.csv and True.csv

df_fake = pd.read_csv("Fake.csv")
df_real = pd.read_csv("True.csv")
df_fake = df_fake['text']
df_real = df_real['text']


df_fake = df_fake[1000:10000]
df_real = df_real[1000:10000]

#test samples 1000-1100 from Fake and True and get accuracies for both
#test samples 1000-10000 from Fake and Real and get accuracies for both
correct = 0
total = 0
for text in df_fake:
  text = clean_article(text)
  for word in text:
    try:                        # try/except is faster than if/else
      article_dict[word] += 1
    except:
      continue #word not in dictionary, go to next word (just an error catch)
  article_list = list(article_dict.values())
  article_dict = dict.fromkeys(article_dict, 0)

  if dtree.predict([article_list]) == 1:
    correct += 1
  total += 1
print(f'Fake Data Test Accuracy:  {round((correct / total) * 100, 2)}%')

#Repeat process for real articles (label of 0 for true)
correct = 0
total = 0
for text in df_real:
  text = clean_article(text)
  for word in text:
    try:
      article_dict[word] += 1
    except:
      continue
  article_list = list(article_dict.values())
  article_dict = dict.fromkeys(article_dict, 0)

  if dtree.predict([article_list]) == 0:
    correct += 1
  total += 1
print(f'Real Data Test Accuracy:  {round((correct / total) * 100, 2)}%')

Fake Data Test Accuracy:  93.91%
Real Data Test Accuracy:  98.27%


In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

df_train = pd.read_csv('news_data.csv')

list_of_features = df_train.keys()[:-1]
X = df_train[list_of_features].values
y = df_train['target_label']

#tree.plot_tree(dtree, feature_names=list_of_features)

#Train KNN
KNN = KNeighborsClassifier(n_neighbors = 3)
KNN = KNN.fit(X,y)


In [None]:
df_fake = pd.read_csv("Fake.csv")
df_real = pd.read_csv("True.csv")
df_fake = df_fake['text']
df_real = df_real['text']


df_fake = df_fake[1000:10000]
df_real = df_real[1000:10000]

correct = 0
total = 0
for text in df_fake:
  text = clean_article(text)
  for word in text:
    try:                        # try/except is faster than if/else
      article_dict[word] += 1
    except:
      continue #word not in dictionary, go to next word (just an error catch)
  article_list = list(article_dict.values())
  article_dict = dict.fromkeys(article_dict, 0)

  if KNN.predict([article_list]) == 1:
    correct += 1
  total += 1
print(f'Fake Data Test Accuracy:  {round((correct / total) * 100, 2)}%')

#Repeat process for real articles (label of 0 for true)
correct = 0
total = 0
for text in df_real:
  text = clean_article(text)
  for word in text:
    try:
      article_dict[word] += 1
    except:
      continue
  article_list = list(article_dict.values())
  article_dict = dict.fromkeys(article_dict, 0)

  if KNN.predict([article_list]) == 0:
    correct += 1
  total += 1
print(f'Real Data Test Accuracy:  {round((correct / total) * 100, 2)}%')

Fake Data Test Accuracy:  16.62%
Real Data Test Accuracy:  95.63%
