# Final Project

You have made it to the end of the course, and you have worked hard to develop your DSA perspectives and skills.  So far we have been internally focused on the operations of performing data science and analytics.  Now we will extend our work to the development of a data story that is externally focused.

In the Module8 labs, you saw simplified examples of constructing data stories. In module4 (Database) there also was an abbreviated example data story.  Throughout the course, there are components and parts useful to consider as a basis for developing a short, unique, focused data story.


For this final project, you will 

- Step 0: Choose your Language for this Adventure

- Step 1: Find a Story

- Step 2: Remember your Audience

- Step 3: Find and Stage Your Data

- Step 4: Vet Data Sources

- Step 5: Filter Results and Build/Validate Models

- Step 6: Visualize Results

- Step 7: Communicate the Story to your intended audience using visualizations and narratives

- Final Step: Connect your workflow/process to the DSA-Project Life Cycle

---
Here are some recommendations for managing the scope and quality of this project:

- Narrow down the issue, problem, question, or hypothesis for you data story to a single, relatively simple perspective.

- Identify already available data that affords addressing your problem.  If using completely new data, know it well.

- Address the data relative to the statistical/machine learning model(s) chosen to minimize any issues.

- Internally document your code using comments that explain the purpose of the operation(s).


Make your project unique by

- Comparing two or more different statistical/machine learning models using the same data.
- Refrain from identically replicating any existing projects obtained from external sources.
- Running a single model multiple times and changing a different single parameter each time for comparison.
- Changing the sampling proportions for building the hold-out data and comparing the same model performance repeatedly.
- Select something you find interesting or unique in the data and write a story around it.




## Step 0: Choose your Language for this Adventure:

You can do this project in either *R* or *Python*.

To change the kernel of this notebook, do the following with the `Kernel` menu.

 * `Kernel > Change Kernel > Python 3`
 * `Kernel > Change Kernel > R`

![FP_Change_Kernel.png MISSING](../images/FP_Change_Kernel.png)


---
## Step 1: Find a Story

Think back to any of the data files we have used in this class. 
Alternatively, you can search online for potential data and story ideas.

In the cell below, please detail the source of your data (with link).
Additionally, preview your story you hope to uncover.

## Step 2: Remember your Audience

In the cell below, describe your audience!
 * Who will the audience be?
 * What value will they derive from your story?

## Step 3: Find and Stage Your Data

If you data is from another source, such as Kaggle, you must download it to your local computer, then upload the data to JuptyerHub.

#### If you are uploading files:
 * Use folder navigation of your first JupyterTab to get to course's `/modules/module8/exercises/` folder.
![FP_Folder_Navigation.png MISSING](../images/FP_Folder_Navigation.png)
 * Click the Upload Button and Choose File(s)
![FP_Upload_Button.png MISSING](../images/FP_Upload_Button.png)
 * Activate the upload
![FP_UploadFile_2.png MISSING](../images/FP_UploadFile_2.png)
 

### In the cell below, please list the name(s) of the file(s) that is now accessible on the JupyterHub environment.

**Note**: 
If you uploaded a file to your `module8/exercises` folder, the file name is all you need to load it into the a data frame in the usual manner.
If you are using a file from another module of the course, you should be able to copy the full pathname and use it as is in this notebook.

## Step 4: Vet Data Sources

Use the cells below to load the data, inspect it, conduct data carpentry and shaping; perform exploratory data analysis.  

Add more cells (`Insert > Insert Cell Below`) if you want additional cells.

In [1]:
# Load packages and libraries
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
# Checks number of times a word appears in text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
#from sklearn.model_selection import train_test_split

# Read the CSV as a Pandas dataframe with UTF-8 encoding
with open('news.csv', encoding = 'utf8') as file:
    news = pd.read_csv(file)
    # Changing text in title and text columns to be all uppercase
    news['title'] = news['title'].str.upper()
    news['text'] = news['text'].str.upper()

# Display the shape, data types, and first few records of data set
print(news.shape)
print(news.dtypes)
news.head()

(27135, 4)
id       object
title    object
text     object
label    object
dtype: object


Unnamed: 0,id,title,text,label
0,29275,YOU CAN SMELL HILLARY’S FEAR,"DANIEL GREENFIELD, A SHILLMAN JOURNALISM FELLO...",FAKE
1,31093,WATCH THE EXACT MOMENT PAUL RYAN COMMITTED POL...,GOOGLE PINTEREST DIGG LINKEDIN REDDIT STUMBLEU...,FAKE
2,24407,KERRY TO GO TO PARIS IN GESTURE OF SYMPATHY,U.S. SECRETARY OF STATE JOHN F. KERRY SAID MON...,REAL
3,30941,BERNIE SUPPORTERS ON TWITTER ERUPT IN ANGER AG...,"— KAYDEE KING (@KAYDEEKING) NOVEMBER 9, 2016 T...",FAKE
4,21674,THE BATTLE OF NEW YORK: WHY THIS PRIMARY MATTERS,IT'S PRIMARY DAY IN NEW YORK AND FRONT-RUNNERS...,REAL


In [2]:
# Convert data types to strings because we can't perform string operations on objects
#news[['title', 'text', 'label']] = news[['title', 'text', 'label']].astype('|S')

#news.dtypes
# Causing errors where no ASCII charachter exists for a UTF-8 character in the text. Maybe I don't need to do this?

In [3]:
# Drop id field as it is not useful
news.drop('id', axis = 1, inplace = True)

news.head()

Unnamed: 0,title,text,label
0,YOU CAN SMELL HILLARY’S FEAR,"DANIEL GREENFIELD, A SHILLMAN JOURNALISM FELLO...",FAKE
1,WATCH THE EXACT MOMENT PAUL RYAN COMMITTED POL...,GOOGLE PINTEREST DIGG LINKEDIN REDDIT STUMBLEU...,FAKE
2,KERRY TO GO TO PARIS IN GESTURE OF SYMPATHY,U.S. SECRETARY OF STATE JOHN F. KERRY SAID MON...,REAL
3,BERNIE SUPPORTERS ON TWITTER ERUPT IN ANGER AG...,"— KAYDEE KING (@KAYDEEKING) NOVEMBER 9, 2016 T...",FAKE
4,THE BATTLE OF NEW YORK: WHY THIS PRIMARY MATTERS,IT'S PRIMARY DAY IN NEW YORK AND FRONT-RUNNERS...,REAL


In [4]:
# Check for any NaNs
news.isnull().values.any()

True

In [5]:
#Checking which columns have NaNs
print(news['title'].isnull().sum())
print(news['text'].isnull().sum())
print(news['label'].isnull().sum())

558
127
0


In [6]:
# There is no way to fill NaNs in this data set so I will remove all rows with NaNs
news = news.dropna()

# Check to make sure NaNs were removed
print(news.isnull().values.any())

# Check new shape of dataframe
news.shape

False


(26450, 3)

In [7]:
# Checking for duplicate articles by title
print("Duplicate titles: ", news.duplicated(subset = ['title']).sum())
# Checking for duplicate articles by text
print("Duplicate text: ", news.duplicated(subset = ['text']).sum())

Duplicate titles:  3111
Duplicate text:  3072


In [8]:
# Sorting by title
news.sort_values("title", inplace = True)

# Dropping duplicate titles
news.drop_duplicates(subset = "title", keep = 'first', inplace = True)

news.shape

(23339, 3)

In [9]:
# Sorting by text
news.sort_values("text", inplace = True)

# Dropping duplicate text
news.drop_duplicates(subset = "text", keep = 'first', inplace = True)

news.shape

(22892, 3)

In [10]:
# Check the values in label field
news.label.unique()

array(['1', 'FAKE', 'REAL', '0'], dtype=object)

In [11]:
# Replace 0 with REAL
news['label'] = news['label'].replace(['0'],'REAL')
# Replace 1 with FAKE
news['label'] = news['label'].replace(['1'],'FAKE')

news.label.unique()

array(['FAKE', 'REAL'], dtype=object)

## Step 5: Filter Results and Build/Validate Models


Perform any additional data carpentry and begin filtering results/data and then build, validate, and describe your model(s). 

Add more cells (`Insert > Insert Cell Below`) if you want additional cells.

In [12]:
# Create the training testing data sets
train = news.sample(frac = 0.7)
test = news.drop(train.index)

display(train)
display(test)

Unnamed: 0,title,text,label
12885,CALIFORNIA TODAY: THE TALE OF THE LAGUNA BEACH...,GOOD MORNING. (WANT TO GET CALIFORNIA TODAY BY...,REAL
17922,COURT IN EGYPT OVERTURNS MOHAMED MORSI’S DEATH...,CAIRO — ONE OF EGYPT’S HIGHEST COURTS OVERT...,REAL
1248,PROJECT VERITAS: SCOTT FOVAL REVEALS WHO WAS R...,PROJECT VERITAS: SCOTT FOVAL REVEALS WHO WAS R...,FAKE
22916,MEXICAN WOMAN ‘POSSESSED’ BY DEMON WHILE EATIN...,A VIDEO FROM MEXICO ALLEGES TO SHOW THE MOMENT...,FAKE
24222,"WITH ONE CASTRO GONE, QUESTIONS ABOUT WHAT THE...","MEXICO CITY — FOR HALF A CENTURY, AS FIDEL ...",REAL
...,...,...,...
1080,THE OVERWHELMING STRESS OF BEING DENIED A BATH...,A RECENT VIRAL VIDEO SHOWED A WOMAN WIELDING A...,REAL
7785,MENENDEZ: ’NOT SURE’ WHAT MESSAGE IS SENT COMM...,ON TUESDAY’S BROADCAST OF CNN’S “SITUATION ROO...,REAL
14371,ASPARTAME CORPORATION SEARLE CREATED FIRST BIR...,ASPARTAME CORPORATION SEARLE CREATED FIRST BIR...,FAKE
6123,WILL INTEREST RATES GO UP? FIVE THINGS TO EXPECT,THE FEDERAL RESERVE IS LIKELY TO RAISE THE FED...,REAL


Unnamed: 0,title,text,label
26472,OATHKEEPERS TO PREVENT VOTER FRAUD- OPERATION ...,\n\nI RECENTLY INTERVIEWED OATHKEEPERS STEWART...,FAKE
5483,"PROJECT VERITAS VIDEO 4 - 20K BRIBERY TO DNC, ...",\n\nIN THE EFFORT TO PROVE THE CREDIBILITY OF ...,FAKE
12785,HOW TO BREAK THE CYCLE OF HUMAN COMPLACENCY AN...,\n\nNEW VIDEO: HOW TO BREAK THE CYCLE OF HUMAN...,FAKE
10427,ABEDIN & WEINER TO TESTIFY AGAINST CLINTON,"\n \n\nHUMA ABEDIN, HILLARY’S CLINTON’S TOP AI...",FAKE
20733,MEXICO’S RICHEST OLIGARCH LOSES BILLIONS ON NE...,\n21ST CENTURY WIRE SAYS… \nMEXICO’S BILLIONAI...,FAKE
...,...,...,...
14558,TRUMP SAYS NORDSTROM TREATED HIS DAUGHTER ‘SO ...,■ A VOTE ON JEFF SESSIONS’S NOMINATION AS ATTO...,REAL
21407,N.F.L.: HERE’S WHAT WE LEARNED IN WEEK 2 - THE...,■ OPTIMISM IS EASY TO COME BY IN THE OFFSEASON...,REAL
11543,"TRUMP RESPONDS TO LOUVRE ATTACK IN PARIS, URGI...",■ PRESIDENT TRUMP RESPONDED TO THE FAILED ATTA...,REAL
24685,MAR-A-LAGO DOUBLES ITS INITIATION FEE AS MEMBE...,■ THE RESORT DOUBLES ITS INITIATION FEE — ...,REAL


In [13]:
# Create training and testing numpy arrays for both the input variables and the target variables
train_X = np.asarray(train.drop('label', axis = 1))
train_y = np.asarray(train.label)

test_X = np.asarray(test.drop('label', axis = 1))
test_y = np.asarray(test.label)

## Step 6: Visualize Results

Build up your key visual story elements!

Add more cells (`Insert > Insert Cell Below`) if you want additional cells.

## Step 7: Communicate the Story to your intended audience using visualizations and narrative


In a few paragraphs, describe the story the data tells. 

Additionally, post your most compelling visual and provide a brief description of what it conveys on to our mutual aid channel (the slack course channel). 

Feel free to post more examples for people to look at and provide feedback. Your classmates will be vital providers of feedback in this process. Utilize them.

# Final Step: Connect your workflow/process to the DSA-Project Life Cycle
- List and briefly discuss how important details from each stage of the [DSA-PLC](../../module1/resources/DSA-ProjectLifecycle-slidedeck.pdf) played a role in your story development.
- Use markdown to provide this overview below:
<hr/>

<h1 align="center"><u>DSA-Project Life Cycle Discussion</u></h1>



# Save your notebook, then `File > Close and Halt`