<a href="https://colab.research.google.com/github/IkeKobby/fatima-fellowship/blob/main/Data_Preprocessing_Fatima_Fellowship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fatima Fellowship Quick Coding Challenge (Pick 1)

Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge. Please pick **1 of these 5** coding challenges, whichever is most aligned with your interests. 

**Due date: 1 week**

**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook to the submission link below. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw).

**Submission link**: https://airtable.com/shrXy3QKSsO2yALd3

# 2. Deep Learning for NLP

**Fake news classifier**: Train a text classification model to detect fake news articles!

* Download the dataset here: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
* Develop an NLP model for classification that uses a pretrained language model
* Finetune your model on the dataset, and generate an AUC curve of your model on the test set of your choice. 
* [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.
* *Answer the following question*: Look at some of the news articles that were classified incorrectly. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from IPython.display import clear_output
!pip install opendatasets
clear_output()

In [None]:
import opendatasets as od

In [None]:
dataset_url = 'https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset'

od.download(dataset_url)

In [None]:
import pandas as pd
import numpy as np

In [None]:
dataset_path = '/content/drive/MyDrive/fake-and-real-news-dataset'

fake_dataset = pd.read_csv(dataset_path + '/Fake.csv')
true_dataset = pd.read_csv(dataset_path + '/True.csv')

**Data exploration** 

In [None]:
fake_dataset.head(5)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [None]:
fake_dataset.subject.value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

In [None]:
true_dataset.head(5)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [None]:
true_dataset.subject.value_counts()

politicsNews    11272
worldnews       10145
Name: subject, dtype: int64

In [None]:
fake_news_body = fake_dataset['text']
true_news_body = true_dataset['text']

In [None]:
def randomly_read_some_articles(texts):
  five_random_texts = np.random.choice(texts, 5)
  for index, text in enumerate(five_random_texts):
    print(index, ':', text)
    print('\n')

print("SOME FAKE NEWS ARTICLES:  \n")
randomly_read_some_articles(fake_news_body)

SOME FAKE NEWS ARTICLES:  

0 : How much more criminal activity are American voters willing to overlook? Have we, as Americans gotten to the point that no crime committed by Hillary or anyone attached to Hillary will disqualify her from being elected to the highest office in our nation? Are we willing to accept that we can no longer trust anyone working for a government organization in America? The political organization of Virginia Gov. Terry McAuliffe, an influential Democrat with longstanding ties to Bill and Hillary Clinton, gave nearly $500,000 to the election campaign of the wife of an official at the Federal Bureau of Investigation who later helped oversee the investigation into Mrs. Clinton s email use.Campaign finance records show Mr. McAuliffe s political-action committee donated $467,500 to the 2015 state Senate campaign of Dr. Jill McCabe, who is married to Andrew McCabe, now the deputy director of the FBI.The Virginia Democratic Party, over which Mr. McAuliffe exerts consi

In [None]:
print("SOME TRUE NEWS ARTICLES:  \n")
randomly_read_some_articles(true_news_body)

SOME TRUE NEWS ARTICLES:  

0 : WASHINGTON/LONG BEACH, California (Reuters) - If Hillary Clinton ends up losing California to Bernie Sanders, it will be because of voters like Nallely Perez. Perez personifies what a Clinton supporter was supposed to look like: a 24-year-old Latina who grew up idolizing the former first lady as a groundbreaking woman in politics. But when she votes in California’s Democratic presidential nominating contest on Tuesday, Perez will be supporting Sanders.   “Everything that I would stand for, he has said it,” said Perez, a student at California State University, Long Beach, who said she likes Sanders’ promises of tuition-free college and universal healthcare. “We found our voice in him.”  California is the final big contest in the long, bitter fight for the Democratic nomination. Opinion polls show the Democratic race there tightening in recent weeks. Where Clinton, a former secretary of state, once held a big lead over Sanders, a U.S. senator from Vermont,

In [None]:
# turn series to dataframe
fake_news_body_dataframe = pd.DataFrame(fake_dataset['title'] + '' + fake_dataset['text'] + '' + fake_dataset['subject'], columns = ['article'])
true_news_body_dataframe = pd.DataFrame(true_dataset['title'] + '' + true_dataset['text'] + '' + true_dataset['subject'], columns = ['article'])

# create labels
fake_news_body_dataframe['label'] = 0 # `0` if fake and 
true_news_body_dataframe['label'] = 1 # `1` if real

# concat both datasets
news_article_dataset = pd.concat([fake_news_body_dataframe, true_news_body_dataframe], axis = 0).sample(frac = 1).reset_index(drop = True)

In [None]:
news_article_dataset.head()

Unnamed: 0,article,label
0,Turkey threatens legal action after lawmaker c...,1
1,OBAMA MAKES STUNNING 11th Hour Gift Of Massive...,0
2,FBI Director Comey’s ‘Leaked’ Memo Explains Wh...,0
3,Newspapers aim to ride 'Trump Bump' to reach r...,1
4,THIS IS GREAT! ANTI-HILLARY STREET ART POPS UP...,0


In [None]:
print(f"New dataset shape: {news_article_dataset.shape}")

# check data distribution
news_article_dataset.label.value_counts()

New dataset shape: (44898, 2)


0    23481
1    21417
Name: label, dtype: int64

### Preprocessing

In [None]:
# check nan
news_article_dataset.isna().sum()

article    0
label      0
dtype: int64

In [None]:
# !pip install autocorrect
# clear_output()

In [None]:
# process the text 
 # remove stopwords, remove special characters
import nltk
import re
import string
from nltk.corpus import stopwords, brown
from nltk.tokenize import word_tokenize
from autocorrect import Speller
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/sugar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/sugar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# instance of spell checker
spell  = Speller(lang='en')

In [None]:
def clean_text(text):
    # split text phrases into words
    words  = nltk.word_tokenize(text)
    
    
    # Create a list of all the punctuations to remove
    punctuations = ['.', ',', '/', '!', '?', ';', ':', '(',')', '[',']', '-', '_', '%']
    
    # Remove all the special characters
    punctuations = re.sub(r'\W', ' ', str(text))
    
    # Initialize the stopwords variable, which is a list of words ('and', 'the', 'i', 'yourself', 'is') that do not hold much values as key words
    stop_words  = stopwords.words('english')
    
    # Getting rid of all the words that contain numbers in them
    w_num = re.sub('\w*\d\w*', '', text).strip()
    
    # remove all single characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
    
    # Substituting multiple spaces with single space
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    
    # Removing prefixed 'b'
    text = re.sub(r'^b\s+', '', text)
    
    
    # Removing non-english characters
    text = re.sub("[^a-zA-Z0-9]+", "",text)
    
    # Return keywords which are not in stop words 
    keywords = [word for word in words if not word in stop_words  and word in punctuations and  word in w_num]

    # join words and turn to lower cases
    text = ' '.join([word.lower() for word in keywords])
    
    return spell(text)

In [None]:
# This code takes 1304m, 5.6s to complete. The text has been cleaned, processed and saved to my google drive.  
clean_texts = []
for txt in news_article_dataset['article']:
  clean_texts.append(clean_text(txt))

 - You can request access to the data [here](https://drive.google.com/file/d/1uLZoXqARX35CPmLMeO0VaVAEu_Wl1c72/view?usp=sharing)

In [None]:
news_article_dataset['clean_article'] = clean_texts
news_article_dataset.to_csv('cleaned_text.csv', index = False)

**Write up**: 
* Link to the model on Hugging Face Hub: 
* Include some examples of misclassified news articles. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)