<a href="https://colab.research.google.com/github/AlexHarry17/CSCI_491/blob/master/Assignment3_Alexander_Harry.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1

The goal of this part is to explore some of the main [scikit-learn](http://scikit-learn.org/stable/index.html) tools on a single practical task: analysing a collection of text documents (newsgroups posts) on twenty different topics.

In this section we will see how to:

1. load the file contents and the categories
2. extract feature vectors suitable for machine learning
3. train a model to perform text classification
4. evaluate the performance of the trained model


### Loading the 20 newsgroups dataset
The 20 newsgroups dataset comprises around 20,000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). 

In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn.

The **sklearn.datasets.fetch_20newsgroups** function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups website, extracts the archive contents in a local folder and calls the **sklearn.datasets.load_files** on either the training or testing set folder, or both of them. Here, we are loading only 4 categories.

In [0]:
cats = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] #select only 4 categories for fast running times

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats, remove=('headers', 'footers', 'quotes'))


We can now list the 4 categories as follows:

In [0]:

print(newsgroups_train.target_names)

The real data lies in the **filenames** and **target** attributes. The target attribute is the integer index of the category:

In [0]:
print(newsgroups_train.filenames[0]) #print the name of the first file
print(newsgroups_train.target[0]) #print the category of the first example
print(newsgroups_train.data[0]) #print the text of the first example

### Converting text to vectors
In order to feed machine learing models with the text data, one first need to turn the text into vectors of numerical values suitable for statistical analysis. This can be achieved with the utilities of the **sklearn.feature_extraction.text** as demonstrated in the following example.




In [0]:
from sklearn.feature_extraction.text import CountVectorizer #tokenizer
vectorizer = CountVectorizer(stop_words='english') #remove english stop words
vectors = vectorizer.fit_transform(newsgroups_train.data)
print (vectors.shape) #print the size

### Training a machine learining model and evaluate its performance
 Let’s use a multinomial Naive Bayes classifier as discussed in the class.

In [0]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(vectors, newsgroups_train.target)



Then let's print the F1 score on the test data.

In [0]:
from sklearn import metrics

newsgroups_test = fetch_20newsgroups(subset='test', categories=cats, remove=('headers', 'footers', 'quotes'))  #test data
vectors_test = vectorizer.transform(newsgroups_test.data)  #generate vectors from test data (using the same vectorizer)
pred = clf.predict(vectors_test) #predict categories for the test data using the above trained classifier

print("macro F1:",metrics.f1_score(newsgroups_test.target, pred, average='macro'))
print("micro F1:",metrics.f1_score(newsgroups_test.target, pred, average='micro'))
print("\n",metrics.classification_report(newsgroups_test.target, pred, target_names=newsgroups_test.target_names))
cm = metrics.confusion_matrix(newsgroups_test.target, pred)
print("Confusion Matrix:\n",cm)

Let’s take a look at what the most informative features are:

In [0]:
import numpy as np
def show_top10(classifier, vectorizer, categories):
  feature_names = np.asarray(vectorizer.get_feature_names())
  for i, category in enumerate(categories):
    top10 = np.argsort(classifier.coef_[i])[-10:]
    print("%s: %s" % (category, " ".join(feature_names[top10])))

show_top10(clf, vectorizer, newsgroups_train.target_names)

Instead of train-test setup, you can also perform cross-validation (CV). Following code shows CV results using the train set (although you could do this with the complete dataset).

In [0]:
from sklearn.model_selection  import cross_val_score
cv_scores = cross_val_score(clf, vectors, newsgroups_train.target , cv=10, scoring="f1_macro" )
print("Avg. macro F1:", np.mean(cv_scores))

## Part 2

Ths goal of this part is to write your own code to train a model to classify the given test dataset using part 1 as inspiriation.

First, upload the given dataset ("diseases-train.csv") using the following cell. It contains 900 scientific artciles (identified by their PubMed IDs) and their labels. This is a multi-class problem.



In [0]:
from google.colab import files
uploaded = files.upload()

Then load the CSV file using pandas as below. 

In [0]:
import pandas as pd
df_train = pd.read_csv("diseases-train.csv")
print(df_train.head())

Then you can iterate over the lines as follows. Each line has the format: pmid, category.

In [0]:
for index, row in df_train.iterrows():    
    pmid = row[0]
    print(pmid)
    break

Then you can get the other information (i.e. title, abstract etc) associated with each of these articles using the [biopython](https://biopython.org/) library. First insatll the library as below.

In [0]:
!pip install biopython

You can fetch the information for an article with the **eftech** function as below. Find more information [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:efetch). 

(Note: You can search for pubmed articles using keywords using [esearch](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc123). This may be useful for your projects.)

In [0]:
from Bio import Entrez, Medline
Entrez.email = "alexander.harry1@ecat1.montana.edu"  # Always tell NCBI who you are
handle = Entrez.efetch(db="pubmed", id=pmid, rettype="medline", retmode="text")
print(handle.read())

The decription of the medline format is [here](https://www.nlm.nih.gov/bsd/disted/pubmedtutorial/030_080.html). You can parse this *handle* using the **Medline.parse** function. More informaton is [here](https://biopython.org/DIST/docs/api/Bio.Medline-module.html).

Once you grab enough information about each article from pubmed, your task is to train a model uisng the given data and make predictions for the articles in test data ("diseases-test-without-labels.csv"). In order to find the best parameter values for your models, you will split the given data into train-1 and train-2. Then you will use train-1 to train your models and train-2 to test your models (typically, train-2 is called the dev set or validation set). This way, you are able to find the best model for making final predictions. 

You can add one more code/text blocks for answering the follwoing.



1. Details about datasets: What is the label distribution in the full dataset? What are the sizes of train-1 and train-2 datasets you used (and their individual label distriutions)? Also, show the distribution(s) visually.

(Note: you can create bar plots or pie charts using the following:
[matplotlib.pyplot.bar](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html), 
[matplotlib.pyplot.pie](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pie.html)) 


In [0]:
labels = df_train['category'].unique() # List of unique labels

def label_distrobution(distro_description, data): # Calculates and plots the label distrobutions.
  label_distro = {}
  print("\n------------------------", distro_description, "Label Distrobution------------------------")
  for label in labels: # Loops through the labels to create a dictionary of counts for each label
    label_distro[label] = list(data['category']).count(label) # creates a key/count value for each label
    print("Label: ", label, " Count: ", label_distro[label], " Distrobution: ", str(label_distro[label] / len(data)) + "%") # Prints the distrobution.

  print("\n------------------------", distro_description, "Label Distrobution Plot------------------------")
  # Plotting dictionary sourced from https://stackoverflow.com/questions/21195179/plot-a-histogram-from-a-dictionary/21195331 from User: Alvaro Fuentes
  import matplotlib.pyplot as plt
  plt.bar(label_distro.keys(), label_distro.values()) # Plots the bar chart of distrobutions. -- Source above--
  plt.show()

In [0]:

# Split the training sets
split = int(len(df_train) * 0.8) # Split the training data 80/20

train_1 = df_train[:split][:] 
train_2 = df_train[split:][:]

print("\n------------------------Training Split------------------------")
print("Train-1 size: ", len(train_1), " Train-2 size: ", len(train_2))

# Print the label distrobutions for each data set.
label_distrobution("Full Data Set", df_train)
label_distrobution("Train-1", train_1)
label_distrobution("Train-2", train_2)



In [0]:
from Bio import Entrez, Medline

def parseData(data, train): # Method to parse the data
  data_set = [] # For the new data set
  label = [] # List of labels
  for pmid_val in data.iterrows(): # Iterate through the rows of the data passed in
   if train == "True":
     handle = Entrez.efetch(db="pubmed", id=pmid_val[1][0], rettype="medline", retmode="text") # Sourced from the code provided above. Grabs the handle of the specific pmid.
   else:
     handle = Entrez.efetch(db="pubmed", id=int(pmid_val[1]['pmid']), rettype="medline", retmode="text") # Sourced from the code provided above. Grabs the handle of the specific pmid.
   parsed = Medline.parse(handle) # Parses the handle
   for parse in parsed:
    if 'TI' in parse.keys(): # Checks that the text exists
      if train == "True": # Creates a data set for labels if data is the training set
        label.append(pmid_val[1][1])
      text = parse['TI'] 

      data_set.append(text) # Appends to the dataset
  if train == "True":
    return data_set, label
  else:
    return data_set


In [0]:
# train_1_data, train_1_labels = parseData(train_1, True)
# print("Text: ", train_1_data)
# print("Labels: ", train_1_labels)

In [0]:
# train_2_data, train_2_labels = parseData(train_2, True)
# print("Text: ", train_2_data)
# print("Labels: ", train_2_labels)

2. Data preprocessing:
How did you preprocess you data? You can do stemming/lemmatization using [NLTK](https://www.nltk.org/) library:
[stemming](https://www.nltk.org/api/nltk.stem.html), 
[lemmatization](https://www.nltk.org/_modules/nltk/stem/wordnet.html). Make sure to apply the same pre-processing to both train-1 and train-2.


ANSWER: I preprocessed the data by parsing with Medline.parse.  I created a list of text from parsing the 'TI' key and created a list of labels associated with the list.  The data then has chars like '.' replaced, and is then lemmatized and stemmed using the SnowballStemmer.

In [0]:
# Help with stemming and lemmatization from source https://www.nltk.org/api/nltk.stem.html#module-nltk.stem

# temp_train_1 = train_1_data.copy()
# temp_train_2 = train_2_data.copy()

# temp_labels_1 = train_1_labels.copy()
# temp_labels_2 = train_2_labels.copy()

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
import nltk 

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# Help with LancasterStemmer and SnowballStemmer code sourced from https://www.nltk.org/api/nltk.stem.html#module-nltk.stem
# stemmer = LancasterStemmer()
stemmer = SnowballStemmer('english')


# Help with stemming and lemmatization from source https://www.nltk.org/api/nltk.stem.html#module-nltk.stem
def stemming(data): # Method for stemming the data
  for text in range(len(data)): # Loops through the text.
    line = data[text]
    line = line.split() # Splits the line at ' ' into a list
    new_line = [] # Temp list to join later
    for word in line: # Loops through the list of words in the sentence
      new_line.append(stemmer.stem(word)) # Stems and appends to the list.
    data[text] = ' '.join(new_line) # Join method sourced from https://stackoverflow.com/questions/29642188/removing-the-square-brackets-commas-and-single-quote, user: halex 
  return data
  
def replace(data): # Replaces line breaks with ''.
  chars = ['.', '[',']', "'", '?', '!', '(', ')', ':', ';' ] # The chars to remove from the sentence
  for text in range(len(data)): # Loops through the text 
    for char in chars: # Loops through the items to replace
      data[text] = data[text].replace(char, '') # Replace method sourced from https://www.journaldev.com/23674/python-remove-character-from-string
  return data

def lemmatization(data): # Method to lemmatize the data.  Sourced from https://www.nltk.org/api/nltk.stem.html#module-nltk.stem
  for text in range(len(data)): # Loops through the text.
    line = data[text]
    line = line.split() # Splits the line at ' ' into a list
    new_line = [] # Temp list to join later
    for word in line: # Loops through the list of words in the sentence
      new_line.append(lemmatizer.lemmatize(word)) # Lematize and appends to the list.
    data[text] = ' '.join(new_line) # Join method sourced from https://stackoverflow.com/questions/29642188/removing-the-square-brackets-commas-and-single-quote, user: halex 
  return data



In [0]:
# Replace, lemmatize, and stem the text
def prep_data(data, has_labels): # Method to prep the data
  if has_labels == "True":
    data, labels = parseData(data, has_labels) # Parses the data
  else:
    data = parseData(data, has_labels) # Parses the data

  data = replace(data) # Replace chars we don't want in the string
  data = lemmatization(data) # Lemmatize the data
  data = stemming(data) # Stem the data.

  if has_labels == "True":
    return data, labels
  else:
    return data

In [0]:
train_1_data, train_1_labels = prep_data(train_1, "True")
train_2_data, train_2_labels = prep_data(train_2, "True")

3. Features used:
What feature model was used? Chekout all the options for [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Also, instead of the Bag-of-words (CountVectorizer) model you can optionally use [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). You can read more about both these approches [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). Regarless of the Vectorizer used, list all paramer values you used.


ANSWER: I tried using both the CountVectorizer and TfidVectorizer feature model.  I found that both were similar, but for this particular dataset, the TfidVectorizer performed slightly better.

In [0]:
# The following code for the TfidVectorizer is sourced and used from the examples at the source: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectorizer = TfidfVectorizer('english')
# vectorizer = CountVectorizer('english')

data_vectors = vectorizer.fit_transform(train_1_data) # Turn the data into a vector
print(data_vectors.shape)

4. NB Model performance:
Report the performance values using Naive Bayes here. What is represented by *alpha* and *fit_prior* parameters? What value pair for *alpha* (try 0.0 or 1.0) and *fit_prior* (try True or False) parameters gives you the best overall performance in terms of macro-avreaged F1 (train using train-1 and test using train-2). For the best performing model, show confusion matrix. Show individual F1 values for each category in a bar chart. What are the categories that are easiest/ hardest to predict?



ANSWER: According to source : https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html,  The aplpha parameter smooths the data, while the fit_prior parameter is a boolean that decides "Whether to learn class prior probabilities or not".  Changing the alpha and fit_prior values slightly decreases performance of my algorithm, though it is only a slight amount. The easiest category to predict is eye diseases.  The hardest category to predict is skin diseases.

In [0]:
#Naive bayes sourced from the code block above in part 1
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB() # Create the naive bayes classifier
classifier.fit(data_vectors, train_1_labels) # Fit the data to the classifier.

In [0]:
# Code sourced from code block above in part 1.
data_vectors_test = vectorizer.transform(train_2_data)  # Turn the data into a vector
predicted = classifier.predict(data_vectors_test) # Get the predictions on the test set.


# The Following five commented blocks are from testing different stemmers/vectorizers.

In [0]:
# # The following code is sourced from the above code block in part 1.
# from sklearn import metrics
# # Snowball Stemmer
# # TfidVectorizer
# #Alpha - 1.0
# # Fit_prior = False

# print("macro F1:",metrics.f1_score(train_2_labels, predicted, average='macro'))
# print("micro F1:",metrics.f1_score(train_2_labels, predicted, average='micro'))
# print("\n",metrics.classification_report(train_2_labels, predicted))
# cm = metrics.confusion_matrix(train_2_labels, predicted)
# print("Confusion Matrix:\n",cm)

In [0]:
# # The following code is sourced from the above code block in part 1.
# from sklearn import metrics
# # Snowball Stemmer
# # TfidVectorizer
# # Alpha - 0.1
# # Fit_prior = True


# print("macro F1:",metrics.f1_score(train_2_labels, predicted, average='macro'))
# print("micro F1:",metrics.f1_score(train_2_labels, predicted, average='micro'))
# print("\n",metrics.classification_report(train_2_labels, predicted))
# cm = metrics.confusion_matrix(train_2_labels, predicted)
# print("Confusion Matrix:\n",cm)

In [0]:
# from sklearn import metrics
# # Lancaster Stemmer
# # CountVectorizer
# The following code is sourced from the above code block in part 1.

# print("macro F1:",metrics.f1_score(temp_labels_2, predicted, average='macro'))
# print("micro F1:",metrics.f1_score(temp_labels_2, predicted, average='micro'))
# print("\n",metrics.classification_report(temp_labels_2, predicted))
# cm = metrics.confusion_matrix(temp_labels_2, predicted)
# print("Confusion Matrix:\n",cm)

In [0]:
# from sklearn import metrics
# # Lancaster Stemmer
# # TfidVectorizer
# The following code is sourced from the above code block in part 1.

# print("macro F1:",metrics.f1_score(temp_labels_2, predicted, average='macro'))
# print("micro F1:",metrics.f1_score(temp_labels_2, predicted, average='micro'))
# print("\n",metrics.classification_report(temp_labels_2, predicted))
# cm = metrics.confusion_matrix(temp_labels_2, predicted)
# print("Confusion Matrix:\n",cm)

In [0]:
# # This code is sourced from above in part 1.
# from sklearn import metrics # Import the metrics library
# # Snowball Stemmer
# # CountVectorizer

# print("macro F1:",metrics.f1_score(train_2_labels, predicted, average='macro'))
# print("micro F1:",metrics.f1_score(train_2_labels, predicted, average='micro'))
# print("\n",metrics.classification_report(train_2_labels, predicted))
# cm = metrics.confusion_matrix(train_2_labels, predicted)
# print("Confusion Matrix:\n",cm)

In [0]:
from sklearn import metrics
# Snowball Stemmer
# TfidVectorizer
# Alpha - 1.0
# Fit_prior = True

# The following code is sourced from the above code block in part 1.
# Print metrics of the predictions
print("macro F1:",metrics.f1_score(train_2_labels, predicted, average='macro'))
print("micro F1:",metrics.f1_score(train_2_labels, predicted, average='micro'))
print("\n",metrics.classification_report(train_2_labels, predicted))
cm = metrics.confusion_matrix(train_2_labels, predicted)
print("Confusion Matrix:\n",cm)

In [0]:
import numpy as np

def get_fScore(f_label, prediction): # Returns the f-score and labels in two arrays
  f_score = metrics.precision_recall_fscore_support(f_label, prediction) # Grabs the metrics from the fscore report with help of https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support
  f_score = f_score[2] # Array of f_scores pulled from the metrics library
  # code for unique values as a np array sourced from: https://www.geeksforgeeks.org/python-get-unique-values-list/
  labels_fscore = np.unique(np.array(f_label)) # Gets the unique values of the labels
  return f_score, labels_fscore

import matplotlib.pyplot as plt

f_score_NB, f_score_labels_NB = get_fScore(train_2_labels, predicted) 

# Label axis of graph
plt.xlabel("Class")
plt.ylabel("F1-Score")
plt.bar(f_score_labels_NB, f_score_NB) # Plots the bar chart of the f-score distrobutions
plt.show() # Display the chart

5. Other Model performance:
Comapre NB performance to at least one other model mentioned in the class (e.g. KNN). Which model did you pick? List all parameter values you selected. Show the individual F1 values for each category for the two models side-by-side in a bar chart.



ANSWER: The model I picked was a KNeighborsClassifier.  All of the parameters are the default except for changing the K value to be 10 neighbors.  It was set to 10 as that was the best result I found.  I tried different algorithms without much change in results.  The algorithms params tried were auto, ball_tree, kd_tree , and brute.  I played with manhattan_distance, euclidean distance, and minkowski.  I found euclidean and minkowski to be pretty close, and manhattan to reduce results.  I then tried the two weight options, 'uniform' and 'distance' and found 'distance' to perform slightly better.

In [0]:
# Source for how to do the following KNN code : https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=10, weights='distance') # Create the naive bayes classifier
classifier.fit(data_vectors, train_1_labels) # Fit the data to the classifier.

predicted = classifier.predict(data_vectors_test) # predict categories for the test data using the above trained classifier

from sklearn import metrics
# Snowball Stemmer
# TfidVectorizer

# The following code is sourced from the above code block in part 1.
# Print prediction metrics
print("macro F1:",metrics.f1_score(train_2_labels, predicted, average='macro'))
print("micro F1:",metrics.f1_score(train_2_labels, predicted, average='micro'))
print("\n",metrics.classification_report(train_2_labels, predicted))
cm = metrics.confusion_matrix(train_2_labels, predicted)
print("Confusion Matrix:\n",cm)


In [0]:
f_score_KNN, f_score_labels_KNN = get_fScore(train_2_labels, predicted)  # Grab the KNN F-Scores

# Following code for how to do the side by side bar chart sourced from : https://pythonspot.com/matplotlib-bar-chart/
x_location = np.arange(4) # Grabs the x location to plot
width = 0.3 # Width of the bars
rect_nb = plt.bar(x_location, f_score_KNN, color="blue", label = 'Naive Bayes', width=width) # Plots the naive bayes bars
rect_knn = plt.bar(x_location + width, f_score_NB, color="red", label = 'KNN', width=width) # Plots the KNN bars

# Label the chart.
plt.ylabel('F1-Score')
plt.xlabel('Class')
plt.xticks(x_location + (width/2), f_score_labels_NB) # Adds the class labels to the bottem of the chart
plt.legend() # Adds the legend to the chart.
plt.tight_layout() # Looks better with this.
plt.show() # Show the plot.


6. Predict the categories for articles in the given test data set ("diseases-test-without-labels.csv"). Save the predictions into "diseases-test-preds.csv" (it should have the exact same format as the "diseases-train.csv" file) and upload along with the completed ipynb file. We will use your generated "diseases-test-preds.csv" file for evaluting the final perforamnce of your code (using macro-averaged F1). 

The three best performing submissions on test data will get **bonus points** (5% of the assignment grade for the 1st place, 3% for 2nd, and 2% for 3rd). The winner will be annouced in the class afther final evalution.

In [0]:
# Code sourced from above block in part 1.
from google.colab import files
uploaded = files.upload()

In [0]:
df_test = pd.read_csv("diseases-test-without-labels.csv") # Reads the csv file

In [0]:
test_data = prep_data(df_test, "False") # Prepares the test Data

In [0]:
# Following sourced from code above in part 1
vector = vectorizer.transform(test_data) # Create a vector of the test
predicted_values = classifier.predict(vector) # Predict classes
pmid = list(df_test['pmid']) 

In [0]:
row_list = [['pmid', 'category']] # Initializes the list with labels
for row in range(len(predicted_values)): # Appends a list to add to a csv
  row_list.append([pmid[row], predicted_values[row]])

# The following code to write to csv is written with the help from source : https://www.programiz.com/python-programming/working-csv-files
import csv
with open('diseases-test-preds.csv', 'w') as file: # Creates and opens the writable csv file
  csv.writer(file).writerows(row_list) # Writes to the csv file.  Each row is an instance in the row_list created prior.
file.close() # Closes the file.

