# CASE STUDY SOLUTIONS
This python notebook evaluates and compares the solutions for the workshop example. There are three solutions in this notebook with direct references to code we've seen in Python notebooks I and II.

Each solution shares the same "Setup", which will be highlighted below, and even the setup will have its first mentions tagged.

## SETUP
All solutions will have this setup code required, especially if the solution is being presented in a different cell or notebook.

In [None]:
!pip3 install nltk
!pip3 install pandas
# Installing libraries is not a requirment for the solution to funciton, but it's good practice.

In [None]:
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#First Appearance: Notebook I, block 1 (A simple python script) lines 1-2
#The practice of importing libraries and collections are mentioned there. Specific items such as NLTK's collections and pandas are mentioned in Notebook II across multiple cells. 

nltk.download('stopwords')
#First Appearance: Notebook I, block 1 (A simple python script) lines 11-25
#Stopwords are further streamlined in Notebook II, code block 12 (NLTK) as a shortcut to that first one.

pd.options.display.max_rows = 20
#First Appearance: Notebook II, block 2 (Pandas) lines 3-4.
#Completely optional, limits the amount of rows viewed whenever a csv is being printed.



In [None]:
#note that variables are declared throughout all code in both notebooks, they can be ...variable to user preference.

litrev_df = pd.read_csv('digihum-lit-rev.csv', delimiter=",")
#First Appearance: Notebook I, block 1 (A simple python script) line 28
#This is actually the pandas application of the same fucntion, first seen in Notebook II, code block 2 (Pandas) line 7

litrev_df
#First Appearance: Notebook II, block 2 (Pandas) line 8
#Declaring a (done in the line above) is mentioned on multiple occasions, it is basically a pratice or indicator that a variable or bucket has the file attached and readable for further analysis. 
#Though optional, it's perfect for learning what the column names are

Above are the minimum requirements for all solutions, or the "Setup" for them. 

# SOLUTION A

This solution is a "Brute force" method, where the .csv file is forced out as a .txt file and there isn't much preprocessing involved. You will get from Point A to B without inlcusion of stopwords or punctuation checking. This is a very NLTK oriented approach. Though this isn't a clean "solution", it can help you figure out your next steps.

In [None]:
nltk.download('punkt_tab')
#First appearance: Notebook II, code block 3 (NLTK), required for tokenization. 
#PUNKT is a resouce part of the nltk library that contains many punctuation tools. It's very useful to not include punctuation in various analyses or text frequencies.

from nltk import FreqDist
#First appearance: Notebook II, code block 7 (NLTK), required to run a frequency distribution. 
#FreqDist is a set of functions and tools that streamlines the act of operating on a Frequency Distribution

description_list = list(litrev_df["Description"])
#First Appearance: Notebook II, block 4 (Pandas), creates a new variable that looks at the dataframe items as a list, easier for export.

litrev_df = litrev_df.loc[litrev_df["Description"].notna()]
#First Appearance: Notebook II, block 7 (Pandas), only examines that don't have NA decription. Useful for strings.

litrev_df["Description"] = litrev_df["Description"].str.lower()
#First Appearance: Notebook II, block 13 (Pandas), changes all text to lowercase strings to help NLTK work with it. 

litrev_df.to_csv("test.txt")
#First Appearance: Notebook II, block 15 (Pandas), forces a file to export as a .txt file for easier NLTK reading

description = open('test.txt', encoding="utf-8").read().lower()
#First Appearance: Notebook I, block 1 (A simple python script) line 28, text exclusive

description_words = word_tokenize(description)
#First Appearance: Notebook II, block 5 (NLTK), demonstartion of tokenizing words

descriptionfrequency = FreqDist(description_words)
descriptionfrequency.most_common(10)
#First Appearance: Notebook II, block 7 (NLTK), using the frequency distribution functions in NLTK to get a pretty, mediocre result.

# SOLUTION B
Here are applications of the more primative loops for actual results. Can be streamlined further though!

In [None]:
import re
from collections import Counter
#First Appearance: Notebook I, block 1 (A simple python script) lines 1-2, the only mention and usage of the regex library and the collections library with the coutner function. This is later replaced by the Frequency Distribution functions in NLTK

description_list = list(litrev_df["Description"])
#First Appearance: Notebook II, block 4 (Pandas), creates a new variable that looks at the dataframe items as a list, easier for export.

litrev_df = litrev_df.loc[litrev_df["Description"].notna()]
#First Appearance: Notebook II, block 7 (Pandas), only examines that don't have NA decription. Useful for strings.

litrev_df["Description"] = litrev_df["Description"].str.lower()
#First Appearance: Notebook II, block 13 (Pandas), changes all text to lowercase strings to help NLTK work with it. 

litrev_df.to_csv("test.txt")
#First Appearance: Notebook II, block 15 (Pandas), forces a file to export as a .txt file for easier NLTK reading

description = open('test.txt', encoding="utf-8").read().lower()
#First Appearance: Notebook I, block 1 (A simple python script) line 28, text exclusive


word_count = 50
#First Appearance: Notebook I, block 1 (A simple python script) line 8. This is a variable cap for the amount of words that can be viewed in the output

stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", 
             "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 
             'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 
             'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 
             'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'said', 
             'say', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
             'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 
             'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 
             'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'would', 'could', 'should', 
             'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 
             'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', 
             "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', 
             "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', 
             "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
#First Appearance: Notebook I, block 1 (A simple python script) lines 11-25. This is a brute stopwords list to be used instead of the NLTK dictionary. 

text_split_into_words = re.split(r"\W+", description)
#First Appearance: Notebook I, block 1 (A simple python script) line 31. This uses the regex (Import re) library's split function to put the text file into words that are countable. 

significant_words = []
#First Appearance: Notebook I, block 1 (A simple python script) line 35. This is the first use of a an empty list, to be used in a loop.

for word in text_split_into_words:
    if word not in stop_words and word.isalpha():
        significant_words.append(word)
#First Appearance: Notebook I, block 1 (A simple python script) line 37-39. This is the first loop that processes and updates the list, this also checks if the words are actually words. Thus creating a dictionary or a list of terms.

significant_words_tally = Counter(significant_words)
order_significant_words = significant_words_tally.most_common(word_count)
#First Appearance: Notebook I, block 1 (A simple python script) lines 42-45. This tallys and counts up the words, to display them by most common using tools found in the collections library. This is later replaced by NLTK's Frequency Distribution functions. 

order_significant_words

# SOLUTION C

There is more pre-processing involved which makes this more functional and the ideal middle ground of coding comprehension, problem solving, and creativity 👏!

In [None]:
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
#First appearance: Notebook II, code block 3 (NLTK), required for tokenization and stopwords library

from nltk import FreqDist
#First appearance: Notebook II, code block 7 (NLTK), required to run a frequency distribution. 

from string import punctuation
punctuation = list(punctuation)
#First Appearance: Notebook II, block 8 (NLTK), lines 4-5, adds a list of punctuation to ignore in the frequency distribution

description_list = list(litrev_df["Description"])
#First Appearance: Notebook II, block 4 (Pandas), creates a new variable that looks at the dataframe items as a list, easier for export.

litrev_df = litrev_df.loc[litrev_df["Description"].notna()]
#First Appearance: Notebook II, block 7 (Pandas), only examines that don't have NA decription. Useful for strings.

litrev_df["Description"] = litrev_df["Description"].str.lower()
#First Appearance: Notebook II, block 13 (Pandas), changes all text to lowercase strings to help NLTK work with it. 

litrev_df.to_csv("test.txt")
#First Appearance: Notebook II, block 15 (Pandas), forces a file to export as a .txt file for easier NLTK reading

description = open('test.txt', encoding="utf-8").read().lower()
#First Appearance: Notebook I, block 1 (A simple python script) line 28, text exclusive

description_words = word_tokenize(description)
#First Appearance: Notebook II, block 5 (NLTK), demonstartion of tokenizing words

## HERE IS WHERE THE MAJOR DIFFERENCES ARE

stop_words = stopwords.words('english')
#First Appearance: Notebook II, block 6 (NLTK), declaring the stopwords list as a variable for a loop and declaring using the english dictionary

stop_words.extend(["n't", "'s", 'would',])
#First Appearance: Notebook II, block 8 (NLTK), line 10, adds more words to the stopword list

filtered_transcript_words = []
#First Appearance: Notebook I, block 1 (A simple python script) line 35. this creates and empty list to store all the processed words in

for word in description_words:
    if word not in stop_words and word not in punctuation:
        filtered_transcript_words.append(word)
#First Appearance: Notebook I, block 1 (A simple python script) line 37-39, Perfected in Notebook II, block 9 (NLTK), lines 2-5, filters through words removing stopwords and punctuations to create a clean list

descriptionfrequency = FreqDist(description_words)
descriptionfrequency.most_common(10)
#First Appearance: Notebook II, block 7 (NLTK), using the frequency distribution functions in NLTK to get a pretty, mediocre result.

# SOLUTION D

Chantal's original solution, a full display of using libraries to their maximum potential! This is above and beyond so we don't expect you to get here, but will be very impressed if you do!

In [None]:
litrev_df = litrev_df[litrev_df["Description"].notna()]
#First Appearance: Notebook II, block 7 (Pandas), only examines that don't have NA decription. Useful for strings.

abstracts_as_text = ""
#First Appearance: Notebook I, block 4 (Naming Variables), examines how to declare an empty string, but there is no further mention of this as an application

for i in litrev_df["Description"]:
    abstracts_as_text = abstracts_as_text + i + "\n"
#NO previous appearance - This loop is meant to look at each cell item as text, organizing the strings into new lines. This is a missing gap that most participants face if reaching for this solution.
    
abstractTokens = word_tokenize(abstracts_as_text.lower())
#MULTIPLE APPEARANCES - word_tokenize uses code from Notebook II, block 5 (NLTK), abstracts_as_text is a variable and the .lower() function for the variable is used when reading files and converting cells to strings in  Notebook II, block 13 (Pandas).

cleaned_abstractTokens = []
#First Appearance: Notebook I, block 1 (A simple python script) line 35. This is the first use of a an empty list, to be used in a loop.

for word in list(abstractTokens):
    if word not in stopwords.words("english") and word.isalpha():
        cleaned_abstractTokens.append(word)
#MULTIPLE APPEARANCES - This is a cleaned up combination of the loops found in Notebook I, block 1 (A simple python script) line 37-39 AND Notebook II, block 9 (NLTK), lines 2-5. The cobination is done using the stopwords.words('english') dictionary, and checking if a word.isalpha() instead of using punctuation.

abstracts_df = pd.DataFrame(cleaned_abstractTokens, columns =['uniqueWords'])
#NO previous appearance - This is a line of code that creates a new dataframe to house all the cleaned tokens from the previous loop, this places those tokens (words) ina  a column named "uniqueWords", this is part of that vital gap connecting using pandas dataframe cells in NLTK.
        
keywords = abstracts_df["uniqueWords"].value_counts()
#First Appearance: Notebook II, block 9 (Pandas), this counts the values using a pandas function, meaning that this version never uses LTK besides the tokenizations and stopwords libraries. 

keywords
#Declaring the variable for printing