# DSFB Assignment 4

In this assignment, you will begin to work with text data and natural language processing. You will analyze aspects of th DonorsChoose.org program. Aspects of this project were first posed as a Kaggle challenge and the data comes from [Kaggle DonorsChoose.org Application Screening challenge](https://www.kaggle.com/c/donorschoose-application-screening/data). We have changed the nature of what you need to do in this assignment (so it does not track what was done in the Kaggle Challenge), but nevertheless using or referring to the Kaggle Challenge repository is not allowed for the assignment.

###  DonorsChoose.org  
  
Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount. DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. In this assignment, you will analyze the text of the essays and requirements from each proposal.

<img src="https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg" width="500" height="500" align="center"/>

Image source: https://cached.imagescaler.hbpl.co.uk/resize/scaleWidth/580/cached.offlinehbpl.hbpl.co.uk/news/NST/C8B9CC1D-03B0-9B80-4CFE78B5B539240F.jpg

### Data

As you will see, this dataset includes many different kinds of features with structured and unstructured data. The dataset consists of application materials (see *application_data.csv*) and resources requested (see *resource_data.csv*). The application materials (see *application_data.csv*) contain the following features.

| Feature name  | Description  |
|----------------|--------------|
| id  | Unique id of the project application    |
| teacher_id    | id of the teacher submitting the application  |
| teacher_prefix    | title of the teacher's name (Ms., Mr., etc.)    |
| school_state    | US state of the teacher's school    |
| project_submitted_datetime    | application submission timestamp    |
| project_grade_category    | school grade levels (PreK-2, 3-5, 6-8, and 9-12)   |
| project_subject_categories   | category of the project (e.g., "Music & The Arts")    |
| project_subject_subcategories    | sub-category of the project (e.g., "Visual Arts")    |
| project_title    | title of the project    |
| project_essay_1    | first essay*   |
| project_essay_2    | second essay*    |
| project_essay_3    | third essay*   |
| project_essay_4    | fourth essay*  |
| project_resource_summary    | summary of the resources needed for the project    |
| teacher_number_of_previously_posted_projects   | number of previously posted applications by the submitting teacher    |
| project_is_approved    | whether DonorsChoose proposal was accepted (0="rejected", 1="accepted"); train.csv only    |


\*Note: Prior to May 17, 2016, the prompts for the essays were as follows:

  * project_essay_1: "Introduce us to your classroom"  

  * project_essay_2: "Tell us more about your students"  

  * project_essay_3: "Describe how your students will use the materials you're requesting"  

  * project_essay_4: "Close by sharing why your project will make a difference"  

Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:

  * project_essay_1: "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."  

  * project_essay_2: "About your project: How will these materials make a difference in your students' learning and improve their school lives?"  

For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be missing (i.e. NaN).


### Special NLP Libraries

We will use several new libraries for this assignment - so be sure to first install those on your machine by with `pip` in a terminal:

    pip install --user -U nltk
    pip install -U gensim
    pip install -U spacy
    pip install -U pyldavis

## IMPORTS

In [1]:
# Standard imports
import numpy  as np
import pandas as pd

import itertools
import random
import math  
import copy

from pprint import pprint  # nicer printing

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Other NLP
import re
import spacy
import nltk
from nltk.corpus import stopwords

# General Plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as patches
%matplotlib inline  
import seaborn as sns
sns.set(style="white")

# Special Plotting
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

# ignore some warnings 
import warnings
warnings.filterwarnings('ignore')

# Set the maximum number of rows displayed by pandas
pd.options.display.max_rows = 1000

# Set some CONSTANTS that will be used later
SEED    = 41  # base to generate a random number
SCORE   = 'roc_auc'
FIGSIZE = (16, 10)

# PART 1: Prep

**PROBLEM**: To use a particular model in the `spacy` package, you need to manually download and install that particular model. You will need to run the following code from a terminal: `python -m spacy download en_core_web_sm`. Rather than doing that manually from bash in a separate terminal program, do it inline below using a "magic" command in jupyter. HINT: Use *!* followed by a bash command in a cell to run a bash command.

In [2]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


**PROBLEM**: To confirm that `spacy` is working (and `en_core_web_sm` is installed on your computer), you should be able to use `spacy.load()` to build a `Language` object to perform some basic nlp. Do that below:

In [3]:
# Test use of spacy by using the spacy.load() function
text = "I love Data Science!"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for token in doc:
    print(token.text, token.pos_, token.dep_)

I PRON nsubj
love VERB ROOT
Data NOUN compound
Science PROPN dobj
! PUNCT punct


**PROBLEM**: Use nltk.download() to download a list of raw stopwords. (see NLTK documentation)

In [4]:
# Download NLTK stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/younge/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**PROBLEM**: Use the `stopwords` object from `nltk` to build a list of English stopwords. 

In [5]:
# Get English Stopwords from NLTK
stop_words = stopwords.words('english')

**PROBLEM**: Extend your `stop_words` list with some additional stopwords that you believe should be ignored in this particular context.

In [6]:
# Extend the stop word list  
stop_words.extend(['project', 'teacher', 'subject', 'school', 'application'])

### Download the Data

Unlike other projects, this project includes a training set too big for GitHub. Through the terminal lab of Jupyter lab, download the data using the *wget* command, unzip it using the *zip* command and check that it's in the root directory of the project. 

Locations : 

    Applications dataset: https://storage.googleapis.com/dsfm/application/application_data.csv.zip
    Resources dataset: https://storage.googleapis.com/dsfm/application/resource_data.csv.zip
    
Hint: Use *wget* and *unzip* commands. Use *!* followed by a bash command in a cell to run a bash command.

**PROBLEM**: wget the data

In [None]:
# wget the data
!wget -N 'https://storage.googleapis.com/dsfm/application/application_data.csv.zip'
!wget -N 'https://storage.googleapis.com/dsfm/application/resource_data.csv.zip'

**PROBLEM**: unzip the data

In [None]:
# unzip the data
!unzip -o application_data.csv.zip
!unzip -o resource_data.csv.zip

# PART 2: Load Data

**PROBLEM**: Load `application_data.csv` and investigate it a bit.

In [7]:
# Load applications
applications = pd.read_csv('application_data.csv')
print('Application dataset')
print(applications.shape)
applications.head()

Application dataset
(182080, 16)


Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
0,p036502,484aaf11257089a66cfedc9461c6bd0a,Ms.,NV,2016-11-18 14:45:59,Grades PreK-2,Literacy & Language,Literacy,Super Sight Word Centers,Most of my kindergarten students come from low...,I currently have a differentiated sight word c...,,,My students need 6 Ipod Nano's to create and d...,26,1
1,p039565,df72a3ba8089423fa8a94be88060f6ed,Mrs.,GA,2017-04-26 15:57:28,Grades 3-5,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",Keep Calm and Dance On,Our elementary school is a culturally rich sch...,We strive to provide our diverse population of...,,,My students need matching shirts to wear for d...,1,0
2,p233823,a9b876a9252e08a55e3d894150f75ba3,Ms.,UT,2017-01-01 22:57:44,Grades 3-5,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Lets 3Doodle to Learn,Hello;\r\nMy name is Mrs. Brotherton. I teach ...,We are looking to add some 3Doodler to our cla...,,,My students need the 3doodler. We are an SEM s...,5,1
3,p185307,525fdbb6ec7f538a48beebaa0a51b24f,Mr.,NC,2016-08-12 15:42:11,Grades 3-5,Health & Sports,Health & Wellness,"\""Kid Inspired\"" Equipment to Increase Activit...",My students are the greatest students but are ...,"The student's project which is totally \""kid-i...",,,My students need balls and other activity equi...,16,0
4,p013780,a63b5547a7239eae4c1872670848e61a,Mr.,CA,2016-08-06 09:09:11,Grades 6-8,Health & Sports,Health & Wellness,We need clean water for our culinary arts class!,My students are athletes and students who are ...,For some reason in our kitchen the water comes...,,,My students need a water filtration system for...,42,1


**PROBLEM**: Load `resource_data.csv` and investigate it a bit.

In [8]:
# Load resources
resources = pd.read_csv('resource_data.csv')
print('Resources dataset')
print(resources.shape)
resources.head()

Resources dataset
(1541272, 4)


Unnamed: 0,id,description,quantity,price
0,p233245,LC652 - Lakeshore Double-Space Mobile Drying Rack,1,149.0
1,p069063,Bouncy Bands for Desks (Blue support pipes),3,14.95
2,p069063,Cory Stories: A Kid's Book About Living With Adhd,1,8.45
3,p069063,"Dixon Ticonderoga Wood-Cased #2 HB Pencils, Bo...",2,13.59
4,p069063,EDUCATIONAL INSIGHTS FLUORESCENT LIGHT FILTERS...,3,24.95


**PROBLEM**: Some of the essays are NA. Replace NAs with empty strings.

In [9]:
# Replace NA values in essay columns with ''
for i, row in applications.iterrows():
    if pd.isna(row["project_essay_1"]): row["project_essay_1"]=''
    if pd.isna(row["project_essay_2"]): row["project_essay_2"]=''
    if pd.isna(row["project_essay_3"]): row["project_essay_3"]=''
    if pd.isna(row["project_essay_4"]): row["project_essay_4"]=''

**PROBLEM**: To simplify matters, combine all essays into just one feature called "essays"

In [10]:
# Combine essays
applications["essays"] = (
    applications["project_essay_1"] + ' ' + 
    applications["project_essay_2"] + ' ' +
    applications["project_essay_3"] + ' ' +
    applications["project_essay_1"] )

**PROBLEM**: Merge the resources and application datasets on the *id* feature.

In [11]:
# Merge two datasets
merged = pd.merge(applications, resources, how='inner', on='id')

# Check the data to confirm it worked
print('Merged shape: {}\n'.format(merged.shape))
merged.dtypes

Merged shape: (1081830, 20)



id                                               object
teacher_id                                       object
teacher_prefix                                   object
school_state                                     object
project_submitted_datetime                       object
project_grade_category                           object
project_subject_categories                       object
project_subject_subcategories                    object
project_title                                    object
project_essay_1                                  object
project_essay_2                                  object
project_essay_3                                  object
project_essay_4                                  object
project_resource_summary                         object
teacher_number_of_previously_posted_projects      int64
project_is_approved                               int64
essays                                           object
description                                     

**PROBLEM**: Keep the following data for additional analysis (the id and the text features): `id`, `school_state`, `project_subject_categories`, `project_subject_subcategories`, `essays`, `description`

In [12]:
FEATURE_NAMES = ['school_state', 'project_subject_categories', 'project_subject_subcategories', 'essays', 'description']

In [13]:
# Keep the Text Features
merged = merged[FEATURE_NAMES]
merged.head()

Unnamed: 0,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,NV,Literacy & Language,Literacy,,Apple - iPod nano� 16GB MP3 Player (8th Genera...
1,NV,Literacy & Language,Literacy,,Apple - iPod nano� 16GB MP3 Player (8th Genera...
2,GA,"Music & The Arts, Health & Sports","Performing Arts, Team Sports",,Reebok Girls' Fashion Dance Graphic T-Shirt - ...
3,UT,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",,3doodler Start Full Edu Bundle
4,NC,Health & Sports,Health & Wellness,,BALL PG 4'' POLY SET OF 6 COLORS


# PART 3: Preprocess Text

Make an independent copy of the data so we can restart here when testing...

In [14]:
data = copy.copy(merged)

**PROBLEM**: Define a custom function `clean_punctuation()` to remove some punctuation from your text data. You don't have to do absolutely everything one might want to do - just show that you can do it. Start with each some easy operations with `str.replace()`.

In [15]:
# Define a custom function to clean punctuation from  given text
def clean_punctuation(txt):
    txt = str(txt).replace('\r',' ')
    txt = str(txt).replace('.',' ')
    txt = str(txt).replace(',',' ')
    txt = str(txt).replace(';',' ')
    txt = str(txt).replace('$',' ')
    txt = str(txt).replace('(',' ')
    txt = str(txt).replace(')',' ')
    txt = str(txt).replace('?',' ')
    txt = str(txt).replace('!',' ')
    return txt

**PROBLEM**: Use the `apply()` function from pandas to _apply_ that function down the `essays` column of your data.

In [16]:
# Apply your function to clean the essays column
for feature in FEATURE_NAMES:
    data[feature] = data[feature].apply(clean_punctuation)
data.head()

Unnamed: 0,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,NV,Literacy & Language,Literacy,,Apple - iPod nano� 16GB MP3 Player 8th Genera...
1,NV,Literacy & Language,Literacy,,Apple - iPod nano� 16GB MP3 Player 8th Genera...
2,GA,Music & The Arts Health & Sports,Performing Arts Team Sports,,Reebok Girls' Fashion Dance Graphic T-Shirt - ...
3,UT,Math & Science Literacy & Language,Applied Sciences Literature & Writing,,3doodler Start Full Edu Bundle
4,NC,Health & Sports,Health & Wellness,,BALL PG 4'' POLY SET OF 6 COLORS


**PROBLEM**: Define **another** custom function called `clean_re()` to clean your text data using regular expressions. Do at least two "cleanings" (i.e., show that you can use the `re` library).

In [17]:
# Define a custom function to clean some given text
def clean_re(txt):
    txt = re.sub('[^a-zA-Z]', ' ', txt)   # Remove non-letters 
    txt = re.sub('\s+', ' ', txt)         # Remove multiple white space   
    return txt

In [18]:
# Apply clean_re() to all features
for feature in FEATURE_NAMES:
    data[feature] = data[feature].apply(clean_re)
data.head()

Unnamed: 0,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,NV,Literacy Language,Literacy,,Apple iPod nano GB MP Player th Generation Lat...
1,NV,Literacy Language,Literacy,,Apple iPod nano GB MP Player th Generation Lat...
2,GA,Music The Arts Health Sports,Performing Arts Team Sports,,Reebok Girls Fashion Dance Graphic T Shirt Dd ...
3,UT,Math Science Literacy Language,Applied Sciences Literature Writing,,doodler Start Full Edu Bundle
4,NC,Health Sports,Health Wellness,,BALL PG POLY SET OF COLORS


**PROBLEM**: Remove stopwords. (Hint: use stopwords from nltk's `stopwords()` plus any additions you'd like to make. Then, again, define a custom function and then apply it to all features.)

In [19]:
# Define custom function to remove stopwords
def remove_stopwords(txt):
    return ' '.join([word for word in txt.split() if word not in stop_words])

In [20]:
# Apply function to remove stopwords  
for feature in FEATURE_NAMES:
    data[feature] = data[feature].apply(remove_stopwords)
data.head()

Unnamed: 0,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,NV,Literacy Language,Literacy,,Apple iPod nano GB MP Player th Generation Lat...
1,NV,Literacy Language,Literacy,,Apple iPod nano GB MP Player th Generation Lat...
2,GA,Music The Arts Health Sports,Performing Arts Team Sports,,Reebok Girls Fashion Dance Graphic T Shirt Dd ...
3,UT,Math Science Literacy Language,Applied Sciences Literature Writing,,doodler Start Full Edu Bundle
4,NC,Health Sports,Health Wellness,,BALL PG POLY SET OF COLORS


**PROBLEM**: Now use Gensim’s `simple_preprocess()` function to tokenize and clean up your text data. TIP: `simple_preprocess()` returns a list of words, so we want to wrap it with a function that joins the list back together into a string.

In [21]:
# Define custom function to wrap simple_preprocess() from gensim
def preprocess(txt):
    return ' '.join(simple_preprocess(txt))

In [22]:
# Apply simple_preprocess() to all features
for feature in FEATURE_NAMES:
    data[feature] = data[feature].apply(preprocess)
data.head()

Unnamed: 0,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,nv,literacy language,literacy,,apple ipod nano gb mp player th generation lat...
1,nv,literacy language,literacy,,apple ipod nano gb mp player th generation lat...
2,ga,music the arts health sports,performing arts team sports,,reebok girls fashion dance graphic shirt dd da...
3,ut,math science literacy language,applied sciences literature writing,,doodler start full edu bundle
4,nc,health sports,health wellness,,ball pg poly set of colors


**PROBLEM**: Lemmatize the text. (Hint: Define a custom function and then apply it to all features.)

In [23]:
# Write a lemmatization function based on nltk.stem.WordNetLemmatizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in text.split()]

In [24]:
# Apply lemmatize_text() to all features  
for feature in FEATURE_NAMES:
    data[feature] = data[feature].apply(lemmatize_text)
data.head()

Unnamed: 0,school_state,project_subject_categories,project_subject_subcategories,essays,description
0,[nv],"[literacy, language]",[literacy],[nan],"[apple, ipod, nano, gb, mp, player, th, genera..."
1,[nv],"[literacy, language]",[literacy],[nan],"[apple, ipod, nano, gb, mp, player, th, genera..."
2,[ga],"[music, the, art, health, sport]","[performing, art, team, sport]",[nan],"[reebok, girl, fashion, dance, graphic, shirt,..."
3,[ut],"[math, science, literacy, language]","[applied, science, literature, writing]",[nan],"[doodler, start, full, edu, bundle]"
4,[nc],"[health, sport]","[health, wellness]",[nan],"[ball, pg, poly, set, of, color]"


**PROBLEM**: What happened to the data in the pandas dataframe>

ANSWER: In was converted from long text into a list of individual words.

# PART 4:  Make an LDA topic model for the ESSAYS.

Define an LDA topic model for the `essays`. Compute the "Coherence score." Visually inspect the topic model by inspecting the top keywords from each model. Gensim provides functions for all of these tasks.  

In [25]:
# 





If you use gensim and the following three variables, then you can visualize topics & keywords with the code below.

    lda_model:    this is an LDA model generated by gensim.models.ldamodel.LdaModel()
    id2word:      this is the dictionary term IDs from corpora.Dictionary()
    corpus:       this is the collection of "documents"


In [None]:
# Visualize topics-keywords
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

# PART 5:  Make an LDA topic model for the DESCRIPTIONS.

Using the same K (and any other hyperparameters from Part 4), recompute a model for Descriptions. Compare the two sets of results. Do they vary? How? Why? Explain what you find. 

In [None]:
# 



