# Text Analysis
Checkpoints by SiqiWang, orignially designed by Maryah Garner, 

## Table of Contents
* [Learning Outcomes](#CLearningOutcomes)
* [Glossary of Terms](#Glossary)
* [Setup - Load Python Packages](#setup)
* [Motivation: Grant Proposal Abstracts ](#Motivation)
    * [Load the Data](#Load)
* [Prepair the Data](#Prepairing)
* [Preparing Text Data for Natural Language Processing (NLP)](#Topic)
* [Latent Dirichlet Allocation (LDA)](#LDA)
* [N-grams](#N-grams)
* [TF-IDF : Weighting terms based on frequency  ](#TF-IDF)
* [Visualize the project by topic ](#Visualize)
* [Checkpoints](#Checkpoints)

# Checkpoints <a class="Checkpoints" id="NLP"></a>
I would like you to turn in a notebook with only the checkpoints, ensuring that the notebook can run all the way through (you can hit the doubble arrow at the top of the notebook to restart the kernal and run the whole notebook through. 3 points will be automatically deducted if your notebook does not run all the way through after I change the Path
- For the name of the notebook you submit to brightspace, please use your first and last name followed by Text Analysis (for example: Maryah Garner Text Analysis.ipynb)

### 1) Install and import libraries (0 points, but if you don't include this your notebook will not run)

In [1]:
# Install package for natural language processing
%pip install nltk

# data manipulation
import pandas as pd
import numpy as np
import os

# text analysis tools
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn import preprocessing
from nltk import SnowballStemmer
import string

# visualization tools
import matplotlib as mplib
import matplotlib.pyplot as plt 
import seaborn as sns  

Note: you may need to restart the kernel to use updated packages.


### 2) Create a path and read in the abstract data and project data for a year of your choice (0.5 points)

#### Read in abstract data of year 2021

In [2]:
# Specify a path with the data folder
# Change "NAME" to your name as recorded on your computer
# path = 'C:/Users/NAME/PADM-GP_2505/Data/'
Path = "/Users/wsq/Desktop/Advanced Data Analytics and Evidence Building/PADM-GP_2505/Data"

# Read-in a CSV file
Abstracts_2021 = pd.read_csv(Path + '/Abstracts/RePORTER_PRJABS_C_FY2021.csv', encoding='latin-1')

#### Read in project data of year 2021

In [3]:
# Read-in the 2021 projects data
grants_2021 = pd.read_csv('/Users/wsq/Desktop/Advanced Data Analytics and Evidence Building/PADM-GP_2505/Data/Projects/RePORTER_PRJ_C_FY2021_new.csv',
                          usecols=['APPLICATION_ID','IC_NAME', 'TOTAL_COST'], encoding='latin-1')

# View the first 5 observations 
grants_2021.head(10)

Unnamed: 0,APPLICATION_ID,IC_NAME,TOTAL_COST
0,10595864,NATIONAL INSTITUTE OF DIABETES AND DIGESTIVE A...,
1,10101643,NATIONAL INSTITUTE ON DRUG ABUSE,618444.0
2,10189622,FOOD AND DRUG ADMINISTRATION,74000.0
3,10189608,FOOD AND DRUG ADMINISTRATION,52000.0
4,10076833,NATIONAL EYE INSTITUTE,540597.0
5,10084900,NATIONAL INSTITUTE OF BIOMEDICAL IMAGING AND B...,644204.0
6,10119627,NATIONAL INSTITUTE OF NEUROLOGICAL DISORDERS A...,655081.0
7,10485753,NATIONAL CENTER FOR ADVANCING TRANSLATIONAL SC...,201957.0
8,10490689,NATIONAL EYE INSTITUTE,31027.0
9,10449884,FOOD AND DRUG ADMINISTRATION,225000.0


### 3) Join together the abstract and projects data and filter for projects funded by the CI most apporpriate for your class project (0.5 points)

#### Subset for NCI projects, because NCI is the most approprite CI for our group's class project.

In [4]:
# Use a conditional subset to select projects that have NATIONAL CANCER INSTITUTE as the IC_NAME
NCI_projects = grants_2021[grants_2021['IC_NAME'] =='NATIONAL CANCER INSTITUTE']

# Reset index
NCI_projects = NCI_projects.reset_index()

# view the first 5 observations
NCI_projects.head()

Unnamed: 0,index,APPLICATION_ID,IC_NAME,TOTAL_COST
0,18,10406126,NATIONAL CANCER INSTITUTE,219626.0
1,29,10246936,NATIONAL CANCER INSTITUTE,636455.0
2,31,10131000,NATIONAL CANCER INSTITUTE,316405.0
3,35,10433611,NATIONAL CANCER INSTITUTE,35000.0
4,38,10118122,NATIONAL CANCER INSTITUTE,


In [5]:
NCI_projects.shape

(12595, 4)

#### Connect the Abstract data to the NCI_projects dataframe

In [6]:
# Create a new dataframe by using an inn merge to merge the NCI_projects and Abstracts_2017 dataframes
# merging on PI_NAME
nci_abstracts = pd.merge(NCI_projects, Abstracts_2021, on='APPLICATION_ID', how = 'left')
nci_abstracts.shape

(12595, 5)

In [7]:
nci_abstracts.head()

Unnamed: 0,index,APPLICATION_ID,IC_NAME,TOTAL_COST,ABSTRACT_TEXT
0,18,10406126,NATIONAL CANCER INSTITUTE,219626.0,Project Summary/Abstract Title: Caribbean Inve...
1,29,10246936,NATIONAL CANCER INSTITUTE,636455.0,ABSTRACT Glioblastoma multiforme (GBM) is the ...
2,31,10131000,NATIONAL CANCER INSTITUTE,316405.0,Gene silencing using small interfering RNA (si...
3,35,10433611,NATIONAL CANCER INSTITUTE,35000.0,Project Summary/Abstract This application is b...
4,38,10118122,NATIONAL CANCER INSTITUTE,,ABSTRACT ? BIOREPOSITORY & PRECISION PATHOLOGY...


### 4) Identify how many projects have missing abstracts and then remove the projects with missing abstracts from your dataframe (0.5 points)


In [8]:
nci_abstracts[pd.isnull(nci_abstracts['ABSTRACT_TEXT'])].count()

index             205
APPLICATION_ID    205
IC_NAME           205
TOTAL_COST        139
ABSTRACT_TEXT       0
dtype: int64

#### Out of the 12595 Cancer projects, 205 of them do not have an abstract. This is about 1.6% of the total projects. 

In [9]:
# Drop the projects with missing abstracts
nci_abstracts = nci_abstracts.dropna(subset = ['ABSTRACT_TEXT'])

# Look at the number of rows and columns of the data frame
nci_abstracts.shape

(12390, 5)

### 5) Save your abstracts to list (0.5 points)

In [10]:
# Save the abstracts to a list
abstracts_list = nci_abstracts['ABSTRACT_TEXT'].values.tolist()

# look at the first element of the list, which is the first abstract
abstracts_list[0]

'Project Summary/Abstract Title: Caribbean Investigation of Cancer Stigma and its effect on Cervical Cancer Screening and HPV Vaccination This application is being submitted in response to the Notice of Special Interest (NOSI) identified as NOT-CA-21-026 Cancer stigma is an understudied barrier to cancer treatment seeking, early diagnosis, screening and other prevention practices. In particular, cervical cancer (CCA) stigma must be prioritized for research as CCA is among the most deadly but highly preventable cancers. CCA is the fourth most common cancer in the world despite an accurate screening test and a preventive HPV vaccine. The Human papilloma Virus (HPV) causes 99% of cervical, as well as much of anal, penile and oral cancers. The burden of CCA persists, and CCA incidence and mortality are even increasing among Black women in low resourced countries globally. Specifically, CCA incidence is highest, and is the leading cause of premature mortality among Caribbean women. Despite 

### 6) Set stemmer at SnowballStemmer (0.5 points)

In [11]:
stemmer = SnowballStemmer("english")
# few examples of how SnowballStemmer works
print(stemmer.stem('stigma'))
print(stemmer.stem('molecular'))
print(stemmer.stem('screening'))
print(stemmer.stem('proximity'))
print(stemmer.stem('culturally'))
print(stemmer.stem('checkpoint'))


stigma
molecular
screen
proxim
cultur
checkpoint


### 7) Create a tokenize function (0.5 points)

In [12]:
# Create a tokenize function
def tokenize(text):
 # translator that replaces punctuation with empty spaces
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation)) 
 # stemmer and tokenizing into words
    return [stemmer.stem(i) for i in text.translate(translator).split()]  

In [13]:
tokenize(abstracts_list[0])[:10]

['project',
 'summari',
 'abstract',
 'titl',
 'caribbean',
 'investig',
 'of',
 'cancer',
 'stigma',
 'and']

### 8) Import and download stopwords form ltk.corpus (0.5 points)

In [14]:
# Import stopwords form ltk.corpus
from nltk.corpus import stopwords
# Download the stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/wsq/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### 9) Set the correct stop words and Tokenize them (0.5 points)

In [15]:
# Set the correct stop words 
stop_words = set(stopwords.words('english'))

# Tokenize stop words and store them a stop_words list
stop_words = [tokenize(s)[0] for s in stop_words]

In [16]:
type(stop_words)

list

In [17]:
print(stop_words)

['each', 'him', 'himself', 'herself', 'you', 'aren', 'hasn', 'into', 'needn', 'your', 'it', 'these', 'shouldn', 'ourselv', 'is', 'your', 'themselv', 'by', 'veri', 'd', 'doe', 'abov', 'shouldn', 'in', 'yourselv', 'won', 'onli', 'haven', 'me', 'dure', 'do', 'at', 'itself', 'which', 'other', 'that', 'weren', 'do', 'should', 'out', 'some', 'o', 'after', 'nor', 'then', 'than', 'haven', 'm', 'for', 'won', 'or', 'doesn', 'hadn', 'her', 'this', 'weren', 'just', 'whi', 'off', 'was', 'as', 'befor', 'wouldn', 'hadn', 's', 'it', 'where', 'no', 'up', 'be', 'she', 'onc', 'now', 'when', 'them', 'had', 'from', 'that', 'shan', 'down', 'my', 'a', 'below', 'needn', 'but', 'on', 'wasn', 'been', 'further', 'an', 'not', 'what', 'mightn', 'did', 'doesn', 'the', 'own', 've', 'it', 'they', 'to', 'with', 'hasn', 'am', 'shan', 'her', 'here', 'couldn', 'most', 'wasn', 'too', 'ain', 'while', 'over', 'wouldn', 'and', 'she', 'against', 'have', 'are', 'you', 'be', 'myself', 'mightn', 'ani', 'will', 'y', 'there', 'if'

### 10) Create a list of other stopwords that you think might be useful to remove in order to create more meaningful topics (0.5 points)


In [18]:
stop = stop_words + ['checkpoint','emphas','cultur','applic','report','provid','use','studi', 'research','program','abstract', 'project', 'e', 'g','propos','such','intervent','implement' ]
full_stopwords = [tokenize(s)[0] for s in stop]

### 11) Create the `vectorizer` object with the following specifications (0.5 points)
    - unit of features are single words rather than characters
    - function to create tokens
    - allow for bigrams
    - remove accent characters
    - remove stopwords
    - only include words with minimum frequency of 0.05
    - only include words with maximum frequency of 0.95

In [19]:
vectorizer = CountVectorizer(analyzer="word",        # unit of features are single words rather than characters
                            tokenizer=tokenize,      # function to create tokens
                            ngram_range=(0,2),       # bigrams - two words
                            strip_accents='unicode', # remove accent characters
                            stop_words = stop_words, # remove stopwords
                            min_df = 0.05,           # only include words with minimum frequency of 0.05
                            max_df = 0.95)           # only include words with maximum frequency of 0.95

### 12) Create a bag of words/bigrams by fitting the vectorizer to your abstract list (0.5 points)

In [20]:
# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts_list)  # transform our corpus as a bag of words
features = vectorizer.get_feature_names()                # get features (words)



In [21]:
type(bag_of_words)

scipy.sparse._csr.csr_matrix

In [22]:
print(bag_of_words)

  (0, 541)	1
  (0, 666)	1
  (0, 20)	1
  (0, 370)	1
  (0, 107)	18
  (0, 236)	1
  (0, 612)	11
  (0, 68)	1
  (0, 596)	2
  (0, 640)	1
  (0, 366)	1
  (0, 332)	1
  (0, 83)	1
  (0, 708)	1
  (0, 614)	1
  (0, 234)	1
  (0, 208)	2
  (0, 530)	4
  (0, 521)	2
  (0, 492)	1
  (0, 589)	5
  (0, 57)	7
  (0, 323)	1
  (0, 153)	1
  (0, 199)	2
  :	:
  (12389, 35)	1
  (12389, 166)	4
  (12389, 439)	1
  (12389, 706)	1
  (12389, 371)	1
  (12389, 412)	3
  (12389, 60)	1
  (12389, 43)	1
  (12389, 321)	1
  (12389, 58)	1
  (12389, 466)	1
  (12389, 560)	1
  (12389, 299)	1
  (12389, 747)	2
  (12389, 190)	1
  (12389, 524)	5
  (12389, 31)	1
  (12389, 253)	1
  (12389, 213)	2
  (12389, 16)	1
  (12389, 398)	1
  (12389, 143)	1
  (12389, 74)	1
  (12389, 648)	1
  (12389, 78)	1


### 13) Use TfidfTransformer to re-weight your bag of words (0.5 points)

In [23]:
# Use TfidfTransformer to re-weight bag of words
transformer = TfidfTransformer(norm = None, smooth_idf = True, sublinear_tf = True)
tfidf = transformer.fit_transform(bag_of_words)

### 14) Fit your transormend data to an LDA model and store your results (0.5 points)

In [24]:
# Fitting LDA model to your data
# We set n_components = 10 to produce 10 topics
lda = LatentDirichletAllocation(n_components = 10, learning_method='online')

# store your results 
doctopic = lda.fit_transform(tfidf)

### 15) Pull out  the top keywords in each topic (0.5 points)

In [25]:
# Pull out the top 8 keywords in each topic
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:8]
    keywords = ', '.join(features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

0 imag, technolog, method, tissu, analysi, use, quantit, sampl
1 research, program, fund, train, career, member, cancer research, center
2 patient, breast, breast cancer, biomark, tumor, predict, lung, clinic
3 health, intervent, care, dispar, communiti, implement, outcom, cancer
4 cell, genom, singl cell, gene, understand, biolog, tumor, singl
5 regul, cell, signal, role, mechan, express, protein, pathway
6 risk, women, infect, associ, studi, age, e, relat
7 immun, tumor, anti, immunotherapi, therapi, cell, respons, efficaci
8 core, administr, support, resourc, share, servic, manag, provid
9 inhibitor, drug, resist, prostat, target, leukemia, prostat cancer, molecul


### 16) Save the results as a data frame with the top words for each topic as the colunn names  (0.5 points)

In [26]:
# Save the topics to a dataframe
Topics = pd.DataFrame(ls_keywords)

# Rename the column Topics
Topics.rename(columns={0:'Topics'}, inplace = True)

# View the dataframe
Topics

Unnamed: 0,Topics
0,"imag, technolog, method, tissu, analysi, use, ..."
1,"research, program, fund, train, career, member..."
2,"patient, breast, breast cancer, biomark, tumor..."
3,"health, intervent, care, dispar, communiti, im..."
4,"cell, genom, singl cell, gene, understand, bio..."
5,"regul, cell, signal, role, mechan, express, pr..."
6,"risk, women, infect, associ, studi, age, e, relat"
7,"immun, tumor, anti, immunotherapi, therapi, ce..."
8,"core, administr, support, resourc, share, serv..."
9,"inhibitor, drug, resist, prostat, target, leuk..."


In [27]:
# Save the results as a data frame with the top words for each topic as the colunn names 
topics_doc = pd.DataFrame(doctopic, columns = ls_keywords)

# View the dataframe
topics_doc

Unnamed: 0,"imag, technolog, method, tissu, analysi, use, quantit, sampl","research, program, fund, train, career, member, cancer research, center","patient, breast, breast cancer, biomark, tumor, predict, lung, clinic","health, intervent, care, dispar, communiti, implement, outcom, cancer","cell, genom, singl cell, gene, understand, biolog, tumor, singl","regul, cell, signal, role, mechan, express, protein, pathway","risk, women, infect, associ, studi, age, e, relat","immun, tumor, anti, immunotherapi, therapi, cell, respons, efficaci","core, administr, support, resourc, share, servic, manag, provid","inhibitor, drug, resist, prostat, target, leukemia, prostat cancer, molecul"
0,0.094005,0.000216,0.000216,0.528912,0.000216,0.000216,0.353162,0.000216,0.022623,0.000216
1,0.000234,0.088802,0.000234,0.000234,0.096908,0.000234,0.025144,0.724016,0.000234,0.063962
2,0.094845,0.000254,0.000254,0.000254,0.000254,0.244047,0.000254,0.435670,0.000254,0.223914
3,0.000445,0.652497,0.000445,0.000445,0.000445,0.000445,0.000445,0.073519,0.270870,0.000445
4,0.362122,0.138811,0.087666,0.000221,0.063087,0.000221,0.018844,0.000221,0.328585,0.000221
...,...,...,...,...,...,...,...,...,...,...
12385,0.033939,0.000290,0.000290,0.000290,0.184027,0.603770,0.128300,0.000290,0.000290,0.048514
12386,0.000215,0.000215,0.064937,0.000215,0.000215,0.265210,0.056512,0.612050,0.000215,0.000215
12387,0.181798,0.001075,0.001075,0.001075,0.001075,0.751026,0.001075,0.059650,0.001075,0.001075
12388,0.075344,0.000700,0.000700,0.554045,0.000700,0.000700,0.000700,0.000700,0.365714,0.000700


### 17) join together the dataframe from 16 with your abstracts dataframe (filtered for the CI)(0.5 points)

In [28]:
# Reset the display options
pd.reset_option('^display.', silent=True)

# join together the topics_doc with the nci_abstracts dataframe
topics_project = pd.concat([topics_doc,nci_abstracts],axis=1)

# View the first 5 observations 
topics_project.head(5)

Unnamed: 0,"imag, technolog, method, tissu, analysi, use, quantit, sampl","research, program, fund, train, career, member, cancer research, center","patient, breast, breast cancer, biomark, tumor, predict, lung, clinic","health, intervent, care, dispar, communiti, implement, outcom, cancer","cell, genom, singl cell, gene, understand, biolog, tumor, singl","regul, cell, signal, role, mechan, express, protein, pathway","risk, women, infect, associ, studi, age, e, relat","immun, tumor, anti, immunotherapi, therapi, cell, respons, efficaci","core, administr, support, resourc, share, servic, manag, provid","inhibitor, drug, resist, prostat, target, leukemia, prostat cancer, molecul",index,APPLICATION_ID,IC_NAME,TOTAL_COST,ABSTRACT_TEXT
0,0.094005,0.000216,0.000216,0.528912,0.000216,0.000216,0.353162,0.000216,0.022623,0.000216,18.0,10406126.0,NATIONAL CANCER INSTITUTE,219626.0,Project Summary/Abstract Title: Caribbean Inve...
1,0.000234,0.088802,0.000234,0.000234,0.096908,0.000234,0.025144,0.724016,0.000234,0.063962,29.0,10246936.0,NATIONAL CANCER INSTITUTE,636455.0,ABSTRACT Glioblastoma multiforme (GBM) is the ...
2,0.094845,0.000254,0.000254,0.000254,0.000254,0.244047,0.000254,0.43567,0.000254,0.223914,31.0,10131000.0,NATIONAL CANCER INSTITUTE,316405.0,Gene silencing using small interfering RNA (si...
3,0.000445,0.652497,0.000445,0.000445,0.000445,0.000445,0.000445,0.073519,0.27087,0.000445,35.0,10433611.0,NATIONAL CANCER INSTITUTE,35000.0,Project Summary/Abstract This application is b...
4,0.362122,0.138811,0.087666,0.000221,0.063087,0.000221,0.018844,0.000221,0.328585,0.000221,38.0,10118122.0,NATIONAL CANCER INSTITUTE,,ABSTRACT ? BIOREPOSITORY & PRECISION PATHOLOGY...


### 19) Choose one topic per document with the highest score (0.5 points)

In [29]:
# Idxmax function and axis=1: return the column name of the max value in a row
topics_doc.idxmax(axis=1)

0        health, intervent, care, dispar, communiti, im...
1        immun, tumor, anti, immunotherapi, therapi, ce...
2        immun, tumor, anti, immunotherapi, therapi, ce...
3        research, program, fund, train, career, member...
4        imag, technolog, method, tissu, analysi, use, ...
                               ...                        
12385    regul, cell, signal, role, mechan, express, pr...
12386    immun, tumor, anti, immunotherapi, therapi, ce...
12387    regul, cell, signal, role, mechan, express, pr...
12388    health, intervent, care, dispar, communiti, im...
12389    core, administr, support, resourc, share, serv...
Length: 12390, dtype: object

In [30]:
# Reset the display option to see more of the abstract
pd.set_option('display.max_colwidth', 200)

# join the abstract with the topic with the greatest weight
# Rename the first column topic
topics_project_max = pd.concat([topics_doc.idxmax(axis=1),nci_abstracts],axis=1).rename(columns={0:'topic'})

topics_project_max.head()

Unnamed: 0,topic,index,APPLICATION_ID,IC_NAME,TOTAL_COST,ABSTRACT_TEXT
0,"health, intervent, care, dispar, communiti, implement, outcom, cancer",18.0,10406126.0,NATIONAL CANCER INSTITUTE,219626.0,Project Summary/Abstract Title: Caribbean Investigation of Cancer Stigma and its effect on Cervical Cancer Screening and HPV Vaccination This application is being submitted in response to the Noti...
1,"immun, tumor, anti, immunotherapi, therapi, cell, respons, efficaci",29.0,10246936.0,NATIONAL CANCER INSTITUTE,636455.0,"ABSTRACT Glioblastoma multiforme (GBM) is the most lethal primary brain cancer, with standard treatments based on surgery, radiotherapy, and chemotherapy promoting an overall survival of approxima..."
2,"immun, tumor, anti, immunotherapi, therapi, cell, respons, efficaci",31.0,10131000.0,NATIONAL CANCER INSTITUTE,316405.0,"Gene silencing using small interfering RNA (siRNA) is a viable therapeutic approach but, limited in translation due to lack of effective delivery systems. Developing effective and non-toxic delive..."
3,"research, program, fund, train, career, member, cancer research, center",35.0,10433611.0,NATIONAL CANCER INSTITUTE,35000.0,Project Summary/Abstract This application is being submitted to continue accrual participation in the Early Drug Development Opportunity (EDDOP). The Rogel Cancer Center first received National Ca...
4,"imag, technolog, method, tissu, analysi, use, quantit, sampl",38.0,10118122.0,NATIONAL CANCER INSTITUTE,,ABSTRACT ? BIOREPOSITORY & PRECISION PATHOLOGY CENTER SHARED RESOURCE The BioRepository & Precision Pathology Center (BRPC) provides patient tissue and blood-based research support for the Duke C...
