<a href="https://colab.research.google.com/github/DigitalHugManitees/DH_Topic_Workshop/blob/main/LDA_with_ngrams_on_Colab_v16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LDA using Spacey and Gensim v 16
## with n-grams for more accurate topics

- [x] set working directory in Google Drive
- [x] install libraries and dependencies
- [ ] gather all files in pwd as df
- [ ] run through df
- [ ] assign topics to files in df
- [ ] export csv for IR?



### Limitations
- [ ] still lots of human level interpretation needed - there are lots of scientific writing terms still
- [ ] data needs to be a corpus, so is one newspaper a corpus of topics? Or a bunch of newspapers? How do you define your corpus?
- [ ] how good is LDA compared with other methods? 

### development
- [ ] is it possible to automate dividing up a newspaper/gazette by its articles, so that one issue may be a corpus of many articles? 
- [ ] or do we work on a larger scale and years of Gazette are lumped together with many years forming a corpus?


### Source:
Anika Nissen (University of Duisburg-Essen (UDE)) and Dominic Rosati (scite.ai) 2022

### References:
- [ ] need to add any references here

## Step 1: Mount your Google Drive to Colab so that files can be saved. 
<br>
This will create a working directory that you can place files in for analyzing, and then download output files.

In [None]:
#mount google drive here
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import os 

# Set your working directory to a folder in your Google Drive. 
# the base Google Drive directory
root_dir = "/content/drive/My Drive/"

# choose where you want your project files to be saved AND
# what you want the folder to be called. This is your working directory
project_folder = "Colab Notebooks/LDA_Project_Folder/"

def create_and_set_working_directory(project_folder):
  # check if your project folder exists. if not, it will be created.
  if os.path.isdir(root_dir + project_folder) == False:
    os.mkdir(root_dir + project_folder)
    print(root_dir + project_folder + ' did not exist but was created.')

  # change the OS to use your project folder as the working directory
  os.chdir(root_dir + project_folder)

  # create a test file to make sure it shows up in the right place
  !touch 'new_file_in_working_directory.txt'
  print('\nYour working directory was changed to ' + root_dir + project_folder + \
        "\n\nAn empty text file was created there. You can also run !pwd to confirm the current working directory." )

create_and_set_working_directory(project_folder)

#source: https://robertbrucecarter.com/writing/2020/06/setting-your-working-directory-to-google-drive-in-a-colab-notebook/

Mounted at /content/drive
/content/drive/My Drive/Colab Notebooks/LDA_Project_Folder/ did not exist but was created.

Your working directory was changed to /content/drive/My Drive/Colab Notebooks/LDA_Project_Folder/

An empty text file was created there. You can also run !pwd to confirm the current working directory.


In [None]:
# You can check to make sure your working directory is correct!
! pwd

/content/drive/My Drive/Colab Notebooks/LDA_Project_Folder


### Now, you have a working directory!
 Go back to GitHub (https://github.com/poppy-nicolette/
Digital_Huge_Manitees.git) and download the corpus file.
 <br>Then, go to your Google Drive and locate the LDA_Project_Folder you created in the cells above.
 <br>Move the corpus file to this Google folder (your working directory).
<br>You will repeat this process with every corpus you want to analyze.

## Step 2: Install libraries and dependencies
Then import them. 

In [None]:
# Install libraries and dependencies
!pip install pyLDAvis -qq 
!pip install -qq -U gensim
!pip install spacy -qq
!pip install matplotlib -qq
!pip install seaborn -qq
!python -m spacy download en_core_web_md -qq
!pip install fsspec

[?25l[K     |▏                               | 10 kB 19.1 MB/s eta 0:00:01[K     |▍                               | 20 kB 6.6 MB/s eta 0:00:01[K     |▋                               | 30 kB 9.0 MB/s eta 0:00:01[K     |▉                               | 40 kB 4.3 MB/s eta 0:00:01[K     |█                               | 51 kB 4.7 MB/s eta 0:00:01[K     |█▏                              | 61 kB 5.6 MB/s eta 0:00:01[K     |█▍                              | 71 kB 5.7 MB/s eta 0:00:01[K     |█▋                              | 81 kB 4.2 MB/s eta 0:00:01[K     |█▉                              | 92 kB 4.6 MB/s eta 0:00:01[K     |██                              | 102 kB 5.0 MB/s eta 0:00:01[K     |██▏                             | 112 kB 5.0 MB/s eta 0:00:01[K     |██▍                             | 122 kB 5.0 MB/s eta 0:00:01[K     |██▋                             | 133 kB 5.0 MB/s eta 0:00:01[K     |██▊                             | 143 kB 5.0 MB/s eta 0:00:01[K    

In [None]:
import warnings
warnings.filterwarnings('ignore') # this ignores warnings
# Import
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import spacy
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()# Visualise inside a notebook
import en_core_web_md
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaMulticore
from gensim.models import CoherenceModel
import itertools

pd.set_option('display.max_columns', None) # this allows you see all columns in pandas

  from collections import Iterable
  from collections import Mapping
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


### Run the next cell, **if** you find that you need to change the processor. 
**Otherwise, skip it.** <br>
Under the menus, Runtime->Change Runtime Type, you can select a different processor depending on your needs. IF you find that CPU is taking too long, then change to GPU. Keep in mind that lots of GPU use will require you to upgrade your account. :

In [None]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 16117904028319291522
 xla_global_id: -1]

## Step 3: Import your data and load the Spacey model
Your file will be the entire corpus you want to analyze
- [ ] need to determine best format for the data considering we're getting .txt files from the OCR notebook
- [ ] example, format our .csv with columns:
|   key   |   publication_year   |  content   |

In [None]:
"""
consider putting a glob here to collect all .txt files into a df. 
The df should have the original file name (preserved in the .txt file name) and the .txt string contents. two columns. 
add a third column later for the cluster # so that topics can be sorted and retrieved.
"""


In [None]:
# Read the data
path = '/content/drive/MyDrive/Colab Notebooks/LDA_Project_Folder/TE SLR Corpus.csv'
reports = pd.read_csv(path)
reports.head()
reports.info()

# Our spaCy model:
nlp = en_core_web_md.load() # this will be used to train the algorithm

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 87 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Key                   118 non-null    object 
 1   Item Type             118 non-null    object 
 2   Publication Year      117 non-null    float64
 3   Author                118 non-null    object 
 4   Title                 118 non-null    object 
 5   Publication Title     115 non-null    object 
 6   ISBN                  0 non-null      float64
 7   ISSN                  113 non-null    object 
 8   DOI                   110 non-null    object 
 9   Url                   112 non-null    object 
 10  summary               113 non-null    object 
 11  Date                  117 non-null    object 
 12  Date Added            118 non-null    object 
 13  Date Modified         118 non-null    object 
 14  Access Date           112 non-null    object 
 15  Pages                 1

### Convert all data in your content column to string data types.

In [None]:
# change this to match your content column name. 
reports['summary'] = reports['summary'].astype(str)
reports.head()
print(reports.dtypes)

Key                  object
Item Type            object
Publication Year    float64
Author               object
Title                object
                     ...   
Section             float64
Session             float64
Committee           float64
History             float64
Legislative Body    float64
Length: 87, dtype: object


## Step 4: Set the number of n_grams here
- [ ] reminder of what n_grams are and how these can be helpful. Where's a good place to start?

In [None]:
from nltk import ngrams

def compile_ngrams(text, number_of_n=3, include_unigrams=False):
  ngram_list = []
  # number_of_n controls up to how many n we build an ngram for
  # 2 being bigrams, 3 being trigrams, ect.
  for n in range(number_of_n):
    if n == 0 and not include_unigrams:
      continue
    for ngram in ngrams(text.split(), n + 1):
      ngram_list.append(' '.join(ngram))
  return ngram_list

"""this is a test to make sure the function works"""
print(compile_ngrams("There was a cloud computing conference about big data and natural language processing"))

['There was', 'was a', 'a cloud', 'cloud computing', 'computing conference', 'conference about', 'about big', 'big data', 'data and', 'and natural', 'natural language', 'language processing', 'There was a', 'was a cloud', 'a cloud computing', 'cloud computing conference', 'computing conference about', 'conference about big', 'about big data', 'big data and', 'data and natural', 'and natural language', 'natural language processing']


## Step 5: Data cleaning, setting your controlled vocabulary, and tokenizing
- [ ] discuss author_assigned_keywords - these are terms that should be preserved together that may be contextual for your field, topic, etc. 
- [ ] this step also removes words that are not helpful for determining topics. Such as pronouns, conjunctions, punctuation etc. 
- [ ] explore the dictionary - how might this be explorable? useful? critical?

In [None]:
# Tags I want to remove from the text
removal= ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE', 'NUM', 'SYM']
tokens = []
# words I really care about that should certainly be in the dictionary
author_assigned_keywords = ["teaching effectiveness"]

for summary in nlp.pipe(reports['summary']):
   # build up tokens here
   # using the authors heuristics:
   unigrams = [token.lemma_.lower() for token in summary if token.pos_ not in removal and not token.is_stop and token.is_alpha]
   # using ngrams
   # we can use the proj_tok clean text as the "paragraph"
   proj_tok = compile_ngrams(" ".join(unigrams), number_of_n=2, include_unigrams=False)
   # using "author assigned keywords" by checking if they are in the text
   proj_tok += [keyword for keyword in author_assigned_keywords if keyword in summary.text]
   tokens.append(proj_tok)

# Add tokens to new column
reports['tokens'] = tokens
reports['tokens']

# Create dictionary
# I will apply the Dictionary Object from Gensim, which maps each word to their unique ID:
dictionary = Dictionary(reports['tokens'])
print(dictionary.token2id)

{'academic growth': 0, 'access modern': 1, 'accord property': 2, 'achievement advance': 3, 'achievement datum': 4, 'achievement evaluate': 5, 'achievement find': 6, 'achievement growth': 7, 'acquire valuable': 8, 'advance understanding': 9, 'affect teacher': 10, 'align grow': 11, 'analysis describe': 12, 'apparent relationship': 13, 'appear contribute': 14, 'article combine': 15, 'ask research': 16, 'assign public': 17, 'attend mounting': 18, 'attend place': 19, 'attention define': 20, 'background contexteducational': 21, 'base condition': 22, 'begin recognize': 23, 'body work': 24, 'broad conception': 25, 'build body': 26, 'build strong': 27, 'career intention': 28, 'career plan': 29, 'challenge pose': 30, 'challenge school': 31, 'characteristic condition': 32, 'characteristic school': 33, 'characteristic wide': 34, 'choose leave': 35, 'clean maintain': 36, 'combine statewide': 37, 'common school': 38, 'compare school': 39, 'conceive working': 40, 'conception context': 41, 'conclusion

##Step 6: Set number of topics and run the LDA analysis
look for num_topics = 

This step also saves out an interactive .html file. Look for this in your working directory.
<br>
There is a lot more that can go on here with coherence score which may be useful in validating the number of topics chosen. 

In [None]:

# Filter dictionary
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=1000)

# Create corpus
corpus = [dictionary.doc2bow(doc) for doc in reports['tokens']]

# LDA model building
# lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=50, num_topics=10, workers = 4, passes=10)

# # Coherence score using C_umass:
# topics = []
# score = []
# for i in range(1,20,1):
#    lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=10, num_topics=i, workers = 4, passes=10, random_state=100)
#    cm = CoherenceModel(model=lda_model, corpus=corpus, dictionary=dictionary, coherence='u_mass')
#    topics.append(i)
#    score.append(cm.get_coherence())
# _=plt.plot(topics, score)
# _=plt.xlabel('Number of Topics')
# _=plt.ylabel('Coherence Score')
# plt.show()

# # Coherence score using C_v:
# topics = []
# score = []
# for i in range(1,20,1):
#    lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=10, num_topics=i, workers = 4, passes=10, random_state=100)
#    cm = CoherenceModel(model=lda_model, texts = reports['tokens'], corpus=corpus, dictionary=dictionary, coherence='c_v')
#    topics.append(i)
#    score.append(cm.get_coherence())
# _=plt.plot(topics, score)
# _=plt.xlabel('Number of Topics')
# _=plt.ylabel('Coherence Score')
# plt.show()

# Optimal model
lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=100, num_topics=5, workers = 4, passes=100)

# Print topics
lda_model.print_topics(-1)

# Where does a text belong to
lda_model[corpus][0]
reports['summary'][0]

# Visualize topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)

# Save the report
pyLDAvis.save_html(lda_display, 'index.html')

## That's it! 
Go to your working directory in Google Drive and look for the index.html file. <br>
Download this and then open with a browser locally. You should see a visualization of words and the clusters. 
<br>
You can continue to analyze more in depth with the following steps. 


## Step 7: Print topics
- [ ] how might someone use this? 
- [ ] cut-n-paste into excel
- [ ] include in a report


In [None]:
lda_model.print_topics()

[(0,
  '0.162*"teaching effectiveness" + 0.151*"teacher education" + 0.104*"effect size" + 0.094*"student evaluation" + 0.058*"control group" + 0.048*"content knowledge" + 0.048*"effectiveness student" + 0.048*"teach effectiveness" + 0.037*"student teacher" + 0.036*"improve teaching"'),
 (1,
  '0.102*"medical school" + 0.089*"effective teaching" + 0.087*"medical education" + 0.062*"self efficacy" + 0.053*"teaching effectiveness" + 0.052*"teaching method" + 0.044*"c apa" + 0.044*"database record" + 0.044*"psycinfo database" + 0.044*"right reserve"'),
 (2,
  '0.205*"effective teacher" + 0.147*"student achievement" + 0.096*"teacher effectiveness" + 0.074*"quality teacher" + 0.052*"teacher school" + 0.044*"student teacher" + 0.038*"student performance" + 0.038*"teacher quality" + 0.038*"purpose study" + 0.036*"teacher student"'),
 (3,
  '0.392*"effective teaching" + 0.105*"teaching strategy" + 0.058*"future research" + 0.054*"self efficacy" + 0.049*"effectiveness research" + 0.047*"teachin

## interpretive analysis:
This is where we can see the weighting of latent topics comes in - there are probabilities that a paper/document fits within a topic. 

### development:
- [ ] clean this up so there is a better way of exploring the lda_model object

In [None]:
lda_model[corpus][1]
lda_model[corpus][1][3][0]
print('cluster ' + str(lda_model[corpus][1][3][0]) + ' is the most highest value')
print()
print(lda_model[corpus][1])

cluster 3 is the most highest value

[(0, 0.040001195), (1, 0.04028713), (2, 0.040353164), (3, 0.83935714), (4, 0.040001404)]


## Interpretive analysis part 2: 
load the original text to see how well you think the topics fit.

In [None]:
reports['summary'][1]

"Effective teaching skills consist of high levels of student engagement based on good classroom and time management skills; the ability to scaffold learning that is adapted to students' current levels of understanding; cognitively engaging students in higher-order thinking; and encouraging and supporting success. The research reported here suggests that in elementary classrooms, effective teaching skills are effective for all students, both with and without special education needs. Drawing on a research programme extending over nearly two decades, we make the case that effective inclusionary practices, and therefore overall effective teaching, depend in part on the beliefs of teachers about the nature of disability, and about their roles and responsibilities in working with students with special education needs. Elementary classroom teachers who believe students with special needs are their responsibility tend to be more effective overall with all of their students. We provide evidence

## Step X: put cluster back in the df and export csv
- [ ] add cluster column
- [ ] use lda_model[corpus][i] max value?
- [ ] fill column with lda_model[corpus][i] values for more detail