Skip to content

Alexjmsherman/nlp_practicum_cohort1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Guild - NLP Practicum

1. Configuration

  • Topics: course overview, git bash, python config.ini files, conda virtual environments
  • Technology: git bash, configparser, conda
  • Homework: use the command line to search data among 1000's of server configuration files

2. Automation

  • Topics: automate the process to collect data from https://www.annualreports.com
  • Technology: requests, Jupyter Notebooks, BeautifulSoup, Scrapy
  • Homework: automate the process to identify and download company 10-K annual reports

3. Databases

  • Topics: use sqlalchemy to create and populate a database, locally and on AWS
  • Technology: sqlalchemy, sqllite, AWS RDS (MySQL)
  • Homework: create and populate a database with sqlalchemy

4. Text Extraction

  • Topics: use docx to extract text from Microsoft Word Documents. Discuss the PyCharm debugger.
  • Technology: docx, pdfminer.six, subprocess, PyCharm
  • Homework: structure the annual reports into sections

5. AWS Data Processing

  • Topics: refactor the automation homework, use task scheduler to automate the script locally, discuss AWS technologies to automate the script in the cloud
  • Technology: python, _init_.py, Spyder, AWS S3, AWS Lambda, AWS DynamoDB, AWS CloudWatch

6. Text Preprocessing

  • Topics: lemmatization, POS tagging, dependency parsing, rule-based matching
  • Technology: SpaCy

7. Phrase (collocation) Detection

  • Topics: acronyms, POS phrases, phrase dectection
  • Technology: SpaCy, gensim

8. Text Vectorization (count-based methods)

  • Topics: vector space model, TFIDF, BM25, Co-occurance matrix
  • Technology: scikit-learn
  • Homework: clean text from annual reports

9. Object Oriented Python

  • Topics: reconstruct scikit-learn's CountVectorizer codebase
  • Technology: scikit-learn, object oriented Python

10. Word Embeddings

  • Topics: PCA, latent semantic indexing (LSI), latent dirichlet allocation(LDA), topic coherence metrics, and Word2Vec
  • Technology: scikit-learn, gensim
  • Homework: Read TamingTextwiththeSVD (ftp://ftp.sas.com/techsup/download/EMiner/) and create topic models for annual report sections,

11. Text Similarity

  • Topics: cosine similarity, distance metrics, l1 and l2 norm, recommendation engines
  • Technology: scikit-learn, SpaCy, gensim

12. Document Classification

  • Topics: tbd
  • Technology: scikit-learn

13. Pipelines and Custom Transformers

  • Topics: capture, format, and send logging messages to a variety of output. Exception Handling. Create an executable of a python package for deployment
  • Technology: scikit-learn, logging, python exceptions, pyinstaller, argparse