# Week 3 Session 1 activity

## Part 1

Please work through the "NLP Week 3.1-Lecture.ipynb" notebook from GitHub before returning here.

Then, **choose whether to attempt Parts 2/3 (applying LSA and/or LDA to movie reviews) or Part 4 (getting started with a web scraper)**. (We encourage you to attempt all parts before next week, but there will only be time for one of these during class.)


## Part 2

In this section, you'll be applying LSA to a new dataset: movie reviews from IMDB.

This section uses a data file constructed using code in [this github project](https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch08/ch08.py)

**First, import the necessary libraries:**

In [None]:
import pandas as pd 
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction import stop_words
from sklearn.metrics.pairwise import cosine_similarity as cosine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.tokenize.casual import casual_tokenize
import pandas as pd
from nlpia.data.loaders import get_data
from sklearn.decomposition import TruncatedSVD

**Next, load the movie reviews.** (This requires movie_data.csv to be in the same directory as this notebook.)

In [None]:
data_from_csv = pd.read_csv('movie_data.csv', encoding='utf-8') #Reads a CSV file into a pandas dataframe
data_from_csv.head(3) #show us the first 3 rows

In [None]:
shape(data_from_csv) #how many rows & columns?

In [None]:
#Let's rename the review column to "text" so we can easily access it later
df = data_from_csv.rename(columns={'0': 'text'})
df

**Now, let's apply LSA.**

Grab some code from the "NLP Week 3.1-Lecture.ipynb" notebook to do the following:

* Create count vectors for your corpus, using a tokenizer of your choice (e.g., possibly Louis' custom tokenizer, or perhaps another one you think will be good for this task)
* Create tf-idf vectors from the count vectors
* Centre the vectorized documents by subtracting the mean
* Use TruncatedSVD to compute the LSA topic vectors. Use 10 topics and 100 iterations to start with.
* Explore the results.
  * Do the topics seem to correspond to movie genres, to types of reviews, or anything else? 
  * Do the topic weights for individual reviews seem to make sense to you?
* Try changing the number of topics to fewer (e.g., 3?) or more (e.g., 20) and explore how this changes the results.
* Optional bonus: Can you write some code that allows you to construct a query for this corpus? It should use cosine similarity to find the movie review(s) that best match a text string you provide.

In [None]:
# Your code here (feel free to add cells)


## Part 3:  Apply Latent Dirichlet Allocation

Again, copying code from the lecture notebook, apply LDiA to this data.

You'll want to specifically do the following:
* Compute count vectors for your corpus, using a tokenizer of your choice (probably the same one you used for LSA)
* Apply LDA to the count vectors. And while you wait for it to compute, maybe go have a coffee? Or get a start on Part 4 below? .....
* Explore the results. 
  * Do the topics seem to correspond to movie genres, to types of reviews, or anything else? 
  * Do the topic weights for individual reviews seem to make sense to you?
  * How do these topics compare to the topics LSA found? Do they seem to be more or less coherent? 
  * How do the document distributions over topics compare to those from LSA? Do you find LDiA's distributions to be more sparse (i.e., with more topics having a near-0 probability within a given document)?
* Try changing the number of topics and explore how this changes the results

In [None]:
# Your code here (feel free to add cells)

## Part 4:  Experiment with Web Scraping


**First,** visit a webpage that you might want to scrape (e.g., perhaps a site with things for sale, music lyrics, reviews, biographies of people, dating profiles, ... ?). Use the developer tools in your browser to identify the HTML tags, classes, and/or ids that correspond to the text you want to scrape.


**Then, work through a tutorial for one (or both) of the following web scraping tools.**

[Webscraper.io](https://webscraper.io/) is a GUI-based tool which allows you to crawl from a starting page and scrape without any coding. You'll probably want to start by watching a tutorial video, e.g., [this one](https://www.youtube.com/watch?v=n7fob_XVsbY&feature=emb_logo)

Alternatively, you can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to build a scraper in Python. Install it using:


In [None]:
!pip install beautifulsoup4

You'll also want to install `requests` and `lxml`:

In [None]:
!pip install requests

In [None]:
!pip install lxml

There is a lot of good documentation for BeautifulSoup online. For instance, you might want to start with [this tutorial](https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3), though note that you probably want to type the Python code here rather than save it in a separate .py file and run it outside of Jupyter notebook.

For your convenience, the final code from this tutorial is provided below.

In [None]:
import requests
import csv
from bs4 import BeautifulSoup


f = csv.writer(open('z-artist-names.csv', 'w'))
f.writerow(['Name', 'Link'])

pages = []

for i in range(1, 5):
    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
    pages.append(url)


for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='AlphaNav')
    last_links.decompose()

    artist_name_list = soup.find(class_='BodyText')
    artist_name_list_items = artist_name_list.find_all('a')

    for artist_name in artist_name_list_items:
        names = artist_name.contents[0]
        links = 'https://web.archive.org' + artist_name.get('href')

        f.writerow([names, links])