# Hackerschool: Python Automation - Web Scraping

This notebook is part of the materials used for the Hackerschool: Python Automation workshop held on 5 Sept 2020, held by NUS Hackers.

Created by Christopher Goh (email: chris@nushackers.org)

**Remember to make a copy of this notebook, or else your changes won't save.**

**Go to "File" > "Save a copy in Drive"**

## Problem Statement

We are going to use Python to help us count who sent the most messages in the "NUS Hackers Chat" public chat group!

First, let's download the exported chat logs that we will use for this tutorial. Run the command below.

In [None]:
!git clone --depth=1 https://github.com/chrisgzf/hackerschool_python_automation.git hackerschool

If you click on the folder icon on the left, you'll now see that the files are downloaded to your Colab notebook.

In [None]:
from glob import glob

`glob` is a built in Python Library. It helps us to isolate the html files

In [None]:
exported_message_files = glob("hackerschool/web_scraping/ChatExport_2020-09-05/*.html")
exported_message_files

In [None]:
len(exported_message_files)

Let's see how one of the message files look like.

In [None]:
with open("hackerschool/web_scraping/ChatExport_2020-09-05/messages.html", "r") as f:
    print(''.join(f.readlines()[:20]))

Ooooh what is this weird "code" thing? I think it's **HTML**

**back to slides**

In [None]:
from bs4 import BeautifulSoup

BeautifulSoup is a 3rd party Python library that helps us parse HTML files

Let's try just parsing 1 file first!

We'll start with `messages.html`


In [None]:
soup = BeautifulSoup(open("hackerschool/web_scraping/ChatExport_2020-09-05/messages.html", "r"))

Let's start with something simple first! Let's try to figure out who has spoken in the chat before.

In [None]:
soup.select(".from_name")[:10]

Hmmm, still a little messy. Let's try to really just extract the names.

In [None]:
raw_names = [div.contents[0].strip() for div in soup.select(".from_name")]
raw_names

Wow! Those are a lot of names! 

**Challenge 1:** A lot of these names are repeated. How do I get a list of unique names in the chat?

**Challenge 2:** How do I count how many unique names I have?

In [None]:
# write challenge 1 answer here

In [None]:
# write challenge 2 answer here

Nice! 😄😄 So we managed to use Python to **help us count how many people have spoken in our chat group**!

But that was just a short introduction to the powers of BeautifulSoup. Let's get back to our task. **We need to count how many messages each user has sent.**

In [None]:
raw_names[:15]

Let's look at this closely. Can we just take `raw_names`, count it, and then we'll get the right number of messages?

In [None]:
all_messages = soup.select(".message.default")
all_messages[:3]

How does 1 message look like?

In [None]:
one_message = all_messages[2]
one_message

In [None]:
one_message.select(".from_name")

In [None]:
one_message.select(".from_name")[0].contents[0].strip()

Let's print every message's sender

In [None]:
for message in all_messages:
    print(message.select(".from_name")[0].contents[0].strip())

In [None]:
for message in all_messages:
    print(message.select(".from_name"))

In [None]:
for message in all_messages:
    sender = message.select(".from_name");
    if not sender:
        # message has no sender
        print("someone sent this")
    else:
        # message has sender
        print(sender[0].contents[0].strip())

Great! We are getting somewhere!

**Challenge 3:** Let's change all the "someone sent this" to the last seen name. How do we do this?

In [None]:
# same code snippet as just now, edit the code here!

for message in all_messages:
    sender = message.select(".from_name");
    if not sender:
        # message has no sender
        print("someone sent this")
    else:
        # message has sender
        print(sender[0].contents[0].strip())

Now that we know how to get the right names associated with each message, let's try to count them.

First, let's convert the names into a proper list of names.

In [None]:
last_seen_sender = ''
all_names = []

for message in all_messages:
    sender = message.select(".from_name");
    if not sender:
        # message has no sender
        all_names.append(last_seen_sender)
    else:
        # message has sender
        last_seen_sender = sender[0].contents[0].strip()
        all_names.append(last_seen_sender)

all_names

In [None]:
len(all_names) == len(all_messages)

So now, we have a proper list of names, how do we count?

In [None]:
import pandas as pd

Pandas is a very popular data wrangling library in Python. This is a library that can make it super easy for us to count stuff.

In [None]:
df = pd.DataFrame(all_names)
df

In [None]:
df[0].value_counts()

Yay we achieved our task! But wait, now your boss wants to see your work, so you can't possibly just send him in this format right?

In [None]:
df[0].value_counts().plot(kind='bar', figsize=(12, 5))

Are we done yet?

**No! We only did it on 1/9 files!**

But relax! This is Python, so it should be very easy for you to repeat the same thing 9 times right?

In [None]:
exported_message_files

In [None]:
all_names = [] # we keep this outside the for loop, so that it keeps
# track of all names across all files

last_seen_sender = ''

for message_file in exported_message_files:
    # use bsoup to open up individual HTML file
    soup = BeautifulSoup(open(message_file, "r"))

    # select all messages
    all_messages = soup.select(".message.default")

    # process every individual message
    for message in all_messages:
        sender = message.select(".from_name");
        if not sender:
            # message has no sender
            all_names.append(last_seen_sender)
        else:
            # message has sender
            last_seen_sender = sender[0].contents[0].strip()
            all_names.append(last_seen_sender)

len(all_names)

In [None]:
df = pd.DataFrame(all_names)
df[0].value_counts()

In [None]:
df[0].value_counts().plot(kind='bar', figsize=(24, 5))

Can you spot your name there? 😇😇

😡😡 If you don't spot your name there 😡😡

Do join us at our Telegram chat at t.me/nushackers_chat! 😇😇

Great job on completing this! There are many other cool things you can do with this:
1. Can you generate a word cloud of the messages?
1. Can you generate a chart of the group's average chat activity at different timings?
1. Can you feed all the messages that is discussed into a machine learning model, and make the machine learning model "talk like a NUS Hacker"?