# DSC 80: Lab 07

### Due Date: Tuesday, February 25 11:59PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab*.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab**.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import lab07 as lab

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
import time
import re

In [4]:
import requests
import json

# Practice with regular expressions (Regex)

**Question 1**

You start with some basic regular expression exercises to get some practice using them. You will find function stubs and related doctests in the starter code. 

**Exercise 1:** A string that has a `[` as the third character and `]` as the sixth character.

**Exercise 2:** Phone numbers that start with '(858)' and follow the format '(xxx) xxx-xxxx' (x represents a digit).

*Notice: There is a space between (xxx) and xxx-xxxx*

**Exercise 3:** A string whose length is between 6 to 10 and contains only word characters, white spaces and `?`. This string must have `?` as its last character.

**Exercise 4:** A string that begins with '\\$' and with another '\\$' within, where:
   - Characters between the two '\\$' can be anything (including nothing) except the letters 'a', 'b', 'c' (lower case).
   - Characters after the second '\\$' can only have any number of the letters 'a', 'b', 'c' (upper or lower case), with every 'a' before every 'b', and every 'b' before every 'c'.
       - E.g. 'AaBbbC' works, 'ACB' doesn't.

**Exercise 5:** A string that represents a valid Python file name including the extension. 

*Notice*: For simplicity, assume that the file name contains only letters, numbers and an underscore `_`.

**Exercise 6:** Find patterns of lowercase letters joined with an underscore.

**Exercise 7:** Find patterns that start with and end with a `_`.

**Exercise 8:**  Apple registration numbers and Apple hardware product serial numbers might have the number '0' (zero), but never the letter 'O'. Serial numbers don't have the number '1' (one) or the letter 'i'. Write a line of regex expression that checks if the given Serial number belongs to a genuine Apple product.

**Exercise 9:** Check if a given ID number is from Los Angeles (LAX), San Diego(SAN) or the state of New York (NY). ID numbers have the following format `SC-NN-CCC-NNNN`. 
   - SC represents state code in uppercase 
   - NN represents a number with 2 digits 
   - CCC represents a three letter city code in uppercase
   - NNNN represents a number with 4 digits

**Exercise 10:**  Given an input string, cast it to lower case, remove spaces/punctuation, and return a list of every 3-character substring that satisfy the following:
   - The first character doesn't start with 'a' or 'A'
   - The last substring (and only the last substring) can be shorter than 3 characters, depending on the length of the input string.
   - The substrings cannot overlap

In [5]:
# Exercise 1
def match_1(string):
    """
    A string that has a [ as the third character and ] as the sixth character.
    >>> match_1("abcde]")
    False
    >>> match_1("ab[cde")
    False
    >>> match_1("a[cd]")
    False
    >>> match_1("ab[cd]")
    True
    >>> match_1("1ab[cd]")
    False
    >>> match_1("ab[cd]ef")
    True
    >>> match_1("1b[#d] _")
    True
    """
    #Your Code Here
    pattern = '^.{2}\[.{2}\]'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


In [6]:
# match_1("abcde]") == False
# match_1("ab[cde") == False
# match_1("a[cd]") == False
# match_1("ab[cd]") == True
# match_1("1ab[cd]") == False
# match_1("ab[cd]ef") == True
# match_1("1b[#d] _") == True

In [7]:
# Exercise 2
def match_2(string):
    """
    Phone numbers that start with '(858)' and
    follow the format '(xxx) xxx-xxxx' (x represents a digit)
    Notice: There is a space between (xxx) and xxx-xxxx

    >>> match_2("(123) 456-7890")
    False
    >>> match_2("858-456-7890")
    False
    >>> match_2("(858)45-7890")
    False
    >>> match_2("(858) 456-7890")
    True
    >>> match_2("(858)456-789")
    False
    >>> match_2("(858)456-7890")
    False
    >>> match_2("a(858) 456-7890")
    False
    >>> match_2("(858) 456-7890b")
    False
    """
    #Your Code Here
    pattern = '^\(858\) \d{3}-\d{4}$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [8]:
# match_2("(123) 456-7890") == False
# match_2("858-456-7890") == False
# match_2("(858)45-7890") == False
# match_2("(858) 456-7890") == True
# match_2("(858)456-789") == False
# match_2("(858)456-7890") == False
# match_2("a(858) 456-7890") == False
# match_2("(858) 456-7890b") == False

In [9]:
# Exercise 3
def match_3(string):
    """
    Find a pattern whose length is between 6 to 10
    and contains only word character, white space and ?.
    This string must have ? as its last character.

    >>> match_3("qwertsd?")
    True
    >>> match_3("qw?ertsd?")
    True
    >>> match_3("ab c?")
    False
    >>> match_3("ab   c ?")
    True
    >>> match_3(" asdfqwes ?")
    False
    >>> match_3(" adfqwes ?")
    True
    >>> match_3(" adf!qes ?")
    False
    >>> match_3(" adf!qe? ")
    False
    """
    #Your Code Here

    pattern = '^[\w ?]{5,9}\?$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [10]:
# match_3("qwertsd?") == True
# match_3("qw?ertsd?") == True
# match_3("ab c?") == False
# match_3("ab   c ?") == True
# match_3(" asdfqwes ?") == False
# match_3(" adfqwes ?") == True
# match_3(" adf!qes ?") == False
# match_3(" adf!qe? ") == False 


In [11]:
# Exercise 4
def match_4(string):
    """
    A string that begins with '$' and with another '$' within, where:
        - Characters between the two '$' can be anything except the 
        letters 'a', 'b', 'c' (lower case).
        - Characters after the second '$' can only have any number 
        of the letters 'a', 'b', 'c' (upper or lower case), with every 
        'a' before every 'b', and every 'b' before every 'c'.
            - E.g. 'AaBbbC' works, 'ACB' doesn't.

    >>> match_4("$$AaaaaBbbbc")
    True
    >>> match_4("$!@#$aABc")
    True
    >>> match_4("$a$aABc")
    False

    >>> match_4("$iiuABc")
    False
    >>> match_4("123$Abc")
    False
    >>> match_4("$$Abc")
    True
    >>> match_4("$qw345t$AAAc")
    False
    >>> match_4("$s$Bca")
    False
    """
    #Your Code Here
    pattern = '^\$[^abc]*\$[Aa]+[Bb]+[Cc]+'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

### ASK!

In [12]:
# match_4("$$AaaaaBbbbc") == True
# match_4("$!@#$aABc") == True
# match_4("$a$aABc") == False
# match_4("$iiuABc") == False
# match_4("123$Abc") == False
# match_4("$$Abc") == True
# match_4("$qw345t$AAAc") == False # Is this False because there is only A and c after the second $? ASK!!!!!!!!!
# match_4("$s$Bca") == False

In [13]:
# Exercise 5
def match_5(string):
    """
    A string that represents a valid Python file name including the extension.
    *Notice*: For simplicity, assume that the file name contains only letters, numbers and an underscore `_`.

    >>> match_5("dsc80.py")
    True
    >>> match_5("dsc80py")
    False
    >>> match_5("dsc80..py")
    False
    >>> match_5("dsc80+.py")
    False
    """

    #Your Code Here
    pattern = '^[\w]+\.py$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [14]:
# match_5("dsc80.py") == True
# match_5("dsc80py") == False
# match_5("dsc80..py") == False
# match_5("dsc80+.py") == False

In [15]:
# Exercise 6
def match_6(string):
    """
    Find patterns of lowercase letters joined with an underscore.
    >>> match_6("aab_cbb_bc")
    False
    >>> match_6("aab_cbbbc")
    True
    >>> match_6("aab_Abbbc")
    False
    >>> match_6("abcdef")
    False
    >>> match_6("ABCDEF_ABCD")
    False
    """

    #Your Code Here
    pattern = '^[a-z]+\_[a-z]+$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [16]:
# match_6("aab_cbb_bc") == False
# match_6("aab_cbbbc") == True
# match_6("aab_Abbbc") == False
# match_6("abcdef") == False
# match_6("ABCDEF_ABCD") == False


In [17]:
# Exercise 7
def match_7(string):
    """
    Find patterns that start with and end with a _
    >>> match_7("_abc_")
    True
    >>> match_7("abd")
    False
    >>> match_7("bcd")
    False
    >>> match_7("_ncde")
    False
    """
    
    #Your Code Here
    pattern = '^\_.*\_$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

### ASK!

In [None]:
# match_7("_abc_") == True
# match_7("abd") == False
# match_7("bcd") == False
# match_7("_ncde") == False
# match_7("__") == True # This would be a match right???????????????????? ASK!!!!!!!!!
# _ false

In [19]:
# Exercise 8
def match_8(string):
    """
    Apple registration numbers and Apple hardware product serial numbers
    might have the number "0" (zero), but never the letter "O".
    Serial numbers don't have the number "1" (one) or the letter "i".

    Write a line of regex expression that checks
    if the given Serial number belongs to a genuine Apple product.

    >>> match_8("ASJDKLFK10ASDO")
    False
    >>> match_8("ASJDKLFK0ASDo")
    True
    >>> match_8("JKLSDNM01IDKSL")
    False
    >>> match_8("ASDKJLdsi0SKLl")
    False
    >>> match_8("ASDJKL9380JKAL")
    True
    """

    #Your Code Here
    pattern = '^[^(O1i)]*$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

### ASK!

In [20]:
# match_8("ASJDKLFK10ASDO") == False
# match_8("ASJDKLFK0ASDo") == True
# match_8("JKLSDNM01IDKSL") == False
# match_8("ASDKJLdsi0SKLl") == False
# match_8("ASDJKL9380JKAL") == True
# match_8("ASDJKL9380JKALIIIo000") == True # So, I, o does not count right? ASK!!!!!!!!!!!!!!!!!!

In [21]:
# Exercise 9
def match_9(string):
    """
    Check if a given ID number is from Los Angeles (LAX), San Diego(SAN) or
    the state of New York (NY). ID numbers have the following format SC-NN-CCC-NNNN.
        - SC represents state code in uppercase
        - NN represents a number with 2 digits
        - CCC represents a three letter city code in uppercase
        - NNNN represents a number with 4 digits
    
    >>> match_9('NY-32-NYC-1232')
    True
    >>> match_9('ca-23-SAN-1231')
    False
    >>> match_9('MA-36-BOS-5465')
    False
    >>> match_9('CA-56-LAX-7895')
    True
    """

    #Your Code Here
    pattern = '(^CA-[0-9]{2}-((LAX)|(SAN))-[0-9]{4}$)|(^NY-[0-9]{2}-(NYC)-[0-9]{4}$)'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

### ASK!

In [22]:
# match_9('NY-32-NYC-1232') == True
# match_9('ca-23-SAN-1231') == False
# match_9('MA-36-BOS-5465') == False# 
# match_9('CA-56-LAX-7895') == True # My solution seems too redundant?????????? ASK!!!!!!!!!


In [63]:
# Exercise 10
def match_10(string):
    """
    Given an input string, cast it to lower case, remove spaces/punctuations, 
    and return a list of every 3-character substring that satisfy the following:
        - The first character doesn't start with 'a' or 'A'
        - The last substring (and only the last substring) can be shorter than 
        3 characters, depending on the length of the input string.
        - The substrings cannot overlap
    
    >>> match_10('ABCdef')
    ['def']
    >>> match_10(' DEFaabc !g ')
    ['def', 'cg']
    >>> match_10('Come ti chiami?')
    ['com', 'eti', 'chi']
    >>> match_10('and')
    []
    >>> match_10( "Ab..DEF")
    ['def']
    """
    no_space = string.replace(' ', '') # Remove space
    lower_case = no_space.lower() # Cast to lower space
    
    # If the length of the string is 10, are we disregarding the last character or not?
    # Example: string = 'ade$ffr% *', are we ignoring * in this first step?
    sequence = re.findall('..?.?', lower_case) # Find all three-character sequence
    
    not_a = []
    for se in sequence: # Remove the ones start with a
        if se[0] != 'a':
            not_a.append(se)
    
    complete = ''.join(not_a) # Join back into string
    punc = r'[\!\"\#\$\%\&\\\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\]\^\_\`\{\|\}\~]' # Punctuations
    no_punc = re.sub('[^A-Za-z0-9 ]', '', complete)
    # [^A-Za-z0-9 ]
    com_sequence = re.findall('..?.?', no_punc) # Find complete three-character sequence
    return com_sequence

In [64]:
import string
string.punctuation
# temp = !@#$%^&*()-=_+|;':",.<>?'
# str.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### ASK!

In [None]:
# match_10('ABCdef') == ['def']
# match_10(' DEFaabc !g ') == ['def', 'cg']
# match_10('Come ti chiami?') == ['com', 'eti', 'chi']
# match_10('and') == []
# match_10( "Ab..DEF") == ['def']
# match_10( "Ab..AEF")  == ['aef'] # Does it mean it would return this?????????? ASK!!!!!!!!!!!!
# Check!!!!!!!!!!!!!!!!

## Regex groups: extracting personal information from messy data

**Question 2**

The file in `data/messy.txt` contains personal information from a fictional website that a user scraped from webserver logs. Within this dataset, there are four fields that interest you:
1. Email Addresses (assume they are alphanumeric user-names and domain-names),
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alpha-numeric strings of long length)
4. Street Addresses

Create a function `extract_personal` that takes in a string like `open('data/messy.txt').read()` and returns a tuple of four separate lists containing values of the 4 pieces of information listed above (in the order given). Do **not** keep empty values.

*Hint*: There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

*Note:* Since this data is messy/corrupted, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `@` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

In [26]:
fp = os.path.join('data', 'messy.txt')
s = open(fp, encoding='utf8').read()

In [27]:
test = s[:1000]
test
# test = '#kdumphreyh@hc360.com|dtr04K,(╯°□°）╯︵ ┻━┻)'

'1\t4/12/2018\tLorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin risus. Praesent lectus.\n\nVestibulum quam sapien| varius ut, blandit non, interdum in, ante. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Duis faucibus accumsan odio. Curabitur convallis.|dottewell0@gnu.org\toR1mOq,!@#$%^&*(),[{bitcoin:18A8rBU3wvbLTSxMjqrPNc9mvonpA4XMiv\tIP:192.232.9.210\tccn:3563354617955160|ssn:380-09-9403}]|05-6609813,814 Monterey Court\n2\t12/18/2018\tSuspendisse potenti. In eleifend quam a odio. In hac habitasse platea dictumst.\n\nMaecenas ut massa quis augue luctus tincidunt. Nulla mollis molestie lorem. Quisque ut erat.,bassiter1@sphinn.com\tc5KvmarHX3o,test\u2060test\u202b,[{bitcoin:1EB7kYpnfJSqS7kUFpinsmPF3uiH9sfRf1,IP:20.73.13.197|ccn:3542723823957010\tssn:118-12-8276}#{bitcoin:1E5fev4boabWZmXvHGVkHcNJZ2tLnpM6Zv*IP:238.206.212.148\tccn:337941898369615,ssn:427-22-9352}#{bitcoin:1DqG3WcmGw74PjptjzcAmxGFuQdvWL7RCC,IP:171.241.15.98\tccn:3574

### ASK!

In [28]:
# No need to actually split????????????????? ASK!!!!!!!!!!!!!!!!
# re.split(r'\t|\n', test)

### ASK!

In [29]:
# What are some valid domain names? Would upper case be involved???? .com???.edu??? ASK!!!!!!!!!!!!!!!!!
email_pat = r'([A-Za-z0-9]+@[A-Za-z0-9]+(\.[A-Za-z0-9]+)+)'
# Exception emails like egristonr7@pagesperso-orange.fr, jpitcaithleyre@t-online.de

# ssn_pat = '(?!(000)|(666))[0-8][0-9]{2}-(?!(00))[0-9]{2}-(?!(0000))[0-9]{4}' # 423-00-9575 not match this, 00 for second group??????????????? ASK!!!!!!!!!
ssn_pat = '((?!(000)|(666))[0-8][0-9]{2}-[0-9]{2}-(?!(0000))[0-9]{4})'

# Online, something about the length of bitcoin, and not including O,I,l?????????? Need to consider this?
bit_pat = '[13][a-km-zA-HJ-NP-Z0-9]{26,33}' 

add_pat = '\d+ [A-z]+ [A-z]+' # Too simple? What about St. Amerbse street?? Or things like this
# Some weird address such as 0 Veith Drive, 04 Westport Lane, 04232 Monterey Circle?????

### ASK!

In [30]:
# email_group = re.findall(email_pat, s)
# emails = [group[0] for group in email_group]
# len(emails) # 938 results
# Search for @ only, has 963 results, within 5%

ssn_group = re.findall('[0-9]{3}-[0-9]{2}-[0-9]{4}', s)
ssn_group = re.findall(ssn_pat, s) # No difference???? Why bother??? hhh
ssns = [group[0] for group in ssn_group]
len(ssns) # 2571
# Search for ssn, has 2858 results, null results has 286, n = 2572, within 5%

# bits = re.findall(bit_pat, s)
# len(bits) # 2781
# Search for bitcoin, has 2857 results, null results has 76, n = 2781 (exact)

adds = re.findall(add_pat, s)
# len(bits) # 2781
# Seems to make sense?? But when bitcoin is null, address does not necessary has to be null


In [31]:
# re.findall('\d+ [A-z]+ [A-z]+', s)

In [32]:
def extract_personal(s):
    """
    :Example:
    >>> fp = os.path.join('data', 'messy.test.txt')
    >>> s = open(fp, encoding='utf8').read()
    >>> emails, ssn, bitcoin, addresses = extract_personal(s)
    >>> emails[0] == 'test@test.com'
    True
    >>> ssn[0] == '423-00-9575'
    True
    >>> bitcoin[0] == '1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2'
    True
    >>> addresses[0] == '530 High Street'
    True
    """
    # Pattern of each category
    email_pat = r'([A-Za-z0-9]+@[A-Za-z0-9]+(\.[A-Za-z0-9]+)+)'
    ssn_pat = '((?!(000)|(666))[0-8][0-9]{2}-[0-9]{2}-(?!(0000))[0-9]{4})'
    bit_pat = '[13][a-km-zA-HJ-NP-Z0-9]{26,33}' 
    add_pat = '\d+ [A-z]+ [A-z]+' 

    # Get email address
    email_group = re.findall(email_pat, s)
    emails = [group[0] for group in email_group]

    # Get social security number
    ssn_group = re.findall(ssn_pat, s) 
    ssns = [group[0] for group in ssn_group]

    # Get bitcoin address
    bits = re.findall(bit_pat, s)

    # Get address
    adds = re.findall(add_pat, s)
    return (emails, ssns, bits, adds)

## Content in Amazon review data

**Question 3**

The dataset `reviews.txt` contains [Amazon reviews](http://jmcauley.ucsd.edu/data/amazon/) for ~200k phones and phone accessories. This dataset has been "cleaned" for you. The goal of this section is to create a function that takes in the review dataset and a review and returns the word that "best summarizes the review" using TF-IDF.'

1. Create a function `tfidf_data(review, reviews)` that takes a review as well as the review data and returns a dataframe:
    - indexed by the words in `review`,
    - with columns given by (a) the number of times each word is found in the review (`cnt`), (b) the term frequency for each word (`tf`), (c) the inverse document frequency for each word (`idf`), and (d) the TF-IDF for each word (`tfidf`).
    
2. Create a function `relevant_word(tfidf_data)` which takes in a dataframe as above and returns the word that "best summarizes the review" described by `tfidf_data`.


*Note:* Use this function to "cluster" review types -- run it on a sample of reviews and see which words come up most. Unfortunately, you will likely have to change your code from your answer above to run it on the entire dataset (to do this, you should compute as many of the frequencies "ahead of time" and look them up when needed; you should also likely filter out words that occur "rarely")

### ASK! (not so sure what this is doing)

In [33]:
# Small dataset testing
fp = os.path.join('data', 'tests.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()

In [34]:
re_pat = '\\b%s\\b' % 'this'
review.count(re_pat)
#'\\b%s\\b' % 'this'

0

In [35]:
words = pd.Series(review.split()).unique() # Get the unique word (as index)
df = pd.DataFrame(columns=['cnt', 'tf', 'idf', 'tfidf'], index=words)  # dataframe of documents
for word in words:
    # print(word)
    # re_pat = '\\b%s\\b' % word
    cnt = review.count(word)# .values[0]
    tf = cnt / (review.count(' ') + 1)# .values[0]
    idf = np.log(len(reviews) / reviews.str.contains(word).sum())#.values[0]
    # print(tf.values[0])
    # df['cnt'], df['tf'], df['idf'], df['tfidf'] = cnt, tf, idf, tf * idf
    df.loc[word, 'cnt'], df.loc[word, 'tfidf'] = cnt, tf * idf
    df.loc[word, 'tf'], df.loc[word, 'idf'] = tf, idf
    # df.loc[word] = pd.Series([cnt, tf, idf, tf * idf])
# df.loc['this', 'cnt'] = 1
df.sort_values('tfidf', ascending=False)

Unnamed: 0,cnt,tf,idf,tfidf
skin,2,0.0227273,3.2581,0.0740476
cover,3,0.0340909,2.15948,0.0736188
different,2,0.0227273,2.56495,0.0582943
case,3,0.0340909,1.64866,0.0562043
combinations,1,0.0113636,3.2581,0.0370238
silicone,1,0.0113636,3.2581,0.0370238
polycarbonate,1,0.0113636,3.2581,0.0370238
damage,1,0.0113636,3.2581,0.0370238
spills,1,0.0113636,3.2581,0.0370238
interchangeable,1,0.0113636,3.2581,0.0370238


In [36]:
words = pd.Series(review.split()).unique()
words

array(['this', 'is', 'a', 'great', 'new', 'case', 'design', 'that', 'i',
       'have', 'not', 'seen', 'before', 'it', 'has', 'slim', 'silicone',
       'skin', 'really', 'locks', 'in', 'the', 'phone', 'to', 'cover',
       'and', 'protect', 'your', 'from', 'spills', 'such', 'also', 'hard',
       'polycarbonate', 'outside', 'shell', 'guard', 'against', 'damage',
       'comes', 'with', 'different', 'interchangeable', 'skins', 'covers',
       'create', 'multiple', 'color', 'combinations', 'kind', 'of',
       'than', 'usual', 'chunk', 'plastic', 'innovative', 'suits',
       'iphone', '5', 'perfectly'], dtype=object)

In [37]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()

In [38]:
%%time
words = pd.Series(review.split()).unique() # Get the unique word (as index)
df = pd.DataFrame(columns=['cnt', 'tf', 'idf', 'tfidf'], index=words)  # dataframe of documents
for word in words:
    # print(word)
    # re_pat = '\\b%s\\b' % word
    cnt = review.count(word)# .values[0]
    tf = cnt / (review.count(' ') + 1)# .values[0]
    idf = np.log(len(reviews) / reviews.str.contains(word).sum())#.values[0]
    # print(tf.values[0])
    # df['cnt'], df['tf'], df['idf'], df['tfidf'] = cnt, tf, idf, tf * idf
    df.loc[word, 'cnt'], df.loc[word, 'tfidf'] = cnt, tf * idf
    df.loc[word, 'tf'], df.loc[word, 'idf'] = tf, idf
    # df.loc[word] = pd.Series([cnt, tf, idf, tf * idf])
# df.loc['this', 'cnt'] = 1
df

CPU times: user 3.96 s, sys: 0 ns, total: 3.96 s
Wall time: 3.96 s


In [39]:
df[(df['cnt'] != 1) & (df['cnt'] != 2) & (df['cnt'] != 3) & (df['cnt'] != 5)]['cnt'].sum()

83

In [40]:
'before' in df.index

True

In [41]:
def tfidf_data(review, reviews):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> out['cnt'].sum()
    85
    >>> 'before' in out.index
    True
    """
    return ...

In [42]:
def relevant_word(out):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> relevant_word(out) in out.index
    True
    """
    return ...

### Tweet Analysis: Internet Research Agency

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the *Internet Research Angency* (the tweet factory facing allegations for attempting to influence US political elections).

The questions in this section will focus on the following:
1. We will look at the hashtags present in the text and trends in their makeup.
2. We will prepare this dataset for modeling by creating features out of the text fields.

**Question 4 (HashTags)**

You may assume that a hashtag is any string without whitespace following a `#` (this is more permissive than Twitters rules for hashtags; you are encouraged to go down this rabbit-hole to better figure out how to clean your data!).

* Create a function `hashtag_list` that takes in a column of tweet-text and returns a column containing the list of hashtags present in the tweet text. If a tweet doesn't contain a hashtag, the function should return an empty list.

* Create a function `most_common_hashtag` that takes in a column of hashtag-lists (the output above) and returns a column consisting a single hashtag from the tweet-text. 
    - If the text has no hashtags, the entry should be `NaN`,
    - If the text has one distinct hashtag, the entry should contain that hashtag,
    - If the text has more than one hashtag, the entry should be the most common hashtag (among all hashtags in the column). If there is a tie for most common, any of the most common can be returned.
        - E.g. if the input column was: `pd.Series([[1, 2, 2], [3, 2, 3]])`, the output would be: `pd.Series([2, 2])`. Even though `3` was more common in the second list, `2` is the most common among all hashtags in the column.

In [43]:
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])
ira

Unnamed: 0,id,name,date,text
0,3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452...,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks...
1,1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5...,2016-12-24 04:31,RT @Philanthropy: Dozens of ‘hate groups’ have...
2,2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty,..."
3,272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busies...
4,7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed...,2016-07-30 15:44,RT @dirtroaddiva1: #IHatePokemonGoBecause he ...
5,1409274,New York City Today,2016-01-04 19:02,Chick-fil-A remains closed after health violat...
6,2973541,ce7b9f8c86dfbf9b2bd03eda62f0d42ac1c2b1b593ba0b...,2016-05-20 14:56,RT @SenSanders: We cannot afford to wait to ad...
7,1042655,Andy Sparks,2016-04-13 14:52,RT @MatthewGellert: #IWouldPreferToForget that...
8,7838616,40bd0ff013b85c7646ca07ad238bc4dc865ce2cc87034a...,2016-10-08 10:19,"RT @rapstationradio: #NowPlaying: RJ (OMMIO) ""..."
9,8005939,0512ea612cfe45a7d9c8c0fd42466e8a8068a6fb3efb34...,2016-08-15 09:57,Hill Street Vida Blues. #AthleticsTVShows @sus...


In [44]:
tweet_text = ira['text']
hashtag_pat = '#(\w+)'
prog = re.compile(hashtag_pat)
tweet_text.apply(lambda x: prog.findall(x))

0               [Exercise, LoseBellyFat, CatTV, TeenWolf]
1                                                      []
2                                                  [tech]
3                                                  [news]
4                  [IHatePokemonGoBecause, PokesAreJokes]
5                                                [health]
6                                                      []
7                                  [IWouldPreferToForget]
8                        [NowPlaying, rap, hiphop, music]
9                                      [AthleticsTVShows]
10                                 [HillaryRottenClinton]
11                                                     []
12                                           [TrumpTapes]
13                                        [entertainment]
14                                                     []
15                                                     []
16                                                     []
17            

In [45]:
def hashtag_list(tweet_text):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = hashtag_list(test['text'])
    >>> (out.iloc[0] == ['NLP', 'NLP1', 'NLP1'])
    True
    """
    hashtag_pat = '#(\w+)' # Hash Tag pattern
    prog = re.compile(hashtag_pat) # Compile
    tags = tweet_text.apply(lambda x: prog.findall(x)) # Find all
    return tags


In [46]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
out = hashtag_list(test['text'])
(out.iloc[0] == ['NLP', 'NLP1', 'NLP1'])

True

In [47]:
tweet_lists = hashtag_list(ira['text'])
freq = pd.Series(tweet_lists.sum())

In [48]:
counts = freq.value_counts()

In [49]:
def most_common(tweet_list):
    """Helper function to compute one single list"""
    if len(tweet_list) == 0: # Empty list
        return np.nan
    
    if len(tweet_list) == 1: # One elem
        return tweet_list[0]
    
    for com in counts: # Highest to lowest freq
        if com in tweet_list: # Check if in tweet
            return com

In [50]:
def most_common_hashtag(tweet_lists):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
    >>> most_common_hashtag(test).iloc[0]
    'NLP1'
    """
    freq = pd.Series(tweet_lists.sum()) # Total Frequency Series
    counts = freq.value_counts() # Count occurrences

    def most_common(tweet_list):
        """Helper function to compute one single list"""
        if len(tweet_list) == 0: # Empty list
            return np.nan

        if len(tweet_list) == 1: # One elem
            return tweet_list[0]

        for com in counts.index: # Highest to lowest freq
            if com in tweet_list: # Check if in tweet
                return com
    
    return tweet_lists.apply(most_common)

In [51]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
most_common_hashtag(test).iloc[0]

'NLP1'

**Question 5 (Features)**

Now create a dataframe of features from the `ira` data.  That is create a function `create_features` that takes in the `ira` data and returns a dataframe with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `num_hashtags` gives the number of hashtags present in a tweet,
* `mc_hashtags` gives the most common hashtag associated to a tweet (as given by the problem above),
* `num_tags` gives the number of tags a given tweet has (look for the presence of `@`),
* `num_links` gives the number of hyper-links present in a given tweet 
    - (a hyper-link is a string starting with `http(s)://` not followed by whitespaces),
* A boolean column `is_retweet` that describes if the given tweet is a retweet (i.e. `RT`),
* A 'clean' text field `text` that contains the tweet text with:
    - The non-alphanumeric characters removed (except spaces),
    - All words should be separated by exactly one space,
    - The characters all lowercase,
    - All the meta-information above (Retweet info, tags, hyperlinks, hashtags) removed.

*Note:* You should make a helper function for each column.

*Note:* This will take a while to run on the entire dataset -- test it on a small sample first!

In [52]:
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])

In [53]:
def hashtag_list2(tweet_text):# Not really needed??????????
    """Helper function to get the hashtag list"""
    hashtag_pat = '(#(\w+))' # Hash Tag pattern
    prog = re.compile(hashtag_pat) # Compile
    tup = tweet_text.apply(lambda x: prog.findall(x)) # Find all
    tags = tup.apply(lambda x: [group[0] for group in x]) # Get full hashtags
    return tags

In [54]:
# hashtag_list2(tweet_text) # Done 

### ASK!

In [55]:
def tag_list(tweet_text): # As long as there is a @, it would be a tag right???????????? ASK!!!!!!!!!!!!!!!!!!
    """Helper function to get the tag list"""
    tag_pat = '@\w+' # Hash Tag pattern
    prog = re.compile(tag_pat) # Compile
    tags = tweet_text.apply(lambda x: prog.findall(x)) # Find all
    # tags = tup.apply(lambda x: [group[0] for group in x]) # Get full tags
    return tags

In [56]:
tag_list(tweet_text) # Done

0                                                       []
1                                          [@Philanthropy]
2                                                       []
3                                                       []
4                                         [@dirtroaddiva1]
5                                                       []
6                                            [@SenSanders]
7                                        [@MatthewGellert]
8                                       [@rapstationradio]
9                                          [@susanslusser]
10       [@c982f7295cf57508a8d39bae6310c9546492d4105cac...
11                                                      []
12                                       [@shannoncoulter]
13                                                      []
14                                                      []
15                                            [@WarfareWW]
16                          [@NiqueTatted_721, @2ficmusi

### ASK!

In [57]:
# Does 'https://t.…"' also count as hyperlink??????????????????????????????????????
def link_list(tweet_text): # My search turns out that there is no link followed by white spaces???????????? ASK!!!!!!!!!!!!
    """Helper function to get the hyperlink list"""
    link_pat = 'https?:\/\/(?! )*' # Hash Tag pattern
    prog = re.compile(link_pat) # Compile
    links = tweet_text.apply(lambda x: prog.findall(x)) # Find all
    return links

In [58]:
def is_retweet(tweet_text):
    """Helper function to check if retweet"""
    rt_pat = '^RT' # Hash Tag pattern
    prog = re.compile(rt_pat) # Compile
    rt = tweet_text.apply(lambda x: prog.findall(x)) # Find all
    return rt.apply(lambda x: True if len(x) != 0 else False)

In [59]:
tweet_text = ira['text']
tweet_lists = hashtag_list(tweet_text) # List of hashtags
num_hashtags = tweet_lists.apply(len) # Number of hashtags
mc_hashtags = most_common_hashtag(tweet_lists) # Most common hashtags
num_tags = tag_list(tweet_text).apply(len) # Number of tags
num_links = link_list(tweet_text).apply(len) # Number of links
is_retweet = is_retweet(tweet_text) # If tweet is retweeted

In [61]:
def clean_text(tweet_text):
    """Helper function to clean single text string"""
    remove_rt = re.sub(r'^RT', '', tweet_text)
    remove_hash = re.sub(r'#\w+', '', remove_rt) # Remove hashtags
    remove_tags = re.sub(r'@\w+', '', remove_hash) # Remove tags
    # remove_link = 
    substitute = re.sub(r'[^A-Za-z0-9 ]', ' ', tweet_text) # Remove non-alphanumeric
    space = re.sub(r' +', ' ', substitute) # Fix space
    lower = space.lower() # Lowercase
    
    

### ASK!

In [62]:
# Help!!!!!!!!!!!!! ASK!!!!!!!!!!!!!!!!!!! Trying to get the link, but not successful here
string = 'http://t.co/JESLKxfiu1 #AfricanArchitecture #Pyramids'
link_pat = 'https?:\/\/(?! )*.+ ' # Hash Tag pattern
prog = re.compile(link_pat) # Compile
links = prog.findall(string) # Find all
links

['http://t.co/JESLKxfiu1 #AfricanArchitecture ']

In [None]:
def create_features(ira):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = create_features(test)
    >>> anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
    >>> ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
    >>> ans = pd.DataFrame(ansdata, columns=anscols)
    >>> (out == ans).all().all()
    True
    """

    return ...

## Congratulations! You're done!

* Submit the lab on Gradescope