# DSC 80: Lab 07

### Due Date: Tuesday, May 18th 11:59PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab*.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab**.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import lab07 as lab

In [3]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
import time
import re

In [4]:
import requests
import json

# Practice with regular expressions (Regex)

**Question 1**

You start with some basic regular expression exercises to get some practice using them. You will find function stubs and related doctests in the starter code. 

**Exercise 1:** A string that has a `[` as the third character and `]` as the sixth character.

In [5]:
def match_1(string):
    """
    >>> match_1("abcde]")
    False
    >>> match_1("ab[cde")
    False
    >>> match_1("a[cd]")
    False
    >>> match_1("ab[cd]")
    True
    >>> match_1("1ab[cd]")
    False
    >>> match_1("ab[cd]ef")
    True
    >>> match_1("1b[#d] _")
    True
    """
    #Your Code Here
    #pattern = '(.+)(.+)\[(.+)(.+)\]'
    pattern = '^..\[(.+)(.+)\]'
    
    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [6]:
match_1("abcde]")

False

In [7]:
match_1("ab[cde")

False

In [8]:
match_1("a[cd]")

False

In [9]:
match_1("ab[cd]")

True

In [10]:
match_1("1ab[cd]")

False

In [11]:
match_1("ab[cd]ef")

True

In [12]:
match_1("1b[#d] _")

True

**Exercise 2:** Phone numbers that start with '(858)' and follow the format '(xxx) xxx-xxxx' (x represents a digit).

*Notice: There is a space between (xxx) and xxx-xxxx*

In [13]:
def match_2(string):
    """
    Phone numbers that start with '(858)' and
    follow the format '(xxx) xxx-xxxx' (x represents a digit)
    Notice: There is a space between (xxx) and xxx-xxxx

    >>> match_2("(123) 456-7890")
    False
    >>> match_2("858-456-7890")
    False
    >>> match_2("(858)45-7890")
    False
    >>> match_2("(858) 456-7890")
    True
    >>> match_2("(858)456-789")
    False
    >>> match_2("(858)456-7890")
    False
    >>> match_2("a(858) 456-7890")
    False
    >>> match_2("(858) 456-7890b")
    False
    """
    #Your Code Here
    pattern = '^\([8][5][8]\) [0-9]{3}-[0-9]{4}$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [14]:
match_2("(123) 456-7890")

False

In [15]:
match_2("858-456-7890")

False

In [16]:
match_2("(858)45-7890")

False

In [17]:
match_2("(858) 456-7890")

True

In [18]:
match_2("(858)456-789")

False

In [19]:
match_2("(858)456-7890")

False

In [20]:
match_2("a(858) 456-7890")

False

In [21]:
match_2("(858) 456-7890b")

False

**Exercise 3:** A string whose length is between 6 to 10 and contains only word characters, white spaces and `?`. This string must have `?` as its last character.

In [22]:
def match_3(string):
    """
    Find a pattern whose length is between 6 to 10
    and contains only word character, white space and ?.
    This string must have ? as its last character.

    >>> match_3("qwertsd?")
    True
    >>> match_3("qw?ertsd?")
    True
    >>> match_3("ab c?")
    False
    >>> match_3("ab   c ?")
    True
    >>> match_3(" asdfqwes ?")
    False
    >>> match_3(" adfqwes ?")
    True
    >>> match_3(" adf!qes ?")
    False
    >>> match_3(" adf!qe? ")
    False
    """
    #Your Code Here

    #pattern = '(\w|[a-zA-Z]){5,9}\?'
    pattern = '^[a-zA-Z\s\?]{5,9}\?$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [23]:
match_3("qwertsd?")

True

In [24]:
match_3("qw?ertsd?")

True

In [25]:
match_3("ab c?")

False

In [26]:
match_3("ab   c ?")

True

In [27]:
match_3(" asdfqwes ?")

False

In [28]:
match_3(" adfqwes ?")

True

In [29]:
match_3(" adf!qes ?")

False

In [30]:
match_3(" adf!qe? ")

False

**Exercise 4:** A string that begins with '\\$' and with another '\\$' within, where:
   - Characters between the two '\\$' can be anything (including nothing) except the letters 'a', 'b', 'c' (lower case).
   - Characters after the second '\\$' can only have any number of the letters 'a', 'b', 'c' (upper or lower case), with every 'a' before every 'b', and every 'b' before every 'c'.
       - E.g. 'AaBbbC' works, 'ACB' doesn't.

In [31]:
def match_4(string):
    """
    A string that begins with '$' and with another '$' within, where:
        - Characters between the two '$' can be anything except the 
        letters 'a', 'b', 'c' (lower case).
        - Characters after the second '$' can only have any number 
        of the letters 'a', 'b', 'c' (upper or lower case), with every 
        'a' before every 'b', and every 'b' before every 'c'.
            - E.g. 'AaBbbC' works, 'ACB' doesn't.

    >>> match_4("$$AaaaaBbbbc")
    True
    >>> match_4("$!@#$aABc")
    True
    >>> match_4("$a$aABc")
    False
    >>> match_4("$iiuABc")
    False
    >>> match_4("123$Abc")
    False
    >>> match_4("$$Abc")
    True
    >>> match_4("$qw345t$AAAc")
    False
    >>> match_4("$s$Bca")
    False
    """
    #Your Code Here
    pattern = r'\$[^abc]*\$+([Aa]+[Bb]+[Cc])]*'
    
    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [32]:
match_4("$$AaaaaBbbbc")

True

In [33]:
match_4("$!@#$aABc")

True

In [34]:
match_4("$a$aABc")

False

In [35]:
match_4("$iiuABc")

False

In [36]:
match_4("123$Abc")

False

In [37]:
match_4("$$Abc")

True

In [38]:
match_4("$qw345t$AAAc")

False

In [39]:
match_4("$s$Bca")

False

**Exercise 5:** A string that represents a valid Python file name including the extension. 

*Notice*: For simplicity, assume that the file name contains only letters, numbers and an underscore `_`.

In [40]:
def match_5(string):
    """
    A string that represents a valid Python file name including the extension.
    *Notice*: For simplicity, assume that the file name contains only letters, numbers and an underscore `_`.

    >>> match_5("dsc80.py")
    True
    >>> match_5("dsc80py")
    False
    >>> match_5("dsc80..py")
    False
    >>> match_5("dsc80+.py")
    False
    """

    #Your Code Here
    pattern = '^[\w\_]*.py'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [41]:
match_5("dsc80.py")

True

In [42]:
match_5("dsc80py")

True

In [43]:
match_5("dsc80..py")

False

In [44]:
match_5("dsc80+.py")

False

**Exercise 6:** Find patterns of lowercase letters joined with an underscore.

In [45]:
def match_6(string):
    """
    Find patterns of lowercase letters joined with an underscore.
    >>> match_6("aab_cbb_bc")
    False
    >>> match_6("aab_cbbbc")
    True
    >>> match_6("aab_Abbbc")
    False
    >>> match_6("abcdef")
    False
    >>> match_6("ABCDEF_ABCD")
    False
    """

    #Your Code Here
    pattern = '^[a-z]*_[a-z]*$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [46]:
match_6("aab_cbb_bc")

False

In [47]:
match_6("aab_cbbbc")

True

In [48]:
match_6("aab_Abbbc")

False

In [49]:
match_6("abcdef")

False

In [50]:
match_6("ABCDEF_ABCD")

False

**Exercise 7:** Find patterns that start with and end with a `_`.

In [51]:
def match_7(string):
    """
    Find patterns that start with and end with a _
    >>> match_7("_abc_")
    True
    >>> match_7("abd")
    False
    >>> match_7("bcd")
    False
    >>> match_7("_ncde")
    False
    """

    pattern = '^_.*_$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [52]:
match_7("_abc_")

True

In [53]:
match_7("abd")

False

In [54]:
match_7("bcd")

False

In [55]:
match_7("_ncde")

False

**Exercise 8:**  Apple registration numbers and Apple hardware product serial numbers might have the number '0' (zero), but never the letter 'O'. Serial numbers don't have the number '1' (one) or the letter 'i'. Write a line of regex expression that checks if the given Serial number belongs to a genuine Apple product.

In [56]:
def match_8(string):
    """
    Apple registration numbers and Apple hardware product serial numbers
    might have the number "0" (zero), but never the letter "O".
    Serial numbers don't have the number "1" (one) or the letter "i".

    Write a line of regex expression that checks
    if the given Serial number belongs to a genuine Apple product.

    >>> match_8("ASJDKLFK10ASDO")
    False
    >>> match_8("ASJDKLFK0ASDo")
    True
    >>> match_8("JKLSDNM01IDKSL")
    False
    >>> match_8("ASDKJLdsi0SKLl")
    False
    >>> match_8("ASDJKL9380JKAL")
    True
    """

    pattern = '^((?!O)(?!i).)*$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [57]:
match_8("ASJDKLFK10ASDO")

False

In [58]:
match_8("ASJDKLFK0ASDo")

True

In [59]:
match_8("JKLSDNM01IDKSL")

True

In [60]:
match_8("ASDKJLdsi0SKLl")

False

In [61]:
match_8("ASDJKL9380JKAL")

True

**Exercise 9:** Check if a given ID number is from Los Angeles (LAX), San Diego(SAN) or the state of New York (NY). ID numbers have the following format `SC-NN-CCC-NNNN`. 
   - SC represents state code in uppercase 
   - NN represents a number with 2 digits 
   - CCC represents a three letter city code in uppercase
   - NNNN represents a number with 4 digits

In [62]:
def match_9(string):
    '''
    >>> match_9('NY-32-NYC-1232')
    True
    >>> match_9('ca-23-SAN-1231')
    False
    >>> match_9('MA-36-BOS-5465')
    False
    >>> match_9('CA-56-LAX-7895')
    True
    '''

    pattern = '^[A-Z]{2}-[0-9]{2}-(NYC|LAX|SAN)-[0-9]{4}$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [63]:
match_9('NY-32-NYC-1232')

True

In [64]:
match_9('ca-23-SAN-1231')

False

In [65]:
match_9('MA-36-BOS-5465')

False

In [66]:
match_9('CA-56-LAX-7895')

True

**Exercise 10:**  Given an input string, cast it to lower case, remove spaces/punctuation, and return a list of every 3-character substring following this logic:
   - The first character doesn't start with 'a' or 'A'
   - The last substring (and only the last substring) can be shorter than 3 characters, depending on the length of the input string.
   - The substrings cannot overlap
   
Here's an example with one of the doctests:

`>>> match_10("Ab..DEF")`
`['def']`

1. convert it to a lowercase string resulting in "ab..def"
2. delete any 3 letter sequence that starts with the letter 'a', so delete "ab." from the string, leaving using with ".def"
3. delete the punctuation resulting in "def"
4. finally, we get `["def"]`

(Only split in the last step, everything else is removing from the string)

In [67]:
s = re.sub('[\W]','',"Ab..DEF").lower()
re.findall('[\w]{1,3}',s)

['abd', 'ef']

In [68]:
def match_10(string):
    '''
    Given an input string, cast it to lower case, remove spaces/punctuation, 
    and return a list of every 3-character substring that satisfy the following:
        - The first character doesn't start with 'a' or 'A'
        - The last substring (and only the last substring) can be shorter than 
        3 characters, depending on the length of the input string.
    
    >>> match_10('ABCdef')
    ['def']
    >>> match_10(' DEFaabc !g ')
    ['def', 'cg']
    >>> match_10('Come ti chiami?')
    ['com', 'eti', 'chi']
    >>> match_10('and')
    []
    >>> match_10( "Ab..DEF")
    ['def']
    '''
    s = re.sub(r'[Aa].{2}', '', string.lower())
    #s = re.sub('[\W]','',s)
    s = re.sub(r'[\s|.!?\\-]','',s)
    chunks = re.findall(r'.{1,3}',s)
    return chunks

In [69]:
match_10('ABCdef')

['def']

In [70]:
match_10(' DEFaabc !g ')

['def', 'cg']

In [71]:
match_10('Come ti chiami?')

['com', 'eti', 'chi']

In [72]:
match_10('and')

[]

In [73]:
match_10( "Ab..DEF")

['def']

## Regex groups: extracting personal information from messy data

**Question 2**

The file in `data/messy.txt` contains personal information from a fictional website that a user scraped from webserver logs. Within this dataset, there are four fields that interest you:
1. Email Addresses (assume they are alphanumeric user-names and domain-names),
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alpha-numeric strings of long length)
4. Street Addresses

Create a function `extract_personal` that takes in a string like `open('data/messy.txt').read()` and returns a tuple of four separate lists containing values of the 4 pieces of information listed above (in the order given). Do **not** keep empty values.

*Hint*: There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

*Note:* Since this data is messy/corrupted, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `@` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

In [74]:
fp = os.path.join('data', 'messy.txt')
s = open(fp, encoding='utf8').read()

In [75]:
s[:1000]

'1\t4/12/2018\tLorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin risus. Praesent lectus.\n\nVestibulum quam sapien| varius ut, blandit non, interdum in, ante. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Duis faucibus accumsan odio. Curabitur convallis.|dottewell0@gnu.org\toR1mOq,!@#$%^&*(),[{bitcoin:18A8rBU3wvbLTSxMjqrPNc9mvonpA4XMiv\tIP:192.232.9.210\tccn:3563354617955160|ssn:380-09-9403}]|05-6609813,814 Monterey Court\n2\t12/18/2018\tSuspendisse potenti. In eleifend quam a odio. In hac habitasse platea dictumst.\n\nMaecenas ut massa quis augue luctus tincidunt. Nulla mollis molestie lorem. Quisque ut erat.,bassiter1@sphinn.com\tc5KvmarHX3o,test\u2060test\u202b,[{bitcoin:1EB7kYpnfJSqS7kUFpinsmPF3uiH9sfRf1,IP:20.73.13.197|ccn:3542723823957010\tssn:118-12-8276}#{bitcoin:1E5fev4boabWZmXvHGVkHcNJZ2tLnpM6Zv*IP:238.206.212.148\tccn:337941898369615,ssn:427-22-9352}#{bitcoin:1DqG3WcmGw74PjptjzcAmxGFuQdvWL7RCC,IP:171.241.15.98\tccn:3574

In [76]:
def extract_personal(s):
    """
    :Example:
    >>> fp = os.path.join('data', 'messy.test.txt')
    >>> s = open(fp, encoding='utf8').read()
    >>> emails, ssn, bitcoin, addresses = extract_personal(s)
    >>> emails[0] == 'test@test.com'
    True
    >>> ssn[0] == '423-00-9575'
    True
    >>> bitcoin[0] == '1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2'
    True
    >>> addresses[0] == '530 High Street'
    True
    """
    email_pattern = '[\w]*@[^.]+\.[a-z]{3}'
    emails = re.findall(email_pattern,s)
    
    ssns_pattern = '[0-9]{3}-[]0-9]{2}-[0-9]{4}'
    ssns = re.findall(ssns_pattern,s)
    
    btc_pattern = '(?<=bitcoin:)\w*'
    bcs = re.findall(btc_pattern,s)
    
    address_pattern = '[0-9]+ [A-Za-z]+ [A-Za-z]+'
    addresses = re.findall(address_pattern,s)
    
    return emails,ssns,bcs,addresses

In [77]:
fp = os.path.join('data', 'messy.test.txt')
s = open(fp, encoding='utf8').read()
emails, ssn, bitcoin, addresses = extract_personal(s)

In [78]:
emails[0] == 'test@test.com'

True

In [79]:
ssn[0] == '423-00-9575'

True

In [80]:
bitcoin[0] == '1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2'

True

In [81]:
addresses[0] == '530 High Street'

True

## Content in Amazon review data

**Question 3**

The dataset `reviews.txt` contains [Amazon reviews](http://jmcauley.ucsd.edu/data/amazon/) for ~200k phones and phone accessories. This dataset has been "cleaned" for you. The goal of this section is to create a function that takes in the review dataset and a review and returns the word that "best summarizes the review" using TF-IDF.'

1. Create a function `tfidf_data(review, reviews)` that takes a review as well as the review data and returns a dataframe:
    - indexed by the words in `review`,
    - with columns given by (a) the number of times each word is found in the review (`cnt`), (b) the term frequency for each word (`tf`), (c) the inverse document frequency for each word (`idf`), and (d) the TF-IDF for each word (`tfidf`).
    
2. Create a function `relevant_word(tfidf_data)` which takes in a dataframe as above and returns the word that "best summarizes the review" described by `tfidf_data`.


*Note:* Use this function to "cluster" review types -- run it on a sample of reviews and see which words come up most. Unfortunately, you will likely have to change your code from your answer above to run it on the entire dataset (to do this, you should compute as many of the frequencies "ahead of time" and look them up when needed; you should also likely filter out words that occur "rarely")

In [82]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()

In [83]:
review

'this is a great new case design that i have not seen before it has a slim silicone skin that really locks in the phone to cover and protect your phone from spills and such and also a hard polycarbonate outside shell cover to guard it against damage  this case also comes with different interchangeable skins and covers to create multiple color combinations  this is a different kind of case than the usual chunk of plastic  it is innovative and suits the iphone 5 perfectly'

In [84]:
#list(set(review.split()))

In [85]:
pd.Series(review.split()).unique()

array(['this', 'is', 'a', 'great', 'new', 'case', 'design', 'that', 'i',
       'have', 'not', 'seen', 'before', 'it', 'has', 'slim', 'silicone',
       'skin', 'really', 'locks', 'in', 'the', 'phone', 'to', 'cover',
       'and', 'protect', 'your', 'from', 'spills', 'such', 'also', 'hard',
       'polycarbonate', 'outside', 'shell', 'guard', 'against', 'damage',
       'comes', 'with', 'different', 'interchangeable', 'skins', 'covers',
       'create', 'multiple', 'color', 'combinations', 'kind', 'of',
       'than', 'usual', 'chunk', 'plastic', 'innovative', 'suits',
       'iphone', '5', 'perfectly'], dtype=object)

In [86]:
reviews

0        works great  i called t mobile and had this si...
1        these items looked to be of good quality and h...
2        this product arrive faster than i expected  i ...
3        i brought this for my sister who has a g2 but ...
4         i am both delighted and disappointed with del...
                               ...                        
58297    bought i for my husband's phone a week ago  ni...
58298    the spring loaded adjustable holder looks nift...
58299    received on time  everything working less spea...
58300    i bought this battery pack with the mindset of...
58301              that work    but thats not the original
Name: 0, Length: 58302, dtype: object

In [87]:
pd.DataFrame([[1,2,3],[1,2,3],[1,2,3]],index=['a','b','c'])[0]['a']#[0]

1

In [88]:
# def tfidf_data(review, reviews):
#     """
#     :Example:
#     >>> fp = os.path.join('data', 'reviews.txt')
#     >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
#     >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
#     >>> out = tfidf_data(review, reviews)
#     >>> out['cnt'].sum()
#     85
#     >>> 'before' in out.index
#     True
#     """
#     words = reviews.str.lower().str.split().sum().unique()
#     df_dict = {'cnt':[],'tf':[],'idf':[]}
#     df = pd.DataFrame(index=words)
    
#     for w in words:
#         re_pat = '\\b%s\\b' % w
         
#         cnt = review.count(re_pat) 
#         df['cnt'][word] = cnt
        
#         tf = cnt / (review.count(' ') + 1)
#         df['tf'][word] = tf
        
#         idf = np.log(len(reviews) / reviews.str.contains(word).sum())
#         df['idf'][word] = idf
#     df['tfidf'] = df['tf'] * df['idf']
#     return df

In [89]:
w_dict = {}
split = review.split()
for w in split:
    if w in w_dict.keys():
        w_dict[w] += 1
    else:
        w_dict[w] = 1
#pd.DataFrame.from_dict(w_dict,orient='index',columns=['cnt'])

In [90]:
def tfidf_data(review, reviews):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> out['cnt'].sum()
    85
    >>> 'before' in out.index
    True
    """
    w_dict = {}
    split = review.split()
    for w in split:
        if w in w_dict.keys():
            w_dict[w] += 1
        else:
            w_dict[w] = 1
    df = pd.DataFrame.from_dict(w_dict,columns=['cnt'],orient='index')
    
    keys = list(w_dict.keys())
    vals = list(w_dict.values())
    
    df = df.assign(tf = np.array(vals) / sum(w_dict.values()))
    df = df.assign(idf = [np.log(len(reviews) / reviews.str.contains(w).sum()) for w in keys])
    df = df.assign(tfidf = df['tf'] * df['idf'])
    
    return df

In [91]:
def tfidf_data(review, reviews):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> out['cnt'].sum()
    85
    >>> 'before' in out.index
    True
    """
    split = review.split()
    words = list(set(split))
    cnts = []
    tfs = []
    idfs = []
    for w in words:
        re_pat = '\\b%s\\b' % w
        cnt = split.count(w)
        cnts.append(cnt)
        
        tf = cnt / (review.count(' ') + 1)
        tfs.append(tf) 
        
        idf = np.log(len(reviews) / reviews.str.lower().str.contains(w).sum())
        idfs.append(idf)
    df_dict = {'cnt':cnts,'tf':tfs,'idf':idfs}
    df = pd.DataFrame(df_dict,index=words)
    df['tfidf'] = df['tf'] * df['idf']
    return df

In [92]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
out = tfidf_data(review, reviews)

In [93]:
out.head()

Unnamed: 0,cnt,tf,idf,tfidf
and,5,0.056818,0.219198,0.012454
new,1,0.011364,2.444468,0.027778
plastic,1,0.011364,2.72591,0.030976
i,1,0.011364,0.002679,3e-05
covers,1,0.011364,3.742829,0.042532


In [94]:
 out['cnt'].sum()

85

In [95]:
'before' in out.index

True

In [96]:
out.sort_values('tfidf',ascending=False).index[0]

'spills'

In [97]:
def relevant_word(out):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> relevant_word(out) in out.index
    True
    """

    return out.sort_values('tfidf',ascending=False).index[0]

In [98]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
out = tfidf_data(review, reviews)

In [99]:
relevant_word(out) in out.index

True

### Tweet Analysis: Internet Research Agency

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the *Internet Research Angency* (the tweet factory facing allegations for attempting to influence US political elections).

The questions in this section will focus on the following:
1. We will look at the hashtags present in the text and trends in their makeup.
2. We will prepare this dataset for modeling by creating features out of the text fields.

**Question 4 (HashTags)**

You may assume that a hashtag is any string without whitespace following a `#` (this is more permissive than Twitters rules for hashtags; you are encouraged to go down this rabbit-hole to better figure out how to clean your data!).

* Create a function `hashtag_list` that takes in a column of tweet-text and returns a column containing the list of hashtags present in the tweet text. If a tweet doesn't contain a hashtag, the function should return an empty list.

* Create a function `most_common_hashtag` that takes in a column of hashtag-lists (the output above) and returns a column consisting a single hashtag from the tweet-text. 
    - If the text has no hashtags, the entry should be `NaN`,
    - If the text has one distinct hashtag, the entry should contain that hashtag,
    - If the text has more than one hashtag, the entry should be the most common hashtag (among all hashtags in the column). If there is a tie for most common, any of the most common can be returned.
        - E.g. if the input column was: `pd.Series([[1, 2, 2], [3, 2, 3]])`, the output would be: `pd.Series([2, 2])`. Even though `3` was more common in the second list, `2` is the most common among all hashtags in the column.

In [100]:
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])

In [101]:
ira.head()

Unnamed: 0,id,name,date,text
0,3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452...,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks...
1,1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5...,2016-12-24 04:31,RT @Philanthropy: Dozens of ‘hate groups’ have...
2,2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty,..."
3,272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busies...
4,7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed...,2016-07-30 15:44,RT @dirtroaddiva1: #IHatePokemonGoBecause he ...


In [102]:
ira.iloc[:5]['text'].sum()

'The Best Exercise To Lose Belly Fat In 2 weeks  https://t.co/oHFToG7rh6 #Exercise #LoseBellyFat #CatTV #TeenWolf… https://t.co/b4pr9gEx38RT @Philanthropy: Dozens of ‘hate groups’ have charity status, Chronicle study finds https://t.co/FxUBBHNlKyArtificial intelligence can find, map poverty, researchers say  #techUber balks at rules proposed by world’s busiest airport  #newsRT @dirtroaddiva1: #IHatePokemonGoBecause he  didn\'t let me do "that" for a Klondike bar.    Screw you Pokemon.  #PokesAreJokes. https://t.…'

In [103]:
def ht_grabber(row):
    hashtags = re.findall('(?<=#)+?([^\s]+)',row)
    return hashtags

In [104]:
def hashtag_list(tweet_text):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = hashtag_list(test['text'])
    >>> (out.iloc[0] == ['NLP', 'NLP1', 'NLP1'])
    True
    """
    
    return tweet_text.apply(ht_grabber)

In [105]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
out = hashtag_list(test['text'])
out

0    [NLP, NLP1, NLP1]
Name: text, dtype: object

In [106]:
(out.iloc[0] == ['NLP', 'NLP1', 'NLP1'])

True

In [107]:
pd.Series(['NLP', 'NLP1', 'NLP1']).value_counts().index[0]

'NLP1'

In [108]:
def most_common_hashtag(tweet_lists):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
    >>> most_common_hashtag(test).iloc[0]
    'NLP1'
    """  
    hashtags = [tag for ht in tweet_lists for tag in ht]
    hashtags = pd.Series(hashtags)
    hashtags = hashtags.value_counts()
    def assign_helper(data):
        if len(data) == 0:
            return np.NaN
        elif len(data) == 1:
            return data[0]
        else:
            return hashtags.loc[data].idxmax()
    
    return tweet_lists.apply(assign_helper)

In [109]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
most_common_hashtag(test).iloc[0]

'NLP1'

**Question 5 (Features)**

Now create a dataframe of features from the `ira` data.  That is create a function `create_features` that takes in the `ira` data and returns a dataframe with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `num_hashtags` gives the number of hashtags present in a tweet,
* `mc_hashtags` gives the most common hashtag associated to a tweet (as given by the problem above),
* `num_tags` gives the number of tags a given tweet has (look for the presence of `@`),
* `num_links` gives the number of hyper-links present in a given tweet 
    - (a hyper-link is a string starting with `http(s)://` not followed by whitespaces),
* A boolean column `is_retweet` that describes if the given tweet is a retweet (i.e. `RT`),
* A 'clean' text field `text` that contains the tweet text with:
    - The non-alphanumeric characters removed (except spaces),
    - All words should be separated by exactly one space,
    - The characters all lowercase,
    - All the meta-information above (Retweet info, tags, hyperlinks, hashtags) removed.

*Note:* You should make a helper function for each column.

*Note:* This will take a while to run on the entire dataset -- test it on a small sample first!

In [110]:
ira.head()

Unnamed: 0,id,name,date,text
0,3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452...,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks...
1,1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5...,2016-12-24 04:31,RT @Philanthropy: Dozens of ‘hate groups’ have...
2,2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty,..."
3,272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busies...
4,7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed...,2016-07-30 15:44,RT @dirtroaddiva1: #IHatePokemonGoBecause he ...


In [111]:
prog = re.compile('^RT')
prog.search('RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1') is not None

True

In [112]:
re.sub('http(s)?://[^\s]+','','RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1')

'RT @DSC80: Text-cleaning is cool! #NLP  #NLP1 #NLP1'

In [113]:
re.sub('@+?([^\s]+)', '', 'RT @DSC80: Text-cleaning is cool! #NLP  #NLP1 #NLP1')

'RT  Text-cleaning is cool! #NLP  #NLP1 #NLP1'

In [114]:
re.sub('#+?([^\s]+)', '', 'RT  Text-cleaning is cool! #NLP  #NLP1 #NLP1')

'RT  Text-cleaning is cool!    '

In [115]:
re.sub('^RT','','RT  Text-cleaning is cool!    ')

'  Text-cleaning is cool!    '

In [116]:
re.sub(r'[^\w\s]', ' ', '  Text-cleaning is cool!    ')

'  Text cleaning is cool     '

In [117]:
'  Text cleaning is cool     '.lower().strip()

'text cleaning is cool'

In [118]:
def ht_cnt(row):
    hashtags = re.findall('(?<=#)+?([^\s])',row)
    return len(hashtags)

In [119]:
def mc_ht(row):
    ht = re.findall('(?<=#)+?([^\s]+)',row)
    data = pd.Series(ht)
    if len(data) == 0:
        return np.NaN
    elif len(data.unique()) == 1:
        return data[0]
    else:
        return data.value_counts().index[0]

In [120]:
def tag_cnt(row):
    tags = re.findall('(?<=@)+?([^\s]+)',row)
    return len(tags)

In [121]:
def link_cnt(row):
    tags = re.findall('http(s)?://[^\s]+',row)
    return len(tags)

In [122]:
def rt_helper(row):
    prog = re.compile('^RT')
    return prog.search(row) is not None

In [123]:
def cleaner(row):
    s = re.sub('http(s)?://[^\s]+', '', row)
    s = re.sub('@+?([^\s]+)', '', s)
    s = re.sub('#+?([^\s]+)', '', s)
    s = re.sub('^RT', '', s)
    s = re.sub(r'[^\w\s]', ' ', s)
    s = s.lower().strip()
    return s

In [124]:
def create_features(ira):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = create_features(test)
    >>> anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
    >>> ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
    >>> ans = pd.DataFrame(ansdata, columns=anscols)
    >>> (out == ans).all().all()
    True
    """
    
    num_hashtags = ira['text'].apply(ht_cnt)
    mc_hashtags = ira['text'].apply(mc_ht)
    num_tags = ira['text'].apply(tag_cnt)
    num_links = ira['text'].apply(link_cnt)
    is_retweet = ira['text'].apply(rt_helper)
    text = ira['text'].apply(cleaner)
    
    df_dict = {'text':text, 'num_hashtags':num_hashtags, 'mc_hashtags':mc_hashtags, 'num_tags':num_tags, 'num_links':num_links, 'is_retweet':is_retweet}
    df = pd.DataFrame(df_dict)
    
    return df

In [125]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
test

Unnamed: 0,text
0,RT @DSC80: Text-cleaning is cool! #NLP https:/...


In [126]:
out = create_features(test)
out

Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,text cleaning is cool,3,NLP1,1,1,True


In [127]:
anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
ans = pd.DataFrame(ansdata, columns=anscols)
(out == ans).all().all()

True

## Congratulations! You're done!

* Submit the lab on Gradescope