## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Luke Hill
    - Email: lh967@drexel.edu


### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# Assignment Group 2

## Module D _(20 points)_

For this assignment, we'll be working with a previously-accessed subset of the [Lending Club Loan Dataset](https://www.kaggle.com/wordsforthewise/lending-club) located via the local path `'./data/loan_extra-small.csv'`.

__D1.__ _(3 points)_  First, complete the function to read the csv and note the first line contains the header. Separately a return the `header` (a list of strings) and `loan_data` (a list of list) as a pair for the function's output.

In [3]:
# D1:Function(3/3)

import csv

def read_loans(file_path):
    loan_reader = csv.reader(open(file_path, "r"))
    loan_data=[]
    header = next(loan_reader)
    for row in loan_reader:
        loan_data.append(row)
    return header, loan_data

For reference, your output should be:
```
loan status:  Current

[('id', ''),
 ('member_id', ''),
 ('loan_amnt', '2500'),
 ('funded_amnt', '2500'),
 ('funded_amnt_inv', '2500'),
 ('term', '36 months'),
 ('int_rate', '13.56'),
 ('installment', '84.92'),
 ('grade', 'C'),
 ('sub_grade', 'C1'),
 ('emp_title', 'Chef'),
 ('emp_length', '10+ years'),
 ('home_ownership', 'RENT'),
 ('annual_inc', '55000'),
 ('verification_status', 'Not Verified'),
 ('issue_d', 'Dec-2018'),
 ('loan_status', 'Current'),
 ('pymnt_plan', 'n'),
 ('url', ''),
 ('desc', ''),
 ('purpose', 'debt_consolidation'),
 ('title', 'Debt consolidation'),
 ('zip_code', '109xx'),
 ('addr_state', 'NY'),
 ('dti', '18.24')]
```

In [4]:
# D1:SanityCheck

header, loan_data = read_loans('./data/loan_extra-small.csv')

print("loan status: ", loan_data[0][header.index("loan_status")])
list(zip(header,loan_data[0]))[:25]

loan status:  Current


[('id', ''),
 ('member_id', ''),
 ('loan_amnt', '2500'),
 ('funded_amnt', '2500'),
 ('funded_amnt_inv', '2500'),
 ('term', '36 months'),
 ('int_rate', '13.56'),
 ('installment', '84.92'),
 ('grade', 'C'),
 ('sub_grade', 'C1'),
 ('emp_title', 'Chef'),
 ('emp_length', '10+ years'),
 ('home_ownership', 'RENT'),
 ('annual_inc', '55000'),
 ('verification_status', 'Not Verified'),
 ('issue_d', 'Dec-2018'),
 ('loan_status', 'Current'),
 ('pymnt_plan', 'n'),
 ('url', ''),
 ('desc', ''),
 ('purpose', 'debt_consolidation'),
 ('title', 'Debt consolidation'),
 ('zip_code', '109xx'),
 ('addr_state', 'NY'),
 ('dti', '18.24')]

__D2.__ _(4 points)_ Now complete the function below to create a dictionary named `statuses` whose keys are the entries in the `loan_status` and values are boolean values, describing `1`- or `0`-loans, where loans that have `"Current"`, `"Fully Paid"`, or `"Issued"` in the `loan_status` field are `1`-loans, and all others are `0`-loans.

In [14]:
# D2:Function(4/4)

import re 

def categorize_satuses(loan_data, header):
    statuses = {}

    # Define the regular expressions for 1-loans and 0-loans
    one_loan_regex = re.compile(r'^(Current|Fully Paid|Issued)$')
    
    # Extract the 'loan_status' column index
    loan_status_index = header.index('loan_status')
    
    # Iterate through 'loan_data' and categorize the loans
    for row in loan_data:
        loan_status = row[loan_status_index]
        is_one_loan = bool(one_loan_regex.match(loan_status))
        statuses[loan_status] = int(is_one_loan)
    
    return statuses

For reference, your output should be:
```
{'Current': 1,
 'Fully Paid': 1,
 'Charged Off': 0,
 'Late (16-30 days)': 0,
 'Late (31-120 days)': 0}
```

In [15]:
# D2:SanityCheck

statuses = categorize_satuses(loan_data, header)
statuses

{'Current': 1,
 'Fully Paid': 1,
 'Charged Off': 0,
 'Late (16-30 days)': 0,
 'Late (31-120 days)': 0}

__D3.__ _(8 pts)_ The `desc` field contains text descriptions of loans. Complete the function to tokenize each loan description and count the words for `1`- and `0`-loan descriptions, putting counts into two separate `Counter()` data structures according to each loan's status in `statuses`. 

In [16]:
# D3:Function(6/8)

from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict

def count_desc_words(loan_data, header):
    counts = defaultdict(Counter)
    loan_status_index = header.index('loan_status')
    desc_index = header.index('desc')
    one_loan_regex = re.compile(r'^(Current|Fully Paid|Issued)$')
    # Iterate through the loan data
    for row in loan_data:
        loan_status = row[loan_status_index]
        loan_desc = row[desc_index]

        # Tokenize the loan description
        words = word_tokenize(loan_desc)

        # Determine the status using the label mapping
        is_one_loan = bool(one_loan_regex.match(loan_status))
        counts[is_one_loan].update(words)
    
    return counts

For reference, your output should be:
```
The top 25 'good/bad'-loan words are: 
1  (good, bad):  ('>', 102166) ('>', 18838)
2  (good, bad):  ('on', 55911) ('<', 10153)
3  (good, bad):  ('<', 54586) ('br', 10153)
4  (good, bad):  ('br', 54585) ('on', 10089)
5  (good, bad):  ('to', 50785) ('to', 8819)
6  (good, bad):  ('added', 47661) ('added', 8702)
7  (good, bad):  ('Borrower', 47584) ('Borrower', 8685)
8  (good, bad):  ('I', 39943) ('I', 6778)
9  (good, bad):  ('and', 31902) ('and', 5873)
10  (good, bad):  ('.', 30952) ('my', 5176)
11  (good, bad):  ('credit', 28660) ('.', 4902)
12  (good, bad):  ('my', 27339) ('credit', 4844)
13  (good, bad):  ('a', 25077) ('a', 3856)
14  (good, bad):  ('the', 19275) ('pay', 3538)
15  (good, bad):  ('pay', 19037) ('off', 3274)
16  (good, bad):  ('off', 18624) ('loan', 2992)
17  (good, bad):  ('loan', 18028) ('the', 2860)
18  (good, bad):  ('debt', 17377) ('debt', 2801)
19  (good, bad):  (',', 16079) (',', 2636)
20  (good, bad):  ('of', 14980) ('of', 2613)
21  (good, bad):  ('cards', 14459) ('cards', 2593)
22  (good, bad):  ('for', 13588) ('for', 2340)
23  (good, bad):  ('have', 13222) ('have', 2330)
24  (good, bad):  ('interest', 12357) ('card', 1861)
25  (good, bad):  ('card', 12345) ('consolidate', 1722)
```

In [18]:
# D3:SanityCheck
import nltk
nltk.download('punkt')
counts = count_desc_words(loan_data, header)
print('The top 25 \'good/bad\'-loan words are: ')
for good_word, bad_word, i in zip(counts[1].most_common(25), 
                                  counts[0].most_common(25),
                                  range(25)):
    print(i+1," (good, bad): ", 
          good_word, bad_word)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lukeh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


The top 25 'good/bad'-loan words are: 
1  (good, bad):  ('>', 102166) ('>', 18838)
2  (good, bad):  ('on', 55911) ('<', 10153)
3  (good, bad):  ('<', 54586) ('br', 10153)
4  (good, bad):  ('br', 54585) ('on', 10089)
5  (good, bad):  ('to', 50785) ('to', 8819)
6  (good, bad):  ('added', 47661) ('added', 8702)
7  (good, bad):  ('Borrower', 47584) ('Borrower', 8685)
8  (good, bad):  ('I', 39943) ('I', 6778)
9  (good, bad):  ('and', 31902) ('and', 5873)
10  (good, bad):  ('.', 30952) ('my', 5176)
11  (good, bad):  ('credit', 28660) ('.', 4902)
12  (good, bad):  ('my', 27339) ('credit', 4844)
13  (good, bad):  ('a', 25077) ('a', 3856)
14  (good, bad):  ('the', 19275) ('pay', 3538)
15  (good, bad):  ('pay', 19037) ('off', 3274)
16  (good, bad):  ('off', 18624) ('loan', 2992)
17  (good, bad):  ('loan', 18028) ('the', 2860)
18  (good, bad):  ('debt', 17377) ('debt', 2801)
19  (good, bad):  (',', 16079) (',', 2636)
20  (good, bad):  ('of', 14980) ('of', 2613)
21  (good, bad):  ('cards', 14459) 

Reviewing the output most common words for `1`- and `0`-loans, determine if there an apparent difference in the words used and respond to the `Inline(2/8)` question, below.

In [19]:
# D3:Inline(2/8)

# Just looking at the top 25 words for each category, 
# are more words the same than different?
# print either "Same" or "Different"
print("Same")

Same


__D4.__ _(5 points)_ Considering the lack of discernability between the `top_n` (a function parameter) word lists for each cateory, you must now complete the function, which is an exercise in sets, determining which words each category _exclusively_ uses.

In particular, you must remove the `0`-loan set's words from the `1`-loan's set, and vice versa, storing these respectively in the `good_notin_bad` and `bad_notin_good` sets, respectively. If the data are meaningful (an integrity check), one might expect more-positive words in the `good_notin_bad`, and more-negative words in the `bad_notin_good` set.

In [20]:
# D4:Function(4/5)

def get_set_differences(counts, top_n):
    good_notin_bad, bad_notin_good = set(), set()
    
    # Extract the top words from '1' (good loans) and '0' (bad loans) categories
    top_good_words = {word for word, _ in counts[1].most_common(top_n)}
    top_bad_words = {word for word, _ in counts[0].most_common(top_n)}
    
    # Calculate words exclusive to each category
    good_notin_bad = top_good_words - top_bad_words
    bad_notin_good = top_bad_words - top_good_words
    
    return good_notin_bad, bad_notin_good

For reference, your output should be:
```
({'!', 'it', 'lower', 'rate'}, {'bills', 'consolidation', 'help', 'need'})
```

In [21]:
# D4:SanityCheck

good_notin_bad, bad_notin_good = get_set_differences(counts, 50)
good_notin_bad, bad_notin_good

({'!', 'it', 'lower', 'rate'}, {'bills', 'consolidation', 'help', 'need'})

In [23]:
# D4:Inline(1/5)

# Now that we can see the separately-used words, do they appear
# meaningfully-different with respect to the `loan_status` field?
# Print either "Meaningful" or "Not meaningful"
print("Meaningful")

Meaningful
