## Lab 1: Data in Python

The data for this activity can be downloaded here (under "Data Sets"): https://www.statlearning.com/resources-python

As a first step, create a folder in this location called "data" and drop the Credit.csv file in that location.

If you open the ".gitignore" file, you'll notice that one of the entries is "data" so that your data files will not be shared on GitHub.

In [158]:
import pandas as pd
import numpy as np
Credit = pd.read_csv('data/Credit.csv')
Credit

### What are the feature names?

In [159]:
print(list(Credit.columns))

['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Own', 'Student', 'Married', 'Region', 'Balance']


### What are the data types of the feature values (i.e., numeric, categorical)

In [183]:
numeric = ['Income', 'Limit', 'Rating', 'Cards']
ordinal = ['Age', 'Education']
categorical = ['Region']
binary = ['Own', 'Student', 'Married']
print(f"Numric: {numeric}\nOrdinal: {ordinal}\nCategorical: {categorical}\nBinary: {binary}")

Numric: ['Income', 'Limit', 'Rating', 'Cards']
Ordinal: ['Age', 'Education']
Categorical: ['Region']
Binary: ['Own', 'Student', 'Married']


### Find the mean, median, and mode of the feature "Age."

In [161]:
# mean
age_array = Credit["Age"].tolist()
def mean(array):
    return sum(array) / len(array)

print(mean(age_array))

55.6675


In [162]:
# median
median = 0
age_array.sort()

def median(array):
    if len(array) % 2 == 1:
        return array[len(array) // 2]
    else:
        middle_1 = array[len(array) // 2]
        middle_2 = array[(len(array) + 1) // 2]
    
        return (middle_1 + middle_2) / 2
        
print(median(age_array))

56.0


In [163]:
# mode

def mode(array):
    current_mode_index = 0
    prev_occurance = 0
    index_loop_1 = 0
    index_loop_2 = 0
    
    while index_loop_1 < len(array):
        current_occurance = 0
        while index_loop_2 < len(array):
            if array[index_loop_2] > array[index_loop_1]:
                break
            else:
                if array[index_loop_1] == array[index_loop_2]:
                    current_occurance += 1
            index_loop_2 += 1 
        if current_occurance > prev_occurance:
            current_mode_index = index_loop_2 - 1
            prev_occurance = current_occurance
        index_loop_1 = index_loop_2
                

    return array[current_mode_index]
    
print(mode(age_array))

44


### Create a subset of Credit that only contains the columns Age, Region, and Income. Call it "Subset."

In [164]:
Subset = Credit[["Age", "Region", "Income"]]
print(Subset)

     Age Region   Income
0     34  South   14.891
1     82   West  106.025
2     71   West  104.593
3     36   West  148.924
4     68  South   55.882
..   ...    ...      ...
395   32  South   12.096
396   65   East   13.364
397   67  South   57.872
398   44  South   37.728
399   64   West   18.701

[400 rows x 3 columns]


### From this subset, create two partitions: 
- One that contains records for Ages below 50, and one that contains Ages 50+
- Call them "Younger" and "Older"

In [165]:
Younger = Subset[Subset['Age'].lt(50)]
print(Younger)

     Age Region   Income
0     34  South   14.891
3     36   West  148.924
6     37   East   20.996
9     41   East   71.061
10    30  South   63.095
..   ...    ...      ...
391   43  South   73.327
392   24   West   25.974
394   40  South   49.794
395   32  South   12.096
398   44  South   37.728

[161 rows x 3 columns]


In [166]:
Older = Subset[Subset['Age'].ge(50)]
print(Older)

     Age Region   Income
1     82   West  106.025
2     71   West  104.593
4     68  South   55.882
5     77  South   80.180
7     87   West   71.408
..   ...    ...      ...
390   81   West  135.118
393   65   East   17.316
396   65   East   13.364
397   67  South   57.872
399   64   West   18.701

[239 rows x 3 columns]


### Use these partitions to compare the mean and median Income for each age group (less than 50 vs. 50+)

In [167]:
# mean of younger
younger_income = Younger["Income"].tolist()
younger_income.sort()
print(mean(younger_income))

41.65203726708074


In [168]:
# mean of older
older_income = Older["Income"].tolist()
older_income.sort()
print(mean(older_income))

47.62165690376569


In [169]:
# median of younger
print(median(younger_income))

31.861


In [170]:
# median of older
print(median(older_income))

33.694


### What does the difference between mean and median suggest about the distribution of income?

Since the mean of the older individual's income is higher than their younger counterpart, we can conclude that on average the older you are the more money you make. Distribution of income is more skewed to the left in younger individuals while only slightly skewed to the left in the older individuals.

### What are the possible values for Region?

In [171]:
region_array = Credit["Region"].tolist()
options = list()
for region in region_array:
    if region not in options:
        options.append(region)
print(options)

['South', 'West', 'East']


### Use the value_counts method to build a frequency table for the possible values of Region.

In [172]:
row_data = list()

for region in options:
    row_data.append({"Count": Credit["Region"].value_counts()[region], "Frequency": Credit["Region"].value_counts(True)[region]})

Frequency = pd.DataFrame(row_data, index=options)
print(Frequency)

       Count  Frequency
South    199     0.4975
West     102     0.2550
East      99     0.2475


### Compare the average income for students vs. non-students.

In [173]:
students = Credit[Credit['Student'].eq("Yes")]
non_students = Credit[Credit['Student'].eq("No")]

In [174]:
print(f"The average income for students is {mean(students['Income'].tolist())}")

The average income for students is 47.29205


In [175]:
print(f"The average income for non-students is {mean(non_students['Income'].tolist())}")

The average income for non-students is 44.98853333333333


Often, banks will use a variable called _utilization_ to assess credit risk.
Utilization is the ratio of balance to limit (i.e., what % of the credit limit is being used?)
### Add a feature to the Credit dataset to measure utilization

In [176]:
Credit["Utilization"] = Credit["Balance"] / Credit["Limit"]

In [177]:
print(Credit['Utilization'])

0      0.092346
1      0.135892
2      0.081979
3      0.101431
4      0.067592
         ...   
395    0.136585
396    0.125065
397    0.033086
398    0.000000
399    0.174873
Name: Utilization, Length: 400, dtype: float64


### What is the maximum utilization value in this set? 

In [178]:
print(Credit['Utilization'].max())

0.27693008426326576
