
# IFI 8420 - Assignment 3: Logistic Regression

## (Group Submission)

**Note:** Quiz 3B in iCollege will be based on this Assignment. Please have your R program available in running condition when you take the quiz. You will need solutions of your program to take the quiz. Quiz 3B is not under lockdown browser.
    


## Part 1 (100 points)

Analyze the data in the **CreditCard** dataset in the `AER` package. (Note that you have to install the `AER` package and any other additional packages required by `AER`.)
    


### Variables in the dataset:

1. **card**: Was the application for a card accepted? (Binary: 1/0) - Response Variable
2. **reports**: Number of major derogatory reports 
3. **income**: Yearly income (in USD 10,000)
4. **age**: Age in years plus 12ths of a year 
5. **owner**: Does the individual own their home? 
6. **dependents**: Number of dependents 
7. **months**: Months living at the current address
8. **share**: Ratio of monthly credit card expenditure to yearly income
9. **selfemp**: Is the individual self-employed?
10. **majorcards**: Number of major credit cards held
11. **active**: Number of active credit accounts
12. **expenditure**: Average monthly credit card expenditure
    

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

df = pd.read_csv(r"C:\Users\chase\Dropbox (Old)\My PC (LAPTOP-P3ARPLF9)\Downloads\CreditCard.csv")


### A. Provide summary statistics of the predictors. (5 points)

In [11]:

# Display summary statistics of predictors
predictors = ['reports', 'income', 'age', 'owner', 'dependents', 'months', 'share']
df[predictors].describe()
    

Unnamed: 0,reports,income,age,dependents,months,share
count,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0
mean,0.456406,3.365376,33.213103,0.993935,55.267627,0.068732
std,1.345267,1.693902,10.142783,1.247745,66.271746,0.094656
min,0.0,0.21,0.166667,0.0,0.0,0.000109
25%,0.0,2.24375,25.41667,0.0,12.0,0.002316
50%,0.0,2.9,31.25,1.0,30.0,0.038827
75%,0.0,4.0,39.41667,2.0,72.0,0.093617
max,14.0,13.5,83.5,6.0,540.0,0.90632


### B. Consider only data with `age > 18` for the rest of the analysis. (5 points)

In [12]:

df = df[df['age'] > 18]
    

### C. Plot of income vs. reports: mark individuals with card application accepted as blue, and not accepted as red. (5 points)

In [13]:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='income', y='reports', hue='card', palette={1: 'blue', 0: 'red'})
plt.title('Income vs. Reports')
plt.show()
    

ValueError: The palette dictionary is missing keys: {'yes', 'no'}

<Figure size 800x600 with 0 Axes>

### D. Boxplots of income and reports as a function of card acceptance status. (5 points)

In [None]:

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.boxplot(x='card', y='income', data=df, ax=axes[0], palette={1: 'blue', 0: 'red'})
sns.boxplot(x='card', y='reports', data=df, ax=axes[1], palette={1: 'blue', 0: 'red'})
axes[0].set_title('Income by Card Acceptance')
axes[1].set_title('Reports by Card Acceptance')
plt.show()
    

### E. Construct the histogram for the predictors. (5 points)

In [None]:

df[predictors].hist(figsize=(12,10), bins=20)
plt.show()
    

### F. Transform `share` and `reports` due to skewness.

In [None]:

import numpy as np

df['log_share'] = np.log(df['share'])
df['log_reports'] = np.log(df['reports'] + 1)
    

### G. Logistic Regression with predictors 2 to 8. (5 points)

In [None]:

from sklearn.linear_model import LogisticRegression

X = df[['reports', 'income', 'age', 'owner', 'dependents', 'months', 'log_share']]
y = df['card']

model = LogisticRegression()
model.fit(X, y)

print(model)
    

### H. Convert probabilities into class labels and compute confusion matrix. (15 points)

In [None]:

from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = (model.predict_proba(X)[:, 1] > 0.5).astype(int)
print(confusion_matrix(y, y_pred))
print(f'Accuracy: {accuracy_score(y, y_pred)}')
    

### I. Fit logistic regression model using training data (1-1000), test on remaining. (20 points)

In [None]:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.3, random_state=42)
X_train, y_train = train[X.columns], train['card']
X_test, y_test = test[X.columns], test['card']

model.fit(X_train, y_train)
y_test_pred = (model.predict_proba(X_test)[:, 1] > 0.5).astype(int)

print(confusion_matrix(y_test, y_test_pred))
print(f'Accuracy: {accuracy_score(y_test, y_test_pred)}')
    

### J. Apply Discriminant Analysis, Nearest Neighbors, and Naïve Bayes, compare models. (20 points)

In [None]:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

models = {
    'LDA': LinearDiscriminantAnalysis(),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB()
}

for name, mdl in models.items():
    mdl.fit(X_train, y_train)
    preds = mdl.predict(X_test)
    print(f'--- {name} ---')
    print(confusion_matrix(y_test, preds))
    print(f'Accuracy: {accuracy_score(y_test, preds)}')
    

### K. Comparing models - Final selection and validation. (15 points)

In [None]:

# Model comparison summary
def compare_models():
    best_model = max(models, key=lambda name: accuracy_score(y_test, models[name].predict(X_test)))
    return f'The best performing model is {best_model} due to higher accuracy.'

print(compare_models())
    