# Introduction to Data Science, CS 5963 / Math 3900
## Lab 9: Regression

In this lab, we'll use the python packages [Statsmodels](http://statsmodels.sourceforge.net/) to study a dataset related to credit cards.
We'll use the 'Credit' dataset, available 
[here](http://www-bcf.usc.edu/~gareth/ISL/data.html). 
This dataset consists of some credit card information for 400 people. 

Of course, a *credit card* is a card issued to a person ("cardholder"), typically from a bank, that can be used as a method of payment. The card allows the cardholder to borrow money from the bank to pay for goods and services. Credit cards have a *limit*, the maximum amount you can borrow, which is determined by the bank. The limit is determined from information collected from the cardholder (income, age, ...) and especially (as we will see) the cardholders "credit rating".  The *credit rating* is an evaluation of the (1) ability of the cardholder to pay back the borrowed money and (2) the likelihood of the cardholder to defaulting on the borrowed money. 

Our focus will be on the use of regression tools to study this dataset. Ideally, we'd like to understand what factors determine *credit ratings* and *credit limits*. We can think about this either from the point of view of (1) a bank who wants to protect their investments by minimizing credit defaults or (2) a person who is trying to increase their credit rating and/or credit limit. A difficulty we will encounter is including categorical data into regression models.  

In [None]:
import pandas as pd
import statsmodels.formula.api as sm
import scipy as sc

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)

##  Import data from Credit.csv file

In [None]:
credit = pd.read_csv('Credit.csv',index_col=0) #load data
print(credit[:10])

## Summarize and describe the data

In [None]:
print(credit.dtypes, '\n') 
print(credit.describe(), '\n')
print(credit['Gender'].value_counts(), '\n')
print(credit['Student'].value_counts(), '\n')
print(credit['Married'].value_counts(), '\n')
print(credit['Ethnicity'].value_counts())

The column names of this data are:  
1. Income 
+ Limit  
+ Rating  
+ Cards
+ Age  
+ Education  
+ Gender (categorial: M,F)
+ Student (categorial: Y,N)
+ Married (categorial: Y,N)
+ Ethnicity (categorial: Caucasian, Asian, African American) 
+ Balance

**Question:** What is wrong with the income data? How can it be fixed? 

The file 'Credit.csv' is a comma separated file. I assume a period was used instead of a comma to indicate thousands in income so it wouldn't get confused with the separating value? Or maybe this is a dataset from Europe? Or maybe the income is just measured in \$1k units?  To fix the income data, we can use the Pandas series 'map' function. 


In [None]:
credit["Income"] = credit["Income"].map(lambda x: 1000*x)
print(credit[:10])

We can also look at the covariances in the data. (This is how the variables vary together.) There are two ways to do this:
1. Visually: Make a scatter matrix of the data
+ Quantitatively: Compute the correlation matrix. For each pair of variables, $(x_i,y_i)$, we compute $\frac{\sum_i (x_i - \bar x) (y_i - \bar y)}{s_x s_y}$ where $\bar x, \bar y$ are sample means and $s_x, s_y$ are sample variances

In [None]:
print(credit.corr())
pd.scatter_matrix(credit, figsize=(10, 10), diagonal='hist'); # trick: semi-colon prevents output

**Observations:**
1. Limit and Rating are highly correlated ($99.7\%$)  
+ Income strongly correlates with Limit ($79\%$) and Rating ($79\%$)
+ Balance correlates with Limit ($86\%$) and Rating ($86\%$)
+ There are "weird stripes" in some of the data. Why? 
+ Categorical information doesn't appear in this plot. Why? How can I visualize the categorical variables?

In [None]:
# Plot Categorical variables: Gender, Student, Married, Ethnicity
fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(10,10))
credit["Gender"].value_counts().plot(kind='pie',ax=axes[0,0]);
credit["Student"].value_counts().plot(kind='pie',ax=axes[1,0]);
credit["Married"].value_counts().plot(kind='pie',ax=axes[0,1]);
credit["Ethnicity"].value_counts().plot(kind='pie',ax=axes[1,1]);

## A first regression model

We regress first regress Limit on Rating: 
$$
\text{Limit} = \beta_0 + \beta_1 \text{Rating}. 
$$
Since credit ratings are primiarily used by banks to determine credit limits, we expect that Rating is very predictive for Limit, so this regression should be very good. 

We use the 'ols' function from the statsmodels python library. 


In [None]:
model = sm.ols(formula="Limit ~ Rating", data=credit)
model_result = model.fit()
print(model_result.summary())

As we might have expected, the credit limit is almost entirely predicted by credit rating in this dataset.

Since Rating and Limit are almost the same variable, next we'll forget about Limit and just try to predict Rating from the real-valued variables (non-categorical variables): Income, Cards, Age, Education, Balance. 

**Exercise:** Develop a multilinear regression model to predict Rating. 
Interpret the results

In [None]:
# your code here 


Which independent variables are good/bad predictors? 

**Observations:**

1. ... 
+ ...

## Incorporating categorical variables into regression models

We have four categorical variables (Gender, Student, Married, Ethnicity). How can we include them in a regression model? 

Let's start with a categorical variable with only 2 categories: Gender (Male, Female).

Idea: Create a "dummy variable" that turns Gender into a real value: 
$$
\text{Gender_num}_i = \begin{cases} 
1 & \text{if $i$-th person is female} \\
0 & \text{if $i$-th person is male}
\end{cases}. 
$$
Then we could try the model 
$$
\text{Balance} = \beta_0 + \beta_1 \text{Gender_num}. 
$$

In [None]:
credit["Gender_num"] = credit["Gender"].map({' Male':0, 'Female':1})
credit["Student_num"] = credit["Student"].map({'Yes':1, 'No':0})
credit["Married_num"] = credit["Married"].map({'Yes':1, 'No':0})

model = sm.ols(formula="Balance ~ Gender_num", data=credit)
model_result = model.fit()
print(model_result.summary())

Since the $p$-value for the Gender_num coefficient is very large, we conclude that there is no support for the conclusion that there is a difference in credit card balance between genders.

**Exercise**: Try to find a meaningful relationship in the data, for example, age vs. education, gender vs. married, etc...

In [None]:
# your code here 


## What about a categorical variable with 3 categories? 

The Ethnicity variable takes three values: Caucasian, Asian, and African American. 

What's wrong with the following?  
$$
\text{Ethnicity_num}_i = \begin{cases} 
0 & \text{if $i$-th person is Caucasian} \\
1 & \text{if $i$-th person is Asian} \\ 
2 & \text{if $i$-th person is African American}
\end{cases}. 
$$

Hint: Recall Nominal, Ordinal, Interval, Ratio variable types from Lecture 2

We'll need more than one dummy variable:  
$$
\text{Asian}_i = \begin{cases} 
1 & \text{if $i$-th person is Asian} \\
0 & \text{otherwise}
\end{cases}. 
$$
$$
\text{Caucasian}_i = \begin{cases} 
1 & \text{if $i$-th person is Caucasian} \\
0 & \text{otherwise}
\end{cases}. 
$$
The value with no dummy variable--African American--is called the *baseline*.

**Exercise**: Can you find a relationship in the data involving the variable ethnicity? 

In [None]:
credit["Asian_num"] = credit["Ethnicity"].map({'Caucasian':0, 'Asian':1, 'African American':0})
credit["Caucasian_num"] = credit["Ethnicity"].map({'Caucasian':1, 'Asian':0, 'African American':0})

In [None]:
# your code here 
