In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline
url = "https://raw.githubusercontent.com/ga-students/DS-SF-24/master/Data/Credit.csv"
CreditData = pd.read_csv(url)
CreditData.head(10)

Unnamed: 0.1,Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
5,6,80.18,8047,569,4,77,10,Male,No,No,Caucasian,1151
6,7,20.996,3388,259,2,37,12,Female,No,No,African American,203
7,8,71.408,7114,512,2,87,9,Male,No,No,Asian,872
8,9,15.125,3300,266,5,66,13,Female,No,No,Caucasian,279
9,10,71.061,6819,491,3,41,19,Female,Yes,Yes,African American,1350


In [2]:
del CreditData['Unnamed: 0']

#### Let's look at correlation matrix. This time, we only explore the quantitative variables that affect Credit Balance. From your preliminary analysis, which 3 variables seem to affect Balance the most? In our goal is interpretation; can we use these 3 variables simultaneously? Why?

In [3]:
CreditData.corr()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance
Income,1.0,0.792088,0.791378,-0.018273,0.175338,-0.027692,0.463656
Limit,0.792088,1.0,0.99688,0.010231,0.100888,-0.023549,0.861697
Rating,0.791378,0.99688,1.0,0.053239,0.103165,-0.030136,0.863625
Cards,-0.018273,0.010231,0.053239,1.0,0.042948,-0.051084,0.086456
Age,0.175338,0.100888,0.103165,0.042948,1.0,0.003619,0.001835
Education,-0.027692,-0.023549,-0.030136,-0.051084,0.003619,1.0,-0.008062
Balance,0.463656,0.861697,0.863625,0.086456,0.001835,-0.008062,1.0


Answer: Limit, Rating, and Income seem to affect Credit Balance the most.  

#### There are few categorical variables, let's first create dummy variables for them


In [4]:

RaceDummy = pd.get_dummies(CreditData.Ethnicity, prefix = 'Race')
del RaceDummy['Race_African American']

GenderDummy = pd.get_dummies(CreditData.Gender, prefix = 'Gender')
del GenderDummy['Gender_ Male']  

MarriedDummy = pd.get_dummies(CreditData.Married, prefix = 'Married')
del MarriedDummy['Married_No']

StudentDummy = pd.get_dummies(CreditData.Student, prefix = 'Student')
del StudentDummy['Student_No']

CreditData = pd.concat([CreditData, RaceDummy,GenderDummy,MarriedDummy,StudentDummy], axis=1)

CreditData.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance,Race_Asian,Race_Caucasian,Gender_Female,Married_Yes,Student_Yes
0,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333,0.0,1.0,0.0,1.0,0.0
1,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903,1.0,0.0,1.0,1.0,1.0
2,104.593,7075,514,4,71,11,Male,No,No,Asian,580,1.0,0.0,0.0,0.0,0.0
3,148.924,9504,681,3,36,11,Female,No,No,Asian,964,1.0,0.0,1.0,0.0,0.0
4,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331,0.0,1.0,0.0,1.0,0.0


# Now it's time for some fun!

#### By a regression line, use Education, Ethnicity, Gender, Age, Cards, and Income to predict Balance. 

First Step, find the coefficients of your regression line

In [5]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

In [6]:
X = CreditData[['Education', 'Race_Asian', 'Race_Caucasian', 'Gender_Female', 'Age', 'Cards', 'Income']]
y = CreditData['Balance']

linreg.fit(X,y)

print (linreg.intercept_)
print (linreg.coef_)

230.042354393
[  1.64553607  -6.54603078   3.47497641  27.12543123  -2.32970547
  33.62953508   6.27995894]


Second Step, find the p-values of your estimates. You have a few variables try to show your p-values along side the names of the variables.

In [7]:
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ X', data=CreditData).fit()
zip()
lm.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.232
Model:,OLS,Adj. R-squared:,0.219
Method:,Least Squares,F-statistic:,16.95
Date:,"Thu, 23 Jun 2016",Prob (F-statistic):,1.41e-19
Time:,21:15:33,Log-Likelihood:,-2966.5
No. Observations:,400,AIC:,5949.0
Df Residuals:,392,BIC:,5981.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,230.0424,130.247,1.766,0.078,-26.028 486.113
X[0],1.6455,6.527,0.252,0.801,-11.187 14.478
X[1],-6.5460,57.531,-0.114,0.909,-119.654 106.562
X[2],3.4750,50.071,0.069,0.945,-94.967 101.917
X[3],27.1254,40.695,0.667,0.505,-52.883 107.134
X[4],-2.3297,1.202,-1.938,0.053,-4.694 0.034
X[5],33.6295,14.881,2.260,0.024,4.373 62.887
X[6],6.2800,0.587,10.696,0.000,5.126 7.434

0,1,2,3
Omnibus:,36.209,Durbin-Watson:,1.968
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.357
Skew:,0.349,Prob(JB):,0.000103
Kurtosis:,2.216,Cond. No.,511.0


In [14]:
zip(['Education', 'Race_Asian', 'Race_Caucasian', 'Gender_Female', 'Age', 'Cards', 'Income'],lm.pvalues[1:7])

[('Education', 0.80109320438053111),
 ('Race_Asian', 0.90946857316642182),
 ('Race_Caucasian', 0.94470607269767004),
 ('Gender_Female', 0.50545317234523912),
 ('Age', 0.053391728919999722),
 ('Cards', 0.024378108899920221)]

**Which of your coefficients are significant at significance level 5%?**

Answer: 

#### What is the R-Squared of your model?

#### How do we interpret this value?

Answer: 

#### Now focus on two of the most significant variables from your previous model and re-run your regression model. 

**In comparison to the previous model, did our R-Squared increase or decrease? Why?**

Answer: 

#### Now let's regress Balance on Gender alone. After running your regression lines, do you have enough evidence to claim that females having more balance than males? (Hint: Look at the p-value of the Gender coefficient. If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.

Answer: 

#### Now let's regress Balance on Ethnicity. After running your regression lines, do you have enough evidence to claim that some ethnic groups carry more balance than others? (Hint: Look at the p-value of  your dummy variables. If it is significant then you will have evidence to support that claim, otherwise you cannot support that statement.

Answer: 

#### I know you get tired of this but for the last time regress Balance on Studentship status. After running your regression lines, do you have enough evidence to claim that students  carry more balance than others? (Hint: Look at the p-value of the your dummy variables. If it is significant then you will have evidence to support that claim, otherwise you cannot support the statement.


Answer: 

#### No let's consider effect of students and income on balance simoltanously. Let's start with a regression line.

#### Are all of our regression coefficients significant? If yes, interpret them.

Answer: 

#### Now let's explore interaction between income and studentship. Let's start with a regression line

In [8]:
# First generate a column for interation term


#### Are our coefficients signifincant? It they are write down your regression line below:

Answer:

#### Assume all coefficients in above regression were significant. Is there any income level at which students and non-students on average carry same level of balance?

Answer: 

