# PDS: Assignment 2

### Deadline: April 27 (until 23:59)


**Instructions:** All answers should be filled in the notebook and then submitted to Moodle. For theoretical question you can use markdown and LaTeX. Name of notebook should be in following format:
 - Name_group_Assignment_1.ipynb (ex. Bill_Gates_1905_Assignment_2.ipynb)
 - Try to use fewer cells for compilation (for example, you can use print for several answers, instead of printing each answer on a separate cell)
 - **PLEASE:** submit only one jupyter notebook (no zip or csv files) and with your names as in the instruction. Ohterwise, there will be punishment for 2 points.

## ✤ *Importing needed libraries:*

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression 
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor

from sklearn import metrics

import warnings  # do not show matching warnings
warnings.filterwarnings('ignore')

## Q1 (2 point)
Use Credit.csv datasets and use only parameters Income and Ethnicity to predict Balance. Do the following:
1. You are allowed to use sklearn, but not statsmodels.
2. Build a model that takes Ethnicity with dummy variables. 
3. Show that building three separate models for each possible Ethnicity value will result in the same models as in case of using Ethinicity with dummy variables in one model. 

In [None]:
data = pd.read_csv('Credit.csv')

data = data[['Income', 'Ethnicity', 'Balance']]

data.head()

Unnamed: 0,Income,Ethnicity,Balance
0,14.891,Caucasian,333
1,106.025,Asian,903
2,104.593,Asian,580
3,148.924,Asian,964
4,55.882,Caucasian,331


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     400 non-null    float64
 1   Ethnicity  400 non-null    object 
 2   Balance    400 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 9.5+ KB


In [None]:
# there are 3 types of Ethnicity:
# data['Ethnicity'].unique()
data['Ethnicity'].value_counts()

Caucasian           199
Asian               102
African American     99
Name: Ethnicity, dtype: int64

### ✤ In order to find dummy variables of Ethnicity column, for 3 types:

In [None]:
# ethnicity = pd.get_dummies(data['Ethnicity'])
ethnicity_dummies = pd.get_dummies(data.select_dtypes(include=[object]))

new_data = pd.concat([data, ethnicity_dummies], axis=1)
new_data.head()

Unnamed: 0,Income,Ethnicity,Balance,Ethnicity_African American,Ethnicity_Asian,Ethnicity_Caucasian
0,14.891,Caucasian,333,0,0,1
1,106.025,Asian,903,0,1,0
2,104.593,Asian,580,0,1,0
3,148.924,Asian,964,0,1,0
4,55.882,Caucasian,331,0,0,1


### ✤ Also, encoding this categorical Enthnicity column, for future models:

In [None]:
label_en = LabelEncoder()

new_data['Ethnicity'] = label_en.fit_transform(new_data['Ethnicity'].astype(str))

new_data.head()

Unnamed: 0,Income,Ethnicity,Balance,Ethnicity_African American,Ethnicity_Asian,Ethnicity_Caucasian
0,14.891,2,333,0,0,1
1,106.025,1,903,0,1,0
2,104.593,1,580,0,1,0
3,148.924,1,964,0,1,0
4,55.882,2,331,0,0,1


In [None]:
# correlation in a form of plot or in dataframe:
# cols = ['Income', 'Ethnicity', 'Balance']
# hm = sns.heatmap(data[cols].corr(), cbar=True,annot=True)
new_data.corr()

Unnamed: 0,Income,Ethnicity,Balance,Ethnicity_African American,Ethnicity_Asian,Ethnicity_Caucasian
Income,1.0,-0.032888,0.463656,0.040132,-0.017137,-0.019701
Ethnicity,-0.032888,1.0,-0.009157,-0.867747,-0.177044,0.903313
Balance,0.463656,-0.009157,1.0,0.01372,-0.009812,-0.003288
Ethnicity_African American,0.040132,-0.867747,0.01372,1.0,-0.335526,-0.570641
Ethnicity_Asian,-0.017137,-0.177044,-0.009812,-0.335526,1.0,-0.582131
Ethnicity_Caucasian,-0.019701,0.903313,-0.003288,-0.570641,-0.582131,1.0


### ✤ All models were created by using Linear Regression library in 1st task ✤
+ **✤Let's predict by using above funciton, check all metrics, such as MAE,MSE,MAX,R2:**

In [None]:
# let's create a function for all metrics at the same time:

def showMetricsEval(true, predicted):  
    r2_square = metrics.r2_score(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    mae = metrics.mean_absolute_error(true, predicted)
    maxx = metrics.max_error(true, predicted)
    
    print('R-squared', np.round(r2_square,3))
    print('MSE:', mse)
    print('RMSE:', rmse)
    print('MAE:', mae)
    print('MAX:', maxx)

### ✤  First model with general Ethnicity column:

In [None]:
X_train = new_data[['Income','Ethnicity']]
y_train = new_data['Balance']

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = LinearRegression()
model.fit(X_train, y_train)

train_pred_1 = model.predict(X_train)
showMetricsEval(y_train, train_pred_1)

print('\nAccuracy model for train:',round(model.score(X_train, y_train),2))
print('Intercept is: ', round(model.intercept_, 2))
print('Coefficients are: ', model.coef_)

R-squared 0.215
MSE: 165514.0292938048
RMSE: 406.83415453204617
MAE: 348.9566328801873
MAX: 1097.7349141598215

Accuracy model for train: 0.22
Intercept is:  242.16
Coefficients are:  [6.05097953 3.38937433]


In [None]:
from scipy import stats # to find p-values

def findStat():
    params = np.append(model.intercept_,model.coef_)
    predictions = model.predict(X_train)

    newX = pd.DataFrame({"Constant":np.ones(len(X_train))}).join(pd.DataFrame(X_train))
    MSE = (sum((y_train-predictions)**2))/(len(newX)-len(newX.columns))

    var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
    sd_b = np.sqrt(var_b)
    ts_b = params/ sd_b

    p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b]
    p_values = np.round(p_values,3)

    # cols = ['Balance','Income','Ethnicity']
    params = np.round(params,4)
    sd_b = np.round(sd_b,3)
    ts_b = np.round(ts_b,3)

    StatDF = pd.DataFrame()
    StatDF["Coefficients"],StatDF["Standard Errors"],StatDF["T-values"],StatDF["P-values"]= [params,sd_b,ts_b,p_values]
    return StatDF

In [None]:
findStat()

Unnamed: 0,Coefficients,Standard Errors,T-values,P-values
0,242.1597,45.984,5.266,0.0
1,6.051,0.58,10.426,0.0
2,3.3894,24.729,0.137,0.891


+ The above table, I have created using numpy, to see statistical data of our model, as in statmodel we have done on lectures,similarly.
+ There are 3 columns, statistic details for Balance and Income are quite good for our model.
+ But the Ethnicity column, there are t-test is low and p-value a little bit higher than others compared.

### ✤ Second model with all 3 dummies:

In [None]:
X_train = new_data[['Income','Ethnicity_African American','Ethnicity_Asian','Ethnicity_Caucasian']]
y_train = new_data['Balance']

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = LinearRegression()
model.fit(X_train, y_train)

train_pred_2 = model.predict(X_train)
showMetricsEval(y_train, train_pred_2)

print('\nAccuracy model for train:',round(model.score(X_train, y_train),2))
print('Intercept is: ', round(model.intercept_, 2))
print('Coefficients are: ', model.coef_)

R-squared 0.215
MSE: 165513.89554684315
RMSE: 406.83399015672614
MAE: 348.9468596386069
MAX: 1097.574595883287

Accuracy model for train: 0.22
Intercept is:  245.51
Coefficients are:  [ 6.05073683 -3.02512703 -0.56850741  3.59363444]


In [None]:
coef_df = pd.DataFrame(model.coef_, X_train.columns, columns=['Coefficient'])
coef_df

Unnamed: 0,Coefficient
Income,6.050737
Ethnicity_African American,-3.025127
Ethnicity_Asian,-0.568507
Ethnicity_Caucasian,3.593634


+ **As I remember, we can make interpretation like that:**

+ Holding all other features fixed, a 1 unit increase in:
+ **Income** is associated with an increase of 6.05, on average.
+ **African_American** is associated with an decrease of -3.025, on average.
+ **Asian** is associated with an decrease of 0.569, on average;
+ **Caucasian** is associated with an increase of 3.594, on average;

## I was experimenting on different targets in X column)
### ✤ Third, there are 3 separate models for each type of Ethnicity:
+ 'Ethnicity_African American'
+ 'Ethnicity_Asian'
+ 'Ethnicity_Caucasian'

In [None]:
X_train = new_data[['Income','Ethnicity_African American']]
y_train = new_data['Balance']

model = LinearRegression()
model.fit(X_train, y_train)

train_pred_3_1 = model.predict(X_train)
showMetricsEval(y_train, train_pred_3_1)

print('\nAccuracy model for train:',round(model.score(X_train, y_train),2))
print('Intercept is: ', round(model.intercept_, 2))
print('Coefficients are: ', model.coef_)

R-squared 0.215
MSE: 165516.81602464267
RMSE: 406.8375794154747
MAE: 349.018265908979
MAX: 1098.9871020029461

Accuracy model for train: 0.22
Intercept is:  247.69
Coefficients are:  [ 6.05092635 -5.20895502]


In [None]:
# By using my  preivously created Function :
findStat()

Unnamed: 0,Coefficients,Standard Errors,T-values,P-values
0,247.6881,34.91,7.095,0.0
1,6.0509,0.581,10.423,0.0
2,-5.209,47.352,-0.11,0.912


In [None]:
X_train = new_data[['Income','Ethnicity_Asian']]
y_train = new_data['Balance']

model = LinearRegression()
model.fit(X_train, y_train)

train_pred_3_2 = model.predict(X_train)
showMetricsEval(y_train, train_pred_3_2)

print('\nAccuracy model for train:',round(model.score(X_train, y_train),2))
print('Intercept is: ', round(model.intercept_, 2))
print('Coefficients are: ', model.coef_)

R-squared 0.215
MSE: 165521.12634758317
RMSE: 406.842876732017
MAE: 349.0330592936825
MAX: 1099.7395799073315

Accuracy model for train: 0.21
Intercept is:  247.04
Coefficients are:  [ 6.04794599 -1.96715398]


In [None]:
# By using my  preivously created Function :
findStat()

Unnamed: 0,Coefficients,Standard Errors,T-values,P-values
0,247.0352,35.478,6.963,0.0
1,6.0479,0.58,10.425,0.0
2,-1.9672,46.854,-0.042,0.967


In [None]:
X_train = new_data[['Income','Ethnicity_Caucasian']]
y_train = new_data['Balance']

model = LinearRegression()
model.fit(X_train, y_train)

train_pred_3_3 = model.predict(X_train)
showMetricsEval(y_train, train_pred_3_3)

print('\nAccuracy model for train:',round(model.score(X_train, y_train),2))
print('Intercept is: ', round(model.intercept_, 2))
print('Coefficients are: ', model.coef_)

R-squared 0.215
MSE: 165514.65258383207
RMSE: 406.834920556031
MAE: 348.9398145052079
MAX: 1097.5649468122601

Accuracy model for train: 0.22
Intercept is:  243.77
Coefficients are:  [6.04986637 5.37091139]


In [None]:
# By using my  preivously created Function :
findStat()

Unnamed: 0,Coefficients,Standard Errors,T-values,P-values
0,243.7748,39.232,6.214,0.0
1,6.0499,0.58,10.428,0.0
2,5.3709,40.845,0.131,0.895


## ✤ Explanation of My findings:

+ The main things that has not changed in all Models is R-squared (0.215) and Accuracy score(0.21 or 0.22).
+ It means that with dummies and without them, the model will show the same result in Score.
+ Other important Evaluation Metrics such as MSE,MAE rmse.
+ Also, as the data was too small, I didn't split to train and test data,but I was thinking about it.
+ As an example,
+ To conclude, I want to say that with Dummy variables or without, there are almost none/small changes in Statistical scores.

## Q2 (3 points)
Build your best model to predict Balance. Do the following steps:
1. Use 10% of data for testing set with random seed = 2021, i.e. you will get 40 observations for testing. 
2. You can use any available parameters, also you can do feature engineering. 
3. Evaluate performance of your model/models on test set and use MSE and R-squared as evaluation metrics.
4. Describe every step you do and show obtained results at the end.

*Note: if you apply (correctly) more techniques, you will get higher mark.*

### ✤ I decided to make 3 models:
+ 1. With all columns
+ 2. With higly correlated columns
+ 3. With Feature Selection

In [None]:
data = pd.read_csv('Credit.csv')

data.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


### ✤ Data preparation:
+ As we can see that there are some Categorical columns, we should encode them into numeric:

In [None]:
label_en = LabelEncoder()

data['Gender'] = label_en.fit_transform(data['Gender'].astype(str))
data['Student'] = label_en.fit_transform(data['Student'].astype(str))
data['Married'] = label_en.fit_transform(data['Married'].astype(str))
data['Ethnicity'] = label_en.fit_transform(data['Ethnicity'].astype(str))

data.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,14.891,3606,283,2,34,11,0,0,1,2,333
1,106.025,6645,483,3,82,15,1,1,1,1,903
2,104.593,7075,514,4,71,11,0,0,0,1,580
3,148.924,9504,681,3,36,11,1,0,0,1,964
4,55.882,4897,357,2,68,16,0,0,1,2,331


In [None]:
# In case if need to Normalize the dataset:
# from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler().fit(data)
# standardizedData = scaler.transform(data)
# standardizedData = pd.DataFrame(standardizedData, index = data.index, columns = data.columns)
# standardizedData

In [None]:
# See a Correlation Matrix in a form of plot or in dataframe:
data.corr() 
# Not easy to Find out which columns have strong Correlation:

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
Income,1.0,0.792088,0.791378,-0.018273,0.175338,-0.027692,-0.010738,0.019632,0.035652,-0.032888,0.463656
Limit,0.792088,1.0,0.99688,0.010231,0.100888,-0.023549,0.009397,-0.006015,0.031155,-0.020837,0.861697
Rating,0.791378,0.99688,1.0,0.053239,0.103165,-0.030136,0.008885,-0.002028,0.036751,-0.020288,0.863625
Cards,-0.018273,0.010231,0.053239,1.0,0.042948,-0.051084,-0.022658,-0.026164,-0.009695,-0.003867,0.086456
Age,0.175338,0.100888,0.103165,0.042948,1.0,0.003619,0.004015,-0.029844,-0.073136,-0.032451,0.001835
Education,-0.027692,-0.023549,-0.030136,-0.051084,0.003619,1.0,-0.005049,0.072085,0.048911,-0.030055,-0.008062
Gender,-0.010738,0.009397,0.008885,-0.022658,0.004015,-0.005049,1.0,0.055034,0.012452,0.001514,0.021474
Student,0.019632,-0.006015,-0.002028,-0.026164,-0.029844,0.072085,0.055034,1.0,-0.076974,-0.030261,0.259018
Married,0.035652,0.031155,0.036751,-0.009695,-0.073136,0.048911,0.012452,-0.076974,1.0,0.060563,-0.005673
Ethnicity,-0.032888,-0.020837,-0.020288,-0.003867,-0.032451,-0.030055,0.001514,-0.030261,0.060563,1.0,-0.009157


### ✤ In order to check and see, which columns have higher Correlation value with Our Target column:

In [None]:
# convert series to dataframe so it can be sorted
correlation = data.corr()['Balance']
correlation_df = pd.DataFrame(correlation)
# correct column label from Balance to correlation
correlation_df.columns = ["Correlation"]
# sort correlation
corr_sorted = correlation_df.sort_values(by=['Correlation'], ascending=False)
corr_sorted

Unnamed: 0,Correlation
Balance,1.0
Rating,0.863625
Limit,0.861697
Income,0.463656
Student,0.259018
Cards,0.086456
Gender,0.021474
Age,0.001835
Married,-0.005673
Education,-0.008062


## ✤ First Model- Linear Regression, created with all columns:

In [None]:
# STEP-1:
X = data.drop("Balance", axis = 1) # All columns except this column.
y = data["Balance"] # Only this column

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2021)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(360, 10)
(40, 10)
(360,)
(40,)


In [None]:
model_lin_reg = LinearRegression().fit(X_train,y_train)
print('Intercept of the model:', np.round(model_lin_reg.intercept_,5))

Intercept of the model: -468.64334


In [None]:
coef_df = pd.DataFrame(model_lin_reg.coef_, X.columns, columns=['Coefficient'])
coef_df

Unnamed: 0,Coefficient
Income,-7.786658
Limit,0.183377
Rating,1.234905
Cards,16.845494
Age,-0.610406
Education,-1.13066
Gender,-15.1935
Student,428.150079
Married,-6.843849
Ethnicity,3.291397


In [None]:
test_pred = model_lin_reg.predict(X_test)
print('R-squared:', round(metrics.r2_score(y_test, test_pred), 3))
print('MSE:', round(metrics.mean_squared_error(y_test, test_pred, squared=False), 3))

R-squared: 0.961
MSE: 87.141


+ According to evaluation of R-squared of model in testing data, I can say that 96.1% of variation in the Balance can be explained by the explanatory variables used in the model, which is very high. I suppose that it is the best model:)

## ✤Second Model-Linear Regression,created with highly correlated columns with Balance
+ As I have find them before, they are : Rating, Limit, Income, Student.
+ other columns were relatively low correlated with Balance.

In [None]:
# STEP-1:
X = data[['Rating','Limit','Income','Student']] 
y = data["Balance"] # Only this column

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2021)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(360, 4)
(40, 4)
(360,)
(40,)


In [None]:
model_lin_reg = LinearRegression().fit(X_train,y_train)
print('Intercept of the model:', np.round(model_lin_reg.intercept_,5))

Intercept of the model: -515.67547


In [None]:
test_pred = model_lin_reg.predict(X_test)
print('R-squared:', round(metrics.r2_score(y_test, test_pred), 3))
print('MSE:', round(metrics.mean_squared_error(y_test, test_pred, squared=False), 3))

R-squared: 0.958
MSE: 90.853


+ According to evaluation of R-squared of model in testing data, I can say that 95.8% of variation in the Balance can be explained by the explanatory variables used in the model, which is very high.
+ Evaluation metrics a little bit differs from previous model.

### ✤ Third Model with Feature Selection of choosing best columns:
+  with Recursive Feature Elimination

In [None]:
X = data.drop("Balance", axis = 1) # All columns except this column.
y = data["Balance"] # Only this column

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2021)

rfe_gb = RFE(estimator=GradientBoostingRegressor())

rfe_gb.fit(X_train, y_train)
# transform train input data
X_train_grb = rfe_gb.transform(X_train)
# transform test input data
X_test_grb = rfe_gb.transform(X_test)

In [None]:
print("Num Features: %d" % rfe_gb.n_features_)
print("Selected Features: %s" % rfe_gb.support_)
print("Feature Ranking: %s" % rfe_gb.ranking_)

Num Features: 5
Selected Features: [ True  True  True False  True False False  True False False]
Feature Ranking: [1 1 1 2 1 4 5 1 3 6]


In [None]:
#  Model with SELECTION:
model_recfel3 = GradientBoostingRegressor().fit(X_train_grb, y_train)
# evaluate the model
print('With Feature Selection -Score is => ', model_recfel3.score(X_test_grb, y_test))

With Feature Selection -Score is =>  0.9722647092243117


In [None]:
# WITH feature selection evaluation:
test_pred_3 = model_recfel3.predict(X_test_grb)
# showing metrics by using my preivous function:
showMetricsEval(y_test, test_pred_3)

R-squared 0.972
MSE: 5432.329027362661
RMSE: 73.70433520060175
MAE: 59.72073438008815
MAX: 226.80147710837844


| **Summary Evaluation Metrics of 3 Models** |

|         | Model_1 | Model_2  | Model_3   
|:-------:|:-------:|:--------:|:----------------------
|R-squared| 0.961   |   0.958  |      0.972
|:-------:|:-------:|:--------:|:----------------------
|MSE      | 87.141  |  90.853  |   5421.02  


### ✤ Conclusion

+ To conclude, we usually use MSE and R-Squared, because they are used metrics to evaluate the prediction error rates and model performance in Regression analysis.


+ All 3 Models have the highest R-square score for the testing data.More than 95 % percentage.
+ As I know, the R-squared is a statistic that only applies to linear regression.
+ According to the R^2 metric provides an indication of the goodness of fit of a set of predictions of 3 models.


+ It means that higher R-squared values represent smaller differences between the predicted and actual data values.
+ It is not so good, because there can be  all  variation in the response variable around its mean.
+ Finally, the higher R^2 metric, the more and better my regression model fits the observations.
+ I suppose that, there is an Overfitting of the Models, because it shows deceptively high R-squared values and a decreased capability for precise predictions.


+ About, MSE metric,Mean squared error, is mean or average of the squared differences between predicted and expected target values.
+ As we can see from above table, in Model-1 and 2,scores for MSE are quite close,both of them were found by Linear Regression.
+ But in Model_3, the MSE, is too large. It is not good.
+ In general,in perfect way, the MSE metric should be around 0, which means that all predictions matched with expected values exactly. 
+  But in my case, MSE is relatively good in both Models 1 and 2.

---

### Evaluation

| Question | Mark     | Comment   
|:-------:|:--------:|:----------------------
| 1       |   1.5/2    |     
| 2       |   3/3    | 
|**Total**|**4.5/5**  | 
