## Problem statement:


Use the Root Mean Square Error (RMSE).
If the absolute error of a prediction is greater than 4.0, I regard the prediction as "wrong". Otherwise, it is "correct".
Any other evaluation measure that you believe is appropriate.


We will use pandas and scikit-learn to load and explore the dataset. The dataset can easily be loaded from scikit-learn’s datasets module using read_csv function.

Also we will try all Linear Regression models and Benchmark the best and predict using the same.

## Descriptive Analysis

Let's check the file: ../input/innercity.csv

It is an important first step for conducting statistical analysis. It gives an idea of the distribution of the data, helps us to detect outliers and typos, and identify associations among variables, thus preparing you for conducting further statistical analyses

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")


There is 1 csv file in the current version of the dataset:

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

The next hidden code cells define functions for plotting data. Click on the "Code" button in the published kernel to reveal the hidden code.

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
# innercity.csv has 21613 rows in reality, but we are only loading/previewing the first 1000 rows
df1 = pd.read_csv('/kaggle/input/credit-score-prediction/CreditScore_train.csv', delimiter=',')
df1.dataframeName = 'CreditScore_train.csv'
nRow, nCol = df1.shape
print(f'TRAIN DATA : There are {nRow} rows and {nCol} columns')

df2 = pd.read_csv('/kaggle/input/credit-score-prediction/CreditScore_test.csv', delimiter=',')
df2.dataframeName = 'CreditScore_test.csv'
nRow, nCol = df2.shape
print(f'TEST DATA : There are {nRow} rows and {nCol} columns')

df1["source"] = "train"
df2["source"] = "test"

merged_df = pd.concat([df1,df2])
merged_df.dataframeName = 'Merged_DF'

nRow, nCol = merged_df.shape
print(f'MERGED DATA : There are {nRow} rows and {nCol} columns')

Let's take a quick look at what the data looks like:

We can easily convert the dataset into a pandas dataframe to perform exploratory data analysis. Simply pass in the dataset.data as an argument to pd.DataFrame(). We can view the first 5 rows in the dataset using head() function.

In [None]:
merged_df.head(5)

Find the dimension of given data


In [None]:
merged_df.info()

In [None]:
merged_df.columns

We can check the datatype of each column using dtypes to make sure every column has numeric datatype. If a column has different datatype such as string or character, we need to map that column to a numeric datatype such as integer or float. For this dataset, luckily there is no such column.

In [None]:
merged_df.dtypes

In [None]:
merged_df.isnull().any()

In [None]:
merged_df.isna().sum()

In [None]:
merged_df.duplicated().sum()

Now, we will understand the statistical summary of the dataset using the describe() function. Using this function, we can understand the count, min, max, mean and standard deviation for each attribute (column) in the dataset. 

Each of these can also be displayed individually using df.count(), df.min(), df.max(), df.median() and df.quantile(q).

In [None]:
merged_df.describe(include='all').T

## Missing data
Important questions when thinking about missing data:

* How prevalent is the missing data?
* Is missing data random or does it have a pattern?

The answer to these questions is important for practical reasons because missing data can imply a reduction of the sample size. This can prevent us from proceeding with the analysis. Moreover, from a substantive perspective, we need to ensure that the missing data process is not biased and hidding an inconvenient truth.

Sometimes, in a dataset we will have missing values such as NaN or empty string in a cell. We need to take care of these missing values so that our machine learning model doesn’t break. To handle missing values, there are three approaches followed.

    Replace the missing value with a large negative number (e.g. -999).
    Replace the missing value with mean of the column.
    Replace the missing value with median of the column.

To find if a column in our dataset has missing values, you can use pd.isnull(df).any() which returns a boolean for each column in the dataset that tells if the column contains any missing value. In this dataset, there are no missing values!

In [None]:
##missing data
total = merged_df.count()
sumcol=merged_df.isnull().sum()
countcol=merged_df.isnull().count()

percent = (merged_df.isnull().sum()/countcol*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent,sumcol,countcol], axis=1, keys=['Total', 'Percent','Sumcol','countcol'])
missing_data.sort_values(['Percent'], axis=0, ascending=False)
#missing_data.head(20)

miss_perc=missing_data.sort_values(['Percent'], axis=0, ascending=False)
miss_perc

In [None]:


#missing data
total = merged_df.count()
sumcol=merged_df.isnull().sum()
countcol=merged_df.isnull().count()

percent = (merged_df.isnull().sum()/countcol*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent,sumcol,countcol], axis=1, keys=['Total', 'Percent','Sumcol','countcol'])
#missing_data.head(20)
miss_perc=missing_data.sort_values(['Percent'], axis=0, ascending=False)
m_per = miss_perc[miss_perc.Percent > 60]
print(m_per)


In [None]:
drop_cols=m_per.index
print(drop_cols)
#[cols.append(i) for i in drop_cols if df[i].isnull().sum()/row*100 > 60 ]
#count=0
filtered_df=merged_df.drop(columns=drop_cols,axis=1)

#for i in drop_cols:
 #   print(i)
#    count=count+1
#filt_concat_df=df_concat.drop(columns=[i],axis=1)
print(filtered_df.shape)

In [None]:
filtered_df.head()

In [None]:
filtered_df['y']

## Exploratory Analysis
To begin this exploratory analysis, first import libraries and define functions for plotting the data using `matplotlib`. Depending on the data, not all plots will be made.

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()


## Correlation

Finding correlation between attributes is a highly useful way to check for patterns in the dataset. Pandas offers three different ways to find correlation between attributes (columns). The output of each of these correlation functions fall within the range [-1, 1].

    1 - Positively correlated
    -1 - Negatively correlated.
    0 - Not correlated.
    
We will use df.corr() function to compute the correlation between attributes


In [None]:
filtered_df.corr()

We have 3 methods

* PEARSON CORRELATION
* SPEARMAN CORRELATION
* KENDALL CORRELATION

Let's do only PEARSON CORRELATION


In [None]:
#PEARSON CORRELATION

plt.figure(figsize = (15,10))
sns.heatmap(filtered_df.corr(method="pearson"))
plt.title('PEARSON CORRELATION', fontsize=15)

In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()


In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()


## Visualize the dataset

Now you're ready to read in the data and use the plotting functions to visualize the data.

We will use two types of visualization strategy namely univariate plots and bivariate plots. As the name suggests, univariate plot is used to visualize a single column or an attribute whereas bivariate plot is used to visualize two columns or two attributes.

## Box plot

A box-whisker plot is a univariate plot used to visualize a data distribution.

* * The ends of whiskers are the maximum and minimum range of data distribution.
* The central line in the box is the median of the entire data distribution.
* The right and left edges in the box are the medians of data distribution to the right and left from the central median, respectively.


View the above Box plot clearly by adding figsize

In [None]:
%matplotlib inline
plt.figure(figsize = (28,8))
sns.boxplot(data=filtered_df)

Distribution graphs (histogram/bar graph) of sampled columns:

In [None]:
print(pd.isnull(filtered_df).any())

## Feature Extraction

Not applicable

In [None]:
filtered_df.head()

In [None]:
##filtered_df = filtered_df.drop(columns=['source'])

In [None]:
filtered_df.isnull().any()

In [None]:
filtered_df.head()

In [None]:
#Correlation with output variable
cor_target = abs(filtered_df.corr()["y"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target<0.3]
relevant_features

In [None]:
relevant_features.item

In [None]:
filtered_df.shape

In [None]:
lst_key=[]
null_key=[]
for i,j in relevant_features.items():
    lst_key.append(i)
#print(lst_key.count())

final_df=filtered_df.drop(columns=lst_key,axis=1)
print(final_df.shape)

In [None]:
a=final_df.isnull().any()==True

In [None]:
for i,j in a.items():
    if j==True:
        null_key.append(i)
print(null_key)
final_df=filtered_df.drop(columns=lst_key,axis=1)

In [None]:
type(a)

In [None]:
for i,j in a.items():
    if j==True:
        null_key.append(i)
print(null_key)

In [None]:
for i in null_key:
    final_df[i].fillna(final_df[i].mean(),inplace=True)
final_df.shape 

In [None]:
train_final = final_df[final_df.source=="train"]
test_final = final_df[final_df.source=="test"]

print(train_final.shape)
print(test_final.shape)

train_final.drop(columns="source",inplace=True)
test_final.drop(columns="source",inplace=True)

In [None]:
X = train_final.drop("y", axis=1)
Y = train_final["y"]
print(X.shape)
print(Y.shape)

As we see different data distributions, we will standardize the dataset using StandardScaler function in scikit-learn. This is a useful technique where the attributes are transformed to a standard gaussian distribution with a mean of 0 and a standard deviation of 1.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = MinMaxScaler().fit(X)
scaled_X = scaler.transform(X)

Now, we will split the data into train and test set. We can easily do this using scikit-learn’s train_test_split() function using a test_size parameter.

In [None]:
from sklearn.model_selection import train_test_split

seed      = 42
test_size = 0.20

X_train, X_test, Y_train, Y_test = train_test_split(scaled_X, Y, test_size = test_size, random_state = seed)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

Let’s dive into Regression. We will use different Regression models offered by scikit-learn to produce a baseline accuracy for this problem. 

We will use the MAE (Mean Absolute Error) as the performance metric for the regression models.

## Training Regression Model

By looking at the dataset, we simply can’t suggest the best Regression Model for this problem. So, we will try out different Regression models available in scikit-learn with a k-fold cross validation method.

let's assume k = 5 (k-fold cross validation)

It means we split the training data into train and test data using a test_size parameter for 10-folds. Each fold will have different samples that are not present in other folds. By this way, we can throughly train our model on different samples in the dataset.

Before doing anything, we will split our dataframe df into features X and target Y.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

import time
import datetime

start = 0
end = 0
start = time.time()

# user variables to tune
folds   = 10
metric  = "neg_mean_absolute_error"

# hold different regression models in a single dictionary
models = {}
models["Linear"]        = LinearRegression()
models["Lasso"]         = Lasso()
models["Ridge"]         = Ridge()
models["ElasticNet"]    = ElasticNet()
models["DecisionTree"]  = DecisionTreeRegressor()
models["KNN"]           = KNeighborsRegressor()
models["RandomForest"]  = RandomForestRegressor()
models["AdaBoost"]      = AdaBoostRegressor()
models["GradientBoost"] = GradientBoostingRegressor()
models["XGBoost"] = XGBRegressor()

# 10-fold cross validation for each model
model_results = []
model_names   = []
for model_name in models:
	model   = models[model_name]
	k_fold  = KFold(n_splits=folds, random_state=seed)
	results = cross_val_score(model, X_train, Y_train, cv=k_fold, scoring=metric)
	
	model_results.append(results)
	model_names.append(model_name)
	print("{}: {}, {}".format(model_name, round(results.mean(), 3), round(results.std(), 3)))
	end = time.time()
	list_lapse = end - start
	print("Time taken for processing {}: {}".format(model_name, str(datetime.timedelta(seconds=list_lapse))))

# box-whisker plot to compare regression models
figure = plt.figure(figsize = (20,8))

figure.suptitle('Regression models comparison')
axis = figure.add_subplot(111)
plt.boxplot(model_results)
axis.set_xticklabels(model_names, rotation = 45, ha="right")
axis.set_ylabel("Mean Absolute Error (MAE)")
plt.margins(0.05, 0.1)

## Choosing the best model

Based on the above comparison, we can see that Gradient Boosting Regression model outperforms all the other regression models. So, we will choose it as the best Regression Model for this problem.

In [None]:
model = XGBRegressor(objective ='reg:squarederror')
model.fit(X_train,Y_train)

#Predicting TEST & TRAIN DATA
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

error_percent = np.mean(np.abs((Y_train - train_predict) / Y_train)) * 100
print("MAPE - Mean Absolute Percentage Error (TRAIN DATA): ",error_percent )
Y_train, train_predict = np.array(Y_train), np.array(train_predict)

In [None]:
model = XGBRegressor(objective ='reg:squarederror')
model.fit(X_test,Y_test)

#Predicting TEST & TRAIN DATA
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

error_percent = np.mean(np.abs((Y_train - train_predict) / Y_train)) * 100
print("MAPE - Mean Absolute Percentage Error (TEST DATA): ",error_percent )
Y_train, train_predict = np.array(Y_train), np.array(train_predict)

We can visualize the predictions made by our best model and the original targets Y_test using the below code.

In [None]:
# plot between predictions and Y_test
x_axis = np.array(range(0, test_predict.shape[0]))
plt.figure(figsize=(20,10))
plt.plot(x_axis, test_predict, linestyle="--", marker="o", alpha=0.7, color='r', label="predictions")
plt.plot(x_axis, Y_test, linestyle="--", marker="o", alpha=0.7, color='g', label="Y_test")
plt.xlabel('Row number')
plt.ylabel('PRICE')
plt.title('Predictions vs Y_test')
plt.legend(loc='lower right')

We could still tune different regression models used in this example using scikit-learn’s GridSearchCV() function. By tuning, we mean trying out different hyper-parameters for each model.

## Feature Importance

Once we have a trained model, we can understand feature importance (or variable importance) of the dataset which tells us how important each feature is, to predict the target. Below chart shows relative importance of different feature in the dataset made by our best model Gradient Boosting Regressor (GBR).

In [None]:
feature_importance = model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())

sorted_idx = np.argsort(feature_importance)
pos        = np.arange(sorted_idx.shape[0]) + .5

plt.figure(figsize = (15,18))

#Make a horizontal bar plot.
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, df1.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')

## Steps for Further Improvement

Some additional steps that may be taken to improve score could be:

Do additional pre-processing on the given data
Not sure regularized Linear Regression approach is fine? Or prefer Ensemble methods? Or maybe something else?
Introduce a greater variety of base models for learning. The more uncorrelated the results, the better the final score.

## It's up to you to find out.😉
