<a href="https://colab.research.google.com/github/SALAH-VECTICE/SDK_notebooks/blob/main/Notebooks/Vanilla/German_Credit_Analysis/German-Credit-Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip3 install -q fsspec
!pip3 install -q gcsfs
!pip3 install -q vectice
!pip3 install -q mlflow
!pip3 install -q google-cloud-storage
!pip3 install -q chart_studio

In [None]:
!pip3 show vectice

The main entrypoint of the SDK is the high level API which provide several solutions to follow your runs.

* a procedural solution with 2 methods to call vectice.create_run() and vectice.save_after_run()

* a more powerful solution based on vectice.Vectice class that provides itself several possibilities:

* use an instance of vectice.Vectice object to create_run(), start_run() and end_run() (fluent API)

* You can also use the context manager syntax (python with keyword): In this case, the end of the run will be automatically managed.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import tools
import chart_studio.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as pyo
import warnings
import os

from google.cloud import storage
import mlflow
from vectice import Vectice
from vectice.models import JobType
from vectice.entity.model import ModelType
from vectice.entity.model_version import ModelVersionStatus

pyo.init_notebook_mode()
init_notebook_mode(connected=True)
warnings.filterwarnings("ignore")
%matplotlib inline

Here is a link to the Python SDK Documentation.
[Python SDK Documentation](https://storage.googleapis.com/sdk-documentation/index.html)

### Goals for this Project
* Explore our data and detecting key patterns.
* Develop a Neural Network to predict whether a loan will be of a good or bad risk.
* Most importantly, have fun while doing this project.
### Brief Overview:
The first phase of this project is to see what is our data made about. Which variables are numerical or categorical and which columns have "Null" values, which is something we will address in the feature engineering phase.

#### Summary:
* We have four numeric and four categorical features.
* The average age of people in our dataset is 35.54
* The average credit amount borrowed is 3271

Make sure you declare the GCS credentials as an environmental as seen below. Or any other way you prefer so we can access the files in GCS.

```
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'FILE.json'
```

In [None]:
# In Google Collab you can load your json key file to access GCS that was provided with your tutorial account. 
# The name should be something like test.json.
from google.colab import files
uploaded = files.upload()

## Vectice Credentials 

To connect to the Vectice App through the SDK you'll need the Project Token, Vectice API Endpoint and the Vectice API Token. You'll find all of this in the Vectice App. The Workspace allows you to create the Vectice API Token, in Projects you'll be able to get the Project Token, as seen below. The Vectice API Endpoint is 'https://be-beta.vectice.com'. You're provided with the GCS Service Account JSON, this will allow you to connect to the GCS Bucket in the Vectice App and get the needed data for the example. 

## Credentials Setup:
The Vectice API Endpoint and Token are needed to connect to the Vectice UI. Furthermore, a Google Cloud Storage credential JSON is needed to connect to the Google Cloud Storage to retrieve and upload the datasets. A project token links the runs to the relevant project and it's needed to create runs.

In [None]:
os.environ['VECTICE_API_ENDPOINT'] = 'https://beta.vectice.com'
# Workspace -> API Tokens from Vectice App
os.environ['VECTICE_API_TOKEN'] = "CONNECTION_TOKEN"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "FILE.json"
# Project token from Vectice App
PROJECT_TOKEN = "PROJECT_TOKEN"

In [None]:
# Create Vectice instance 
vectice = Vectice(project_token=PROJECT_TOKEN)
# It will specify the dataset version we just created as the run's input. *NB - You need to add a dataset called "German-Credit-Data" in your Vectice Project.
ds_version = [vectice.create_dataset_version().with_parent_name("German-Credit-Data")]
# Create a run 
run = vectice.create_run('Data Cleaning', JobType.PREPARATION)
# Start a run to track this data cleaning job
vectice.start_run(run, inputs = ds_version)

This is an example how you would push your data into your GCS bucket.

```
data.to_csv("gs://BUCKET/FILE_PATH/FILENAME.csv")
```


The dataset used in this tutorial is retrieved from a Google Cloud Storage Bucket. Throughout the notebook you will notice that interacting with the Google Cloud Storage Bucket and it's relatively easy.

In [None]:
# Get data from GCS Storage bucket
df = pd.read_csv("gs://vectice-examples-samples/German_Credit/german_credit_data.csv", index_col=0)
# Create a copy of the dataframe to use later
original_df = df.copy()

In [None]:
# Rename the column
df = df.rename(columns={"Credit amount": "Credit_amount"})

In [None]:
df.head()

You could then ```df.to_csv("gs://BUCKET/FILE_PATH/FILENAME.csv")``` to push the dataset to the GCS Bucket and then add the dataset to the Vectice App through the UI for this datatset that will be used in the Exploratory Data Analysis.

In [None]:
# Create outputs
outputs = [vectice.create_dataset_version().with_parent_name("EDA data")]
# End the run and save the new dataset version.
# Set the orginal_cleaned as an output.
vectice.end_run(outputs = outputs)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
# Check missing values
df.isnull().sum().sort_values(ascending=False)

In [None]:
df.head()

In [None]:
# Distribution of Credit_Amount for each Gender
male_credit = df["Credit_amount"].loc[df["Sex"] == "male"].values
female_credit = df["Credit_amount"].loc[df["Sex"] == "female"].values
total_credit = df['Credit_amount'].values

fig, ax = plt.subplots(1, 3, figsize=(16,4))

sns.distplot(male_credit, ax=ax[0], color="#FE642E")
ax[0].set_title("Male Credit Distribution", fontsize=16)
sns.distplot(female_credit, ax=ax[1], color="#F781F3")
ax[1].set_title("Female Credit Distribution", fontsize=16)
sns.distplot(total_credit, ax=ax[2], color="#2E64FE")
ax[2].set_title("Total Credit Distribution", fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(13,6)) #figure size
g = sns.boxplot(x='Purpose', y='Credit_amount', 
                   data=df, palette="RdBu")


g.set_title("Credit Distribution by Purpose", fontsize=16)
g.set_xticklabels(g.get_xticklabels(),rotation=45) # It's the way to rotate the xticks when we use variable to our graphs
g.set_xlabel('Device Names', fontsize=18) # Xlabel
g.set_ylabel('Trans Revenue(log) Dist', fontsize=18) 
plt.show()

### Analysis by Group:
#### Gender Analysis:
In this section analyze the gender section of our dataset.

#### Objectives:
* Find the distribution of genders in our dataset.
* See the distribution o each gender by the age (For instance, we have a higher number of young males than younger females)
* What were the main application reasons for a credit loan? Does it vary by Gender?
* How many jobs does each gender have? How many are Unemployed?

#### Summary:
* Theres 2x more males than females in our dataset.
* Most females that applied for a credit loan were less than 30 .
* Most of the males that applied for a loan ranged from their 20s-40s
* Females were more likely to apply for a credit loan tobuy furniture and equipment. (10% more than males)
* Males applied 2x more than females for a credit loan to invest in a business.
* 2x of females were unemployed compared to males.
* 2x of males worked 3 jobs compared to females.
* Suprisingly, most people that applied for a credit loan have two jobs!

In [None]:
# We have 2x more German males applying for Credit Loans than Females.
df["Sex"].value_counts()

In [None]:
from IPython.display import HTML

by_age = df['Age'].values.tolist()
male_age = df['Age'].loc[df['Sex'] == 'male'].values.tolist()
female_age = df['Age'].loc[df['Sex'] == 'female'].values.tolist()

trace0 = go.Histogram(
    x=male_age,
    histnorm='probability',
    name="German Male",
    marker = dict(
        color = 'rgba(100, 149, 237, 0.6)',
    )
)
trace1 = go.Histogram(
    x=female_age,
    histnorm='probability',
    name="German Female",
    marker = dict(
        color = 'rgba(255, 182, 193, 0.6)',
    )
)
trace2 = go.Histogram(
    x=by_age,
    histnorm='probability',
    name="Overall Gender",
     marker = dict(
        color = 'rgba(169, 169, 169, 0.6)',
    )
)
fig = tools.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Males','Female', 'All Genders'))

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Distribution of Gender', bargap=0.05)
# iplot(fig, filename='custom-sized-subplot-with-subplot-titles')
HTML(fig.to_html())

In [None]:
# Gender vs Purpose let's see the purpose of having credit loans for each gender.
df["Purpose"].unique()
sex_purpose = pd.crosstab(df['Purpose'], df['Sex']).apply(lambda x: x/x.sum() * 100)
sex_purpose

### Age Groups:
In this section we will create categorical groups based on the age column. The following categorical variables will belong to the "Age_Group" column:

* Young: Clients age ranges from (19 - 29).
* Young Adults: Clients age ranges from (30-40)
* Senior: Clients age ranges from (41-55)
* Elder: Clients age is more than 55 years old
 
#### What we want to accomplish:
* Create different age groups based on their age.
* See the Credit amounts borrowed by clients belonging to each age group.
* Get deeper in our analysis and determine which loans were high risk and see if there is any patterns with regards to age groups.

#### Summary:
* The younger age group tended to ask slightly for higher loans compared to the older age groups.
* The young and elederly groups had the highest ratio of high risk loans. With 45.29% of all the clients that belong to the young age group being considered of high risk.
* The number of loans that were considered of high risk within the elderly group is 44.28% of the total amount of people considered in the elderly group.
* Interesting enough these are the groups that are most likely to be unemployed or working part-time, since the youngest group either don't have the experience to have a job or they are studying in a university so they don't have enough time to work in a full-time job.
* In the elderly group side, this is the group that are most likely receiving their money from their pensions, meaning the elderly group is most likely unemployed or working part-time.

In [None]:
# Ok we have to create for each group risky and non-risky loans.
df['Age_Group'] = np.nan

lst = [df]

for col in lst:
    col.loc[(col['Age'] > 18) & (col['Age'] <= 29), 'Age_Group'] = 'Young'
    col.loc[(col['Age'] > 29) & (col['Age'] <= 40), 'Age_Group'] = 'Young Adults'
    col.loc[(col['Age'] > 40) & (col['Age'] <= 55), 'Age_Group'] = 'Senior'
    col.loc[col['Age'] > 55, 'Age_Group'] = 'Elder' 
    
df.head()

In [None]:
# Lets find loans by age group and by the level of risk and plot them in a bar chart.

# Age Group Segments
young_good = df['Credit_amount'].loc[(df['Age_Group'] == 'Young') & (df['Risk'] == 'good')].sum()
young_bad = df['Credit_amount'].loc[(df['Age_Group'] == 'Young') & (df['Risk'] == 'bad')].sum()
young_adult_good = df['Credit_amount'].loc[(df['Age_Group'] == 'Young Adults') & (df['Risk'] == 'good')].sum()
young_adult_bad = df['Credit_amount'].loc[(df['Age_Group'] == 'Young Adults') & (df['Risk'] == 'bad')].sum()
senior_good = df['Credit_amount'].loc[(df['Age_Group'] == 'Senior') & (df['Risk'] == 'good')].sum()
senior_bad = df['Credit_amount'].loc[(df['Age_Group'] == 'Senior') & (df['Risk'] == 'bad')].sum()
elder_good = df['Credit_amount'].loc[(df['Age_Group'] == 'Elder') & (df['Risk'] == 'good')].sum()
elder_bad = df['Credit_amount'].loc[(df['Age_Group'] == 'Elder') & (df['Risk'] == 'bad')].sum()

# Percents
young_good_p = young_good/(young_good + young_bad) * 100
young_bad_p = young_bad/(young_good + young_bad) * 100
young_adult_good_p = young_adult_good/(young_adult_good + young_adult_bad) * 100
young_adult_bad_p = young_adult_bad/(young_adult_good + young_adult_bad) * 100
senior_good_p = senior_good/(senior_good + senior_bad) * 100
senior_bad_p =  senior_bad/(senior_good + senior_bad) * 100
elder_good_p = elder_good/(elder_good + elder_bad) * 100
elder_bad_p = elder_bad/(elder_good + elder_bad) * 100

# Round Percents
young_good_p = str(round(young_good_p, 2))
young_bad_p = str(round(young_bad_p, 2))
young_adult_good_p = str(round(young_adult_good_p, 2))
young_adult_bad_p = str(round(young_adult_bad_p, 2))
senior_good_p = str(round(senior_good_p, 2))
senior_bad_p = str(round(senior_bad_p, 2))
elder_good_p = str(round(elder_good_p, 2))
elder_bad_p = str(round(elder_bad_p, 2))



x = ["Young", "Young Adults", "Senior", "Elder"]

good_loans = go.Bar(
    x=x,
    y=[young_good, young_adult_good, senior_good, elder_good],
    name="Good Loans",
    text=[young_good_p + '%', young_adult_good_p + '%', senior_good_p + '%', elder_good_p + '%'],
    textposition = 'auto',
    marker=dict(
        color='rgb(111, 235, 146)',
        line=dict(
            color='rgb(60, 199, 100)',
            width=1.5),
        ),
    opacity=0.6
)

bad_loans =  go.Bar(
    x=x,
    y=[young_bad, young_adult_bad, senior_bad, elder_bad],
    name="Bad Loans",
    text=[young_bad_p + '%', young_adult_bad_p + '%', senior_bad_p + '%', elder_bad_p + '%'],
    textposition = 'auto',
    marker=dict(
        color='rgb(247, 98, 98)',
        line=dict(
            color='rgb(225, 56, 56)',
            width=1.5),
        ),
    opacity=0.6
)

data = [good_loans, bad_loans]

layout = dict(
    title="Type of Loan by Age Group", 
    xaxis = dict(title="Age Group"),
    yaxis= dict(title="Credit Amount")
)

fig = go.Figure(data)

HTML(fig.to_html())

### Wealth Analysis:
In this section we will analyse the amount of wealth our clients have by analyzing their checking accounts and whether the wealth status of our clients contribute to the risk of the loans Lending Club is issuing to customers.

#### Summary:
* Individuals belonging to the "little wealth" group, had a higher probability of being bad risk loans than other types fo groups.
* The higher the wealth, the lower the probability of being a bad risk loan.

In [None]:
# We have some missing value so we will just ignore the missing values in this analysis.
df["Checking account"].unique()
df.columns

In [None]:
cross_checking = pd.crosstab(df['Risk'], df['Checking account']).apply(lambda x: x/x.sum() * 100)
decimals = pd.Series([2,2,2], index=['little', 'moderate', 'rich'])

cross_checking = cross_checking.round(decimals)
cross_checking

### High Risk Loans vs Low Risk Loans:
In this section we will analyze both high and low risk loans. The most important thing is to find patters that could describe the some sort of correlation with these output values.

#### Correlation (Our intent):
In this part of the analysis, we want to look as to what feature affect directly the risk of the loan. In order to see these patterns, the first thing we have to do is to create a new column named "Risk_int" (Stands for risk in integer form) and involve this column in the correlation heatmap plot. "0" will stand for "bad risk" loans and "1" will stand for "good risk" loans.

#### Summary:
* The higher the credit amount borrowed, the most likely the loan will end up bad.
* The higher the duration of the loan, the most likely the loan will turn out to be bad
* Senior and Elders that asked for loans over 12k, have a high chance of becoming bad loans
* If the credit amount borrowed is equivalent to 11,000 or more, the probability for the loan to be a bad one increases drastically. (Observe the Correlation of Risk with Credit Amount Borrowed.)

In [None]:
df['Risk_int'] = np.nan
lst = [df]

for col in lst:
    col.loc[df['Risk'] == 'bad', 'Risk_int'] = 0 
    col.loc[df['Risk'] == 'good', 'Risk_int'] = 1
    
    
df['Risk_int'] = df['Risk_int'].astype(int)
df.head()

In [None]:
# The higher the credit amount the higher the risk of the loan. Scatter plot?
# The higher the duration of the loan the higher the risk of the loan?

bad_credit_amount = df["Credit_amount"].loc[df['Risk'] == 'bad'].values.tolist()
good_credit_amount = df["Credit_amount"].loc[df['Risk'] == 'good'].values.tolist()
bad_duration = df['Duration'].loc[df['Risk'] == 'bad'].values.tolist()
good_duration = df['Duration'].loc[df['Risk'] == 'good'].values.tolist()


bad_loans = go.Scatter(
    x = bad_duration,
    y = bad_credit_amount,
    name = 'Bad Loans',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(152, 0, 0, .8)',
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    )
)

good_loans = go.Scatter(
    x = good_duration,
    y = good_credit_amount,
    name = 'Good Loans',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(34, 139, 34, .9)',
        line = dict(
            width = 2,
        )
    )
)

data = [bad_loans, good_loans]

layout = dict(title = 'Correlation of Risk with <br> Credit Amount Borrowed',
              yaxis = dict(zeroline = False),
              xaxis = dict(zeroline = False)
             )

fig = go.Figure(data)

HTML(fig.to_html())

### Exploring Purposes of Loans:
In this section my main aim is to see what purposes where most likely to bring most risk, in other words which of these pruposes were more likely to be considered high risk loans. Also, I would like to explore the operative side of the business, by determining which purposes where the ones that contributed the most towards loans issued.

#### Summary:
* Cars, Radio/TV and Furniture and Equipment made more than 50 % of the total risk and has the highest distribution of credit issued
* The rest of the purposes were not frequent purposes in applying for a loan.
* Cars and Radio/TV purposes were the less risky from the operative perspective since it had the widest gap between good and bad risk.

In [None]:
df['Purpose'].unique()

cross_purpose = pd.crosstab(df['Purpose'], df['Risk']).apply(lambda x: x/x.sum() * 100)
cross_purpose = cross_purpose.round(decimals=2)
cross_purpose.sort_values(by=['bad'])

### Predictive Modelling:

In [None]:
# Create inputs 
inputs = [vectice.create_dataset_version().with_parent_name('German-Credit-Data')]
# Create a run
run = vectice.create_run('Data Cleaning', JobType.PREPARATION)
vectice.start_run(run, inputs = inputs)

### Data Cleaing 
In machine learning, if the data is irrelevant or error-prone then it leads to an incorrect model being built.

In [None]:
# Check missing values in our dataframe
original_df.isnull().sum().sort_values(ascending=False)

In [None]:
# We will drop the columns that have missing values although we will be loosing some information. Hopefully this does not cause
# the model to underfit in the future.
original_df.drop(['Checking account', 'Saving accounts'], axis=1, inplace=True)

In [None]:
original_df.isnull().sum().sort_values(ascending=False)

In [None]:
# In a real world scenario you could upload datasets into the GCS Bucket and update runs with the new dataset and add it through the Vectice App.
# For example -> original_df.to_csv(GCS_URI)
# Create outputs 
outputs = [vectice.create_dataset_version().with_parent_name("data cleaned")]
# End run
vectice.end_run(outputs = outputs)

### Train-Test Split Evaluation 
The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

In [None]:
# Feature Engineering (We cannot delete the missing values because we have too little data)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedShuffleSplit

original_df["Risk"].value_counts() # 70% is good risk and 30% is bad risk.

stratified = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

for train, test in stratified.split(original_df, original_df["Risk"]):
    strat_train = original_df.loc[train]
    strat_test = original_df.loc[test]
    

# The main purpose of this code is to have an approximate ratio
# of 70% good risk and 30% bad risk in both training and testing sets.
strat_train["Risk"].value_counts() / len(df) 
strat_test["Risk"].value_counts() / len(df)

### Imbalanced Classification Problem
The number of examples that belong to each class may be referred to as the class distribution.

Imbalanced classification refers to a classification predictive modeling problem where the number of examples in the training dataset for each class label is not balanced.

That is, where the class distribution is not equal or close to equal, and is instead biased or skewed.

In [None]:
# Create inputs 
inputs = [vectice.create_dataset_version().with_parent_name('German-Credit-Data')]
# Create a run
run = vectice.create_run('Train Test Split', JobType.PREPARATION)
# Start the run
vectice.start_run(run, inputs = inputs)

In [None]:
# Have our new train and test data
train = strat_train
test = strat_test


# Our features
X_train = train.drop('Risk', axis=1)
X_test = test.drop('Risk', axis=1)

# Our Labels we will use them later
y_train = train["Risk"]
y_test = test["Risk"]

In [None]:
# In a real world scenario you could upload datasets into the GCS Bucket and update runs with the new dataset and add it through the Vectice App.
# For example -> train.to_csv(GCS_URI) & test.to_csv(GCS_URI)
# Create outputs, in the Vectice App you can add the train and test data together and name it "train test data"
outputs = [vectice.create_dataset_version().with_parent_name("train test data")]
# End run
vectice.end_run(outputs = outputs)

In [None]:
# This is just a custom encoder that will be used in the Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown

    def fit(self, X, y=None):
        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out

In [None]:
# Scikit-Learn does not handle dataframes in pipeline so we will create our own class.
# Reference: Hands-On Machine Learning
from sklearn.base import BaseEstimator, TransformerMixin
# Create a class to select numerical or cateogrical columns.
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit (self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [None]:
# Create a run
run = vectice.create_run("SVC", job_type=JobType.TRAINING)
# Create inputs 
inputs = [vectice.create_dataset_version().with_parent_name("train test data")]
# Start run 
vectice.start_run(run, inputs = inputs)

### Pipelines
In most machine learning projects the data that you have to work with is unlikely to be in the ideal format for producing the best performing model. There are quite often a number of transformational steps such as encoding categorical variables, feature scaling and normalisation that need to be performed. Scikit-learn has built in functions for most of these commonly used transformations in itâ€™s preprocessing package.
However, in a typical machine learning workflow you will need to apply all these transformations at least twice. Once when training the model and again on any new data you want to predict on. Of course you could write a function to apply them and reuse that but you would still need to run this first and then call the model separately. Scikit-learn pipelines are a tool to simplify this process. They have several key benefits:
* They make your workflow much easier to read and understand.
* They enforce the implementation and order of steps in your project.
* These in turn make your work much more reproducible.

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler

numeric_train_df = X_train.select_dtypes(exclude=['object'])
numeric_test_df = X_test.select_dtypes(exclude=['object'])

categorical_train_df = X_train.select_dtypes(['object'])
categorical_test_df = X_test.select_dtypes(['object'])

numerical_pipeline = Pipeline([
    ("select_numeric", DataFrameSelector(numeric_train_df.columns.values.tolist())),
    ("std_scaler", StandardScaler()),
])

categorical_pipeline = Pipeline([
    ('select_categoric', DataFrameSelector(categorical_train_df.columns.values.tolist())),
    ('encoding', CategoricalEncoder(encoding='onehot-dense'))
])

# Combine both pipelines
main_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', numerical_pipeline),
    ('cat_pipeline', categorical_pipeline)
])

X_train_scaled = main_pipeline.fit_transform(X_train)
X_test_scaled = main_pipeline.fit_transform(X_test)

In [None]:
from sklearn.preprocessing import LabelEncoder

encode = LabelEncoder()
y_train_scaled = encode.fit_transform(y_train)
y_test_scaled = encode.fit_transform(y_test)

### Support Vector Machine: C-Support Vector Classification

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

![Image](https://scikit-learn.org/stable/_images/sphx_glr_plot_iris_svc_0011.png)

[More info](https://scikit-learn.org/stable/modules/svm.html#svm-classification)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Implement gridsearchcv to see which are our best p

params = {'C': [0.75, 0.85, 0.95, 1], 
          'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 
          'degree': [3, 4, 5]}

svc_clf = SVC(random_state=42)

grid_search_cv = GridSearchCV(svc_clf, params)
grid_search_cv.fit(X_train_scaled, y_train_scaled)

### GridSearchCV: Hyper Parameter Tuning
It's an exhaustive search over specified parameter values for an estimator.

So it takes an estimator (eg. SVC) and uses a parameter list. Like below: Cross Validation is then performed in order to attain the best combinations of the parameters. Other methods like RandomSearchCV and Genetic Algorithms can be used aswell.

```
params = {'C': [0.75, 0.85, 0.95, 1], 
          'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 
          'degree': [3, 4, 5]}
```
The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [None]:
grid_search_cv.best_estimator_

In [None]:
grid_search_cv.best_params_

In [None]:
svc_clf = grid_search_cv.best_estimator_
svc_clf.fit(X_train_scaled, y_train_scaled)

In [None]:
score = svc_clf.score(X_train_scaled, y_train_scaled)
score

In [None]:
from sklearn.model_selection import cross_val_score

# Let's make sure the data is not overfitting
svc_clf = SVC(kernel='rbf', C=1, random_state=42)
scores = cross_val_score(svc_clf, X_train_scaled, y_train_scaled)
scores.mean()

In [None]:
# Create model version
model_version = [
        vectice.create_model_version().with_parent_name('SVC').with_user_version("Run").with_algorithm('Classification').with_properties([(x,str(y)) for x,y in grid_search_cv.best_params_.items()]).with_metrics([("Score", score), ("CV score", scores.mean()).with_type(ModelType.CLASSIFICATION).with_status(ModelVersionStatus.EXPERIMENTATION)])
    ]
# End run 
vectice.end_run(outputs = model_version)

#### End!

Congratulations and as Jake Peralta would say:

![Image](https://i.imgur.com/I1wR7mE.gif?noredirect)