# Introduction to Scikit-Learn and Pandas

Artifical Intelligence and Machine Learning Symposium at OU  
Univeristy of Oklahoma Memorial Union Ballroom  
September 25, 2019 
Author: Keerti Banweer <keerti.banweer@ou.edu> 

## Overview: Regression (Boston housing dataset)

Below are the topics that will be covered:
1. Load the dataset using sklearn.datasets
2. Describe the dataset using DESCR
3. Check for missing values using numpy functions isnan() and any()
4. Scale the data using sklearn scaler (we will be using min max scaler)
5. regression
    1. build the models
    using sklearn packages: linear regression, SGD regressor, LASSO and Elastic net
    2. Evaluate the predictions, check accuracy
    3. compare different models using cross validation (sklearn.model_selection.cross_validate )
    
    
### General References
* [Sci-kit Learn API](https://scikit-learn.org/stable/modules/classes.html)
    

## Imports

In [1]:
"""
This section will import all the required packages

"""

# Index of skleanr datasets https://scikit-learn.org/stable/datasets/index.html#datasets
# https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets
from sklearn.datasets import load_boston, load_iris
from sklearn import cluster, datasets

import numpy as np
import pandas as pd
import itertools 
import time

from matplotlib import rcParams, pyplot as plt
# Though the following import is not directly being used, it is required
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import NullFormatter

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc, roc_curve

from sklearn.metrics.cluster import contingency_matrix 
from sklearn.metrics.pairwise import paired_euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.manifold import TSNE

from sklearn.cluster import KMeans

%matplotlib inline
%reload_ext autoreload
%autoreload 2

rcParams['figure.figsize'] = (8, 8)

globalStart = time.time()

## Load dataset 

In [2]:
"""
We will be using boston dataset
load_boston()
Using functtions such as keys() and DESCR
"""


# function keys() will display the keys in dataset 
## dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


# Using DESCR function, we can display the dataset description with list of features and target


'\nWe will be using boston dataset\nload_boston()\nUsing functtions such as keys() and DESCR\n'

In [3]:
# In this section, we will setup variables and store data and target information
 #data key contains the data of all the features


 # list of all the attributes
  #target column name

"""
This section, we will print out the dimensions of the data
"""


"""
convert the dataset load to a dataframe using pandas
Pandas provide multiple functionality for accessing 
and utilizing data efficiently
"""


'\nconvert the dataset load to a dataframe using pandas\nPandas provide multiple functionality for accessing \nand utilizing data efficiently\n'

In [4]:
#this section will check for any missing values


In [5]:
""" 
Store the number of samples and the number of features, by
accessing the values from the shape of X
"""


' \nStore the number of samples and the number of features, by\naccessing the values from the shape of X\n'

In [6]:
"""
Visualize the data using histogram,
pandas have a nice function hist()
here is the link for more info: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html
"""


'\nVisualize the data using histogram,\npandas have a nice function hist()\nhere is the link for more info: \nhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html\n'

In [7]:
"""
Use boxplot to display the data in feature
"""


'\nUse boxplot to display the data in feature\n'

In [8]:
"""
Use min-max scaler to scale the data
"""


'\nUse min-max scaler to scale the data\n'

In [9]:
"""
Display the scaled dataset
"""



'\nDisplay the scaled dataset\n'

In [10]:
"""
Use boxplot to display the data in feature
"""


'\nUse boxplot to display the data in feature\n'

In [11]:
# add target column to the dataframe


In [12]:
# Display top 5 rows


## Regression
Update the models 

In [13]:
"""
Create a linear regression model
"""


'\nCreate a linear regression model\n'

In [22]:
# train_test_split using sklearn



In [14]:
# shape of train data



In [15]:
# shape of test data



(339,)

In [26]:
## after the train test split, use the linear regression 
 

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [27]:
# Calculate the predictive accuracy of training dataset


In [28]:
# Calculate the predictive accuracy of test dataset


In [16]:
# Prediction using Linear regression model



In [17]:
# Print the stats 


In [18]:
#Plot the Scatter plots to show the predictions



In [19]:
## Linear Model trained with L1 prior as regularizer (aka the Lasso)


In [46]:
# In this section we will use the Lasso regression model to train and predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

# train the model
lasso_model.fit(X_train, y_train)

# Calculate the predictive accuracy of training dataset
# scoe method will returns the coefficient of determination R^2 of the prediction.
training_acc_lasso = lasso_model.score(X_train, y_train)

# Calculate the predictive accuracy of test dataset
test_acc_lasso = lasso_model.score(X_test, y_test)

# Print the stats 
print("Train Accuracy %.02f" % training_acc_lasso)
print("Test Accuracy %.02f" % test_acc_lasso)

print(lasso_model.coef_)

pred_lasso_model = lasso_model.predict(X_test)

Train Accuracy 0.73
Test Accuracy 0.70
[-0.11621593  0.04163929 -0.0218949   1.73588239 -0.          3.68084761
 -0.02707188 -1.24007073  0.21427949 -0.01045183 -0.70264226  0.01235152
 -0.60766973]


In [20]:
#Plot the Scatter plots to show the predictions

