## Loading a Sample Dataset

scikit-learn comes with a number of popular datasets for you to use:

load_boston
Contains 503 observations on Boston housing prices. It is a good dataset for exploring regression algorithms.

load_iris
Contains 150 observations on the measurements of Iris flowers. It is a good dataset for exploring classification algorithms.

load_digits
Contains 1,797 observations from images of handwritten digits. It is a good dataset for teaching image classification.

In [249]:
# Load scikit-learn's datasets
from sklearn import datasets

# Load digits dataset
digits = datasets.load_digits()

# Create features matrix
features = digits.data

# Create target vector
target = digits.target

# View first observation
features[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

## Creating a Simulated Dataset

In [250]:
# Load library
from sklearn.datasets import make_regression

# Generate features matrix, target vector, and the true coefficients
features, target, coefficients = make_regression(n_samples = 100,
                                                 n_features = 3,
                                                 n_informative = 3,
                                                 n_targets = 1,
                                                 noise = 0.0,
                                                 coef = True,
                                                 random_state = 1)

# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
 [[ 1.29322588 -0.61736206 -0.11044703]
 [-2.793085    0.36633201  1.93752881]
 [ 0.80186103 -0.18656977  0.0465673 ]]
Target Vector
 [-10.37865986  25.5124503   19.67705609]


In [251]:
# Load library
from sklearn.datasets import make_classification

# Generate features matrix and target vector
features, target = make_classification(n_samples = 100,
                                       n_features = 3,
                                       n_informative = 3,
                                       n_redundant = 0,
                                       n_classes = 2,
                                       weights = [.25, .75],
                                       random_state = 1)

# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])


Feature Matrix
 [[ 1.06354768 -1.42632219  1.02163151]
 [ 0.23156977  1.49535261  0.33251578]
 [ 0.15972951  0.83533515 -0.40869554]]
Target Vector
 [1 0 0]


In [252]:
# Load library
from sklearn.datasets import make_blobs

# Generate feature matrix and target vector
features, target = make_blobs(n_samples = 100,
                              n_features = 2,
                              centers = 3,
                              cluster_std = 0.5,
                              shuffle = True,
                              random_state = 1)

# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])



Feature Matrix
 [[ -1.22685609   3.25572052]
 [ -9.57463218  -4.38310652]
 [-10.71976941  -4.20558148]]
Target Vector
 [0 1 1]


## Loading CSV Files

In [253]:
# Load library
import pandas as pd

# Create URL
#url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/data.csv'
url = '/Users/tthekkum/Downloads/results-2020-09-04T092001.csv'

# If a header row does not exist, we set header=None.
# header = None

# Load dataset
dataframe = pd.read_csv(url)

# View first two rows
#dataframe.head(2)

# To view the full data frame
dataframe



Unnamed: 0,PROFILE
0,WARNING: The PROFILE query requires plan compi...
1,Run PROFILE query again to get accurate execut...
2,Gather partitions:single alias:remote_1 actual...
3,"Project [garEvent_1.uUId, garEvent_1.dtlBusDt,..."
4,Top limit:[?] actual_rows: 1 exec_time: 0ms
5,Filter [garEvent_1.uUId = $0] actual_rows: 1 e...
6,:subselect $0 correlated:no
7,Project [remote_0.uUid] actual_rows: 1...
8,Top limit:[?] actual_rows: 1 exec_time...
9,Gather partitions:all alias:remote_0 a...


## Loading Excel Files

In [254]:
# Create URL
#url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/data.xlsx'
url = '/Users/tthekkum/Downloads/Application Inventory.xlsx'
# Load data
dataframe = pd.read_excel(url, sheet_name=0, header=0)

# sheetname=[0,1,2, "Monthly Sales"] will return a dictionary
# of pandas DataFrames containing the first, second,
# and third sheets and the sheet named Monthly

# View the first two rows
dataframe.head(3)

Unnamed: 0,Application Name,AIM ID,DBMS,SP Repository,Template
0,TC Portal,600000437.0,Postgres,https://stash.aexp.com/stash/projects/AIM60000...,
1,olaf,200000190.0,Oracle,https://stash.aexp.com/stash/projects/AIM20000...,
2,,,,,


## Loading a JSON File

In [255]:
# Create URL
#url = 'http://raw.githubusercontent.com/chrisalbon/sim_data/master/data.json'
url = '/Users/tthekkum/Downloads/test.json'
#url = '/Users/tthekkum/Downloads/plan-2020-09-04T091334.json'
# Load data
dataframe = pd.read_json(url, orient='columns')
# t might take some experimenting to figure out which argument
# (split, records, index, columns, and values) is the right one


# View the first two rows
dataframe.head(2)


Unnamed: 0,integer,datetime,category
0,5,2015-01-01 00:00:00,0
1,5,2015-01-01 00:00:01,0


## Querying SQL Database

In [256]:
from sqlalchemy import create_engine

# Create a connection to the database
database_connection = create_engine('sqlite:///Users/tthekkum/Documents/LnD/BV/540/data-wrangling-master/code/chp6-db/data_wrangling.db')

# Load data
dataframe = pd.read_sql_query('SELECT * FROM data_sources', database_connection)

# View first two rows
dataframe.head(2)

OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: http://sqlalche.me/e/13/e3q8)