<h2><font color="#004D7F" size=6>Module 2. Data Analysis</font></h2>

<h1><font color="#004D7F" size=5>1. Load a dataset</font></h1>

<h2><font color="#004D7F" size=5>Table of Contents</font></h2>
<a id="indice"></a>

* 1. Introduction
* 2. Load CSV
    * 2.1. From the standard library,
    * 2.2. From NumPy
    * 2.3. From Pandas
* 3. Dataset Description
    * 3.1. Multiclass Classification: IRIS
    * 3.2. Binary Classification: Sonar, Mines vs. Rocks
    * 3.3. Regression: Boston House Price
* 4. Conclusions

---
# <font color="#004D7F"> 1. Introduction</font>

In this first part of this topic, we will see how to load a dataset that is in Tidy format, and also, we will see how to load the main datasets that we are going to work with throughout the course.

----
# <font color="#004D7F"> 2. Load a CSV</font>

You must be able to load your data before starting your machine learning project. The most common format for machine learning data is CSV files. There are several ways to load a CSV file in Python:

* Load CSV files with the Python standard library.
* Load CSV files with NumPy.
* Load CSV files with Pandas.

## <font color="#004D7F">2.1. From the standard library</font>

The Python API provides the CSV module and `reader()` functions that can be used to load CSV files. Once loaded, you can convert the CSV data to a NumPy array and use it for machine learning. For example, you can download the Pima Indians dataset to your local directory with the filename `pima-indians-diabetes.data.csv`. All fields in this dataset are numeric, and there is no header line. The example loads an object that can iterate over each data row and can easily be converted into a NumPy array. Running the example prints the shape of the array.

In [6]:
# Load CSV Using Python Standard Library
import csv  
import numpy as np  

# Define the path to the CSV file to be read
filename = 'data/pima-indians-diabetes.csv'

# Open the CSV file in read mode and assign the file object to 'raw_data'
raw_data = open(filename, 'r')


# This reader will read data from the 'raw_data' file object
# Set the delimiter to ',' since it's a comma-separated file
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)

# Convert the data into a list of lists and assign it to 'x'
x = list(reader)

# Convert the list of lists 'x' into a numpy array using np.array()
# Convert the data type of the array elements to 'float' using astype('float')
data = np.array(x).astype('float')

# Print the numpy array containing the CSV data
print(data)

print(data.shape)

[[  6. 148.  72. ... 627.  50.   1.]
 [  1.  85.  66. ... 351.  31.   0.]
 [  8. 183.  64. ... 672.  32.   1.]
 ...
 [  5. 121.  72. ... 245.  30.   0.]
 [  1. 126.  60. ... 349.  47.   1.]
 [  1.  93.  70. ... 315.  23.   0.]]
(768, 9)


## <font color="#004D7F">2.2. From NumPy</font>

You can load your CSV data using NumPy and the `numpy.loadtxt()` function. This function does not assume a header row, and all data is assumed to be in the same format. The following example assumes that the file `pima-indians-diabetes.data.csv` is in your current working directory. Running the example will load the file as a `numpy.ndarray` and print the shape of the data.

In [8]:
# Load CSV using NumPy
import numpy as np

# Defining the filename variable to store the path to the CSV file
filename = 'data/pima-indians-diabetes.csv'

# Opening the CSV file in binary read mode and storing the raw data
raw_data = open(filename, 'rb')

# Using NumPy's loadtxt function to load the CSV data into a NumPy array,
# specifying the delimiter as a comma
data = np.loadtxt(raw_data, delimiter=',')

# Printing the loaded data array
print(data)

[[  6. 148.  72. ... 627.  50.   1.]
 [  1.  85.  66. ... 351.  31.   0.]
 [  8. 183.  64. ... 672.  32.   1.]
 ...
 [  5. 121.  72. ... 245.  30.   0.]
 [  1. 126.  60. ... 349.  47.   1.]
 [  1.  93.  70. ... 315.  23.   0.]]


## <font color="#004D7F">2.3. From Pandas</font>

You can load your CSV data using Pandas and the `pandas.read_csv()` function. This function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a pandas [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that you can begin summarizing and plotting immediately. The following example assumes that the file `pima-indians-diabetes.csv` is in the current working directory. Note that in this example, we explicitly specify the names of each attribute to the DataFrame.

In [13]:
# Load CSV using Pandas
import pandas as pd

# Define the path to the CSV file to be read
filename = 'data/pima-indians-diabetes.csv'

# Define the column names for the dataframe
name = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# Read the CSV file into a pandas dataframe using pd.read_csv()
# Set the 'names' parameter to the list of column names defined above
data = pd.read_csv(filename, names=name)

data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,627.00,50,1
1,1,85,66,29,0,26.6,351.00,31,0
2,8,183,64,0,0,23.3,672.00,32,1
3,1,89,66,23,94,28.1,167.00,21,0
4,0,137,40,35,168,43.1,2288.00,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,171.00,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,245.00,30,0
766,1,126,60,0,0,30.1,349.00,47,1


We can also modify this example to load CSV data directly from a URL.

In [17]:
# load dataset
import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone-data'
dataframe = pd.read_csv(url, header= None)
dataframe

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10
