### **Machine Learning with Python: An Introduction**

#### **Lessons**
1. Lesson 1: **Python Ecosystem for Machine Learning**
2. Lesson 2: **Python and SciPy**
3. Lesson 3: **Load Datasets from CSV**
4. Lesson 4: **Analyze Data**
    * Understand Data with Descriptive Statistics
    * Understand Data with Visualization
5. Lesson 5: **Prepare Data**
    * Pre-Process Data
    * Feature Selection
6. Lesson 6: **Evaluate Algorithms**
    * Resampling Methods
    * Algorithm Evaluation Metrics
    * Spot-Check Classification Algorithms
    * Spot-Check Regression Algorithms
    * Model Selection
    * Pipelines
7. Lesson 7: **Improve Results**
    * Ensemble Methods
    * Algorithm Parameter Tuning
8. Lesson 8: **Present Results**
    * Model Finalization

#### **Lesson 1: Python Ecosystem for Machine Learning**
**1.1 Python:** Python is a dynamic language, widely used for machine learning and data science because of the excellent library
support. Both useful for research and development and production of systems.

**1.2 SciPy:** SciPy is an ecosystem of Python libraries for mathematics, science and engineering. The ecosystem is comprised of:
* `NumPy:` to efficiently work with data in arrays.
* `Matplotlib:` to create 2D charts and plots from data.
* `Pandas:` to load, organize and analyze the data.

**1.3 Scikit-Learn** It is build upon and requires the SciPy ecosystem. The focus of the scikit-learn library is machine learning algorithms for classification, regression, clustering and so on. It also provides tools for related tasks such as - evaluating models, tuning parameters and pre-processing data.

**1.4 Installing the Ecosystem** [For Windows]
* Python: Download python exe file for your windows (update version is better), install it on your machine, add the path in enviromnet variable.
* SciPy: `pip install scipy`
* Numpy: `pip install numpy`
* Matplotlib: `pip install matplotlib`
* Pandas: `pip install pandas`

Once installed, we can confirm that the installation was successful. To check the installation - open any python code editor and run the following codes:

In [1]:
# Check whether they are installed and version
import sys
print(f"Python      : {sys.version}")

import scipy
print(f"Scipy       : {scipy.__version__}")

import numpy
print(f"Numpy       : {numpy.__version__}")

import matplotlib
print(f"Matplotlib  : {matplotlib.__version__}")

import pandas
print(f"Pandas      : {pandas.__version__}")

import sklearn
print(f"Scikit-Learn: {sklearn.__version__}")

Python      : 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]
Scipy       : 1.14.0
Numpy       : 1.26.4
Matplotlib  : 3.9.0
Pandas      : 2.2.2
Scikit-Learn: 1.5.1


#### **Lesson 2: Python and SciPy**
This section is basically for Python and SciPy Libraries (Numpy, Matplotlib, Pandas) crash course. We assume that we know the basics of -
* Coding in Python
* Numpy basics - numpy structure and operations
* Matplotlib basics - plotting using pyplot
* Pandas basics - load, manipulate of dataframe

#### **Lesson 3: Load Datasets from CSV**
The most common format for machine learning data is `csv (comma separated values)` file. Before loading csv data we have to consider some parameters of csv file.
1. **File Header:** If the data have a file header it can help assigning names automatically to each column of data. But if not, we need to name the attributes manually.
2. **Comments:** Comments in a csv file are indicated by a hash (#) at the start of a line. If comments exist in data, we may need to indicate whether or not to expect comments and the character that indicates the comment line.
3. **Delimiter:** In csv file the common seperator is the comma (,). In some cases the data file may use a different delimiter like- tab or white-space in which we must specify the separator explicitely.
4. **Quotes:** Sometimes the field values may contain spaces and they will be quated using double quotation (""), the default quote character. Other characters may be used, and we must specify them in the file.

**`Pima Indians Dataset`**

To demonstrate data loading here we will use the 'Pima Indians' dataset. The dataset is good for demonstration because all the attributes are numeric and the output variable is binary (0 or 1), hence it is a classification problem. The dataset is available in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes).

##### **`Load CSV file with Python Standard Library`**

It uses an object that can iterate over each row of the dataset and then convert them into numpy array which makes a dataset of numpy array type.

In [2]:
# Load CSV file (using Python Standard Library)
import csv
import numpy as np
filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
raw_data = open(filepath, 'rb')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = np.array(x).astype(float)
print(data.shape)

Error: iterator should return strings, not bytes (the file should be opened in text mode)

##### **`Load CSV file with Numpy`**

Numpy uses 'numpy.loadtxt()' function that assumes no header row and all the data has the same format. It can also load dataset directly from the url.

In [6]:
# Load CSV using Numpy
import numpy as np
filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
raw_data = open(filepath, 'rb')
data = np.loadtxt(raw_data, delimiter=',')
print(data.shape)

TypeError: 'header' is an invalid keyword argument for open()

In [7]:
# Load CSV from URL using NumPy
import numpy as np
from urllib import urlopen
url = 'https://goo.gl/vhm1eU'   # Url of pima_indians_diabetes dataset
raw_data = urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)


ImportError: cannot import name 'urlopen' from 'urllib' (c:\Program Files\Python311\Lib\urllib\__init__.py)

##### **`Load CSV File with Pandas`**
Pandas uses the 'pandas.read_csv()' function to load the dataset which is very flexible and the most recommended approach for loading machine learning data. The function returns a Dataframe which is helpful in summarizing and plotting data. This can also load data directly fron url.

In [9]:
# Load local csv Data using Pandas
import pandas as pd
filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
data = pd.read_csv(filepath)
print(data.shape)

(768, 9)


In [11]:
# Load csv Data using Pandas from URL
import pandas as pd
url = 'https://goo.gl/vhm1eU/'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']    # Explicitely specify the column names
data = pd.read_csv(url, names=names)
print(data.shape)

URLError: <urlopen error [Errno 11002] getaddrinfo failed>

#### **Lesson 4: Analyze Data**

##### **4.1 Undestand Data with Descriptive Statistics**
To get the best result we have to understand the data. To better understand the machine learning data, we will follow 7 recipes and through our journey of understanding data, we will use the 'Pima Indians Diabetes' dataset.

**`4.1.1. Take a Peek at Raw Data`**

Looking at the raw data can reveal insights of the data and grow ideas on how to better pre-process and handle the data for our machine learning tasks.

In [17]:
# Take a Look at Data
import pandas as pd
pd.set_option('display.width', 200)

filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
data = pd.read_csv(filepath)
print(data.head())      # Displays the first 5 items by default, we can also specify the item number - 'data.head(10)'
print(data.tail())      # Displays the last 5 items by default

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72             35        0  33.6                     0.627   50        1
1            1       85             66             29        0  26.6                     0.351   31        0
2            8      183             64              0        0  23.3                     0.672   32        1
3            1       89             66             23       94  28.1                     0.167   21        0
4            0      137             40             35      168  43.1                     2.288   33        1
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
763           10      101             76             48      180  32.9                     0.171   63        0
764            2      122             70             27        0  36.8                     0.340   27        0
765          

**`4.1.2. Dimensions of the Data`**

Dimension means how many rows and columns are there in the dataset. It is important to know the dimension of the data because by this we can realize two things:
* Too many rows may take too long to train the algorithms and too few rows perhaps we do not have enough data to train the algorithms.
* Too many features (columns) and few instances (rows) can suffer poor performance due to the curse of dimensionality.

[NB] The 'Pima Indinas Diabetes' dataset has 768 rows and 9 columns. The `shape` property results in rows then columns (rows, columns).

In [18]:
# Shape of the Data
import pandas as pd
filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
data = pd.read_csv(filepath)
print(data.shape)

(768, 9)


**`4.1.3. Data Type of Each Attribute`**

Knowing data type of each attribute is necessary because to train the algorithms on data we need the data in integer or floating point values. So, Strings, Categorical or Ordinal values need to be converted into floating point or integer value. We can get an idea while taking a look at data but we also can explicitely check the data type of the attributes.

In [19]:
# Types of the Data
import pandas as pd
filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
data = pd.read_csv(filepath)
print(data.dtypes)

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


**`4.1.4. Descriptive Statistics`**

Descriptive statistics gives a great insights like - 'Total Instances', 'Are there any missing values', 'Central Tendency', 'Range', 'Dispersion' of data of each attribute. The pandas `describe()` function lists 8 statistical properties of each attribute:
* `Count`: Total number of instances
* `Mean`: Average value of instances
* `Standard Deviation`: 
* `Minimum Value`: Lowest value among instances
* `25th Percentile`: 
* `50th Percentile (Median)`: Middle value among instances
* `75th Percentile`: 
* `Maximum Value`: Highest value among instances

In [22]:
# Types of the Data
import pandas as pd
pd.set_option('display.width', 200)
#pd.set_option('precision', 3)

filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
data = pd.read_csv(filepath)
print(data.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin         BMI  DiabetesPedigreeFunction         Age     Outcome
count   768.000000  768.000000     768.000000     768.000000  768.000000  768.000000                768.000000  768.000000  768.000000
mean      3.845052  120.894531      69.105469      20.536458   79.799479   31.992578                  0.471876   33.240885    0.348958
std       3.369578   31.972618      19.355807      15.952218  115.244002    7.884160                  0.331329   11.760232    0.476951
min       0.000000    0.000000       0.000000       0.000000    0.000000    0.000000                  0.078000   21.000000    0.000000
25%       1.000000   99.000000      62.000000       0.000000    0.000000   27.300000                  0.243750   24.000000    0.000000
50%       3.000000  117.000000      72.000000      23.000000   30.500000   32.000000                  0.372500   29.000000    0.000000
75%       6.000000  140.250000      80.000000      32.0

**`4.1.5. Class Distribution (for Classification only)`**

Class distribution means - how many instances are in each class. On classification problem we need to know how balanced the class distribution is. Highly imbalanced dataset is common and may need special handling in data pre-processing. 'Pima Indians Diabetes' dataset is a binary classification problem having 500 instances in class 0 and 268 instances in class 1. (0 = No Diabetes, 1 = Diabetes)

In [23]:
# Types of the Data
import pandas as pd
filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
data = pd.read_csv(filepath)
class_counts = data.groupby('Outcome').size()
print(class_counts)

Outcome
0    500
1    268
dtype: int64


**`4.1.6. Correlation Between Attributes`**

Correlation refers to the relationship between two attributes and how they may or may not change together. Pearson's Correlation Coefficient, the most comonly used method, describes correlation between two attributes by -1 or 1 means a full negetive or positive correlation respectively and 0 shows no correlation at all. `Highly correlated attributes can cause poor performance in linear or logistic regression.`

The pandas 'corr()' function lists all attributes across the top and down and give correlation coefficient between all pairs of attributes. The diagonal line through the matrix shows perfect correlation of each attribute with itself.

In [24]:
# Types of the Data
import pandas as pd
pd.set_option('display.width', 200)
#pd.set_option('precision', 3)

filepath = "F:/courses/mlds_nactar/dataset/pima_indians_diabetes.csv"
data = pd.read_csv(filepath)
correlations = data.corr(method='pearson')
print(correlations)

                          Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  DiabetesPedigreeFunction       Age   Outcome
Pregnancies                  1.000000  0.129459       0.141282      -0.081672 -0.073535  0.017683                 -0.033523  0.544341  0.221898
Glucose                      0.129459  1.000000       0.152590       0.057328  0.331357  0.221071                  0.137337  0.263514  0.466581
BloodPressure                0.141282  0.152590       1.000000       0.207371  0.088933  0.281805                  0.041265  0.239528  0.065068
SkinThickness               -0.081672  0.057328       0.207371       1.000000  0.436783  0.392573                  0.183928 -0.113970  0.074752
Insulin                     -0.073535  0.331357       0.088933       0.436783  1.000000  0.197859                  0.185071 -0.042163  0.130548
BMI                          0.017683  0.221071       0.281805       0.392573  0.197859  1.000000                  0.140647  0.036242  0

**`4.1.7. Skew of Univariate Distribution`**

Many Machine Learning algorithms assume that the data has a `Gaussion` distribution, because it is preferred for better result. Skew refers to a distribution that is assumed Gaussian (Normal or Bell Curve) 