<font size = 6> <center> 01 - Getting Started </center> </font>

![https://images.unsplash.com/photo-1461896836934-ffe607ba8211?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=870&q=80](attachment:image.png)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Google-Colab-Configuration" data-toc-modified-id="Google-Colab-Configuration-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Google Colab Configuration</a></span></li><li><span><a href="#Essential" data-toc-modified-id="Essential-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Essential</a></span></li><li><span><a href="#Save-Figures" data-toc-modified-id="Save-Figures-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Save Figures</a></span></li></ul></li><li><span><a href="#Fetch-the-data" data-toc-modified-id="Fetch-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fetch the data</a></span><ul class="toc-item"><li><span><a href="#Download-the-data-then-clone-it-in-local" data-toc-modified-id="Download-the-data-then-clone-it-in-local-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Download the data then clone it in local</a></span></li><li><span><a href="#Other-ways-to-read-and-fetch-data" data-toc-modified-id="Other-ways-to-read-and-fetch-data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Other ways to read and fetch data</a></span></li><li><span><a href="#Generate-your-own-data-set-with-random-number" data-toc-modified-id="Generate-your-own-data-set-with-random-number-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Generate your own data set with random number</a></span></li></ul></li></ul></div>

# Description

1. Define business goals and context
2. Define performance metric or KPIs
3. Define your goal and the baseline for this project to be relevant
4. Enclosed dataset documentation

# Setup

## Google Colab Configuration

In [None]:
#clone the repository to have access to all the data and files
repository_name = "Machine_Learning Pipeline_-_Complete Overview"
repository_url = 'https://github.com/TKovaks78/' + repository_name

In [None]:
! git clone $repository_url

In [None]:
#Install Requirements
! pip install -Uqqr $repository_name/requirements.txt

⚠️ Restart the kernel after running these cells for the first time

## Essential

In [11]:
# Importing required libraries for the project
import numpy as np # for scientific computing
import pandas as pd # for data anaysis
import matplotlib # for visualization
import seaborn as sns # for visualization
import sklearn # ML Library
import os

# Scikit-Learn ≥0.20 is required
assert sklearn.__version__ >= "0.20"

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Increase pandas display limit of columns to 500 
pd.options.display.max_columns = 500 

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# No warning of any kind please!
import warnings
# will ignore any warnings
warnings.filterwarnings("ignore")

## Save Figures

**Method 1**: makes it easy to save figure in a specific location in an organized way

In [12]:
import os

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "01_-_Getting Started"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

To call the function just insert the code below in a plot cell (We will see example later)

In [3]:
#save_fig("input figure name")

**Method 2**: matplotlib integrated function (easier but more limitation, especially if you are working on github 

In [None]:
#Insert this code in a plot cell
#fig.savefig('path/to/save/image/to.png')

# Fetch the data

## Download the data then clone it in local

In [13]:
import os
import tarfile
import urllib.request

#Define path from where you download the data
DOWNLOAD_ROOT = "https://github.com/TKovaks78/Machine_Learning-Pipeline_-_Complete_Overview/blob/main/"
PATH = os.path.join("datasets", "housing")
URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

#Function to fetch the data from the url
def fetch_data(url= URL, path= PATH):
    if not os.path.isdir(path):
        os.makedirs(path)
    tgz_path = os.path.join(path, "housing.tgz")
    urllib.request.urlretrieve(url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=path)
    housing_tgz.close()
    
#Call the function
fetch_data()

#Function to load the data
def load_data(path=PATH):
    csv_path = os.path.join(path, "housing.csv")
    return pd.read_csv(csv_path)

#Call the function
df = load_data()

#Read the data
df.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


## Other ways to read and fetch data

**Pandas Documentation**: </br>
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

**Sklearn Documentation** </br>
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

Example of fetch data from sklearn

In [16]:
from sklearn.datasets import fetch_openml

data = fetch_openml("house_sales", as_frame=True)
df_sk = data.frame

df_sk.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,20141013T000000,221900.0,3.0,1.0,1180.0,5650.0,1.0,0.0,0.0,3.0,7.0,1180.0,0.0,1955.0,0.0,98178.0,47.5112,-122.257,1340.0,5650.0
1,20141209T000000,538000.0,3.0,2.25,2570.0,7242.0,2.0,0.0,0.0,3.0,7.0,2170.0,400.0,1951.0,1991.0,98125.0,47.721,-122.319,1690.0,7639.0
2,20150225T000000,180000.0,2.0,1.0,770.0,10000.0,1.0,0.0,0.0,3.0,6.0,770.0,0.0,1933.0,0.0,98028.0,47.7379,-122.233,2720.0,8062.0
3,20141209T000000,604000.0,4.0,3.0,1960.0,5000.0,1.0,0.0,0.0,5.0,7.0,1050.0,910.0,1965.0,0.0,98136.0,47.5208,-122.393,1360.0,5000.0
4,20150218T000000,510000.0,3.0,2.0,1680.0,8080.0,1.0,0.0,0.0,3.0,8.0,1680.0,0.0,1987.0,0.0,98074.0,47.6168,-122.045,1800.0,7503.0


## Generate your own data set with random number

I wanted to include this part as it is more convenient to generate your own dataset when discovering new techniques </br></br>
However, my notebooks will use with real dataset

**Documentation** </br>
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html

In [17]:
# Generate random intergers from 0 to 100, with 100 rows and 4 columns
df_rand = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df_rand.head()

Unnamed: 0,A,B,C,D
0,42,31,71,14
1,34,18,4,79
2,45,86,59,48
3,20,6,46,73
4,89,49,87,99


**Documentation**: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html

In [50]:
from sklearn.datasets import make_blobs

# Generates the dataset
X, y = make_blobs(n_samples=50, centers=1, random_state=4, cluster_std=2)

<u>Note:</u> you can also use sklearn to generate any kind of models such as a multiclassification dataset