# SciKit Learn for Predictions

Using this specific version of scikit learn:
!pip install --upgrade scikit-learn==0.23.0

In [4]:
! pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.4.1.post1-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting numpy<2.0,>=1.19.5 (from scikit-learn)
  Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.12.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Using cached threadpoolctl-3.3.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.4.1.post1-cp311-cp311-win_amd64.whl (10.6 MB)
Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl (15.8 MB)
Using cached scipy-1.12.0-cp311-cp311-win_amd64.whl (46.2 MB)
Using cached threadpoolctl-3.3.0-py3-none-any.whl (17 kB)
Installing collected packages: threadpoolctl, numpy, joblib, scipy, scikit-learn
Successfully installed joblib-1.3.2 num

1. Start with data-> split data into x and y. x represents everything using to make a prediction. y represents the target we are trying to predict.
2. give to model & model learns->
3. model makes predictions 


## Data

In [None]:
from sklearn.datasets import load_boston
X, y = load_boston(return_x_y=True) #one array for house prices

## Model

2 phases:
    - Create the model <- python object
    - Model learns from data <- .fit(x,y)

In [None]:
from sklearn.neighbors import KNeighborsRegressor 
mod = KNeighborRegressor()
mod.fit(X,y) #will learn from the data and make a prediction. Number of predictions should be equal to number of rows in the X array.

In [None]:
from sklearn.linear_model import LinearRegression
mod = LinearRegression()
mod.fit(X,y)
#Basically all api's are set up the same. Call a function and fit the model for the results.

## Scale

In [None]:
#Working with true values and predicted value
from matplotlib.pylab as plt
pred = mod.predict(X)
plt.scatter(pred,y)
#Need to be cautious Nwhen comparing the variables, may be on different scales so assumptions may not always be correct

## Pipeline

In [None]:
#Defintion: Pipelines bundle preprocessing and modeling steps so you can use them as a single step in your machine learning workflow.

#In the case of different units for variables, Scale the X before passing data into knn
#   - The "model" will include the scaling, and scikit learn is set up to .fit and .predict from the entire pipeline

from sklearn.preprocessing import StandardScaler 
from sklearn.pipeline import Pipeline #allows you to train processing steps after eachother 


#under mod fit
pip = Pipeline([
    ("scale", StandardScaler()),
    ("model", KNeigborsRegressor())])

pipe.fit(X,y)
pred = pipe.predict(X) 
plt.scatter(pred,y) #creates a better, scaled scatter plot!

# Settings

In [None]:
#We want to avoid using an orginal data point as apart of the predictions.

pip = Pipeline([
    ("scale", StandardScaler()),
    ("model", KNeigborsRegressor(n_neighbors=1))]) 

pipe.fit(X,y)
pred = pipe.predict(X) 
plt.scatter(pred,y) # the nearest neighbor is the original data point(n=1) so of course it will be a perfectly correlated line.

pipe.get_params()

# Grid 

We have the pipeline and we have some settings such that the model gives the best predicitions. 
    - We need to avoid predicitions on the same data that we are working on
To avoid any issues, split the X into three parts and duplicate the variable three times. One for predictions and two for training. Essentially to test model outputs for each "side" of the data and avoid prediciting on data used during training.


      fit() = train
    
      predict() = predict



GridSearchCV (object) - you can give it a pipeline and a grid(ie; number of neighbors 1-10) to conduct cross validation 

## GridSearch

In [None]:
from sklearn.model_selection import GridSearchCV
GridSearchCV(estimator=pipe,
    param_grid={'model_n_neighbors':[1,2,3,4,5,6,7,8,9,10]}, cv=3) #estimator has a .fit and .predict,
mod.fit(X,y) 
mod.cv_results_ # for every cross validation it will keep track of the numbers in a dictionary format. Putting it into a df you can see training time, score, and rank.

## Using Scikit-learn the wrong way

In [None]:
#IT IS ALWAYS IMPORTANT TO INSPECT THE DATASET BEFORE PREDICTIONS
print(load_bston()['DESCR']) # has an offensive and biased data column


## ML vs Reality

There is a danger in using GRID-SEARCH 
    - Optimisic insights and "well" performing models can give you a blind spot in your development 
    - The data and output of a machine learning algorithim are the users responsibility 

# Cookiecutter

Goal: Optimizing the base for all of your future projects

cookiecutter 
- json file
- folder and directory structure for automatic filling 

In a venv create a file called cookiecutter.json 
insert
{
    "project":"project"
}
create a folder called {{cookiecutter.project}} 
    - inside folder create a readme.md 
    - insert #readme of {{cookiecutter.project}}

cookiecutter took the .json file replaced it with "hello-world". Will subsequently create a folder called hello-world and a readme.md with hello-world. 

pip install cookiecutter
cookiecutter .
#computer will prompt you for an input and type 
hello-world

# Adding an Author

in cookiecutter .json 

{
    "project":"project",
    "author":"author"
}

in readme file for cookiecutter type
This package was made by {{cookiecutter.author}}. 

in terminal type cookiecutter .

Ultimately you can use the {{}} as a dunamic field 

## Adding Folders

create another folder called {{cookiecutter.project_slug}}
in the cookiecutter.json type

{
    "project":"project",
    "author":"author",
    "project_slug":"{{cookiecutter.project.lower().replace(' ','_')}}"
}
anything inside "{{}}" will be treated as a python string. When you input the project name, it will take into account the rules for the project_slug line. 

## Conditionals and sets

If you want end user to select one option out of a set:
in the cookiecutter.json type
{
    "project":"project",
    "author":"author",
    "project_slug":"{{cookiecutter.project.lower().replace(' ','_')}}",
    :license": ["MIT","BSD"]
}

in readme 
add:

{%- if cookiecutter.license == "MIT" -%}

This is an MIT license.

{%- elif cookiecutter.license == "BSD" -%}

This is a BSD license.

{% endif %}


The using ginja python tool to do templating and syntax  

When you run the "cookiecutter ." command in terminal you will see the choices for MIT or BSD. 

in the readme file you will see the answers from the input!

## Setup.py

Create a setup.py 

from setuptools import setup 

setup(
    name="{{cookiecutter.project_slug}}",
    version ="0.0.1"
)

Now running cookiecutter .  in terminal will make the import statements dynamic to what you name the project on the first input statement. 

This is great for automating 

## Sharing cookiecutter

You can push the cookiecutter.project and cookiecutter.json to gitlab or github. 

instead of typing cookiecutter ., you can type cookiecutter (link to .git file)

Great for not storing locally!

# Final Thoughts

You can always find other cookiecutters online. 
Check the [official cookiecutter GitHub]([https://github.com/cookiecutter/cookiecutter)