<a href="https://colab.research.google.com/github/Rohanrathod7/my-ml-labs/blob/main/14_Model_Validation_in_Python/01_Basic_Modeling_in_scikit-learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Basic Modeling in scikit-learn


Machine learning models are easier to implement now more than ever before. Without proper validation, the results of running new data through a model might not be as accurate as expected. Model validation allows analysts to confidently answer the question, how good is your model? We will answer this question for classification models using the complete set of tic-tac-toe endgame scenarios, and for regression models using fivethirtyeight’s ultimate Halloween candy power ranking dataset. In this course, we will cover the basics of model validation, discuss various validation techniques, and begin to develop tools for creating validated and high performing models.

### Introduction to model validation

In [9]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import datetime as dt
# Import confusion matrix and train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import Ridge, Lasso, LogisticRegression, LinearRegression
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestRegressor




url = "https://raw.githubusercontent.com/Rohanrathod7/my-ml-labs/main/14_Model_Validation_in_Python/dataset/candy-data.csv"
# Read the CSV file

# Apply pd.to_numeric only to relevant columns, excluding non-numeric ones
candy = pd.read_csv(url)


display(candy.head())

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


**Seen vs. unseen data**  
Model's tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints; Skittles is in the dataset, and Andes Mints is not.

You've built a model based on 50 candies using the dataset X_train and need to report how accurate the model is at predicting the popularity of the 50 candies the model was built on, and the 35 candies (X_test) it has never seen. You will use the mean absolute error, mae(), as the accuracy metric.

In [14]:
# Define features (X) and target (y)
X = candy.drop(['competitorname', 'winpercent'], axis=1)
y = candy['winpercent']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1111)

# Define and initialize the model (using Linear Regression as an example)
model = RandomForestRegressor()

# The model is fit using X_train and y_train
model.fit(X_train, y_train)

# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Define the Mean Absolute Error function
def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

# Train/Test Errors
train_error = mae(y_true=y_train, y_pred=train_predictions)
test_error = mae(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))

# When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model.
# In the next lesson, you will start building models to validate.

Model error on seen data: 3.57.
Model error on unseen data: 9.70.


### Regression models


**Set parameters and fit a model**  
Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a continuous variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a regression model.

In this exercise, you will specify a few parameters using a random forest regression model rfr.

In [16]:
# Set the number of trees
model.n_estimators = 100

# Add a maximum depth
model.max_depth = 6

# Set the random state
model.random_state = 1111

# Fit the model
model.fit(X_train, y_train)

# You have updated parameters _after_ the model was initialized.
# This approach is helpful when you need to update parameters. Before making predictions,
# let's see which candy characteristics were most important to the model.

### Classification models

In [17]:

url = "https://raw.githubusercontent.com/Rohanrathod7/my-ml-labs/main/14_Model_Validation_in_Python/dataset/tic-tac-toe.csv"
# Read the CSV file

# Apply pd.to_numeric only to relevant columns, excluding non-numeric ones
candy = pd.read_csv(url)


display(candy.head())

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive


**Classification predictions**  
In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in how likely it is a team will win.

      Probability	Prediction	Meaning
      0 < .50	0	Team Loses
      .50 +	1	Team Wins
In this exercise, you look at the methods, .predict() and .predict_proba() using the tic_tac_toe dataset. The first method will give a prediction of whether Player One will win the game, and the second method will provide the probability of Player One winning. Use rfc as the random forest classification model.

In [21]:
from sklearn.ensemble import RandomForestClassifier

# Load the tic-tac-toe dataset (assuming it's already loaded into the 'candy' DataFrame based on previous cells)
# If not, you would need to load it here, e.g.,
# url_tictactoe = "https://raw.githubusercontent.com/Rohanrathod7/my-ml-labs/main/14_Model_Validation_in_Python/dataset/tic-tac-toe.csv"
# tictactoe = pd.read_csv(url_tictactoe)

# Assuming the tic-tac-toe data is in the 'candy' DataFrame from cell lE_BMoNOAFJH
tictactoe = candy.copy()

# Define features (X) and target (y) for the tic-tac-toe dataset
# You might need to encode the categorical features if they are not already numerical
X = tictactoe.drop('Class', axis=1)
y = tictactoe['Class']

# Convert categorical features to numerical using one-hot encoding
X = pd.get_dummies(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define and initialize the RandomForestClassifier model
rfc = RandomForestClassifier(random_state=1111)

In [23]:
# Fit the rfc model.
rfc.fit(X_train, y_train)

# Create arrays of predictions
classification_predictions = rfc.predict(X_test)
probability_predictions = rfc.predict_proba(X_test)

# Print out count of binary predictions
print(pd.Series(classification_predictions).value_counts())

# Print the first value from probability_predictions
print('The first predicted probabilities are: {}'.format(probability_predictions[0]))

positive    194
negative     94
Name: count, dtype: int64
The first predicted probabilities are: [0.57 0.43]


***Reusing model parameters***  
Replicating model performance is vital in model validation. Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as Stack Overflow. You might use such a site to ask other coders about model errors, output, or performance. The best way to do this is to replicate your work by reusing model parameters.

In this exercise, you use various methods to recall which parameters were used in a model.

In [24]:
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))

# Recalling which parameters were used will be helpful going forward.
# Model validation and performance rely heavily on which parameters were used,
# and there is no way to replicate a model without keeping track of the parameters used!

RandomForestClassifier(max_depth=6, n_estimators=50, random_state=1111)
The random state is: 1111
Printing the parameters dictionary: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}


**Random forest classifier**  
This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model. You will:

- Create a random forest classification model.
- Fit the model using the tic_tac_toe dataset.
- Make predictions on whether Player One will win (1) or lose (0) the current game.
- Finally, you will evaluate the overall accuracy of the model.
= Let's get started!

In [25]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Fit rfc using X_train and y_train
rfc.fit(X_train, y_train)

# Create predictions on X_test
predictions = rfc.predict(X_test)
print(predictions[0:5])

# Print model accuracy using score() and the testing data
print(rfc.score(X_test, y_test))

# Notice the first five predictions were all 1, indicating that Player One is predicted to win all five of those games. You also see the model accuracy was only 82%.

# Let's move on to Chapter 2 and increase our model validation toolbox by learning about splitting datasets, standard accuracy metrics, and the bias-variance tradeoff.

['positive' 'positive' 'positive' 'negative' 'negative']
0.90625
