<h1><center>MACHINE LEARNING</center></h1>
<center><img src="https://www.fsm.ac.in/blog/wp-content/uploads/2022/08/ml-e1610553826718.jpg" align="center"/></center>

# Introduction
Welcome to "Data Science with Machine Learning," your practical guide to unlocking insights from data using cutting-edge machine learning techniques. In today's data-driven world, organizations across industries are leveraging the power of data science and machine learning to extract valuable insights, make informed decisions, and drive innovation.

This notebook is designed to provide you with a hands-on approach to mastering the intersection of data science and machine learning. Whether you're a beginner looking to build a solid foundation in data science principles or an experienced practitioner seeking advanced techniques and strategies, this notebook will equip you with the skills and knowledge you need to succeed in today's competitive landscape.

## What is Machine Learning
Machine Learning is a subset of artificial intelligence that focuses on building systems that can learn from data. The goal is to enable computers to make predictions or decisions without being explicitly programmed.
## Machine Learning Chat
![Image](https://miro.medium.com/v2/resize:fit:720/format:webp/0*botktOR526S9maYd)
## Types of Machine Learning
1. **Supervised Learning**: In supervised learning, the algorithm learns from labeled data, which means it's provided with input-output pairs. The goal is to learn a mapping from inputs to outputs, so when presented with new, unseen data, it can predict the correct output. Common algorithms in supervised learning include linear regression, decision trees, support vector machines (SVM), and neural networks.
![Image](https://databasetown.com/wp-content/uploads/2023/05/Supervised-Learning-1024x726.jpg)
2. **Unsupervised Learning**: Unsupervised learning involves learning from unlabeled data. The algorithm tries to find patterns or intrinsic structures in the input data. Clustering and dimensionality reduction are typical tasks in unsupervised learning. Clustering algorithms like k-means, hierarchical clustering, and density-based clustering, along with dimensionality reduction techniques like principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), fall under this category.
![Image](https://databasetown.com/wp-content/uploads/2023/05/Unsupervised-Learning-1024x726.jpg)
3. **Reinforcement Learning**: Reinforcement learning (RL) is about learning to make decisions sequentially. It learns from feedback in the form of rewards or punishments. The agent, which makes decisions, interacts with an environment and learns to choose actions that maximize cumulative reward over time. Reinforcement learning has been successfully applied in various domains such as game playing, robotics, and autonomous vehicle control.
Algorithms like Q-learning, deep Q-networks (DQN), and policy gradients are commonly used in reinforcement learning.
![Image](https://www.3nions.com/wp-content/uploads/2021/03/Reinforcement-Learning-in-ML-TV.jpeg)
These categories are not mutually exclusive, and there are also hybrid approaches that combine elements of these types, such as semi-supervised learning, where the algorithm learns from a combination of labeled and unlabeled data. Additionally, there are specialized forms of learning, such as online learning and transfer learning, which adapt the learning process to specific scenarios and tasks.
## AI Machine Learning
It is important to note that Machine Learning(ML), is a subset of Artificial Intelligence(AI). Deep Learning(DL) in turn, is a subset of Machine Learning(ML). But both ML and DL are subsets of AI
![Image](https://idm.net.au/sites/idm.net.au/files/Hexagon1-600.png)
## Tools for implementing ML
- Pandas:- for manipulating and analysis, offers data structures like DataFrames that are crucial for handling structured data
- Numpy:- fundamental package for scientific computing in python, provides support for multi dimensional arrays and matrices
- Matplotlib:- 2D plotting library for creating static, animated and interactive visualization. Also useful for visualizing data and model performance
- Seaborn:- built on Matplotlib, it provides high-level interface for statistical data visualization. Also simplifies creating information and attractive statistical graphics
- Scikit-Learn:- provides simple and efficient tools for data mining and data analysis. Also includes various machine learning algorithms for classification, regression, clustering, and more

# What is Scikit-Learn?
Scikit-learn is an open source library in Python. Scikit-learn ia a python programming library which is sed to implement machine learning models. It was previously known as scikits.learn and was created in 2007 as a google summer of code project by David Cournapeau.
Along with scikit-learn, we will be using few other libraries such as numpy, pandas and matplotlib.
## What kind of Dataset won't be recognized by Scikit Learn
A dataset that won't work well for scikit-learn is one with missing values or NaNs (Not a Number) that haven't been handled or imputed. Scikit-learn's algorithms typically expect complete datasets with no missing values. If your dataset has missing values, you'll need to preprocess it by either imputing the missing values (replacing them with estimated values based on other data points) or removing the rows or columns with missing values. Otherwise, scikit-learn algorithms may throw errors or produce unreliable results. You can fill missing values using Pandas and also with Scikit Learn
## Scikit Learn Workflow
- Get data ready
- Pick a model to suit your problem
- Fit the model to the data and make predictions
- Evaluate the model
- Improve through experimentation
- Save!!!
## Where Can You Get Help?
- Follow along with the code
- Try it yourself
- press Shift+Tab to learn from the docstring on Jupyter notebook
- Search for it using Stack Overflow and/or reading the documentation of the library
- Try again
- Ask

# Choosing the Right Model for your Data
![image](https://scikit-learn.org/stable/_static/ml_map.png)

Scikit Learn Documentation: https://scikit-learn.org/stable/tutorial/index.html

Based on this diagram, we are able to see the flow of the chat. This will guide you to an extent on the right model to use for your dataset. Lets have an example use case below

## Use Case

In [1]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
print(housing.data.shape, housing.target.shape)
print(housing.feature_names[0:6],'\n')
housing

(20640, 8) (20640,)
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'] 



{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [2]:
# Lets turn this into a DataFrame so it will be easier to use and understand
import pandas as pd
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
housing_df['target'] = housing['target']
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [3]:
# Now that we have our dataset visible and easier to understand, we start oue exercise
# import algorithm

#setup random seed
import numpy as np
np.random.seed(42)

# extracting the dependent and independent variables
x = housing_df.iloc[:, :-1].values
y = housing_df.iloc[:, 8].values

# split into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Now this is the moment
# Moment to figure out which model to use for this dataset
# Go back to the Picture/Map above. You will see a circle labelled START.
# That is where you start from.
# Now i will work you through step by step from the beginning.

# From the circle labelled START, when you follow the arrow down, it says >50 samples. This is asking if our Dataset has more than 50 samples/Row. In our case it is YES
# Now we follow the arrow that says YES, we now have PREDICTING CATEGORY, meaning Are we Predicting category?. In our case it is NO
# Now we follow the arrow that says NO, we now have PREDICTING QUANTITY, meaning Are we predicting quantity?. In our case it is Yes
# Now this takes us straight to the REGRESSION MODEL AREA.
# Following the arrow again, it says <100k samples. This is asking if our Dataset has less than 100k samples/rows. In our case it is Yes
# Then you pick the Model

# Based on the chat, ill be picking the Rigde regression model
from sklearn.linear_model import Ridge
# fitting Ridge Regression to the Training set
model = Ridge() # creates an instance of Ridge regression
model.fit(x_train, y_train)

# check the score of the model on the test set
model.score(x_test, y_test)

# From the score we got (0.5758549611440142), is pretty low. We should be aiming for 0.8 and above. Well, up to 1.0 anyways
# So, because of the low score, we use another model and based on the chat, the next option model wound be SVR(Kernel='linear')

0.5758549611440142

In [13]:
# make predictions
y_pred = model.predict(x_test)
print(y_pred)
np.mean(y_pred == y_test)

[0.71923978 1.76395141 2.70909238 ... 4.46864495 1.18785499 2.00912494]


0.0

In [9]:
# getting the predictive probability
# but unfortunately, Ridge model does not have the attribute 'Predict_proba'
y_pred_pro = model.predict_proba(x_test)
y_pred_pro

AttributeError: 'Ridge' object has no attribute 'predict_proba'

In [5]:
'''# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR  # Import SVR for regression

# Set random seed
np.random.seed(42)

# Assuming 'housing_df' is your DataFrame containing the dataset

# Extracting the dependent and independent variables
x = housing_df.iloc[:, :-1].values
y = housing_df.iloc[:, -1].values  # Assuming the last column is the target variable

# Splitting into train and test sets
try:
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
except ValueError as e:
    print("Error:", e)
    # Add additional error handling or modify the code accordingly

# Creating and training the model
try:
    model = SVR(kernel='linear', C=1)  # Use SVR for regression
    model.fit(x_train, y_train)
except ValueError as e:
    print("Error:", e)
    # Add additional error handling or modify the code accordingly
else:
    # Evaluating the model
    score = model.score(x_test, y_test)
    print("R-squared Score:", score)'''

'# Import necessary libraries\nimport numpy as np\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import SVR  # Import SVR for regression\n\n# Set random seed\nnp.random.seed(42)\n\n# Assuming \'housing_df\' is your DataFrame containing the dataset\n\n# Extracting the dependent and independent variables\nx = housing_df.iloc[:, :-1].values\ny = housing_df.iloc[:, -1].values  # Assuming the last column is the target variable\n\n# Splitting into train and test sets\ntry:\n    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\nexcept ValueError as e:\n    print("Error:", e)\n    # Add additional error handling or modify the code accordingly\n\n# Creating and training the model\ntry:\n    model = SVR(kernel=\'linear\', C=1)  # Use SVR for regression\n    model.fit(x_train, y_train)\nexcept ValueError as e:\n    print("Error:", e)\n    # Add additional error handling or modify the code accordingly\nelse:\n    # Evaluating the model\n    score = 

In [6]:
'''# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR  # Import SVR for regression

# Set random seed
np.random.seed(42)

# Assuming 'housing_df' is your DataFrame containing the dataset

# Extracting the dependent and independent variables
x = housing_df.iloc[:, :-1].values
y = housing_df.iloc[:, -1].values  # Assuming the last column is the target variable

# Splitting into train and test sets
try:
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
except ValueError as e:
    print("Error:", e)
    # Add additional error handling or modify the code accordingly

# Creating and training the model
try:
    model = SVR(kernel='rbf', C=1)  # Use SVR for regression
    model.fit(x_train, y_train)
except ValueError as e:
    print("Error:", e)
    # Add additional error handling or modify the code accordingly
else:
    # Evaluating the model
    score = model.score(x_test, y_test)
    print("R-squared Score:", score)'''

'# Import necessary libraries\nimport numpy as np\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import SVR  # Import SVR for regression\n\n# Set random seed\nnp.random.seed(42)\n\n# Assuming \'housing_df\' is your DataFrame containing the dataset\n\n# Extracting the dependent and independent variables\nx = housing_df.iloc[:, :-1].values\ny = housing_df.iloc[:, -1].values  # Assuming the last column is the target variable\n\n# Splitting into train and test sets\ntry:\n    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\nexcept ValueError as e:\n    print("Error:", e)\n    # Add additional error handling or modify the code accordingly\n\n# Creating and training the model\ntry:\n    model = SVR(kernel=\'rbf\', C=1)  # Use SVR for regression\n    model.fit(x_train, y_train)\nexcept ValueError as e:\n    print("Error:", e)\n    # Add additional error handling or modify the code accordingly\nelse:\n    # Evaluating the model\n    score = mod

In [7]:
'''# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor  # Use regression model

# Set random seed
np.random.seed(42)

# Assuming 'housing_df' is your DataFrame containing the dataset

# Extracting the dependent and independent variables
x = housing_df.iloc[:, :-1].values
y = housing_df.iloc[:, -1].values  # Assuming the last column is the target variable

# Splitting into train and test sets
try:
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
except ValueError as e:
    print("Error:", e)
    # Add additional error handling or modify the code accordingly

# Creating and training the model
try:
    model = HistGradientBoostingRegressor(max_iter=100)  # Use regression model
    model.fit(x_train, y_train)
except ValueError as e:
    print("Error:", e)
    # Add additional error handling or modify the code accordingly
else:
    # Evaluating the model
    score = model.score(x_test, y_test)
    print("R-squared Score:", score)
'''

'# Import necessary libraries\nimport numpy as np\nimport pandas as pd\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import HistGradientBoostingRegressor  # Use regression model\n\n# Set random seed\nnp.random.seed(42)\n\n# Assuming \'housing_df\' is your DataFrame containing the dataset\n\n# Extracting the dependent and independent variables\nx = housing_df.iloc[:, :-1].values\ny = housing_df.iloc[:, -1].values  # Assuming the last column is the target variable\n\n# Splitting into train and test sets\ntry:\n    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\nexcept ValueError as e:\n    print("Error:", e)\n    # Add additional error handling or modify the code accordingly\n\n# Creating and training the model\ntry:\n    model = HistGradientBoostingRegressor(max_iter=100)  # Use regression model\n    model.fit(x_train, y_train)\nexcept ValueError as e:\n    print("Error:", e)\n    # Add additional error handling or modify the code